The Image Data Explorer: Interactive exploration of image-derived data

Coralie Muller; Beatriz Serrano-Solano; Yi Sun; Christian Tischer; Jean-Karim Hériché

doi:10.1371/journal.pone.0273698

. 2022 Sep 15;17(9):e0273698. doi: 10.1371/journal.pone.0273698

The Image Data Explorer: Interactive exploration of image-derived data

Coralie Muller ¹, Beatriz Serrano-Solano ^1,^¤, Yi Sun ¹, Christian Tischer ², Jean-Karim Hériché ^1,^*

Editor: Gregg Roman³

PMCID: PMC9477257 PMID: 36107835

Abstract

Many bioimage analysis projects produce quantitative descriptors of regions of interest in images. Associating these descriptors with visual characteristics of the objects they describe is a key step in understanding the data at hand. However, as many bioimage data and their analysis workflows are moving to the cloud, addressing interactive data exploration in remote environments has become a pressing issue. To address it, we developed the Image Data Explorer (IDE) as a web application that integrates interactive linked visualization of images and derived data points with exploratory data analysis methods, annotation, classification and feature selection functionalities. The IDE is written in R using the shiny framework. It can be easily deployed on a remote server or on a local computer. The IDE is available at https://git.embl.de/heriche/image-data-explorer and a cloud deployment is accessible at https://shiny-portal.embl.de/shinyapps/app/01_image-data-explorer.

Introduction

A typical bioimage informatics workflow for cellular phenotyping [1] involves segmentation (the identification of objects of interest in images, e.g. cells or nuclei) and feature extraction (the computation of measurements for each object) often followed by the application of a classifier to annotate the objects into biologically relevant classes. The use of a classifier requires the availability of already annotated objects to form the training set out of which the classifier algorithm will build a model. However, the generation of such a training set is often an iterative process that requires prior understanding of several aspects of the data. Such understanding is usually gained through the use of exploratory data analysis techniques such as plotting, dimensionality reduction and clustering. In addition, exploratory techniques are also useful for quality control (e.g. outliers identification) and curation of the data (e.g. identification of mis-annotated data points). Many image analysis software are available to perform segmentation and feature extraction with some also allowing annotation and classification [2]. Exploratory analysis of image-derived data is however generally left out of such software. Combinations of plugins for the popular image analysis software ImageJ [3] can be used for exploratory analysis. For example, the BAR plugin [4] contains routines for coloring regions of interests according to measurements. More extensive data exploration tools were developed in the context of high content screening. Phaedra [5] and CellProfiler Analyst [6] are desktop-based applications that address exploratory needs by drawing interactive plots allowing visualization of images associated with data points. In addition, CellProfiler Analyst offers dimensionality reduction and classification functionalities while Phaedra can delegate machine learning functionalities to external scripts and workflow management systems.

Similarly, the Image Data Explorer (IDE) integrates linked visualization of images and data points with additional exploratory methods such as dimensionality reduction and clustering as well as classification and feature selection. Its light requirements on data input allows flexibility in data provenance. Another distinguishing feature is the possibility of deploying it as an online service as well as a desktop application.

Design and implementation

The Image Data Explorer is written in the R programming language [7] using the shiny framework [8]. This choice was motivated by two considerations. First, R gives access to a wide range of powerful statistical and machine learning methods with which the IDE functionalities could be extended should the need arise. Second, the shiny framework allows building web applications in R which, compared to desktop-bound applications, facilitates deployment and accessibility in a remote environment while still allowing easy installation for local use. The IDE is written using a modular architecture to facilitate addition of new functionalities.

The application requires two types of input: a set of images and a character-delimited text file of tabular data derived from the images. When the IDE runs on a user’s computer (or a server), the images can be on any file system the user has access to from that computer, the only requirement is that the images must be under one common top-level directory (as opposed for example to being distributed across multiple file systems). Alternatively, the images can be put in an S3-compatible object store for remote access. The IDE can in principle read all image formats supported by the Bio-Formats library [9] but has only been extensively tested using multi-page TIFF files. The IDE can currently display images of at most 3 dimensions with the third dimension typically representing either depth (z coordinate) or time or channel. Images of higher dimensions are supported but in this case, the IDE merges the channels and the user has to decide which of depth or time should be displayed. In the tabular data, rows represent either images or regions of interest (ROIs) in images and columns represent features and other associated data such as annotations of class membership. For the application to link data points to the corresponding images, the table should include at least one column with the path to image files relative to the image root directory or bucket of an S3-compatible object storage. ROIs are identified by the coordinates of an anchor point (e.g. the ROI centre) therefore there should also be a column for each of the relevant coordinates.

The user interface is a web page composed of different sections corresponding to the different functionalities accessible from a list on the left hand side of the screen. The entry point is the input section where the user can upload a tabular data file and configure the application to locate the images (e.g. by locating the image root directory on the server or giving object store connection details) and provide information on the table content (e.g. specify which column contains the relative path to the image files or which columns contain coordinates of ROIs anchor points). The input configuration parameters can be saved to a file which can be uploaded to the application to restore the configuration in a subsequent session.

The explore section (Fig 1) is where the interactive visualization takes place. This section is divided into three areas: a plot viewer, an image viewer and a data table. These areas are linked in three way interactions such that selecting an element in any area highlights the corresponding data point in the others. The data table can be replaced by a second image viewer when two images need to be visualized for a given data point (for example, the original image and the corresponding segmentation mask). The plot viewer can be switched between a scatter plot, a histogram, a bar plot and a multi-well plate representation. The interactive plots are implemented using the ggplot2 [10] and plotly [11] libraries.

The image viewer is a modified version of the display function from the EBImage library [12].

The annotate section allows users to define labels and either create a new data table column or select an existing one as annotation-containing column. Selected data points can then be batch annotated with one of the available labels using a button in the explore section.

The dimensionality reduction and clustering sections allow application of various methods useful for finding patterns in the data. Methods currently implemented are principal component analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP) [13], for dimensionality reduction (including supervised metric learning with UMAP) and k-means [14] and HDBSCAN [15] for clustering. The list is easily extensible to other methods. Dimensionality reduction methods add two new coordinate columns to the data table that are then used for plotting. Similarly, clustering methods add a column with cluster membership.

The classification and feature selection section implements gradient boosting classification via the XGBoost library [16]. It outputs various summary statistics on the model performance as well as a plot of feature importance which can be used to inform feature selection. A column with class predictions from the model can also be added to the data table. A button allows users to download the data table with all modifications made to it.

The statistics section gives access to one-way ANOVA and post hoc tests. The IDE assumes independence of the samples but otherwise tries to validate the assumptions underlying ANOVA. For sample sizes less than 30, the normality assumption is tested using the Shapiro-Wilk test and if the data is deemed not normally distributed, ANOVA is performed using the non-parametric Kruskal-Wallis test and post hoc tests using pairwise Wilcoxon rank sum tests. The homogeneity of variance is tested using Bartlett’s test and when unequal variances are detected, the IDE performs Welch’s ANOVA. Post hoc tests are then carried out with the Games-Howell test.

Results

Mitotic chromosome condensation and nucleolus formation are two processes thought to be driven by phase separation [17, 18]. Proteins involved in these two processes could be candidate regulators of phase separation. To explore this idea, we sought to analyse nucleoli from images of gene knockdowns known to affect chromosome condensation. To this end, we processed publicly available images from a published RNAi screen for chromosome condensation [19] with CellProfiler using a Galaxy [20] workflow to characterize nucleoli (workflow available on the WorkflowHub at https://workflowhub.eu/workflows/41). In brief, chromosome condensation hits known to localize to the nucleolus were selected as well as NPM1, a nucleolar protein whose knockdown is known to disrupt nucleolar structure [21], and corresponding images were accessed from the Image Data Resource [22] (https://idr.openmicroscopy.org/webclient/?show=screen-102), nuclei were segmented out of the DNA channel at time point 170 and nucleoli segmented as “holes” in each nucleus. Features were extracted for both nuclei and nucleoli and nucleoli features averaged for each nucleus. This resulted in a data table in which each row represents a segmented nucleus with summary descriptors of its nucleoli. We use the IDE to explore whether any gene knockdown resulted in a nucleolar phenotype.

We start by generating a scatterplot of nucleus size vs nucleoli number. This reveals that larger nuclei tend to have more nucleoli. Selecting outlying points allows for visual inspection of the corresponding nuclei in the original image as well as in the segmentation mask (Fig 2) and reveals that although nuclei have generally been adequately segmented, very large nuclei correspond to segmentation artefacts. Selecting the siRNA column as group variable and switching to a box plot representation of nucleoli count (Fig 3A) shows that control cells typically have 1–3 nucleoli with a median of 2 nucleoli while some knockdowns seem to increase the proportion of cells with 0 or 1 nucleolus. Among these are the two siRNAs targeting NPM1. Nuclei with a high number of nucleoli are consistent with the larger number of nucleoli observed after mitosis [23]. We next asked if other features could prove useful to detect nuclei with nucleolar phenotypes. To explore this, we switch to the annotate section and create an annotation column with two labels: control and phenotype, then back in the explore section, we select NPM1 nuclei from the siRNA with the strongest effect on nucleoli number and annotate them as phenotype. To find negative controls from neighbouring wells, we make use of the table search functions to filter the rows by plate, well and/or siRNA ID and annotate negative control nuclei as control. We then switch to the classification and feature selection section to train a XGboost classifier on all features to produce a plot of relative feature importance (Fig 3B) where the values over all features sum to 1. The feature importance reported by the IDE is the gain which is a measure of the improvement in accuracy contributed by a feature [16]. When comparing two features, the one with the highest value is more important for the classifier performance. However, before checking feature importance, we ensure that the classes can be distinguished. For this, the IDE reports several statistics on the classifier performance. In this case, the resulting classifier has an accuracy of 74% which can be compared to the no information rate which is the fraction of the most abundant class and corresponds to the best performance that can be obtained by a naive classifier always assigning the most common label to samples. Confidence in the accuracy being above the no information rate is given by the low associated p-value of 8e^-9. Here, the most significant feature corresponds to nucleolus size. A box plot of nucleolus size distribution by siRNA confirms this as negative controls and some siRNAs show a median nucleolus size around 40 pixels with some siRNAs, including those targeting NPM1, having a median around 20.

Fig 3 — A—Box plot of nucleolar number per siRNA as produced by the IDE. Each colour corresponds to data from one siRNA. B—Output of the XGBoost classifier. Left panel: Plot of relative feature importance, colours are used to indicate clusters of features with similar importance. Right panel: Information ato assess performance of the trained classifier.

Our quick exploration with the IDE thus indicates that images from this data set contain relevant information and already identifies a few knockdowns that may warrant further analysis. However, it also suggests that derived features such as the fraction of nuclei with certain properties (e.g. nucleolar size above or below some threshold) could be useful to more robustly identify nucleolar phenotypes.

Discussion

The IDE couples exploratory data analysis methods with interactive visualization of images and associated data. The light input requirements make it ideally suited for quick exploration of image data from different sources because many image analysis tools can produce character-delimited tabular data. In its current implementation, it is particularly suitable for small to medium-sized data sets (up to hundreds of thousands of data points). The modular architecture makes it easy to add new tools and its implementation as a web application makes it easily deployable in cloud environments for remote access.

Availability and future directions

The IDE is licensed under the GNU General Public License v3.0. The code is managed with a GitLab instance at https://git.embl.de/heriche/image-data-explorer from which it can be downloaded and where documentation can be found in the wiki section. A cloud-based deployment is available here: https://shiny-portal.embl.de/shinyapps/app/01_image-data-explorer. Future work will include support for images in OME-NGFF, the next-generation file format for bioimaging [24] to improve interoperability and facilitate more flexible data access, for example by allowing direct access to images from repositories such as the IDR or the BioImage Archive [25]. With a foreseen increase in the use of the Galaxy platform for image analysis, we plan to deploy the IDE as a Galaxy interactive environment to give it direct access to data produced by Galaxy workflows.

Supporting information

S1 File. Contains the IDs of images retrieved from the IDR.

(TXT)

Click here for additional data file.^{(705B, txt)}

S2 File. Contains the table with data on nuclei and nucleoli used as input to the IDE to produce Figs 2 and 3.

(TXT)

Click here for additional data file.^{(16.4MB, txt)}

S3 File. Contains an example of saved input parameters including connection to the image object storage.

Alternatively, the associated images can be accessed by the IDE by manually selecting as image root directory the S3 bucket named screens at end point s3.embl.de.

(RDS)

Click here for additional data file.^{(427B, rds)}

Acknowledgments

We thank Aliaksandr Halavatyi, Kimberly Meechan and Hugo Botelho for discussions and suggestions and Valerie Petegnieff for test data.

Data Availability

The code is available in a GitLab instance at https://git.embl.de/heriche/image-data-explorer. Documentation can be found in the wiki section of the repository. Data used to produce the figures is included in the Supporting information files and associated images are available in an S3 bucket named screens at end point s3.embl.de but can also be retrieved from the Image Data Resource (https://idr.openmicroscopy.org/) using the list of image IDs provided in the supplementary information.

Funding Statement

Work by BSS and YS was supported by EOSC-Life under grant agreement H2020-EU.1.4.1.1. EOSC-Life 824087. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Sommer C, Gerlich DW. Machine learning in cell biology—teaching computers to recognize phenotypes. J Cell Sci. 2013; 126(Pt 24):5529–39. doi: 10.1242/jcs.123604 [DOI] [PubMed] [Google Scholar]
2.Smith K, Piccinini F, Balassa T, Koos K, Danka T, Azizpour H, et al. Phenotypic Image Analysis Software Tools for Exploring and Understanding Big Image Data from Cell-Based Assays. Cell Syst. 2018; 6(6):636–653. doi: 10.1016/j.cels.2018.06.001 [DOI] [PubMed] [Google Scholar]
3.Rueden CT, Schindelin J, Hiner MC, DeZonia BE, Walter AE, Arena ET, et al. ImageJ2: ImageJ for the next generation of scientific image data. BMC Bioinformatics 18, 529 (2017). doi: 10.1186/s12859-017-1934-z [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Ferreira T, Miura K, Bitdeli Chef, Eglinger J. Scripts: Bar 1.1.6. (2015, August 21). Zenodo.
5.Cornelissen F, Cik M, Gustin E. Phaedra, a protocol-driven system for analysis and validation of high-content imaging and flow cytometry. J Biomol Screen. 2012; 17(4):496–506. doi: 10.1177/1087057111432885 [DOI] [PubMed] [Google Scholar]
6.Stirling DR, Carpenter AE, Cimini BA. CellProfiler Analyst 3.0: Accessible data exploration and machine learning for image analysis. Bioinformatics. 2021. Sep 3:btab634. doi: 10.1093/bioinformatics/btab634 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
8.Chang W, Cheng J, Allaire JJ, Sievert C, Schloerke B, Xie Y, et al. shiny: Web Application Framework for R. R package version 1.7.1. https://CRAN.R-project.org/package=shiny
9.Linkert M, Rueden CT, Allan C, Burel JM, Moore W, Patterson A, et al. Metadata matters: access to image data in the real world. J Cell Biol. 2010; 189(5):777–82. doi: 10.1083/jcb.201004104 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Wickham H. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag; New York, 2016. [Google Scholar]
11.Sievert C. Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC; Florida, 2020. [Google Scholar]
12.Pau G, Fuchs F, Sklyar O, Boutros M, Huber W. EBImage—an R package for image processing with applications to cellular phenotypes. Bioinformatics. 2010;26(7):979–81. doi: 10.1093/bioinformatics/btq046 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.McInnes, L, Healy, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiv:1802.03426 [Preprint]. 2018. https://arxiv.org/abs/1802.03426.
14.Hartigan JA, Wong MA. Algorithm AS 136: A K-Means Clustering Algorithm. Applied Statistics. 1979; 28(1):100–108. doi: 10.2307/2346830 [DOI] [Google Scholar]
15.Hahsler M, Piekenbrock M, Doran D. dbscan: Fast Density-Based Clustering with R. Journal of Statistical Software. 2019; 91(1): 1–30. [Google Scholar]
16.Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. KDD ’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. p. 785–794.
17.Strom AR, Brangwynne CP. The liquid nucleome—phase transitions in the nucleus at a glance. J Cell Sci. 2019; 132(22). doi: 10.1242/jcs.235093 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Cuylen S, Blaukopf C, Politi AZ, Müller-Reichert T, Neumann B, Poser I, et al. Ki-67 acts as a biological surfactant to disperse mitotic chromosomes. Nature. 2016; 535(7611):308–12. doi: 10.1038/nature18610 [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Hériché JK, Lees JG, Morilla I, Walter T, Petrova B, Roberti MJ, et al. Integration of biological data by kernels on graph nodes allows prediction of new genes involved in mitotic chromosome condensation. Mol Biol Cell. 2014; 25(16):2522–2536. doi: 10.1091/mbc.E13-04-0221 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Afgan E, Baker D, Batut B, van den Beek M, Bouvier D, Cech M, et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res. 2018; 46(W1):W537–W544. doi: 10.1093/nar/gky379 [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Holmberg Olausson K, Nistér M, Lindström MS. Loss of nucleolar histone chaperone NPM1 triggers rearrangement of heterochromatin and synergizes with a deficiency in DNA methyltransferase DNMT3A to drive ribosomal DNA transcription. J Biol Chem. 2014; 289(50):34601–19. doi: 10.1074/jbc.M114.569244 [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Williams E, Moore J, Li SW, Rustici G, Tarkowska A, Chessel A, et al. The Image Data Resource: A Bioimage Data Integration and Publication Platform. Nat Methods. 2017; 14(8):775–781. doi: 10.1038/nmeth.4326 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Savino TM, Gébrane-Younès J, De Mey J, Sibarita JB, Hernandez-Verdun D. Nucleolar assembly of the rRNA processing machinery in living cells. J Cell Biol. 2001; 153(5):1097–1110. doi: 10.1083/jcb.153.5.1097 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Moore J, Allan C, Besson S, Burel JM, Diel E, Gault D, et al. OME-NGFF: a next-generation file format for expanding bioimaging data-access strategies. Nat Methods. 2021; 18(12):1496–1498. doi: 10.1038/s41592-021-01326-w [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Ellenberg J, Swedlow JR, Barlow M, Cook CE, Sarkans U, Patwardhan A, et al. A call for public archives for biological image data. Nat Methods. 2018; 15(11) 849–854. doi: 10.1038/s41592-018-0195-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0273698.r001

Decision Letter 0

Gregg Roman

Transfer Alert

This paper was transferred from another journal. As a result, its full editorial history (including decision letters, peer reviews and author responses) may not be present.

28 Jun 2022

PONE-D-22-14030The Image Data Explorer: interactive exploration of image-derived dataPLOS ONE

Dear Dr. Heriche,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Both reviewers found the idea and work to have value for the analysis of imaging data. Both reviewers had some constructive comments that I believe would help improve. Reviewer #1 had four specific comments and questions that I believe are specific and helpful. Please respond to each in your response. Also, please improve the WIKI as suggested and answer the questions in comment #4 and in #1. This would be necessary for acceptance of the manuscript for publication in PLoS One.

Reviewer #2 also had several useful suggestions. This reviewer listed two drawbacks. Even if you decide not to directly address them in the manuscript (something I suggest, but will not require), please address/answer them in your response. This reviewer also sought a way to make the use of the software easier by suggesting a video. Please be thoughtful in how you can make the use of this tool easier, which has direct befits for the paper and the tool's adoption by the imaging community.

Please submit your revised manuscript by Aug 06 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Gregg Roman, PhD

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please update your submission to use the PLOS LaTeX template. The template and more information on our requirements for LaTeX submissions can be found at http://journals.plos.org/plosone/s/latex.

3. Thank you for stating the following in the Funding Section of your manuscript:

"Work by BSS and YS was supported by EOSC-Life under grant agreement H2020-EU.1.4.1.1. EOSC-Life 824087."

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

"Work by BSS and YS was supported by EOSC-Life under grant agreement H2020-EU.1.4.1.1. EOSC-Life 824087.

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

4. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: N/A

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This Manuscript introduced a tool for interactive image data visualization. The image data explorer (IDE) can be used as a web application or installed on local computers. IDE integrates interactive linked visualization of images and derived data points with exploratory data analysis methods, annotation, classification and feature selection functionalities. The manual of the Software needs to be optimized for new users.

1. How to input the images to your data base if users do not have a S3-compatible object store? What is the address of the ‘one common root directory accessible by the server component of the IDE’?

2. The IDE wiki need to be updated and optimized. I tried RStudio installation based on the wiki, but still could not get the code running. I would suggest to provide corresponding compatible RStudio download links and step-by-step installation and screenshots.

3. With all these plotting, machine learning and interactive functions, it is better to provide some statistical tools for better data analysis and exploring. For example, incorporating the ANOVAs and Post-hoc test would be very helpful.

4. In the Results session, In this case, how do you conclude ‘the resulting classifier has an accuracy of 74% which is above the no information rate and the most significant feature corresponds to nucleolus size’? How do you define the no information rate? In figure 3, what is the range (x-axis) of feature importance? The most important feature is the Nucleoli size and is less than 0.2, what does this mean?

Reviewer #2: This paper presents a tool (more precisely a web application also available as a Desktop application) that permits dynamic visualization of images and their (numerous) features and offers a set of statistical tools to apply to these features. This goes from exploratory data analysis to classification and seems very flexible, so that you could integrate your own algorithms.

Acquisition of big data in microscopy becomes more and more frequent (e.g. high content screening or mosaic acquisitions). The community definitely needs tools to be able to visualize and classify features obtained by any analysis workflow on this amount of data. This is offered by this software, with very simple inputs: the images and the associated features stored in a table (CSV type e.g.), generated by any workflow beforehand.

With no doubt, a specific attention has been paid to the interface: it seems very user-friendly and the dynamic interaction between the results table and the corresponding structures in the image is really effective; it can help detect anomalies in the workflow (e.g. bad segmentations) and then create more accurate results.

However, two drawbacks:

- the limitation to 3-dimensions for analyzed data (today data acquired on the microscope are often 5D); I do not understand this limitation since color-z-time information could be included in the table and the application already reads TIFF images (of course one would be limited by its own computer memory)

- the ROIs are only represented by one point (if I understood well) and not by the "real" shape; I understand that it is due to the fact that the analysis is performed beforehand but maybe could be interesting to find a way to display them as Region of Interest

Some (minor) remarks:

- The authors could also cite the Plugin BAR of ImageJ; even if it does not implement any statistical tool, the ROI color-coding can really help the user to define (instinctively) which parameter can describe the best, for instance, the difference between structures morphology

- I understand the choice of R and shiny framework that offers the adequate tools for this project, but I suggest to find a way (if possible of course) to launch it from another software in order to avoid a workflow using several softwares separately

- a video showing how the software works could also be a bonus

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Sep 15;17(9):e0273698. doi: 10.1371/journal.pone.0273698.r002

Author response to Decision Letter 0

5 Aug 2022

We would like to thank the reviewers for their useful feedback. Below are our point by point responses to the reviewers' comments.

Reviewer #1:

1- How to input the images to your data base if users do not have a S3-compatible object store? What is the address of the ‘one common root directory accessible by the server component of the IDE’?

When the IDE runs on a user’s computer, the images can be on any file system the user has access to from that computer. This also applies when the IDE is run on a server, i.e. the images can be on any file system the server has access to. The only requirement is that the images must be under one common top-level directory (as opposed to, for example, being distributed across multiple file systems). We refer to this common top level directory as the image root. S3 storage is an alternative for remote access, i.e. when the images are not on a file system accessible to the computer the IDE is running on. We have now clarified this in the text and in the documentation.

2- The IDE wiki need to be updated and optimized. I tried RStudio installation based on the wiki, but still could not get the code running. I would suggest to provide corresponding compatible RStudio download links and step-by-step installation and screenshots.

We’re sorry for that. We realized that the instructions made some assumptions about requirements being satisfied on the user’s system. We’ve now written step-by-step instructions to include these requirements and added screenshots where relevant. As part of the process we also uncovered a bug in the latest version of one of the required R packages that prevented its installation and thereby preventing the IDE from running. Following our report, the package maintainer has now fixed this bug.

3- With all these plotting, machine learning and interactive functions, it is better to provide some statistical tools for better data analysis and exploring. For example, incorporating the ANOVAs and Post-hoc test would be very helpful.

We’re happy to add functionalities to support new use cases. In fact a statistics workspace had been planned but wasn’t implemented because none of the projects that started using the IDE needed it. However, we’ve now added the statistics workspace with one-way ANOVA and post hoc tests and describe it in the manuscript.

4- In the Results session, In this case, how do you conclude ‘the resulting classifier has an accuracy of 74% which is above the no information rate and the most significant feature corresponds to nucleolus size’? How do you define the no information rate? In figure 3, what is the range (x-axis) of feature importance? The most important feature is the Nucleoli size and is less than 0.2, what does this mean?

We apologize for the lack of details. The no information rate is the best performance a naive classifier could reach by always assigning the label of the most abundant class and is therefore the proportion of the most abundant class.

The measure of feature importance used in the IDE is called the gain. The gain quantifies the improvement in accuracy obtained when the corresponding feature is included in a branch of the (tree-based) classifier. The values are relative and sum to 1 over the features such that when comparing two features, the one with the highest value is more important for the classifier performance.

We have now expanded the corresponding section of the text with more explanations.

Reviewer #2:

Images of higher dimensions are supported, i.e. they can be read, but the viewer itself is limited to 3 dimensions. So in the case of higher dimension images, the IDE merges the channels into a colour image and the user has to decide which of depth or time should be displayed. We currently don’t have a 5D viewer that would allow for the kind of integrated web-based three-way interactivity the IDE is offering. However, in the future we plan to leverage the development of the next generation image file format ome.zarr for image viewing, for example by modifying the viv library (https://github.com/hms-dbmi/viv) to support the kind of interactivity needed by the IDE.

We agree that representing ROIs as shapes on the original image would be an interesting feature and considered it earlier in the project. We decided to forgo its implementation for two reasons: it is technically more challenging to implement (i.e. drawing shapes on the fly on the original image without delaying display to the user) and, since we rely on pre-computed data, there is no simple and standard representation in tabular form (as opposed to a point which only needs x, y, z, t coordinates and which most analysis software already associate with objects, e.g. center of mass, brightest point). However as many image analysis workflows produce a label mask image, we added a second interactive image viewer to allow viewing the mask on the same screen as the original image. This being said, discussions are also ongoing on the topic of ROI representation in the ome.zarr format which as mentioned above we plan to use in the future.

We have added mention of the BAR plugin as example of data exploration that’s possible in ImageJ.

The IDE can be started from other software either as an R script (for example using the command-line shown in the installation wiki) or in the remote case by linking to the appropriate URL. However, automatically transferring data from an external software to the IDE is currently not possible and would require the development of a complex API that the external software would need to use.

- a video showing how the software works could also be a bonus

We’ve added a short (less than 2 min) introductory video to the repository at https://git.embl.de/heriche/image-data-explorer/-/raw/master/videos/intro.mp4

It focuses on how to get started with the data input area and we now plan to follow up with a series of short videos each dealing with one of the IDE workspaces.

Attachment

Submitted filename: Response to Reviewers.pdf

Click here for additional data file.^{(46.8KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0273698.r003

Decision Letter 1

Gregg Roman

15 Aug 2022

The Image Data Explorer: interactive exploration of image-derived data

PONE-D-22-14030R1

Dear Dr. Heriche,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Gregg Roman, PhD

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

PLoS One. doi: 10.1371/journal.pone.0273698.r004

Acceptance letter

Gregg Roman

6 Sep 2022

PONE-D-22-14030R1

The Image Data Explorer: interactive exploration of image-derived data

Dear Dr. Heriche:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr Gregg Roman

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 File. Contains the IDs of images retrieved from the IDR.

(TXT)

Click here for additional data file.^{(705B, txt)}

S2 File. Contains the table with data on nuclei and nucleoli used as input to the IDE to produce Figs 2 and 3.

(TXT)

Click here for additional data file.^{(16.4MB, txt)}

S3 File. Contains an example of saved input parameters including connection to the image object storage.

Alternatively, the associated images can be accessed by the IDE by manually selecting as image root directory the S3 bucket named screens at end point s3.embl.de.

(RDS)

Click here for additional data file.^{(427B, rds)}

Attachment

Submitted filename: Response to Reviewers.pdf

Click here for additional data file.^{(46.8KB, pdf)}

Data Availability Statement

[pone.0273698.ref001] 1.Sommer C, Gerlich DW. Machine learning in cell biology—teaching computers to recognize phenotypes. J Cell Sci. 2013; 126(Pt 24):5529–39. doi: 10.1242/jcs.123604 [DOI] [PubMed] [Google Scholar]

[pone.0273698.ref002] 2.Smith K, Piccinini F, Balassa T, Koos K, Danka T, Azizpour H, et al. Phenotypic Image Analysis Software Tools for Exploring and Understanding Big Image Data from Cell-Based Assays. Cell Syst. 2018; 6(6):636–653. doi: 10.1016/j.cels.2018.06.001 [DOI] [PubMed] [Google Scholar]

[pone.0273698.ref003] 3.Rueden CT, Schindelin J, Hiner MC, DeZonia BE, Walter AE, Arena ET, et al. ImageJ2: ImageJ for the next generation of scientific image data. BMC Bioinformatics 18, 529 (2017). doi: 10.1186/s12859-017-1934-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0273698.ref004] 4.Ferreira T, Miura K, Bitdeli Chef, Eglinger J. Scripts: Bar 1.1.6. (2015, August 21). Zenodo.

[pone.0273698.ref005] 5.Cornelissen F, Cik M, Gustin E. Phaedra, a protocol-driven system for analysis and validation of high-content imaging and flow cytometry. J Biomol Screen. 2012; 17(4):496–506. doi: 10.1177/1087057111432885 [DOI] [PubMed] [Google Scholar]

[pone.0273698.ref006] 6.Stirling DR, Carpenter AE, Cimini BA. CellProfiler Analyst 3.0: Accessible data exploration and machine learning for image analysis. Bioinformatics. 2021. Sep 3:btab634. doi: 10.1093/bioinformatics/btab634 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0273698.ref007] 7.R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.

[pone.0273698.ref008] 8.Chang W, Cheng J, Allaire JJ, Sievert C, Schloerke B, Xie Y, et al. shiny: Web Application Framework for R. R package version 1.7.1. https://CRAN.R-project.org/package=shiny

[pone.0273698.ref009] 9.Linkert M, Rueden CT, Allan C, Burel JM, Moore W, Patterson A, et al. Metadata matters: access to image data in the real world. J Cell Biol. 2010; 189(5):777–82. doi: 10.1083/jcb.201004104 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0273698.ref010] 10.Wickham H. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag; New York, 2016. [Google Scholar]

[pone.0273698.ref011] 11.Sievert C. Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC; Florida, 2020. [Google Scholar]

[pone.0273698.ref012] 12.Pau G, Fuchs F, Sklyar O, Boutros M, Huber W. EBImage—an R package for image processing with applications to cellular phenotypes. Bioinformatics. 2010;26(7):979–81. doi: 10.1093/bioinformatics/btq046 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0273698.ref013] 13.McInnes, L, Healy, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiv:1802.03426 [Preprint]. 2018. https://arxiv.org/abs/1802.03426.

[pone.0273698.ref014] 14.Hartigan JA, Wong MA. Algorithm AS 136: A K-Means Clustering Algorithm. Applied Statistics. 1979; 28(1):100–108. doi: 10.2307/2346830 [DOI] [Google Scholar]

[pone.0273698.ref015] 15.Hahsler M, Piekenbrock M, Doran D. dbscan: Fast Density-Based Clustering with R. Journal of Statistical Software. 2019; 91(1): 1–30. [Google Scholar]

[pone.0273698.ref016] 16.Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. KDD ’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. p. 785–794.

[pone.0273698.ref017] 17.Strom AR, Brangwynne CP. The liquid nucleome—phase transitions in the nucleus at a glance. J Cell Sci. 2019; 132(22). doi: 10.1242/jcs.235093 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0273698.ref018] 18.Cuylen S, Blaukopf C, Politi AZ, Müller-Reichert T, Neumann B, Poser I, et al. Ki-67 acts as a biological surfactant to disperse mitotic chromosomes. Nature. 2016; 535(7611):308–12. doi: 10.1038/nature18610 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0273698.ref019] 19.Hériché JK, Lees JG, Morilla I, Walter T, Petrova B, Roberti MJ, et al. Integration of biological data by kernels on graph nodes allows prediction of new genes involved in mitotic chromosome condensation. Mol Biol Cell. 2014; 25(16):2522–2536. doi: 10.1091/mbc.E13-04-0221 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0273698.ref020] 20.Afgan E, Baker D, Batut B, van den Beek M, Bouvier D, Cech M, et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res. 2018; 46(W1):W537–W544. doi: 10.1093/nar/gky379 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0273698.ref021] 21.Holmberg Olausson K, Nistér M, Lindström MS. Loss of nucleolar histone chaperone NPM1 triggers rearrangement of heterochromatin and synergizes with a deficiency in DNA methyltransferase DNMT3A to drive ribosomal DNA transcription. J Biol Chem. 2014; 289(50):34601–19. doi: 10.1074/jbc.M114.569244 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0273698.ref022] 22.Williams E, Moore J, Li SW, Rustici G, Tarkowska A, Chessel A, et al. The Image Data Resource: A Bioimage Data Integration and Publication Platform. Nat Methods. 2017; 14(8):775–781. doi: 10.1038/nmeth.4326 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0273698.ref023] 23.Savino TM, Gébrane-Younès J, De Mey J, Sibarita JB, Hernandez-Verdun D. Nucleolar assembly of the rRNA processing machinery in living cells. J Cell Biol. 2001; 153(5):1097–1110. doi: 10.1083/jcb.153.5.1097 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0273698.ref024] 24.Moore J, Allan C, Besson S, Burel JM, Diel E, Gault D, et al. OME-NGFF: a next-generation file format for expanding bioimaging data-access strategies. Nat Methods. 2021; 18(12):1496–1498. doi: 10.1038/s41592-021-01326-w [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0273698.ref025] 25.Ellenberg J, Swedlow JR, Barlow M, Cook CE, Sarkans U, Patwardhan A, et al. A call for public archives for biological image data. Nat Methods. 2018; 15(11) 849–854. doi: 10.1038/s41592-018-0195-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

The Image Data Explorer: Interactive exploration of image-derived data

Coralie Muller

Beatriz Serrano-Solano

Yi Sun

Christian Tischer

Jean-Karim Hériché

Roles

Abstract

Introduction

Design and implementation

Fig 1. The explore workspace.

Results

Fig 2. Use of a second image viewer to simultaneously visualize the original image and the segmentation mask corresponding to a selected data point.

Fig 3.

Discussion

Availability and future directions

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

Gregg Roman

Roles

Transfer Alert

Author response to Decision Letter 0

Decision Letter 1

Gregg Roman

Roles

Acceptance letter

Gregg Roman

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases