Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2020 Oct 19;16(10):e1008349. doi: 10.1371/journal.pcbi.1008349

PyHIST: A Histological Image Segmentation Tool

Manuel Muñoz-Aguirre 1,2,*,#, Vasilis F Ntasis 1,#, Santiago Rojas 3, Roderic Guigó 1,4
Editor: Dina Schneidman-Duhovny5
PMCID: PMC7647117  PMID: 33075075

Abstract

The development of increasingly sophisticated methods to acquire high-resolution images has led to the generation of large collections of biomedical imaging data, including images of tissues and organs. Many of the current machine learning methods that aim to extract biological knowledge from histopathological images require several data preprocessing stages, creating an overhead before the proper analysis. Here we present PyHIST (https://github.com/manuel-munoz-aguirre/PyHIST), an easy-to-use, open source whole slide histological image tissue segmentation and preprocessing command-line tool aimed at tile generation for machine learning applications. From a given input image, the PyHIST pipeline i) optionally rescales the image to a different resolution, ii) produces a mask for the input image which separates the background from the tissue, and iii) generates individual image tiles with tissue content.

Author summary

Histopathology images are an essential tool to assess and quantify tissue composition and its relationship to disease. The digitization of slides and the decreasing costs of computation and data storage have fueled the development of new computational methods, especially in the field of machine learning. These methods seek to make use of the histopathological patterns encoded in these slides with the aim of aiding clinicians in healthcare decision-making, as well as researchers in tissue biology. However, in order to prepare digital slides for usage in machine learning applications, researchers usually need to develop custom scripts from scratch in order to reshape the image data in a way that is suitable to train a model, slowing down the development process. With PyHIST, we provide a toolbox for researchers that work in the intersection of machine learning, biology and histology to effortlessly preprocess whole slide images into image tiles in a standardized manner for usage in external applications.


This is a PLOS Computational Biology Software paper.

Introduction

In histopathology, Whole Slide Images (WSI) are high-resolution images of tissue sections obtained by scanning conventional glass slides [1]. Currently, these glass slides of fixed tissue samples are the preferred method in pathology laboratories around the world to make clinical diagnoses [2], notably in cancer [3]. However, the increasing automation of WSI acquisition has led to the development of computational methods to process the images with the goal of helping clinicians and pathologists in diagnosis and disease classification [4]. As an increasing number of larger WSI datasets became available, methods have been developed for a wide array of tasks, such as the classification of breast cancer metastases, Gleason scoring for prostate cancer, tumor segmentation, nuclei detection and segmentation, bladder cancer diagnosis, mutated gene prediction, among others [510]. Besides of being important diagnostic tools, histopathological images capture endophenotypes (of organs and tissues) that, when correlated with molecular and cellular data on the one hand, and higher-order phenotypic traits on the other, can provide crucial information on the biological pathways that mediate between the sequence of the genome and the biological traits of the organisms (including diseases) [11].

Because of the complexity of the information typically contained in WSIs, Machine Learning (ML) methods that can infer, without prior assumptions, the relevant features that they encode are becoming the preferred analytical tools [12]. These features may be clinically relevant but challenging to spot even for expert pathologists, and thus, ML methods can prove valuable in healthcare decision-making [13].

In most ML tasks, data preprocessing remains a fundamental step. Indeed, in the domain of histological images, there are several issues when preprocessing the data before an analysis: due to the large dimensions of WSIs, many deep learning applications have to break them down into smaller-sized square pieces called tiles [14]. Furthermore, a significant fraction of the area in a WSI is often uninformative background that is not meaningful for the majority of downstream analyses. To circumvent this, some applications apply a series of image transformations to identify the foreground from the background (see, for example, [15]), and perform relevant operations only over regions with tissue content. However, this process is not standardized, and customized scripts have to be frequently developed to deal with data preparation stages (see, for example [10,15]). This is cumbersome and may introduce dataset specific-biases, which can prevent integration across multiple datasets.

Currently available tools for WSI processing focus mostly on the analysis of human-interpretable features by means of nuclei segmentation, object quantification and region-of-interest annotation [1618]; but WSI preparation into tiles for external ML applications has not yet been directly addressed. To systematize the WSI preprocessing procedure for these applications, and in order to streamline the data preparation stage at the initial phase of a ML project by avoiding the need of creating custom image preprocessing scripts, we developed PyHIST, a command-line based pipeline to segment the regions of a histological image into tiles with relevant tissue content (foreground) with little user intervention. PyHIST was developed to process Aperio SVS/TIFF WSIs due to this format being supported by large slide databases such as The Cancer Genome Atlas (TCGA) which has approximately 31,000 WSIs [19] and The Genotype-Tissue Expression Project (GTEx) with approximately 25,000 WSIs [20]. PyHIST currently has experimental support for other image formats (see S1 Text).

Design and implementation

PyHIST is a command-line Python tool based on OpenSlide [21], a library to read high-resolution histological images in a memory-efficient way. PyHIST's input is a WSI encoded in SVS format (Fig 1A), and the main output is a series of image tiles retrieved from regions with tissue content (Fig 1E).

Fig 1. PyHIST pipeline.

Fig 1

(a) The input to the pipeline is a Whole Slide Image (WSI). Within PyHIST, the user can decide to scale down the image to perform the segmentation and tile extraction at lower resolutions. The WSI shown is of a skin tissue sample (GTEX-1117F-0126) from the Genotype-Tissue Expression (GTEx) project [20]. (b) An alternative version of the input image is generated, where the tissue edges are highlighted using a Canny edge detector. A graph segmentation algorithm is employed over this image in order to generate the mask shown in (c). PyHIST extracts tiles of specific dimensions from the masked regions, and provides an overview image to inspect the output of the segmentation and masking procedure, as shown in (d), where the red lines indicate the grid generated by tiling the image at user-specified tile dimensions, while the blue crosses indicate the selected tiles meeting a certain user-specified threshold of tissue content with respect to the total area of the tile. In (e), examples of selected tiles are shown.

The PyHIST pipeline involves three main steps: 1) produce a mask for the input WSI that differentiates the tissue from the background, 2) create a grid of tiles on top of the mask, evaluate each tile to see if it meets the minimum content threshold to be considered as foreground and 3) extract the selected tiles from the input WSI at the requested resolution. By default, PyHIST uses a graph-based segmentation method to produce the mask. In this method, first, tissue edges inside the WSI are identified using a Canny edge detector (Fig 1B), generating an alternative version of the image with diminished noise and an enhanced distinction between the background and the tissue foreground. Second, these edges are processed by a graph-based segmentation algorithm [22], which is used here to identify tissue content. In short, this step evaluates the boundaries between different regions of an image as defined by the edges; different parts of the image are represented as connected components of a graph, and the "within" and "in-between" variations of neighboring components are assessed in order to decide if the examined image regions should be merged or not into a single component. From this, a mask is obtained in which the background and the different tissue slices are separated and marked as distinct objects using different colors (Fig 1C). Finally, the mask is divided into a tile grid with a user-specified tile size. These tiles are then assessed to see if they meet a minimum foreground (tissue) threshold with respect to the total area of the tile, in which case they are kept, and otherwise are discarded. Optionally, the user can also decide to save all the tiles in the image.

Of note, tile generation can be performed at the native resolution of the WSI, but downsampling factors can also be specified to generate tiles at lower resolutions. Additionally, edge detection and mask generation can also be performed on downsampled versions of WSIs—reducing segmentation runtimes (S1 Fig, S1 Text). A segmentation overview image is generated at the end of the segmentation procedure for the user to visually inspect the selected tiles (Fig 1D). With the set of parameters available in PyHIST (S2 Text), the user can specify regions to ignore when performing the masking and segmentation (S2 Fig), and have a fine-grained control over specific use-cases.

By default, PyHIST uses the graph-based segmentation method described previously due to its robustness in detecting tissue foreground in WSIs that do not have a homogeneous composition. However, alternative tile-generation methods based on thresholding that tend to work well on heterogeneous WSIs are also implemented (S3S5 Figs, see S1 Text for details and benchmarking information). PyHIST also has a random tile sampling mode for those applications that do not necessarily need to distinguish the background from the foreground. In this mode, tiles at a user-specified size and resolution will be extracted from random starting positions in the WSI.

Results

To demonstrate how PyHIST can be used to preprocess WSIs for usage in a ML application, we generated a use case example with the goal of building a classifier at the tile-level that allows us to determine the cancer-affected tissue of origin based on the histological patterns encoded in these tiles. To this end, we first retrieved a total of 36 publicly available WSIs, six from each of the following human tissues hosted in The Cancer Genome Atlas (TCGA) [23]: Brain (glioblastoma), Breast (infiltrating ductal carcinoma), Colon (adenocarcinoma), Kidney (clear cell carcinoma), Liver (hepatocellular carcinoma), and Skin (malignant melanoma). Slides within each tissue have the same cancer primary diagnosis as established by TCGA. Second, these WSIs were preprocessed with PyHIST, generating a total of 7163 tiles with dimensions 512x512. These tiles were then partitioned into training and test sets (constraining all the tiles of a given WSI to be in only one of the two sets), and we then fit a deep learning convolutional neural network model over these tiles with weighted sampling at training time (S6 Fig), achieving a classification accuracy of 95% (Fig 2A, S1 Table, S2 Table, see S3 Text for data preparation and model details, and a detailed assessment of Fig 2A).

Fig 2. TCGA use case.

Fig 2

(a) Examples of the top 5 most accurately predicted tiles per cancer-affected tissue (rows) from the TCGA use case test set. The label above each tile shows the predicted cancer-affected tissue type (GB: glioblastoma, DC: infiltrating ductal carcinoma, AC: adenocarcinoma, CC: clear cell carcinoma, HC: hepatocellular carcinoma, MM: malignant melanoma), followed by the probability of the ground truth label. All of these tiles were correctly classified. (b) Dimensionality reduction of TCGA tiles. t-SNE performed with the feature vectors of each tile that were derived from the deep learning classifier model. Each dot corresponds to an image tile.

We also inspected the feature vectors generated by the deep learning model: for each tile, we retrieved the features corresponding to the linear layer of the last (fully connected) sequential container of the model, and performed dimensionality reduction (t-SNE) over the stacked matrix of these vectors. From here, we infer that the learned features recapitulate tissue morphology since tile clusters corresponding to each tissue are formed (Fig 2B, S7 Fig). We note that this classifier is only an exercise to show end-users how to quickly prepare WSI data using PyHIST to generate tiles, reducing the overhead to start performing downstream analyses: further tuning of the model with more data is desirable to ensure that the classifier is robust enough to generalize to different types of unseen WSIs for a real application.

Availability and future directions

The example use case described above is documented and fully available at https://pyhist.readthedocs.io/en/latest/testcase/, and divided into three Jupyter notebooks: 1) Data preprocessing with PyHIST, 2) Constructing a deep learning tissue classifier, and 3) Dimensionality reduction. The TCGA WSIs in the use case were downloaded from the Genomic Data Commons (GDC) repository (https://gdc.cancer.gov/) using the GDC Data-transfer tool (https://gdc.cancer.gov/access-data/gdc-data-transfer-tool).

PyHIST is a generic tool to segment histological images automatically: it allows for easy and rapid WSI cleaning and preprocessing with minimal effort to generate image tiles geared towards usage in ML analyses. The tool is available at https://github.com/manuel-munoz-aguirre/PyHIST and released under a GPL license. Updated documentation and a tutorial can be found at https://pyhist.readthedocs.io/. PyHIST is highly customizable, enabling the user to tune the segmentation process in order to suit the needs of any particular application that relies on histological image tiles. The software and all of its dependencies have been packaged in a Docker image, ensuring portability across different systems. PyHIST can also be used locally within a regular computing environment with minimal requirements. Future directions and improvements include adding support for more histological image formats and features to save tiles into specialized data structures, as well as the inclusion of a graphical user interface to ease the learning curve for users who are new to the field of image processing for ML analyses. Finally, PyHIST is open source software: all the code and reproducible notebooks for the example use case are available in GitHub and will continue to be improved based on user feedback.

Supporting information

S1 Text. PyHIST overview.

General description of the pipeline: supported file formats, tile generation methods, and execution times.

(PDF)

S2 Text. Parameter description.

Description of supported arguments in PyHIST.

(PDF)

S3 Text. TCGA tissue classification use case.

Description of data preprocessing, model training and analysis for the TCGA tissue classification use case.

(PDF)

S1 Fig. WSI scaling steps in PyHIST.

(a) WSI at its original resolution (1x). (b) The mask can be generated and processed at a given downsampling factor. A smaller resolution will lead to a faster segmentation. (c) The output can be requested at a given downsampling factor. (d) The segmentation overview image can also be generated at a given downsampling factor. The dimensions in all steps are matched to ensure that the tile sizes and grid are consistent. The downsampling choices for all the steps are independent of each other.

(PNG)

S2 Fig. Image in graph-based segmentation test mode.

Test mode allows the user to see how the image mask will be with the chosen segmentation parameters and tile dimension configuration, before proceeding to generate the individual tile files. The black border defines the region of exclusion for tissue content placed within the edges of the slide (see—borders and—corners arguments, and section 2.2 in S2 Text).

(PNG)

S3 Fig. Comparison of mask generation methods.

(a) Adipose tissue WSI from the GTEx project, from sample GTEX-111CU-1826. Thresholding-based masks (b-d) are generated by first converting (a) into grayscale and then applying the corresponding thresholding method. Note that simple thresholding is shown here for completeness but only Otsu and adaptive are implemented in PyHIST due to their overall better performance when compared to simple thresholding. In the graph-based method, an image with highlighted edges is first generated through a Canny edge detector (e, left) and then the connected components are labeled through graph-based segmentation (e, right).

(PNG)

S4 Fig. Runtime benchmarks for random sampling and graph-based segmentation.

(a) Execution time to perform random sampling (y-axis) of a varying number of tiles (x-axis) at different downsampling factors for the WSI shown in S1 Fig. For each combination of number of tiles and downsampling factor, the sampling was repeated 30 times. Each dot represents the average running time across the 30 runs, while the interval shows the range between the maximal and minimal running time. (b) Execution time to perform random sampling of 1000 tiles (y-axis) at different tile dimensions (x-axis) at different downsampling factors for the same WSI in (a). Each combination was repeated 50 times, with each dot showing the average runtime. (c) Segmentation runtime of 50 Stomach WSIs from the GTEx project, at different downsampling factors, at a tile size of 256x256. Each dot represents the average execution time. Each interval shows the range between the fastest and slowest segmentations, while the labels show the dimensions of the corresponding WSIs. (d) Segmentation runtime (y-axis) at 1x resolution for the 50 Stomach WSIs, with respect to the number of pixels in the WSI (x-axis).

(PNG)

S5 Fig. Runtime comparison of mask-generating methods.

Tile extraction was evaluated for the three different methods at four different settings of tile size. Each method + tile size combination was repeated ten times to show runtime variability.

(PNG)

S6 Fig. Tile distribution per class in a training epoch in the TCGA example use case.

Within each training epoch, weighted random sampling is performed to create batches with a fair distribution of tiles among the classes. Even if the sample sizes in the training dataset are different among the classes, the balance in the number of tiles per epoch is obtained through data augmentation.

(PNG)

S7 Fig. Correlation matrix of TCGA tiles based on their feature vectors.

Heatmap of Pearson’s correlation matrix between the feature vectors obtained for each TCGA tile. Rows and columns are reordered with hierarchical agglomerative clustering.

(PNG)

S1 Table. Tile distribution across classes in the TCGA use case training and test sets.

(PNG)

S2 Table. Confusion matrix for the tiles in the test set of the TCGA use case.

(PNG)

Acknowledgments

We acknowledge Kaiser Co and Valentin Wucher for testing PyHIST, and the colleagues at the lab for useful feedback; Ferran Marqués, Verónica Vilaplana and Marc Combalia for useful discussions about image processing. All authors acknowledge the support of the Spanish Ministry of Science, Innovation and Universities to the EMBL partnership, the Centro de Excelencia Severo Ochoa, and the CERCA Programme / Generalitat de Catalunya.

Data Availability

All relevant data are within the manuscript and its Supporting Information files.

Funding Statement

The authors received no specific funding for this work. M.M.-A. performs his research with support of pre-doctoral fellowship FPU15/03635 from Ministerio de Educación, Cultura y Deporte. (URL: http://www.mecd.gob.es/) Agencia Estatal de Investigación (AEI) and FEDER under project PGC2018-094017-B-I00 is also acknowledged. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Abels E, Pantanowitz L, Aeffner F, Zarella MD, van der Laak J, Bui MM, et al. Computational pathology definitions, best practices, and recommendations for regulatory guidance: a white paper from the Digital Pathology Association. J Pathol. 2019;249: 286–294. 10.1002/path.5331 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Parwani AV. Next generation diagnostic pathology: use of digital pathology and artificial intelligence tools to augment a pathological diagnosis. Diagn Pathol. 2019;14: 138 10.1186/s13000-019-0921-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Mulrane L, Rexhepaj E, Penney S, Callanan JJ, Gallagher WM. Automated image analysis in histopathology: a valuable tool in medical diagnostics. Expert Rev Mol Diagn. 2008;8: 707–725. 10.1586/14737159.8.6.707 [DOI] [PubMed] [Google Scholar]
  • 4.Niazi MKK, Parwani AV, Gurcan MN. Digital pathology and artificial intelligence. Lancet Oncol. 2019;20: e253–e261. 10.1016/S1470-2045(19)30154-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Ehteshami Bejnordi B, Veta M, Johannes van Diest P, van Ginneken B, Karssemeijer N, Litjens G, et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA. 2017;318: 2199–2210. 10.1001/jama.2017.14585 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Nagpal K, Foote D, Liu Y, Chen P-HC, Wulczyn E, Tan F, et al. Development and validation of a deep learning algorithm for improving Gleason scoring of prostate cancer. npj Digital Med. 2019;2: 48 10.1038/s41746-019-0112-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Qaiser T, Tsang Y-W, Taniyama D, Sakamoto N, Nakane K, Epstein D, et al. Fast and accurate tumor segmentation of histology images using persistent homology and deep convolutional features. Med Image Anal. 2019;55: 1–14. 10.1016/j.media.2019.03.014 [DOI] [PubMed] [Google Scholar]
  • 8.Hou L, Nguyen V, Kanevsky AB, Samaras D, Kurc TM, Zhao T, et al. Sparse autoencoder for unsupervised nucleus detection and representation in histopathology images. Pattern Recognit. 2019;86: 188–200. 10.1016/j.patcog.2018.09.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Zhang Z, Chen P, McGough M, Xing F, Wang C, Bui M, et al. Pathologist-level interpretable whole-slide cancer diagnosis with deep learning. Nat Mach Intell. 2019;1: 236–245. 10.1038/s42256-019-0052-1 [DOI] [Google Scholar]
  • 10.Coudray N, Ocampo PS, Sakellaropoulos T, Narula N, Snuderl M, Fenyö D, et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat Med. 2018;24: 1559–1567. 10.1038/s41591-018-0177-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Banna GL, Olivier T, Rundo F, Malapelle U, Fraggetta F, Libra M, et al. The promise of digital biopsy for the prediction of tumor molecular features and clinical outcomes associated with immunotherapy. Front Med (Lausanne). 2019;6: 172 10.3389/fmed.2019.00172 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Madabhushi A, Lee G. Image analysis and machine learning in digital pathology: Challenges and opportunities. Med Image Anal. 2016;33: 170–175. 10.1016/j.media.2016.06.037 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Serag A, Ion-Margineanu A, Qureshi H, McMillan R, Saint Martin M-J, Diamond J, et al. Translational AI and deep learning in diagnostic pathology. Front Med (Lausanne). 2019;6: 185 10.3389/fmed.2019.00185 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Srinidhi CL, Ciga O, Martel AL. Deep neural network models for computational histopathology: A survey. arXiv. 2019; [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Gertych A, Swiderska-Chadaj Z, Ma Z, Ing N, Markiewicz T, Cierniak S, et al. Convolutional neural networks can accurately distinguish four histologic growth patterns of lung adenocarcinoma in digital slides. Sci Rep. 2019;9: 1483 10.1038/s41598-018-37638-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Stritt M, Stalder AK, Vezzali E. Orbit Image Analysis: An open-source whole slide image analysis tool. PLoS Comput Biol. 2020;16: e1007313 10.1371/journal.pcbi.1007313 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Bankhead P, Loughrey MB, Fernández JA, Dombrowski Y, McArt DG, Dunne PD, et al. QuPath: Open source software for digital pathology image analysis. Sci Rep. 2017;7: 16878 10.1038/s41598-017-17204-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Marée R, Rollus L, Stévens B, Hoyoux R, Louppe G, Vandaele R, et al. Collaborative analysis of multi-gigapixel imaging data using Cytomine. Bioinformatics. 2016;32: 1395–1401. 10.1093/bioinformatics/btw013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Gupta R, Kurc T, Sharma A, Almeida JS, Saltz J. The emergence of pathomics. Curr Pathobiol Rep. 2019;7: 73–84. 10.1007/s40139-019-00200-x [DOI] [Google Scholar]
  • 20.Aguet F, Barbeira AN, Bonazzola R, Brown A, Castel SE, Jo B, et al. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 2020;369: 1318–1330. 10.1126/science.aaz1776 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Goode A, Gilbert B, Harkes J, Jukic D, Satyanarayanan M. OpenSlide: A vendor-neutral software foundation for digital pathology. J Pathol Inform. 2013;4: 27 10.4103/2153-3539.119005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Felzenszwalb PF, Huttenlocher DP. Efficient Graph-Based Image Segmentation. Int J Comput Vis. 2004;59: 167–181. 10.1023/B:VISI.0000022288.19776.77 [DOI] [Google Scholar]
  • 23.Cancer Genome Atlas Research Network, Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet. 2013;45: 1113–1120. 10.1038/ng.2764 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008349.r001

Decision Letter 0

Dina Schneidman-Duhovny

17 Jul 2020

Dear Muñoz-Aguirre,

Thank you very much for submitting your manuscript "PyHIST: A Histological Image Segmentation Tool" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Dina Schneidman-Duhovny

Software Editor

PLOS Computational Biology

Dina Schneidman-Duhovny

Software Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The paper introduces a simple and easy to use open-source tool for slide tiling, which is useful for histopathology image analysis. I have to acknowledge there are currently inadequate tools available that make tiling of digital slides simple, and I had to write my own scripts to tile my slides when analyzing whole slide images. Overall, my experience with the application was positive; I was able to create tiles from svs and TIFF files in a short amount of time, with a short learning curve.

The setup and installation was relatively easy; the Docker version of the software worked without problems on Ubuntu and Windows 10 based machines. A few problems were encountered when the program was installed through Anaconda using the program’s accompanying installation instructions: in the Ubuntu machine, ‘cv2’ was reported to be missing; in Windows 10, several libraries were missing. I was able to get the program running in Ubuntu and Anaconda after manually installing OpenCV. I did not attempt to manually fix the Anaconda installation in Windows due to the large number of missing libraries. I am not sure if this is due to the unique setup of my computer, or the program was not tested with Windows 10 and Anaconda.

Suggested Correction:

Lines 22 - 23 states “Histopathological images are routinely used in the diagnosis of many diseases, notably cancer.” This can be misinterpreted as saying that pathologists make their diagnoses predominantly through whole slide images (WSIs). Although WSI is becoming more widespread in pathology departments, most pathologists still render their diagnoses by examining glass slides under a microscope. This statement has to be corrected/modified to reflect that whole slide images are still not being used by majority of pathologists to sign out their cases, although there is an increasing adoption of whole slide scanning technologies in pathology departments.

Future direction:

There is more potential in this software, which can accommodate additional features in the future while retaining its simplicity. Aside from adding new features, I believe adding a Graphical user interface (GUI) version of the program would increase the application’s user base, and be helpful for those who are less computer savvy and have no experience in using the command line.

Reviewer #2: The manuscript submitted by Muñoz-Aguirre and colleagues aims to describe the development

of PyHIST which is a histological image segmentation tool. Overall, this manuscript presents

results that would be of interest to the community of scientists and computational biologists

concerned with this problem. However, there are major issues in this manuscript that prevent us

from recommending that this manuscript be accepted in its current state.

Major:

1) Abstract: highlights that preprocessing enabled by PyHIST involves image scaling,

segmentation, and eventually tile extraction to clearly mention the utility of PyHIST.

2) Introduction: The paper correctly addresses the need for standardization of the tiling and

patch-creating pipeline for researchers working in this area to prevent dataset-specific

biases. Although, as far as saving research time is concerned, currently, WSI preprocessing

requires developing custom scripts, but once a process is established researchers can

typically use similar code for subsequent tiling for all projects. Therefore, PyHIST may only

save a significant amount of time at the initial phase.

3) Facts have been mentioned without references – we have mentioned a few examples but

urge the authors to add extensive references:

- lines 22 (citation for WSI obtaining process required),

- 23 (citation for use in cancer),

- 25 (citation to support the claim of development of computational methods for

disease diagnosis and classification), and

- 33 (cite literature to support histopathological images capturing endophenotypes that

provide crucial information when correlated with molecular and cellular data).

- In a similar way, kindly provide references at lines 37, 46, 50

4) Design and Implementation:

- It’s not clear why the authors are interested in highlighting edges within tissue

fragments rather than outlining the entire fragment. Figure 1b resembles a

grayscaled WSI. A similar result as Figure 1b can be reached with less computation

by just binarizing the WSI using a threshold to separate background from foreground.

Does edge detection provide any unique benefits over binarizing the WSI?

- The graph-based segmentation algorithm can perform unsupervised segmentation

on complex images, but in this case the algorithm just needs to detect the connected

objects. If the input image is a binary mask (foreground and background), there are

many simple functions to label contiguous/interconnected objects and produce an

output similar to Figure 1c. Is graph-based segmentation used because it works well

with edge detection inputs? How does it compare computationally to other

connected-component labeling techniques such as Python Skimage’s measure.label

function?

- Why are steps (b) and (c) needed in the PyHIST pipeline in Figure 1? Red gridlines

still appear to tile the entire WSI and then some tiles are not stored based on a

background threshold. How are the tissue fragment labels from (c) used?

5) Results:

- Details of the deep learning model have not been provided – patches detected

correctly have vague histology that is shown in Figure 2 A (explained below). We

suggest a pathologist review of the deep learning model results. Additionally, the

connection between a better model accuracy on the dataset and the validity of the

pre-processing steps has not been made.

- The partitioning of training and test sets can be the most time-consuming pre-

processing steps of the ML process. Tiles from the same WSI should be constrained

to the training or test sets. It is difficult to satisfy this constraint, while also managing

the percentage of tiles in the test set and class imbalances. This process is not a

built-in feature of PyHIST and it is unclear in the paper if PyHIST assists with this

aspect of the ML pipeline at all.

- The deep learning results are an example that tiles processed using PyHIST can

achieve high prediction performance, but it doesn’t necessarily prove that it is better

than other baseline or competing approaches. WSIs from different part of the body

can be quite distinguishable, so many different tiling approaches could produce

similar results. The Results section could include comparisons of performance and

computation time for several tiling methodologies. How does PyHIST stack up

against other techniques?

6) Availability and future directions:

- The SVS limitation is mentioned here but should also be addressed earlier in the

Intro or Design sections. For example, “PyHIST is currently limited to only SVS

format due to/because…”.

7) Figures:

- Figure 2 A: histology is ambiguous since the top panel for ‘T-brain’ shows artefactual

tissue rather than brain tissue with cell bodies of neurons or glia etc. This is repeated

for 3rd, 4th and 5th (from left) T-breast, and 1st (from left) T-colon.

8) Supplementary Materials:

- Section S3: cropping the image tiles is mentioned – what is the size of these crops

and are these kept uniform each time? Explanation is required for clarity of the user.

- Section S2: the segmentation parameters seem to be an important part of tiling, but it

is still unclear how they work. Is this a way of capturing tiles that have background in

a certain orientation? How do parameters for border and corners interact with the

background percentage and how does this influence segmentation?

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Jerome Cheng

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008349.r003

Decision Letter 1

Dina Schneidman-Duhovny

17 Sep 2020

Dear Muñoz-Aguirre,

We are pleased to inform you that your manuscript 'PyHIST: A Histological Image Segmentation Tool' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Dina Schneidman

Software Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: In the revised and significantly improved version of the manuscript, the authors addressed each reviewer's concerns, and all of my previous comments have been satisfactorily addressed. I do not have any new recommendations.

Reviewer #2: The revised manuscript submitted by Muñoz-Aguirre and colleagues extensively address the comments raised by the reviewers.

We commend them for adding detailed methods regarding pre-processing including tile extraction, additional relevant references, and mask comparisons in Supplementary Text S1 and Supplementary Figure S3. Further the edits done for figure 2 have enabled to message to be clearer and the authors have done a remarkable job.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Jerome Cheng

Reviewer #2: Yes: Sana Syed

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008349.r004

Acceptance letter

Dina Schneidman-Duhovny

9 Oct 2020

PCOMPBIOL-D-20-00862R1

PyHIST: A Histological Image Segmentation Tool

Dear Dr Muñoz-Aguirre,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Matt Lyles

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Text. PyHIST overview.

    General description of the pipeline: supported file formats, tile generation methods, and execution times.

    (PDF)

    S2 Text. Parameter description.

    Description of supported arguments in PyHIST.

    (PDF)

    S3 Text. TCGA tissue classification use case.

    Description of data preprocessing, model training and analysis for the TCGA tissue classification use case.

    (PDF)

    S1 Fig. WSI scaling steps in PyHIST.

    (a) WSI at its original resolution (1x). (b) The mask can be generated and processed at a given downsampling factor. A smaller resolution will lead to a faster segmentation. (c) The output can be requested at a given downsampling factor. (d) The segmentation overview image can also be generated at a given downsampling factor. The dimensions in all steps are matched to ensure that the tile sizes and grid are consistent. The downsampling choices for all the steps are independent of each other.

    (PNG)

    S2 Fig. Image in graph-based segmentation test mode.

    Test mode allows the user to see how the image mask will be with the chosen segmentation parameters and tile dimension configuration, before proceeding to generate the individual tile files. The black border defines the region of exclusion for tissue content placed within the edges of the slide (see—borders and—corners arguments, and section 2.2 in S2 Text).

    (PNG)

    S3 Fig. Comparison of mask generation methods.

    (a) Adipose tissue WSI from the GTEx project, from sample GTEX-111CU-1826. Thresholding-based masks (b-d) are generated by first converting (a) into grayscale and then applying the corresponding thresholding method. Note that simple thresholding is shown here for completeness but only Otsu and adaptive are implemented in PyHIST due to their overall better performance when compared to simple thresholding. In the graph-based method, an image with highlighted edges is first generated through a Canny edge detector (e, left) and then the connected components are labeled through graph-based segmentation (e, right).

    (PNG)

    S4 Fig. Runtime benchmarks for random sampling and graph-based segmentation.

    (a) Execution time to perform random sampling (y-axis) of a varying number of tiles (x-axis) at different downsampling factors for the WSI shown in S1 Fig. For each combination of number of tiles and downsampling factor, the sampling was repeated 30 times. Each dot represents the average running time across the 30 runs, while the interval shows the range between the maximal and minimal running time. (b) Execution time to perform random sampling of 1000 tiles (y-axis) at different tile dimensions (x-axis) at different downsampling factors for the same WSI in (a). Each combination was repeated 50 times, with each dot showing the average runtime. (c) Segmentation runtime of 50 Stomach WSIs from the GTEx project, at different downsampling factors, at a tile size of 256x256. Each dot represents the average execution time. Each interval shows the range between the fastest and slowest segmentations, while the labels show the dimensions of the corresponding WSIs. (d) Segmentation runtime (y-axis) at 1x resolution for the 50 Stomach WSIs, with respect to the number of pixels in the WSI (x-axis).

    (PNG)

    S5 Fig. Runtime comparison of mask-generating methods.

    Tile extraction was evaluated for the three different methods at four different settings of tile size. Each method + tile size combination was repeated ten times to show runtime variability.

    (PNG)

    S6 Fig. Tile distribution per class in a training epoch in the TCGA example use case.

    Within each training epoch, weighted random sampling is performed to create batches with a fair distribution of tiles among the classes. Even if the sample sizes in the training dataset are different among the classes, the balance in the number of tiles per epoch is obtained through data augmentation.

    (PNG)

    S7 Fig. Correlation matrix of TCGA tiles based on their feature vectors.

    Heatmap of Pearson’s correlation matrix between the feature vectors obtained for each TCGA tile. Rows and columns are reordered with hierarchical agglomerative clustering.

    (PNG)

    S1 Table. Tile distribution across classes in the TCGA use case training and test sets.

    (PNG)

    S2 Table. Confusion matrix for the tiles in the test set of the TCGA use case.

    (PNG)

    Attachment

    Submitted filename: Reviewers_comments_R1.pdf

    Data Availability Statement

    All relevant data are within the manuscript and its Supporting Information files.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES