Skip to main content
Journal of Medical Imaging logoLink to Journal of Medical Imaging
. 2020 Jul 16;7(4):047501. doi: 10.1117/1.JMI.7.4.047501

Automatic cancer detection on digital histopathology images of mid-gland radical prostatectomy specimens

Wenchao Han a,b,c, Carol Johnson a, Andrew Warner a, Mena Gaed d, Jose A Gomez d, Madeleine Moussa d, Joseph Chin e,f, Stephen Pautler e,f, Glenn Bauman b,c,e, Aaron D Ward a,b,c,e,*
PMCID: PMC7363935  PMID: 32715024

Abstract.

Purpose: Automatic cancer detection on radical prostatectomy (RP) sections facilitates graphical and quantitative surgical pathology reporting, which can potentially benefit postsurgery follow-up care and treatment planning. It can also support imaging validation studies using a histologic reference standard and pathology research studies. This problem is challenging due to the large sizes of digital histopathology whole-mount whole-slide images (WSIs) of RP sections and staining variability across different WSIs.

Approach: We proposed a calibration-free adaptive thresholding algorithm, which compensates for staining variability and yields consistent tissue component maps (TCMs) of the nuclei, lumina, and other tissues. We used and compared three machine learning methods for classifying each cancer versus noncancer region of interest (ROI) throughout each WSI: (1) conventional machine learning methods and 14 texture features extracted from TCMs, (2) transfer learning with pretrained AlexNet fine-tuned by TCM ROIs, and (3) transfer learning with pretrained AlexNet fine-tuned with raw image ROIs.

Results: The three methods yielded areas under the receiver operating characteristic curve of 0.96, 0.98, and 0.98, respectively, in leave-one-patient-out cross validation using 1.3 million ROIs from 286 mid-gland whole-mount WSIs from 68 patients.

Conclusion: Transfer learning with the use of TCMs demonstrated state-of-the-art overall performance and is more stable with respect to sample size across different tissue types. For the tissue types involving Gleason 5 (most aggressive) cancer, it achieved the best performance compared to the other tested methods. This tool can be translated to clinical workflow to assist graphical and quantitative pathology reporting for surgical specimens upon further multicenter validation.

Keywords: prostate cancer detection, whole-slide histopathology imaging, machine learning, transfer learning, tissue component segmentation, radical prostatectomy pathology

1. Introduction

Radical prostatectomy (RP), the removal of the prostate gland, is the most common treatment for organ-confined prostate cancer (PCa), which is performed on 40% of PCa patients annually in the United States.1 Approximately 17% to 29% of patients experience cancer recurrence after surgery, portended by serum prostate-specific antigen relapse.2,3 Adjuvant and salvage therapy, including radiation therapy to the prostate bed, can be life-saving for patients with recurrence, to prevent or delay mortality due to metastatic disease.3

The surgical pathology report provides valuable information for postsurgery prognosis, recurrence risk management, and selection and guidance of adjuvant therapy. The primary and secondary tumors in the specimen are reported in terms of location, volume, and the aggressiveness of differentiation based on the Gleason grading system, which classifies the tumors into five grades based on their morphological patterns, considering Gleason grade 1 to 3 as low grade and 4 to 5 as high grade.3

Advances in pathology have included establishing standards for important report elements and a move toward synoptic summaries.4 However, current clinical pathology reporting is still primarily text-based, qualitative, and subject to interobserver variability, which leads to challenges in terms of quantitative and repeatable interpretation of lesion size, location, and spread. Methods of measuring and reporting tumor volume vary, and no generally accepted standard has been established.5 Large interobserver variability has been reported for identifying extra prostatic extension (EPE) (i.e., where the tumor extends outside of the prostate into the surrounding region) without anatomic landmarks.6 In addition, assessing the differentiation degree of the tumor using Gleason score7 (i.e., assigning a total Gleason score using the sum of Gleason grades of primary and secondary tumors) has been established for decades, but reporting the total Gleason score remains problematic. For cases with multiple nodules, reporting the overall score may underestimate the tumor aggressiveness.8

Currently, pathology reports include or seek to incorporate accurate and detailed information to maximize clinical utility.9,10 Bettendorf et al.9 proposed a hand-drawn tumor map of the prostatectomy specimen for pathology reporting, which provides a way for visual estimation of tumor size and location. To quantify the hand-drawn loci on the anatomical maps in prostatectomy specimens, Eminaga et al.10 developed an extensible markup language-based document architecture. However, this method requires substantial effort from the pathologist, with potential to slow the workflow and increase fatigue. Also the quantitative reporting based on these approaches is derived from approximate hand-drawn representations. Thus the clinical challenges remain. Annotating each cancerous region of interest (ROI) at high resolution (20× or higher) on whole-mount RP sections enables quantitative reporting, giving measurements of tumor size, location, and grade. This would resolve the clinical challenges mentioned above and would also benefit research studies investigating the relationship between quantitative pathologic parameters, such as tumor volume and clinical outcomes. Moreover, PCa is challenging to detect and localize on imaging, such as multiparametric magnetic resonance imaging. This has motivated the undertaking of imaging validation studies that use annotated prostatectomy histology as the reference standard.11,12 Adequate target delineation of the tumor through such studies can potentially improve disease control by allowing safe boosting of radiation dose (or targeting other ablative therapies) to corresponding areas of the prostate bed and reduction of side effects by reducing the treatment margins.13

The time required to conduct such contouring manually precludes its use within a clinical pathology workflow and adds substantial time and expense to imaging validation studies. Therefore, there is an unmet need for a software tool that can provide accurate and fast automatic contouring of cancerous regions on whole-mount digital histopathology images of RP specimens.

Distinguishing cancer from noncancer tissue on histology is challenging since their appearance is visually similar [e.g., in Fig. 1, Gleason grade 3 (G3), small gland G4, and benign prostatic hyperplasia (BPH) (noncancer), are visually similar]. Previous studies have demonstrated the potential for computational approaches to address this problem (see the recent review article14). Most studies involved training and validation on a small subset (typically hundreds) of selected ROIs. In comparison, detecting cancerous regions throughout entire whole-mount whole-slide images (WSIs) of RP sections from different patients is more challenging for three primary reasons. First, the high resolution of the digital histopathology images leads to a large number of pixels in each WSI (e.g., a WSI of a mid-gland prostate section often includes >4  gigapixels), requiring an efficient approach. Second, there is substantial heterogeneity of appearance in cancerous and noncancerous prostate tissue. Noncancerous tissue includes normal parenchyma, high-grade prostatic intraepithelial neoplasia (PIN), cancer-mimicking atrophy, and BPH, all of which having different appearances and some of which (e.g., PIN) sharing many common features with cancerous tissue. For cancerous tissue, there are different morphological patterns across different Gleason grades (e.g., different appearances for cancer samples G3 and G4 and noncancer samples benign, PIN, and BPH in Fig. 1). Even within the same grade, the tissue appearance can be quite heterogeneous.3 For instance, Gleason grade 4 has several subtypes that have different morphological patterns (e.g., small glands, large cribriform glands, mucinous glands, poorly formed glands, etc.) (Fig. 1). Testing a system on all of the tissue throughout the WSIs thus introduces more heterogeneity into the samples, resulting in a more challenging classification problem. Third, staining variability is a substantial issue, with contributing factors such as slide preparation and batch effects.15,16 Although many color normalization methods have been proposed,1719 color normalization on all images to a template did not reduce instability of texture features and therefore has not completely resolved the problem.15 Other challenges arise from factors including the presence of tissue marking dye due to the inking of the surfaces as part of pathology processing, artifacts (e.g., tissue folding, out-of-focus regions, and tissue tearing), presence of red blood cells (RBCs), and within-ROI heterogeneity (i.e., ROIs contain both cancerous and noncancerous tissue). Validation using all available tissue avoids bias from ROI selection and tests the system against the full variability in staining and tissue appearance.

Fig. 1.

Fig. 1

Tissue samples: (a) cancer (G3) and noncancer tissue samples (benign, PIN, and BPH) and (b) subtypes of cancer (G4).

This paper presents a system for cancer detection and localization on WSIs of mid-gland RP sections, yielding state-of-the-art accuracies using three different methods with fast processing times. To the best of our knowledge, this is the first study that developed and validated a system for detecting cancerous regions on whole-mount WSIs using all tissues throughout the slides, including all clinically relevant grade groups, and demonstrating a practical processing time. Also we have validated the system on the largest reported annotated data set (i.e., we used the largest amount of tissue from the largest set of whole-mount tissue sections), with highly detailed reference standard contours provided by a pathologist. This is very important for translating the technology toward clinical application. Although many previous studies have demonstrated the capability of machine learning methods for distinguishing cancer tissue from noncancer tissue, most studies used preselected ROIs or biopsy WSIs, which contain a much smaller amount of tissue than whole-mount prostatectomy WSIs. Because PCa is highly heterogeneous, such studies do not demonstrate that machine learning methods can be implemented on whole-mount prostatectomy WSIs from large data sets for cancer detection for clinical translation. Only two studies20,21 used materials from whole-mount prostatectomy WSIs without preselected ROIs, but they reported a limitation of their methods regarding detection of high-grade cancer tissues, which are of high importance in clinical practice.

In addition, our study provided the first direct comparison of cancer detection performance using conventional and deep learning approaches. We also identified that the three-class tissue components provided the major cues for cancer detection and better detection performance for detecting high-grade cancer using pretrained AlexNet.

The primary contributions of this paper are as follows.

  • (1)

    A calibration-free adaptive thresholding algorithm for fast nuclei segmentation, which yields consistent three-class tissue component maps (TCMs) despite staining variability across WSIs for PCa detection on whole-mount WSIs.

  • (2)

    Validation of a conventional machine learning approach with three classifiers and a deep learning approach using the inputs of raw images and TCMs for cancer detection on 286 WSIs of mid-gland RP sections from 68 patients (4 to 5  WSIs/patient). Cross validation (CV) was performed, grouping data on per-patient basis (i.e., tissues from a single patient never appeared in both the training and testing sets).

  • (3)

    Direct comparisons of the five machine learning methods using our proposed pipeline for PCa detection on whole-mount WSIs.

  • (4)

    Identified 14 TCM-based texture features for effective cancer detection from the conventional machine learning approach, yielding better than state-of-the-art performance compared to other nondeep learning-based methods.

  • (5)

    Achieved the best overall performance across all methods using deep learning approaches and the use of the three-class TCM demonstrated more robust performance across different types of prostate tissue.

2. Related Work

Substantial work has been published on the problem of PCa detection on digital histology images of hematoxylin and eosin (H&E)-stained specimens.14 Most previous research has focused on the classification of cancerous versus noncancerous using preselected ROIs, with a few previous studies focused on cancer identification and localization on whole-mount tissues (biopsy and RP).

Studies2230 performed cancer versus noncancer classification using preselected ROIs. Their overall positive results are valuable in that they point to the potential utility of the methods employed. However, the total processed tissue amount in each study was <4  cm2, which is about half of the size of a single WSI of a mid-gland tissue section. Considering the heterogeneity of PCa tissue and staining variability among WSIs, the limited extent of the processed tissue points to the need for a study with more comprehensive validation throughout entire WSIs. Also these studies generally did not prioritize computational cost and the scalability to larger tissue samples as would be encountered in clinical practice.

Several proposed approaches have demonstrated the potential for detection and localization of cancerous regions on WSIs of biopsy tissues. Doyle et al.31 used a multiresolution approach to detect PCa with a boosted Bayesian classifier, and reported areas under the receiver operating characteristic curve (AUCs) of 0.84, 0.83, and 0.76 at the lowest, intermediate, and highest resolution levels, respectively, on 100 biopsy WSIs from 58 patients with an average processing time of 3  min per 1000×1000  pixel region. Litjens et al.32 used a super-pixel-based approach and reported an AUC of 0.96 at the per-slide level with a sensitivity of 1.0 and specificity 0.4, and an AUC of 0.92 at the super-pixel patch level using 10-fold CV (data were stratified on a per-patient basis) using 204 WSIs of biopsy tissues from 163 patients. Approximately 4 min of processing time per WSI was reported. Their more recent work33 used a convolutional neural network and reported an AUC of 0.99 (90th percentile) and 0.98 (median percentile). The processing time was not reported. Overall, it is not clear that these systems would scale up to the data sizes involved in RP specimens; for instance, a computation time of 4 min per WSI of biopsy tissue32 would result in a requirement of 17  days of computation per mid-gland RP WSI. Recently, Campanella et al.34 demonstrated that a weakly supervised approach using multiple instance learning was able to classify prostate biopsy slides, at the slide level, as containing cancer (or not) with an AUC of 0.99, 0.96, and 0.93, testing on 1784, 1274, and 12,727 biopsy WSIs, respectively. Those data sets were from the same scanner, a different scanner, and different institutes, respectively. Their work showed that when the classification of interest is at the whole-slide level, curated data sets with detailed ground-truth contours are not necessary for training.

Several studies reported system designs for detecting PCa on WSIs of prostate tissue sections. Monaco et al.20 and Rashid et al.21 classified each individual gland as malignant or benign by extracting gland features. Monaco et al. reported a sensitivity and specificity of 0.87 and 0.90, respectively, using 40 RP tissue sections from 20 patients at low resolution (8  μm/pixel) with average processing time of 2.75 min per whole-mount tissue section. Rashid et al. reported a sensitivity of 90%, a specificity of 93%, and an accuracy of 93%, validating on 20 WSIs from 11 patients with a training data set of 50 WSIs from 19 patients without reporting system processing time. Both works reported a limitation that the system was unable to detect poorly differentiated cancer due to the dependence of the method on gland classification. This is an issue for Gleason grade 5 (G5) cancer, in which the glands are disrupted. In their data set, no G5-inclusive tissue samples (i.e., pure G5 and mixed grades including G5, such as G4+5) were used. However, the presence of G5 cancer is highly negatively prognostic;35 and therefore it is critical to detect G5 cancer.

DiFranco et al.36 used a tile-based approach, classifying each 512×512  pixel tile of the WSI from RP specimens and reported an AUC of 0.95 using 15 WSIs from 14 different patients. Texture features were calculated at different color channels of RGB and CIE L*a*b color space. The experiments conducted in the study used a 16-core server without reporting computational time, and the need for parallel computing was described.

Nguyen et al.37 reported a false positive rate (FPR) of 6% with 78% sensitivity using 6 images (approximate size of 4000×7000  pixels) for training and 11 WSIs (approximate size of 5000×23000  pixels) for testing, incorporating nucleus-related cytology features with texture features without reporting the processing time.

Nir et al.38 used multiple pathologists’ annotations with hand-crafted features and reported an AUC of 0.85 with an accuracy of 90.5%, a sensitivity of 91.5%, and a specificity of 85.2% at the optimal operating point. They tested on 333 tissue microarray (TMA) cores sampled from 231 RP specimens for classifying TMA patches as cancerous versus noncancerous. The validation was conducted using leave-one-patient-out (LOPO) CV with a processing time of 14h for feature extraction for 333 TMAs, and 4.5  h for 231 LOPO simulations for training and classification using parallel computing via two 12-core computers. They reported an overall AUC of 0.75 when testing their system, trained using all of the TMAs, on an external dataset of 230 WSIs of RP sections from 56 patients using all of the cancerous tissues, and 10% of the noncancerous tissue by random sampling.

To the best of our knowledge, no system for PCa detection has been reported and validated for performance throughout all tissues on RP WSIs including all clinically relevant grade groups, with practical processing time.

3. Methods

Figure 2 shows a block diagram depicting the training of conventional and deep learning-based classifiers for automatic cancer detection on WSIs of RP sections, building on our previously reported prototype system.39 In training, cancerous and noncancerous ROIs were determined using gold-standard histopathology annotations. We computed a three-class TCM for each ROI using our proposed segmentation method. The three machine learning approaches used in our study were as follows. (1) We computed 14 selected first- and second-order statistical features40,41 from TCMs and trained three classifiers [a Fisher linear discriminant classifier (FisherC), a logistic linear classifier (LoglC), and a support vector machine (SVM) classifier]. (2) Pretrained AlexNet42 (pretrained on the ImageNet database43) was fine-tuned using the TCMs. (3) Pretrained AlexNet was fine-tuned using raw image ROIs without any image processing. During testing, the trained system classified all of the ROIs covering each WSI as cancerous or noncancerous. We validated the classification results against gold standard histopathology annotations using CV, with data grouped on a per-patient basis, ensuring that samples from the same patient never appeared in both the training and testing sets in any fold. We performed the validation using each of the trained systems, and the results were collected for analysis and comparison. Our implementation used MATLAB 2018a (The Mathworks, Natick, MA), OpenCV 3.1 for SVM implementation, and PRtools 5.0 (Delft Pattern Recognition Research, Delft, The Netherlands) for the implementation of FisherC and LoglC, and MATLAB 2018a for AlexNet. The code for tune-AlexNet-TCM is publicly available in the Github repository: [https://github.com/WenchaoHanSRI/WenchaoProstate].

Fig. 2.

Fig. 2

Method overview for system training using three different machine learning methods. (a) WSIs with our expert annotations. Different colored annotations represent different types of tissue based on the Gleason grading system. (b), (c) Zoomed views from the black square regions in (a) and (b), respectively.

3.1. Data

3.1.1. Manual annotation

This study was approved by our institutional Human Subjects Research Ethics Board with informed consent of all patients. We obtained 299 WSIs of H&E-stained, 4-μm thick, paraffin-embedded mid-gland tissue sections from 71 RP specimens from patients with biopsy-confirmed PCa (clinical stage T1 or T2). All tissue sections were prepared and scanned in our hospital pathology laboratory following the same protocol.28 Two different types of scanners were used: an Aperio ScanScope GL (Leica Biosystems, Wetzlar, Germany) for sections from 46 patients and an Aperio ScanScopeAT Turbo (Leica Biosystems, Wetzlar, Germany) for sections from the other 25 patients. Sections were scanned at 20×(0.5  μm/pixel) in bigtiff pyramid format without compression. The resulting 24-bit RGB color images have a pixel size of 0.5  μm/pixel.

Each WSI was contoured and graded by a trained physician (Gaed) at 20× magnification using different colors for different Gleason grades, and verified by one of two genitourinary pathologists (Moussa or Gomez). Contouring was performed using a Cintiq 12WX pen-enabled display (Wacom Co. Ltd., Saitama, Japan) with the ScanScope ImageScope v11.0.2.725 image viewing software (Aperio Technologies, Vista, CA, USA). Contouring was conducted at high precision [Figs. 2(b) and 2(c)], which takes about 70 h per case. Where Gleason grades were intermingled to a degree where they could not be readily separated, foci were given a grade such as 4+3, indicating that the majority of the focus contained grade 4 cancer, and the remainder contained grade 3 cancer.

3.1.2. Ground truth ROI labeling

Each WSI was separated into a set of square 960×960  pixel ROIs. ROIs containing at least 50% cancerous tissue according to the manual pathology annotations were considered cancerous; all other ROIs were considered noncancerous.

3.1.3. Data separation for system tuning and feature selection

We performed classifier hyper-parameter tuning and feature selection on a separate tuning data set comprising 13 WSIs from 3 different patients. We did not use the patients in the tuning data set for CV and used only the 68 remaining patients for CV. The tuning and CV data sets each have a mixture of WSIs from both scanners.

3.2. Tissue Component Mapping

The purpose of tissue staining is to assist in identifying different types of tissue components, which have semantic meaning to the pathologist for identifying abnormalities. We developed an algorithm that assigns a label to each image pixel to generate a TCM for further analysis. We labeled each pixel as one of three tissue components: nuclei, lumen, and stroma/other tissue. The main steps of our TCM generation algorithm are: (1) segmentation of nuclei using color deconvolution and a proposed adaptive thresholding method, (2) segmenting luminal areas by global thresholding in the red-green-blue (RGB) color space, and (3) designating other pixels as stroma/other tissue. These steps are described in more detail as follows.

3.2.1. Nucleus mapping

We segmented cell nuclei by adaptive thresholding of the hematoxylin channel after color deconvolution.44 We then used morphological operations to eliminate RBC pixels that were falsely labeled as nuclei, by relabeling them as stroma/other tissue. The details of these steps are described below.

Color deconvolution. We used a color deconvolution algorithm44 to separate the H&E stains into three image channels corresponding to the hematoxylin stain, eosin stain, and the background.

We used the standard deconvolution matrix used by Ruifrok and Johnston44 and applied this algorithm to each ROI independently. This separated each ROI into three gray-level images representing the amount of hematoxylin [e.g., Fig. 3(b), darker region corresponding to larger amount of hematoxylin stain], eosin, and background, respectively. Since most substances within nuclei are basophilic, they bind to hematoxylin. We, therefore, used the hematoxylin channel to segment nuclei.

Fig. 3.

Fig. 3

Plot of number of connected components and the binarized hematoxylin channel at the corresponding thresholds. (a) Sample ROI. (b) Gray-level image representation of the hematoxylin channel after color deconvolution. (c)–(g) Binary maps after thresholding using the thresholds where the blue triangle, purple circle, black square, red arrow, and blue square labeled in the plot, respectively. The red square highlighted region in a zoomed-in view shows in (h). (i), (j) Images of zoomed in view highlighted by the red squares in (e) and (f), respectively.

Cell nucleus segmentation using adaptive thresholding: staining variability results in variations in hematoxylin channel intensity across different WSIs [see the middle images for the two sample ROIs in Fig. 4(a)]. The left case has much more darkly hematoxylin-stained stroma/fibromuscular tissue than the right, where the nuclei are more prominent. This makes global thresholding inapplicable for nucleus segmentation. We, therefore, propose an adaptive thresholding method.

Fig. 4.

Fig. 4

Tissue component segmentation. Top row: two cases of H&E stained WSIs with their TCMs to the right. (a) Left: ROI samples from each of the cases, middle: gray-level hematoxylin channel images, right: nuclei map after adaptive thresholding. (b) Left: ROI samples from each of the cases, right: computed TCM using our segmentation method. (c) Left: ROI samples from the two cases with RBCs included, right: TCM computed using our segmentation method.

Our proposed algorithm is based on the observation that as the threshold on the hematoxylin channel increases, causing more hematoxylin-stained tissue to be excluded in the thresholded image [Figs. 3(c)3(g)], initially the components of background tissue are excluded as a sharp decreasing number of connected components [Figs. 3(c) and 3(d)], resulting in a nuclei map with some background tissue. With a further increase of the threshold, background tissue was excluded to form a nuclei-only map [Fig. 3(e)]. At this point of inflection, the threshold can separate nuclei from other tissue components [Fig. 3(e)]. With further increasing of the threshold, the number of connected components increases slightly [shown qualitatively in Figs. 3(e) and 3(f) and quantitatively in Fig. 3(h) between the black square and the red arrow] because pixels from the same full nucleus disappear to separate the nucleus to form multiple independent connected components [Figs. 3(e), 3(f), 3(i), and 3(j)].

To detect this inflection point, we first fit a high-order polynomial curve to the data depicting the number of connected components as a function of threshold. The hyper parameter of curve order was tuned manually to 20 using our tuning data set, with two objectives in mind: (1) making the polynomial order as low as possible to avoid noise generated from oscillation based on the Runge phenomenon45 and (2) minimizing the squared error between the curve and the original points. The resulting continuous and differential plane curve was fit using least squares approximation, and the coefficients were calculated using the Vandermonde method46 with the chosen curve order.

Using the fit curve, we computed the desired inflection point in three steps. First, we computed the threshold x˜ giving the largest number of connected components [blue triangle in

x˜=argmaxxXF(x) (1)

Figs. 3 and 5(a)] as denoted, where x is a threshold on the hematoxylin channel, X is the domain of thresholds on the hematoxylin channel, and F is the fit polynomial curve. Second, we computed the threshold x¨>x˜ corresponding to the sharpest decrease in connected components [purple circle in Figs. 3 and 5(a) and 5(c)] as denoted

x¨=argmaxxX,x>x˜F(x). (2)

Third, the threshold xTX is defined as the closest rising inflection point to x¨ [black square in Figs. 3 and 5(a) and 5(b)]:

xT=argminxXTxx¨, (3)
XT={xT>x¨|(δϵR+)(ϵ<δ)[F(xT)>F(xT+ϵ)F'(xT)>F(xTϵ)]}, (4)

where XT is the set of isolated local maxima of the first derivative (therefore the set of rising inflection points) of curve F such that xT>x¨.

Fig. 5.

Fig. 5

Plots of (a) number of connected components and amount of hematoxylin stain with fitted curve in red, (b) first derivative of curve in (a), and (c) second derivative of curve in (a).

This threshold was found for each WSI via a cumulative assessment of 2000 randomly selected 120  μm×120  μm samples lying within the prostate (i.e., avoiding clear slide areas) and not containing tissue marking dye (i.e., avoiding areas of artifact).

RBC removal. Since RBCs [Figs. 6(a) and 6(f)] also stain with hematoxylin, the adaptive thresholding process erroneously labels them as nuclei [Fig. 6(b)]. However, RBCs have higher saturation than nuclei in a red-pink hue, allowing us to distinguish them from nuclei. We selected and applied hue–saturation–intensity RBC thresholds (hue0.95/1, saturation0.72/1, and intensity0.6/1) based on a cumulative histogram from 100 40  μm×40  μm RBC ROIs selected from our tuning data set [see a sample thresholding result in Fig. 6(c)]. We then applied morphological dilation with a disk-shaped structuring element of radius=4  μm (approximate radius of human RBCs) [Fig. 6(f)]. This resulted in an RBC mask [Fig. 6(e)] that was subtracted from the nucleus map [Fig. 6(b)] to eliminate falsely detected RBCs [Fig. 6(d)].

Fig. 6.

Fig. 6

(a) RBC removal for an example ROI. (b), (d) Nuclei maps before and after RBC removal, respectively. (c) Binary mask covering the RBCs, generated by thresholding in HSV color space. (e) A binary mask created from (c) after morphological operation with a disk shaped structuring element of radius=4  μm [approximate radius of human RBC (f)].

3.2.2. Lumen and stroma/other tissue component segmentation

Luminal regions are consistently nearly white on each WSI. We used global thresholding to segment luminal pixels with threshold values of red0.86/1, green0.71/1, and blue0.82/1. (Threshold values were chosen using a cumulative histogram calculated from luminal ROIs sampled from the tuning data set.) All pixels that were not labeled as nuclei or lumen were labeled as stroma/other.

3.2.3. Tuning ROI size and down-sampling ratio

The ROI size used in the experiment affects the resolution of the resulting cancer map, and the down-sampling ratio of the TCM affects computation time. We selected an ROI size of 480  μm×480  μm (960×960  pixel) and nearest-neighbor down-sampling ratio of 0.25 by experimentation with our tuning data set. We tested ROI sizes from 120  μm×120  μm to 720  μm×720  μm in 120  μm steps. We tested down-sampling ratios of 0.25 to 1 in increments of 0.25. The selection of those two parameters was manually performed based on evaluating the performance (AUC) for cancer detection (using FisherC) in leave-one-patient-out cross validation (LOPO CV) on the tuning data set.

3.3. Feature Extraction and Selection

We calculated 24 first-order and 132 second-order statistical features40,41 from the TCM of each ROI, giving a total of 156 features. The second-order statistical features were based on the gray-level co-occurrence matrix (GLCM)40 and gray-level run length matrix (GLRLM).41 GLCMs and GLRLMs were calculated using neighbors in four directions [(0, 1) denoted direction 1 in Table 1, (1,1) denoted 2, (1,0) denoted 3, and (1,1) denoted 4] without aggregation over the directions. In total, we calculated 156 features: (22 GLCM + 11 GLRLM) × 4 directions + 24 first-order features = 156.

Table 1.

Selected features used in cross validation.

Mean gradient value
GLRLM short-run low gray-level emphasis-1
GLRLM short-run low gray-level emphasis-3
GLRLM short-run low gray-level emphasis-4
GLCM energy-1
GLCM energy-2
GLCM information measure of correlation-1
GLCM information measure of correlation-2
GLCM inverse difference moment-2
GLCM inverse difference moment-3
GLCM cluster shade-3
GLCM correlation-1
GLCM entropy-2
GLRLM short-run emphasis-3

We selected the 14 top ranked features using backward feature selection via ranking the AUCs from LOPO CV of a Fisher linear classifier with the TCMs on the tuning dataset. The number of features was chosen by iterative experiments from 1 to 156 using feature selection as described. The chosen texture features are listed in Table 1.

3.4. Cancer Detection Using Machine Learning

We classified each ROI as cancerous versus noncancerous using the calculated features. We performed supervised machine learning on the TCMs using (1) FisherC, (2) LoglC, and (3) SVM (NU-SVC with a radial basis function kernel, parameters tuned as cost=12.5, gamma=0.50625 using our tuning dataset). Each of these classifiers is henceforth denoted as: (1) TCM-texture-FisherC, (2) TCM-texture-LoglC, and (3) TCM-texture-SVM, respectively.

We also used transfer learning by fine-tuning pretrained AlexNet with our TCMs and raw image ROIs, which are denoted as: (4) tune-AlexNet-TCM and (5) tune-AlexNet-RawIM. The methodological details of using transfer learning are described in the following paragraphs.

We used TCMs as input images to fine-tune the pretrained AlexNet. The TCMs were converted into RGB color images, in which red, green, and blue represent nuclei, stroma/other, and lumen, respectively. All the ROIs of size of 240×240×3 were resized to 227×227×3 (to conform to the necessary input size for AlexNet) using bilinear interpolation. For comparison, the same method as that used for tune-AlexNet-TCM was repeated using the “raw” unmodified H&E images instead of TCMs.

We replaced the final fully connected layer of AlexNet with a fully connected layer, which has a two-way output followed by the two-way softmax algorithm with a two-class label output (cancerous versus noncancerous). We calculated the loss function using cross entropy. The weights and biases of the replaced layers were initialized with random numbers. We set the initial learning rate α=0.0001 for all the other layers, and α=0.002 for the output layer to make the weights and biases from other layers almost unchanged while those from the output layer learn faster. We used the adaptive moment estimation (“Adam”) optimizer47 for gradient descent. The following hyper-parameters were set as: β1=0.9, β2=0.9999, and ε=108.47 Other training parameters were set as: minibatch size=200 and maximum epoch=10, which (including α) were chosen using our tuning dataset.

3.5. Experimental Design and Evaluation Methods

3.5.1. Cross validation

We performed LOPO CV for each classifier to classify each ROI as cancerous versus noncancerous covering each WSI. In each CV, data were grouped on a per-patient basis, such that no same-patient samples were used in both the training and testing sets. During training, the positive (cancerous) and negative (noncancerous) samples were balanced by random subsampling of the negative samples. We performed testing on all ROIs covering each WSI in our 68-patient set (i.e., all tissue on all slides was classified; we did not use only selected ROIs in our experiments).

We calculated the error metrics (using a fixed operating point at the confidence level of 0.5) of error rate, FPR, false negative rate (FNR), AUC, and area under the precision recall curve (AU-PRC)48 by validating against our expert manual annotations. For the AU-PRC, we also calculated the baseline as a reference, which reflects the classification results that would result from random guessing.48 The overall performance was measured by averaging each of the error metrics across all the patients. The sample sizes for each tissue type are shown in Table 2. We also calculated the AUCs for each tissue type using each of the methods from LOPO CV.

Table 2.

Sample size for each tissue type (480  μm×480  μm).

Cancerous tissue types Noncancerous tissue types
G3 G4 G5 EPE G3+4 G4+3 G4+5 G5+4 G5+3 Atrophy PIN Healthy
14,718 3949 37 272 6008 3839 727 8216 16 5433 26,449 1,178,814

We did statistical tests to compare the AUCs and AU-PRCs obtained from the different methods. We grouped AUC and AU-PRC values into three groups: group 1 (TCM-texture-Fisher, TCM-texture-LoglC, and TCM-texture-SVM); group 2 (tune-AlexNet-TCM); and group 3 (tune-AlexNet-RawIM). We tested the AUCs for normality using the Shapiro–Wilk test. We compared the AUCs for the three groups (i.e., group 1 versus group 2 versus group 3) using the Kruskal–Wallis test. We then compared each pair of groups using the Wilcoxon rank sum test. We did the same tests for the AU-PRCs.

We also performed a two-split validation on data from the two different scanners, where the WSIs from 43 patients were from scanner 1, and the WSIs from the remaining 25 patients were from scanner 2. We used tune-AlexNet-TCM and tune-AlexNet-RawIM to perform this experiment.

3.5.2. Training sample size experiment

We conducted an experiment to investigate how the number of patients in the training set influences the system performance for each machine learning approach. We randomly selected 34 patients as the testing set, then iteratively trained the classifiers using training set sizes ranging from 1 patient to 33 patients. At each iteration, we trained all of the classifiers described in Sec. 3.4, tested the trained systems on the entire 34-patient testing set, and computed error metrics as described in Sec. 3.5.1.

4. Results

4.1. Tissue Component Segmentation

The average adaptive threshold on the hematoxylin channel for WSIs within each patient ranged from 0.5 to 1.1. The standard deviation ranged from 0 to 0.4, with 42% of patients having a standard deviation 0.1. The average±standard deviation threshold for all patients was 0.7±0.2.

Qualitative results are shown in Fig. 4, where staining variability was observed for the two WSIs shown [Fig. 4(a)]. The hematoxylin channel thresholds are different for the samples after color deconvolution [Fig. 4(a), middle image for each case]. The nuclei were segmented by our adaptive thresholding method to generate nuclei maps [Fig. 4(a), right image for each case].

Figure 4 illustrates TCMs for the two WSIs right beside each case. Note the consistency of the maps despite the observed staining variability. Figures 4(b) and 4(c) show zoomed views for samples taken from the two cases, and their TCMs after segmentation. Also note that the confounding RBCs in the oval in Fig. 4(c) left and indicated by the yellow arrows in Fig. 4(c) right were correctly classified into the “stroma/other” category and not as nuclei.

4.2. Cancer versus Noncancer Classification

The quantitative results for cancer versus noncancer classification from our LOPO CV are reported in Table 3.

Table 3.

Average±standard deviation of error metrics for cancer versus noncancer classification from LOPO CV.

  Error rate (%) FNR (%) FPR (%) AUC [0, 1] AU-PRC Baseline
TCM-texture-FisherC 13.7±6.7 12.6±16.9 13.6±7.1 0.94±0.05 0.36±0.26 0.03±0.26
TCM-texture- LoglC 12.3±6.0 12.4±15.6 13.6±7.1 0.95±0.05 0.38±0.26 0.03±0.26
TCM-texture-SVM 8.5±4.6 13.1±13.7 8.2±4.6 0.96±0.04 0.44±0.25 0.03±0.26
Tune-AlexNet-TCM 6.2±4.0 10.7±12.6 6.0±4.2 0.98±0.03 0.52±0.24 0.03±0.26
Tune-AlexNet-RawIM 5.5±6.9 8.5±13.8 5.2±6.7 0.98±0.02 0.52±0.23 0.03±0.26

From the normality test, neither AUCs nor AU-PRCs were normally distributed. The Wilcoxon rank sum test results for each of the two groups are shown in Table 4. For AUCs, group 1 and 2, group 1 and 3, and group 2 and 3 were significantly different. For AU-PRCs, group 1 and 2, group 1 and 3 were significant. Group 2 and 3 were not significant.

Table 4.

Wilcoxon rank sum test results for each of the two groups.

Testing groups p values
AUC
Group 1 Group 2   <0.0001
Group 1   Group 3 <0.0001
 
Group 2
Group 3
<0.006
AU-PRC
Group 1 Group 2   <0.0002
Group 1   Group 3 <0.0004
  Group 2 Group 3 <0.8903

Note: p values that are significant (p<0.05) are bolded. The group that had better performance within each pair is bolded.

The results from the two-split validation using data from two different scanners are reported in Table 5.

Table 5.

Average±standard deviation of error metrics for cancer versus noncancer classification for WSIs from each scanner.

  Error rate (%) FNR (%) FPR (%) AUC [0, 1] AU-PRC Baseline
Results for WSIs from scanner 1
Tune-AlexNet-TCM 14.0±6.4 5.1±7.0 14.4±7.0 0.97±0.03 0.37±0.20 0.03±0.06
Tune-AlexNet-RawIM
3.7±3.7
35.8±25.9
2.3±2.1
0.96±0.04
0.39±0.23
0.03±0.06
Results for WSIs from scanner 2
Tune-AlexNet-TCM 7.8±3.4 8.0±8.1 7.8±3.4 0.97±0.02 0.50±0.24 0.02±0.02
Tune-AlexNet-RawIM 4.4±2.6 6.4±5.8 4.4±2.6 0.99±0.01 0.50±0.18 0.02±0.02

From the LOPO CV experiments, we calculated the AUCs for cancerous and noncancerous tissue types. The results are shown in Table 6.

Table 6.

AUCs for cancer versus noncancer classification for each tissue type from the LOPO CV. The highest AUCs in each row are bolded in this table.

  TCM-texture-FisherC TCM-texture-LoglC TCM-texture-SVM Tune-AlexNet-TCM Tune-AlexNet-RawIM
AUCs for each cancerous tissue type
G3 0.89±0.07 0.90±0.07 0.90±0.07 0.95±0.04 0.96±0.03
G4 0.93±0.07 0.93±0.07 0.92±0.11 0.94±0.12 0.95±0.09
G5 0.86±0.00 0.85±0.00 0.80±0.00 0.88±0.00 0.87±0.00
G3+4 0.93±0.07 0.93±0.06 0.93±0.07 0.97±0.03 0.98±0.03
G4+3 0.94±0.05 0.94±0.05 0.95±0.05 0.97±0.03 0.97±0.03
G4+5 0.88±0.00 0.87±0.00 0.84±0.00 0.93±0.00 0.90±0.00
G5+4 0.92±0.00 0.91±0.00 0.87±0.00 0.94±0.00 0.90±0.00
G5+3 1.00±0.00 1.00±0.00 0.99±0.00 1.00±0.00 1.00±0.00
EPE
0.87±0.06
0.82±0.11
0.73±0.22
0.86±0.20
0.86±0.14
AUCs for each noncancerous tissue type
Atrophy 0.97±0.04 0.98±0.03 0.94±0.05 0.96±0.07 0.96±0.06
PIN 0.80±0.119 0.83±0.11 0.84±0.11 0.92±0.07 0.93±0.06
Healthy 0.91±0.06 0.92±0.06 0.92±0.05 0.96±0.03 0.97±0.03

Confusion matrices are shown in Table 7 for each classification method, using the fixed operating point corresponding to the confidence level of 0.5.

Table 7.

Confusion matrix for each method from LOPO CV; each sample is a 480  μm×480  μm ROI.

  System predicted label Pathologist annotation
Cancerous Noncancerous
TCM-texture-FisherC
Cancerous 32,134 162,974
Noncancerous
5647
1,047,722
TCM-texture-LoglC
Cancerous 31,385 145,563
Noncancerous
6396
1,065,133
TCM-texture-SVM
Cancerous 30,260 99,523
Noncancerous
7521
1,111,173
Tune-AlexNet-TCM
Cancerous 32,092 70,530
Noncancerous
5689
1,140,166
Tune-AlexNet-RawIM Cancerous 30,537 66,983
Noncancerous 7244 1,143,713

Figure 7 shows the quantitative results from the training sample size experiment.

Fig. 7.

Fig. 7

Plots for error metrics at each training patient number for cancer detection. Red and black dashed lines: reference points using 12 and 32 patients for training, respectively.

Figure 8 illustrates the capability of our system to map cancer throughout entire WSIs. The upper and lower cases in Fig. 8 depict average and below average performance of the methods (see error metrics below each map in Fig. 8 in comparison to Table 3) for LOPO CV using each of the three classifiers. Our implementation is not optimized for speed and consists primarily of MATLAB code.

Fig. 8.

Fig. 8

Cancer maps for two example WSIs. (a), (e) Example WSIs; (b), (f) cancer maps from TCM-texture-SVM; (c), (g) cancer maps from tune-AlexNet-TCM; (d), (h) cancer maps from tune-AlexNet-RawIM. Color contours in each image are the pathologist’s annotations. The error metrics are below each cancer map. Labels in the cancer maps are: dark gray: true positives, light grey: true negatives, black: false positives, white: false negatives, and ER: error rate.

The non-deep learning methods require 45  min to map cancer throughout an entire WSI using an Intel i5 workstation at 3.10 GHz with 24 GB of random access memory. For the deep learning methods, the tune-AlexNet-TCM method requires 15 min, and the tune-AlexNet-RawIM method requires 3 min.

5. Discussion

In this paper, we proposed an approach to tissue component mapping that finds the loci of nuclei, luminal regions, and other tissue components (including stroma) based on our adaptive thresholding algorithm, which compensates for staining variability across the WSIs. This algorithm is amenable to fast implementation and yields consistent TCMs supporting cancer detection using machine learning algorithms. We validated conventional and deep learning-based approaches for classifying 480  μm×480  μm ROIs as cancer or noncancer throughout digitized mid-gland WSIs of RP sections. We did not subsample the tissue available in our data set. We cross validated using all of the tissue on every slide, ensuring that tissues from the same patient never appeared in both the training and testing sets. For CV, we used 286 WSIs from 68 patients including 1.3 million 480  μm×480  μm ROIs, which is 3588  cm2 of prostate tissue in total. To the best of our knowledge, this represents the largest validation data set presented thus far in the literature for this problem. All of the validated methods achieved similar or better performance with respect to comparable previously published approaches. The necessary processing times suggest the potential of an optimized, parallel implementation of the algorithms to yield processing speeds compatible with the clinical pathology workflow, upon further multicenter validation.

5.1. Tissue Component Mapping

The proposed nuclei segmentation method demonstrated robustness to staining variability. Even in our single-center data set where robotic staining systems were used, we observed substantial staining variability across and within patients. This was evidenced by the substantial WSI-to-WSI variability in the optimal thresholds on the hematoxylin channel for nuclei segmentation. This suggests that the concentration of hematoxylin stain for nuclei in WSIs from different patients varies substantially. This is illustrated qualitatively with two example WSIs that were stained very differently (Fig. 4) and the corresponding hematoxylin channel images [Fig. 4(a) middle image sets for both cases]. The hematoxylin channel intensity of the nuclei in the more lightly stained image [Fig. 4(a) right case] is similar to that of the stroma in the more darkly stained image [Fig. 4(a) left case]. Figure 4 shows the proposed method’s segmentation of nuclei from the background tissue despite the staining variation; we compare the loci and shapes of the nuclei in the example output label map images with the original stained images shown in Figs. 4(a) and 4(b). In addition, our method successfully assigned RBCs as stroma/other tissue [Fig. 4(c)] for the samples in Fig. 4. Although we did not evaluate the segmentation algorithm via quantitative measurement, validation results for cancer detection (Table 3) indirectly reflect the utility of the segmentation algorithm.

Compared to machine learning or normalization based approaches,21,28,49 our proposed algorithm does not use other images for calibration/training. Therefore, by using this approach, differences between validation performance and performances for an unknown population could be reduced. In addition, this algorithm, whose implementation has not been optimized for speed, has a processing time of 45  min/whole-mount section using conventional machine learning and 15  min/whole-mount section using AlexNet on a standard workstation for cancer detection throughout entire slides. In clinical practice, it is important for the system to have the ability to process whole-mount sections within a reasonable time frame.

5.2. Cancer versus Noncancer Classification

5.2.1. Overall system evaluation

Our system yielded state-of-the-art overall performance (all metrics except for AU-PRC in Table 3) compared to literature-reported performance on this problem. Since our data set is highly imbalanced (0.037 million cancerous ROIs versus 1.2 million noncancerous ROIs), we also computed the AU-PRC to evaluate system performance. Considering we have a large number (1.2 million) of negative samples, our FPRs (Table 3) imply a large number of false positives in detection. However, the FPR values (Table 3) are small, which also indicates that the number of false positives is small relative to the total number of negatives. In addition, our FNRs (Table 3) imply high recall. More intuitively, in interpreting the confusion matrices (Table 7), it is important to note that the total number of negatives in our data set is far larger than the total number of positives. This leads to what appears to be, on first impression, a large number of false positive classifications. However, this concern is tempered when comparing the number of false positives to the generally much larger number of true negatives in each case (i.e., there is a large amount of negative tissue, and the classifiers are correctly identifying it as such most of the time). This is helpful to the clinical scenario where this system is providing assistance in reviewing slides to the pathologist since a high-recall system minimizes the chance that a cancerous region will be missed and improves pathologist efficiency by drawing attention to most of the cancer on each slide. Since all previous studies, in this area as we discussed in the previous work, did not use the AU-PRC metric, direct comparison of our work to previous studies using this metric is not possible. However, the system performance using this metric can be evaluated by comparing system AU-PRC to the baseline,48 which reflects the expected AU-PRC arising from random guessing.48 All of the methods yielded AU-PRCs much higher than the baselines, which indicates superior performance to random guessing.

This is the first study that used all tissues covering WSIs of whole-mount RP sections including all clinically relevant grade groups. This performance suggests the potential to use our proposed pipeline for cancer detection in a clinical setting after multicenter validation. Deep learning approaches overall outperformed the conventional machine learning-based approaches (AUCs and AU-PRCs in Table 3, and the statistical tests in Table 4). Also tune-AlexNet-RawIM yielded superior overall performance compared to tune-AlexNet-TCM using the AUC metric (Tables 3 and 4), but no significant difference was observed using AU-PRC metric (Tables 3 and 4). The significant differences given by the Wilcoxon rank sum test indicate that the AUC distributions are different, even though their means are similar. Tune-AlexNet-RawIM yielded higher AUCs for patients who do not have G5 cancer involved, whereas tune-AlexNet-TCM yielded higher AUCs for patients who have G5 cancer involved.

Our proposed pipeline using deep learning approaches was not substantially affected by the scanner types as the overall results from the two-split validation (AUCs in Table 5) were close to the LOPO CV overall results (AUCs in Table 3). Although testing on the data from scanner 1 yielded inferior performance compared to the LOPO CV using the AU-PRC metric (AU-PRC in Tables 3 and 5), we used a much smaller sample size for training (i.e., training used 25 patients from scanner 2). In particular, we have a much smaller sample size for each cancer tissue type in training (baseline for data from scanner 2 in Table 5).

All the methods were affected by tissue type and the corresponding sample size of each tissue type (Tables 2 and 6). Higher AUCs were often associated with larger sample sizes for the different tissue types, whereas lower AUCs were associated with smaller sample sizes, with one anomaly. Although G5+4 has a relatively large sample size of 8216, its training sample size was small because most of the G5 cancer was concentrated in very few patients, and no tissue from the same patient was used in both training and testing.

The classifier performance for different tissue types varies according to the machine learning method used. We speculate this may relate to the different sample sizes of the different tissue types. Tune-AlexNet-RawIM and TCM-texture-SVM are the most sensitive to sample size, performing worse than the other methods for G5, G4+5, G5+4, and EPE, all of which having relatively small sample sizes (Tables 2 and 6; bottom case in Fig. 8). Because G5 is the most aggressive cancer, it is important to not miss any G5 involved tissue in clinical practice. In comparison, the performances of tune-AlexNet-TCM and TCM-texture-FisherC were the least sensitive to the smaller sample sizes of these tissue types. Despite the small sample size, tune-AlexNet-TCM demonstrated the best performance for these tissue types (Tables 2 and 6; bottom case in Fig. 8). This is reinforced by the observation that in the training sample size experiment, the FNRs of tune-AlexNet-RawIM and TCM-texture-SVM decreased much more with increasing sample sizes, compared to the other methods (Fig. 7, FNR metric). Tune-AlexNet-TCM and tune-AlexNet-RawIM showed larger fluctuations in FNR compared to the TCM-texture-based methods, when the training sample size is small (Fig. 7 FNR before red dashed line). We speculate that this may arise due to the inherent randomness associated with fine-tuning AlexNet. As the number of training patients increases, the amplitude of the fluctuation reduces (all error metrics in Fig. 7, especially FNR), suggesting that larger training sample sizes may increase the stability in performance of the tune-AlexNet methods. In addition, we found that the amplitude of the FNR fluctuations was larger than the amplitude of the FPR fluctuations. This may be due to the heterogeneity of PCa tissue, such that we have many different cancerous tissue types with relatively smaller samples sizes, compared to the larger samples of noncancerous tissue.

In the LOPO CV, we also found that, for PIN, the FPRs were smaller when using more complex models (tune-AlexNet-TCM and -RawIM) (Table 6). We speculate that this could be due to PIN’s resemblance to cancerous tissue, requiring more complex models or more image information (i.e., raw images or TCMs rather than 14 calculated texture features) to differentiate it from other cancerous tissue. From a clinical perspective, high-grade PIN is considered as a putative precursor lesion, as PIN shares features with cancer tissue.50

5.2.2. Performance comparison by methods

Tune-AlexNet-RawIM yielded the highest AUC and AU-PRC in the LOPO CV and in the training sample size experiment, but it is sensitive to sample size. Although Shin et al.51 demonstrated efficient (i.e., smaller sample size) training by fine-tuning pretrained AlexNet and this method performs better than or equal to tune-AlexNet-TCM and the TCM-texture-based methods in general (all the error metrics in Fig. 7 and LOPO CV in Table 3), tune-AlexNet-RawIM yielded inferior performance in terms of FNR with training sets containing fewer than 12 patients (Fig. 7 after red dotted line). This may be because of the heterogeneity of cancerous tissue samples; 12 patient data sets may not have provided enough positive samples of each cancer tissue type with which to train the system. This is also reflected by the inferior performance of tune-AlexNet-RawIM on cancerous tissue types involving G5 (see AUCs for tissue types of G5, G4+5, and G5+4 in Table 6). This could be due to the fact that only two patients had G5 cancer in our data set, combined with the fact that we performed LOPO CV. To illustrate, note the large portion of G5+4 cancerous regions that were missed by tune-AlexNet-RawIM [white regions in Fig. 8(h)]. This is of particular concern because missing G5 cancer could result in failure to apply adjuvant therapy after surgery or could increase the pathologist’s necessary editing time to correct the cancer maps. Although tune-AlexNet-RawIM had an error rate of 0% on G5+3 using a large training sample size of G5 from tissues of G5, G4+4, G4+5, the sample size being tested was too small (9 ROIs from one patient) to be considered representative.

In comparison, tune-AlexNet-TCM yielded more stable performance in the training sample size experiment and demonstrated superior performance on tissue types with smaller sample sizes. With fewer than 12 training patients, tune-AlexNet-TCM yielded a lower FNR than tune-AlexNet-RawIM in nearly every case (Fig. 7 before the red dotted line). In LOPO CV, even though the samples of G5 involved tissue types were restricted to two patients, the highest AUCs were achieved by tune-AlexNet-TCM (Table 6). More intuitively, for example, in Fig. 8(g), we can see most of the cancerous regions of G4+5 and G5+4 were captured by tune-AlexNet-TCM. This suggests that higher-order tissue features (e.g., TCMs) can enhance the performance of tune-AlexNet when sample size is small. Also comparing the results between using tune-AlexNet-TCM and tune-AlexNet-RawIM in CVs (Table 3), the performance differences are negligible. Considering the huge dimensionality reduction from the raw image (227 pixels × 227 pixels × 3 color channels × 256 intensities per channel) to TCM (227 pixels × 227 pixels × 3 tissue component labels), the computed TCM can effectively reduce the dimensionality and make salient the key visual cues for the cancer detection problem.

Comparable results between using TCM-texture-SVM and tune-AlexNet-TCM and -RawIM in CVs (Table 3) suggest that extracting features from TCMs can effectively reduce the high-dimensional image information to 14 TCM-based texture features, and the resulting feature set is appropriate for our problem. This could support better understanding of the key visual cues for our problem. For TCM-texture-based methods, the SVM was more sensitive to sample size than FisherC and LoglC because (1) its performance monotonically and substantially improved as the training sample size increased and exceeded the performance of the FisherC and LoglC classifiers with a training sample size of 33 in the training sample size experiment (Fig. 7 after the black dotted line) and (2) higher FNRs and FPRs were found for tissue types having smaller sample sizes (G5 involved tissue types, excepting PIN as discussed previously) in LOPO CV (Tables 2 and 6).

Finally, to compare our method to more recently published methods by Campanella et al.,34 our work is complementary to theirs, such that our classifiers map the locations of cancer throughout the slides with tile-level classifications, on RP slides. This enables quantification of tumor sizes and locations, and coupled to future work automating the segmentation of the surgical margin and prostatic capsule, would enable the quantification of negatively prognostic factors such as extraprostatic extension and positive surgical margins. An interesting avenue of future work fusing these two approaches would be to evaluate the accuracy of the weak labels on each tile from the method described by Campanella et al. against our manual annotations, to discover the accuracy of their method for whole-slide cancer mapping.

6. Limitations

Our results should be interpreted in the context of several limitations of our study. All of our tissues were processed in the same clinical pathology laboratory and the cancerous annotations were done by one physician and verified by one of the two pathologists. These aspects of our study limit the variability of the material (i.e., the tissues and resulting images) and observers’ contours. Although our adaptive thresholding algorithm demonstrated the capability of compensating for staining variability, the performance of this algorithm needs to be further tested on an external data set, in which the WSIs are processed and collected differently. However, this algorithm is based on finding an inflection point in the plot of the number of connected components versus the threshold. In theory, this inflection point may not always exist, although this situation did not happen in any of our cases. This inflection point does not exist in the extreme case where the background tissue is so darkly stained by hematoxylin that it is similar in appearance to the nuclei. However, this situation can be detected by our algorithm because the solution of our function set does not exist. Also validations were conducted at the 480  μm×480  μm ROI level with regions containing more than 50% cancer considered as positive. All the conventional machine learning-based approaches used feature selection on an independent tuning data set, such that the feature may be overfit to the tuning data set. Those factors may lead to suboptimal performance of the system. Future work may include using classifiers that do not require feature selection (e.g., L1 or L2 regularized linear classifiers, random forest, and gradient boosting algorithms). We only used one type of deep learning approach with pretrained AlexNet; there are many deep learning architectures (e.g., ResNet and GoogleNet) demonstrating superior performance to AlexNet in other image classification tasks, and those methods may yield better performance for our problem and could be important avenues of future work. Although we used an adaptive threshold for nuclei segmentation, it was a global threshold for each WSI. Although our informal experiments (not reported here) suggested that locally adaptive thresholding did not improve performance, such an approach could be straightforwardly adapted from our methods if needed. Finally, it must be acknowledged that all CV studies may be subject to positive bias in their results; validation using an external data set is required to support clinical translation of this tool.

7. Conclusion

In conclusion, our proposed methods demonstrated state-of-the-art performance for automatically mapping PCa on whole-mount WSIs, with validation on the largest high-precision pathologist-annotated data reported thus far. Transfer learning with pretrained AlexNet demonstrated performance superior to the tested conventional machine learning approaches. Compared to using raw image ROIs as inputs, the use of TCMs yielded similar overall performance and more stable performance across different tissue types with various sample sizes, and substantially superior performance for the G5, the most aggressive cancer tissue type. Upon successful multicenter validation, this system could support imaging validation studies using annotated histopathology as the gold standard and could also facilitate quantitative and graphical pathology reporting after RP, which has the potential to support better prognosis, recurrence risk management, and adjuvant therapy planning.

Acknowledgments

Dr. Ward gratefully acknowledges the support from Canadian Institute of Health Research (Funding Reference No. CTP 87515), the Ontario Institute for Cancer Research, Prostate Canada, and the Natural Sciences and Engineering Research Council of Canada.

Biographies

Wenchao Han received his BSc degree in optical and electronic information from Huazhong University of Science and Technology in 2010 and his MEng degree in communication systems and data networking from Western University in 2014. He is a PhD candidate in the Department of Medical Biophysics, Western University, Canada. He is also a research assistant at Baines Imaging Laboratory, London Regional Cancer Program, Canada. He is a member of SPIE.

Carol Johnson received her BSc degree in physics from the University of Toronto. She is a research associate at Baines Imaging Laboratory, London Regional Cancer Program, Canada.

Andrew Warner received his MSc degree in epidemiology and biostatistics from the Western University. He is a research associate in the Department of Radiation Oncology, London Regional Cancer Program, Canada.

Mena Gaed received his bachelor of medicine and bachelor of surgery degrees from Cairo University and his MSc degree in pathology and laboratory medicine from the Western University, Canada. He is a research associate in the Department of Pathology and Laboratory Medicine at the Western University.

Jose A. Gomez received his MD from the Pontifical Xavierian University, Colombia. He is an associate professor in the Department of Pathology and Laboratory Medicine at the Western University. He is also an anatomical pathologist specialized in the genitourinary tract, head and neck, and breast at London Health Sciences Center.

Madeleine Moussa received her MBBCh degree from Ain Shams University, Egypt, and her master’s degree in anatomy from Queen’s University in 1991. She is a professor in the Department of Pathology and Laboratory Medicine at the Western University. She is also a genitourinary pathologist at London Health Sciences Center.

Joseph Chin received his MD from the University of Toronto and did an urologic oncology fellowship at Roswell Park Cancer Institute. He is a professor in the Departments of Urology and Oncology at the Western University, London, Ontario, Canada. He is a urologic oncologist at London Health Sciences Center and is active in clinical trials and translational research. He has published 250 research papers and book chapters.

Stephen Pautler received his MD from the University of Calgary in 1995. He completed his urology residency at the Western University. He is an associate professor in the Departments of Surgery and Oncology at the Western University. He is a surgeon at St. Joseph’s Health Care, London, Canada.

Glenn Bauman is a radiation oncologist and a professor in the Departments of Oncology and Medical Biophysics, Western University, London, Ontario, Canada, specializing in genitourinary and central nervous system malignancies. His primary areas of research interest are multimodality image guided radiotherapy and cancer imaging.

Aaron Ward received his PhD in computing science from Simon Fraser University, Canada. He is an associate professor in the Departments of Medical Biophysics and Oncology, Western University, Canada. He is also a scientist at Lawson Health Research Institute, Canada. His research interests focus on computer-assisted diagnosis and therapy guidance based on medical imaging and digital pathology, with applications in oncology.

Disclosures

No conflicts of interest, financial or other potential conflicts of interest, are declared by the authors.

Contributor Information

Wenchao Han, Email: whan25@uwo.ca.

Carol Johnson, Email: carol.Johnson@lhsc.on.ca.

Andrew Warner, Email: andrew.warner@lhsc.on.ca.

Mena Gaed, Email: mgaed@uwo.ca.

Jose A. Gomez, Email: Jose.GomezLemus@lhsc.on.ca.

Madeleine Moussa, Email: madeleine.moussa@lhsc.on.ca.

Joseph Chin, Email: joseph.chin@lhsc.on.ca.

Stephen Pautler, Email: stephen.pautler@sjhc.london.on.ca.

Glenn Bauman, Email: glenn.bauman@lhsc.on.ca.

Aaron D. Ward, Email: aaron.ward@uwo.ca.

References

  • 1.Stephenson A. J., et al. , “Predicting the outcome of salvage radiation therapy for recurrent prostate cancer after radical prostatectomy,” J. Clin. Oncol. 25(15), 2035–2041 (2007). 10.1200/JCO.2006.08.9607 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Izawa J. I., “Salvage radiotherapy after radical prostatectomy,” Can. Urol. Assoc. J. 3(3), 245–250 (2009). [PMC free article] [PubMed] [Google Scholar]
  • 3.Gleason D. F., Mellinger G. T., Veterans G., “Prediction of prognosis for prostatic adenocarcinoma by combined histological grading and clinical staging,” J. Urol. 111(1), 58–64; discussion 959 (1974). 10.1016/S0022-5347(17)59889-4 [DOI] [PubMed] [Google Scholar]
  • 4.Egevad L., Srigley J. R., Delahunt B., “International society of urological pathology consensus conference on handling and staging of radical prostatectomy specimens,” Adv. Anatomic Pathol. 18(4), 301–305 (2011). 10.1097/PAP.0b013e3182211ce0 [DOI] [PubMed] [Google Scholar]
  • 5.van der Kwast T. H., et al. , “International Society of Urological Pathology (ISUP) consensus conference on handling and staging of radical prostatectomy specimens. Working group 2: T2 substaging and prostate cancer volume,” Mod. Pathol. 24(1), 16–25 (2011). 10.1038/modpathol.2010.156 [DOI] [PubMed] [Google Scholar]
  • 6.Evans A. J., et al. , “Interobserver variability between expert urologic pathologists for extraprostatic extension and surgical margin status in radical prostatectomy specimens,” Am. J. Surg. Pathol. 32(10), 1503–1512 (2008). 10.1097/PAS.0b013e31817fb3a0 [DOI] [PubMed] [Google Scholar]
  • 7.Epstein J. I., et al. , “The 2005 International Society of Urological Pathology (ISUP) consensus conference on Gleason grading of prostatic carcinoma,” Am. J. Surg. Pathol. 29(9), 1228–1242 (2005). 10.1097/01.pas.0000173646.99337.b1 [DOI] [PubMed] [Google Scholar]
  • 8.Sun M., et al. , “Insights of modern pathology reports originating from prostate biopsy and radical prostatectomy specimens,” Eur. Urol. 62(1), 40–41 (2012). 10.1016/j.eururo.2012.03.042 [DOI] [PubMed] [Google Scholar]
  • 9.Bettendorf O., et al. , “Implementation of a map in radical prostatectomy specimen allows visual estimation of tumor volume,” Eur. J. Surg. Oncol. 33(3), 352–357 (2007). 10.1016/j.ejso.2006.11.004 [DOI] [PubMed] [Google Scholar]
  • 10.Eminaga O., et al. , “CMDX(c)-based single source information system for simplified quality management and clinical research in prostate cancer,” BMC Med. Inf. Decis. Making 12, 141 (2012). 10.1186/1472-6947-12-141 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Gibson E., et al. , “Registration of prostate histology images to ex vivo MR images via strand-shaped fiducials,” J. Magn. Reson. Imaging 36(6), 1402–1412 (2012). 10.1002/jmri.23767 [DOI] [PubMed] [Google Scholar]
  • 12.Soetemans D. J., “Computer-assisted characterization of prostate cancer on magnetic resonance imaging,” Electronic Thesis and Dissertation Repository 4504 (2017).
  • 13.Croke J., et al. , “Proposal of a post-prostatectomy clinical target volume based on pre-operative MRI: volumetric and dosimetric comparison to the RTOG guidelines,” Radiat. Oncol. 9, 303 (2014). 10.1186/s13014-014-0303-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Mosquera-Lopez C., et al. , “Computer-aided prostate cancer diagnosis from digitized histopathology: a review on texture-based systems,” IEEE Rev. Biomed. Eng. 8, 98–113 (2015). 10.1109/RBME.2014.2340401 [DOI] [PubMed] [Google Scholar]
  • 15.Leo P., et al. , “Evaluating stability of histomorphometric features across scanner and staining variations: prostate cancer diagnosis from whole slide images,” J. Med. Imaging 3(4), 047502 (2016). 10.1117/1.JMI.3.4.047502 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Boyce B. F., “Whole slide imaging: uses and limitations for surgical pathology and teaching,” Biotech. Histochem. 90(5), 321–330 (2015). 10.3109/10520295.2015.1033463 [DOI] [PubMed] [Google Scholar]
  • 17.Magee D., et al. , “Colour normalisation in digital histopathology images,” in Proc. Opt. Tissue Image Anal. Microsc., Histopathol. and Endosc. (MICCAI Workshop) (2009). [Google Scholar]
  • 18.Basavanhally A., Madabhushi A., “EM-based segmentation-driven color standardization of digitized histopathology,” Proc. SPIE 8676, 86760G (2013). 10.1117/12.2007173 [DOI] [Google Scholar]
  • 19.Mosquera-Lopez C., Agaian S., “Iterative local color normalization using fuzzy image clustering,” Proc. SPIE 8755, 875518 (2013). 10.1117/12.2016051 [DOI] [Google Scholar]
  • 20.Monaco J. P., et al. , “High-throughput detection of prostate cancer in histological sections using probabilistic pairwise Markov models,” Med. Image Anal. 14(4), 617–629 (2010). 10.1016/j.media.2010.04.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Rashid S., et al. , “Automatic pathology of prostate cancer in whole mount slides incorporating individual gland classification,” Comput. Methods Biomech. Biomed. Eng. 7(3), 336–347 (2019). 10.1080/21681163.2018.1514280 [DOI] [Google Scholar]
  • 22.Tabesh A., et al. , “Multifeature prostate cancer diagnosis and Gleason grading of histological images,” IEEE Trans. Med. Imaging 26(10), 1366–1378 (2007). 10.1109/TMI.2007.898536 [DOI] [PubMed] [Google Scholar]
  • 23.Farjam R., et al. , “An image analysis approach for automatic malignancy determination of prostate pathological images,” Cytometry Part B 72(4), 227–240 (2007). 10.1002/cyto.b.20162 [DOI] [PubMed] [Google Scholar]
  • 24.Tahir M. A., Bouridane A., “Novel Round-Robin Tabu search algorithm for prostate cancer classification and diagnosis using multispectral imagery,” IEEE Trans. Inf. Technol. Biomed. 10(4), 782–793 (2006). 10.1109/TITB.2006.879596 [DOI] [PubMed] [Google Scholar]
  • 25.Bouatmane S., et al. , “Round-Robin sequential forward selection algorithm for prostate cancer classification and diagnosis using multispectral imagery,” Machine Vision and Applications 22(5), 865–878 (2011). 10.1007/s00138-010-0292-x [DOI] [Google Scholar]
  • 26.Sun X., et al. , “Automatic diagnosis for prostate cancer using run-length matrix method,” Proc. SPIE 7260, 72603H (2009). 10.1117/12.811414 [DOI] [Google Scholar]
  • 27.Yu E., et al. , “Detection of prostate cancer on histopathology using color fractals and probabilistic pairwise Markov models,” in Annu. Int. Conf. IEEE Eng. Med. and Biol. Soc., pp. 3427–3430 (2011). 10.1109/IEMBS.2011.6090927 [DOI] [PubMed] [Google Scholar]
  • 28.Gorelick L., et al. , “Prostate histopathology: learning tissue component histograms for cancer detection and classification,” IEEE Trans. Med. Imaging 32(10), 1804–1818 (2013). 10.1109/TMI.2013.2265334 [DOI] [PubMed] [Google Scholar]
  • 29.Peyret R., et al. , “Automatic classification of colorectal and prostatic histologic tumor images using multiscale multispectral local binary pattern texture features and stacked generalization,” Neurocomputing 275, 83–93 (2018). 10.1016/j.neucom.2017.05.010 [DOI] [Google Scholar]
  • 30.Kwak J. T., Hewitt S. M., “Nuclear architecture analysis of prostate cancer via convolutional neural networks,” IEEE Access 5, 18526–18533 (2017). 10.1109/ACCESS.2017.2747838 [DOI] [Google Scholar]
  • 31.Doyle S., et al. , “A boosted Bayesian multiresolution classifier for prostate cancer detection from digitized needle biopsies,” IEEE Trans. Biomed. Eng. 59(5), 1205–1218 (2012). 10.1109/TBME.2010.2053540 [DOI] [PubMed] [Google Scholar]
  • 32.Litjens G., et al. , “Automated detection of prostate cancer in digitized whole-slide images of H and E-stained biopsy specimens,” Proc. SPIE 9420, 94200B (2015). 10.1117/12.2081366 [DOI] [Google Scholar]
  • 33.Litjens G., et al. , “Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis,” Sci. Rep. 6, 26286 (2016). 10.1038/srep26286 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Campanella G., et al. , “Clinical-grade computational pathology using weakly supervised deep learning on whole slide images,” Nat. Med. 25(8), 1301–1309 (2019). 10.1038/s41591-019-0508-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Epstein J. I., et al. , “The 2014 International Society of Urological Pathology (ISUP) consensus conference on Gleason grading of prostatic carcinoma: definition of grading patterns and proposal for a new grading system,” Am. J. Surg. Pathol. 40(2), 244–252 (2016). 10.1097/pas.0000000000000530 [DOI] [PubMed] [Google Scholar]
  • 36.DiFranco M. D., et al. , “Ensemble based system for whole-slide prostate cancer probability mapping using color texture features,” Comput. Med. Imaging Graphics 35(7–8), 629–645 (2011). 10.1016/j.compmedimag.2010.12.005 [DOI] [PubMed] [Google Scholar]
  • 37.Nguyen K., Jain A. K., Sabata B., “Prostate cancer detection: fusion of cytological and textural features,” J. Pathol. Inf. 2, 3 (2011). 10.4103/2153-3539.92030 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Nir G., et al. , “Automatic grading of prostate cancer in digitized histopathology images: learning from multiple experts,” Med. Image Anal. 50, 167–180 (2018). 10.1016/j.media.2018.09.005 [DOI] [PubMed] [Google Scholar]
  • 39.Han W., et al. , “Automatic cancer detection and localization on prostatectomy histopathology images,” Proc. SPIE 10581, 105810Q (2018). 10.1117/12.2292450 [DOI] [Google Scholar]
  • 40.Haralick R. M., Shanmugam K., Dinstein I., “Textural features for image classification,” IEEE Trans. Syst. Man Cybern. SMC-3(6), 610–621 (1973). 10.1109/TSMC.1973.4309314 [DOI] [Google Scholar]
  • 41.Galloway M. M., “Texture analysis using grey level run lengths,” Comput. Med. Imaging Graphics 4, 172–179 (1975). 10.1016/S0146-664X(75)80008-6 [DOI] [Google Scholar]
  • 42.Krizhevsky A., Sutskever I., Hinton G. E., “Imagenet classification with deep convolutional neural networks,” in Adv. Neural Inf. Process. Syst., pp. 1097–1105 (2012). [Google Scholar]
  • 43.Deng J., et al. , “Imagenet: a large-scale hierarchical image database,” in IEEE Conf. Comput. Vision and Pattern Recognit., pp. 248–255 (2009). 10.1109/CVPR.2009.5206848 [DOI] [Google Scholar]
  • 44.Ruifrok A. C., Johnston D. A., “Quantification of histochemical staining by color deconvolution,” Anal. Quant. Cytol. Histol. 23(4), 291–299 (2001). [PubMed] [Google Scholar]
  • 45.Runge C., “Über empirische Funktionen und die Interpolation zwischen äquidistanten Ordinaten,” Z. Math. Phys. 46(224–243), 20 (1901). [Google Scholar]
  • 46.Horn R. A., Johnson C. R., Matrix Analysis, Cambridge University Press, New York: (2012). [Google Scholar]
  • 47.Kingma D. P., Ba J., “Adam: a method for stochastic optimization,” arXiv:1412.6980 [cs.LG] (2014).
  • 48.Saito T., Rehmsmeier M., “The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets,” PLoS One 10(3), e0118432 (2015). 10.1371/journal.pone.0118432 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Khan A. M., et al. , “A nonlinear mapping approach to stain normalization in digital histopathology images using image-specific color deconvolution,” IEEE Trans. Biomed. Eng. 61(6), 1729–1738 (2014). 10.1109/TBME.2014.2303294 [DOI] [PubMed] [Google Scholar]
  • 50.Epstein J. I., Lotan T. L., “The lower urinary tract and male genital system,” in Robbins and Cotran Pathologic Basis of Diseases, 9th ed., Kumar V., Abbas A. K., Fausto N., Eds., pp. 985–990, Elsevier, Philadelphia: (2015). [Google Scholar]
  • 51.Shin H. C., et al. , “Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning,” IEEE Trans. Med. Imaging 35(5), 1285–1298 (2016). 10.1109/TMI.2016.2528162 [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Journal of Medical Imaging are provided here courtesy of Society of Photo-Optical Instrumentation Engineers

RESOURCES