Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Jul 15.
Published in final edited form as: Conf Proc IEEE Eng Med Biol Soc. 2018 Jul;2018:4644–4647. doi: 10.1109/EMBC.2018.8513157

Analysis of Cellular Feature Differences of Astrocytomas with Distinct Mutational Profiles Using Digitized Histopathology Images

Mousumi Roy 1, Fusheng Wang 2, George Teodoro 3, Jose Velazqeuz Vega 4, Daniel Brat 5, Jun Kong 6
PMCID: PMC7363437  NIHMSID: NIHMS1584835  PMID: 30441386

Abstract

Cellular phenotypic features derived from histopathology images are the basis of pathologic diagnosis and are thought to be related to underlying molecular profiles. Due to overwhelming cell numbers and population heterogeneity, it remains challenging to quantitatively compute and compare features of cells with distinct molecular signatures. In this study, we propose a self-reliant and efficient analysis framework that supports quantitative analysis of cellular phenotypic difference across distinct molecular groups. To demonstrate efficacy, we quantitatively analyze astrocytomas that are molecularly characterized as either Isocitrate Dehydrogenase (IDH) mutant (MUT) or wildtype (WT) using imaging data from The Cancer Genome Atlas database. Representative cell instances that are phenotypically different between these two groups are retrieved after segmentation, feature computation, data pruning, dimensionality reduction, and unsupervised clustering. Our analysis is generic and can be applied to a wide set of cell-based biomedical research.

I. INTRODUCTION

Large-scale microscopic pathology images have been used in the practice of diagnostic pathology and understanding of disease mechanisms. Cells are fundamental pathology objects, as they capture rich information on disease characteristics that have been used for diagnostic purposes for over a century. Cell size and shape are controlled by many factors including the genomic, biochemical, and metabolic status, signaling networks engaged, physical properties of the plasma membrane, and underlying cytoskeletal properties [1]. Therefore, quantitative morphometric analysis may provide information on the genetic or molecular properties of cell populations. Although the prognostic significance of subjective cell features are well known in some diseases, it is often challenging to manually segment and quantitatively analyze cell population features due to their overwhelmingly large numbers in histologic sections. As a result, numerous automated image analysis methods for cell analytics have been proposed [2], [3], [4]. However, there is a lack of complete, self-reliant, modularized processing functions to facilitate the study of cellular phenotypic differences in distinct disease states. To address this, we present a complete and automated cellular feature analysis framework that includes cell segmentation, feature computation, and feature analytics for distinguishing cell populations with distinct molecular signatures.

As a driving use case for our study, we quantitatively analyze tumor cells from Grade III astrocytomas that are molecularly characterized as either Isocitrate Dehydrogenase (IDH)-mutant (MUT) or wildtype (WT). IDH mutations in infiltrating gliomas identify biologically distinct disease subsets with substantially younger age at presentation, slower clinical progression and longer overall survival compared to those that are IDH wildtype. IDH mutations represent an early, and likely initiating event that is present in more than 80% of grades II and II astrocytomas and secondary glioblastomas [5]. These genetically and clinically distinct forms of astrocytoma were only recently recognized and their morphologic differences have not been described, providing an excellent test case for framework development.

II. METHOD

Our cell analysis workflow consists of a sequence of steps, including tumor region annotation, tumor region image extraction, image stain normalization, stain color deconvolution, cell segmentation, cellular feature computation, and feature comparison analysis with representative cell retrieval.

A. Tumor Cell Segmentation

The overall schema for cell segmentation is presented in Figure 1. As not all tissue regions in a whole-slide microscopy image are tumor regions of interest, neuropathologists manually select Regions of Interest (ROIs) from each slide [6]. The resulting annotation results are captured and exported in xml files. Next, our analysis module programmatically reads xml files and uses openSlide API to retrieve the annotated image regions from whole-slide images [7].

Fig. 1:

Fig. 1:

Overall schema of cell analysis is presented.

To begin the cell segmentation analysis, we normalize colors of all images to mitigate the impact of stain variation on the follow-up segmentation. Aiming for color channel correlation minimization, we convert each image from the RGB to LAB color space where each channel is mapped by mean and standard deviation of the corresponding image channel from target image, respectively [8]. The transformed color channels are then mapped back to RGB space [9]. All slides are stained by Hematoxylin and Eosin (H&E) stain, resulting in regions of nuclei and cytoplasms in purple and pink. To target signals useful for segmentation, we decouple these two distinct stains from the original color image by Lambert-Beer’s law [10]. As nuclei in cells are highlighted by Hematoxylin, we next process the deconvolved Hematoxylin image channel for further segmentation.

We explore the resulting Hematoxylin image channel and find that pixels tend to have gradually decreased Hematoxylin stain signal intensity as they approach to nuclear centroid. To enhance signal contrast for more accurate segmentation, we determine the likelihood of a given pixel in cell regions by analyzing eigenvalues of the Hessian matrix from local image neighbors [11]. Given the prior knowledge about cell shape, we search for circular structures in Hematoxylin channel and compute likelihood for cellular pixels based on geometric structures characterized by the neighboring pixel intensity profiles. For any arbitrary pixel at (x0, y0), its local image intensity change can be represented by Taylor expansion:

hσ(x0+δx,y0+δy)=hσ(x0,y0)+(δx,δy)D(hσ)(x0,y0)+(δx,δy)D2(hσ)(x0,y0)(δx,δy)T+O((δ3) (1)

where hσ is the convolution of the deconvolved Hematoxylin image channel h and a Gaussian filter Gσ with standard deviation σ; D2 (h) is Hessian of h that is symmetric and thus diagonalizable with two resulting eigenvalues λi, i = 1, 2. Due to the Hematoxylin channel property, it is straightforward to have 0 ≪ λ1λ2 for pixels within cells. To improve the intensity contrast between cells and background, we use the following cell enhancement function [11]: f(hσ(x,y),α,β)=(1exp((λ1λ2)22α2))(1exp(λ12+λ222β2)) where α and β are sensitivity parameters for two product terms. As cell size is variable, we take the maximum response with the optimal scale σ*. The resulting enhanced likelihood map is further processed with hysteresis thresholding [12]. In the postprocessing step, we exclude the resulting candidate either too small or too bright to be a cell. For those candidates with internal holes, we fill up holes. The cellular boundary is smoothed by a low pass filter in the end.

B. Feature Computation

To remove all erroneous results generated by machine analysis, the resulting cellular boundaries are manually corrected by domain experts before they are used for feature computation. Manual corrections included annotations for mitotic figures, apoptotic nuclei, corpora amylacea, neurons, and endothelial cells. Abnormally segmented cells, such as those clumped cells and partial cells on image edges, are removed. After this validation and correction process by human experts, we proceed with cellular feature analysis by computing quantitative features related to cell size, shape, intensity, and hyper-chromaticity. Specifically, for each segmented cell, Area, Perimeter, Max Distance, and Equivalent Diameter are computed to represent cell size; Eccentricity, Circularity, and Extent are derived to describe cell shape; Intensity Standard Deviation, and Intensity Entropy are used to characterize Hyper-chromaticity; Intensity Mean is used to represent cell pixel intensity.

C. Feature Analysis

To visually perceive cellular feature difference, we project data to a lower dimensional subspace with Multi-Dimensional Scaling (MDS) [13] after feature normalization.

MDS is a non-linear dimensionality reduction method that detects the underlying dimensions of data by finding the similarity or dissimilarity between pairs of data. Given the dissimilarity matrix Δ = [∥yiyj ∥ = δi,j] derived from the original feature space with dimensionality of p, our goal is to find {xiRd} with d < p such that ∥xixj∥ ≈ δi, j. xiRd is the low dimensional data representation of yiRp in a higher dimensional space. MDS generalizes the optimization method by minimizing the cost function: S=ij(δi,jxixj)2). For visual assessment, we apply MDS and project data to a lower space (i.e. d = 3) with MDS and present data in Figure 2 (Left) where data from MUT and WT are observed to be severely overlapped. As there is a large number of cells from multiple cell populations with large phenotypic and biological variation in each image, severe noise is embedded in data representation as suggested by the nonlinear dimensionality reduction method MDS. Therefore, it is more promising to find true cell phenotypic feature signatures by data pruning than exhaustively formulating additional cell feature representations in this scenario. Following this idea, we measure discriminating power of cells in an ensemble classification manner and robustly find representative instances distinctive across two groups.

Fig. 2:

Fig. 2:

MDS derived scatter plots (Left) before and (Right) after data pruning are demonstrated.

1). Data Pruning:

We prune data with a mixed collection of generative and discriminative classifiers at this stage: Logistic Regression (LR), Random Forest (RF), AdaBoost (AB), Naive Bayes (NB), Quadratic Discriminant Analysis (QDA) and Neural Net (NN) [15], [18]. Generative models are used to model the distribution of individual class. These methods derive the posterior probability on class label y given observation x using Bayes rule after modeling p(y) and p(xy):arg maxyp(yx)=arg maxxp(xy)p(y). By contrast,

Algorithm 1: Algorithm for representative cell retrieval
Input:D:datafeaturematrix;N: number of rows inD;F9;C:clusternumberOutput:Pi:cellpanelforclusteriofeachgroup1:forallgroupsG(0,1)do2:D0D[G];count03:forallc(0,1,,C)do4:foralli(0,1,,N1)do5:count[c]count[c]+16:c¯avg(D0,count)c¯:cluster centroid7:foaalli(0,1,,N1)do8:δ[i]euclidDist(c¯,D0[i])9:append(D0,δ)10:D0ssort(D0,δ)11:Find thewmaxandhmaxof MBR from 100 cells12:wmax0;hmax013:forallj(1,2,,100)do14:xminmin(NB[0]);ymaxmin(NB[1])15:xmaxmax(NB[0]);ymaxmax(NB[1])16:wxmaxxmin;hymaxymin17:wmaxmax(w,wmax);hmaxmax(h,hmax)18:wfinalwmax+xmargin;hfinalhmax+hmargin19:forallj(1,2,,100)do20:xminmin(NB[0];)yminmin(NB[1])21:NMBR[j]crop(I,[xmin,ymin,wfinal,hfinal])22:forallj(1,2,,100)do23:Pi:image panel generated by addingNMBR[j]

discriminative methods predict the class label y from the training example x, by evaluating: f(x)=arg maxyp(yx).

We use ensemble classifiers to vote for discriminating cell instances in a robust way [16]. We have trained all these classifiers and classified data into two groups: MUT and WT. Five-fold cross validation is used to mitigate over fitting problem with the training data. We prune data iteratively in a few successive steps. First, those instances that are correctly classified by any of the above classifiers are kept. In the next iteration, we train classifiers with the remaining data and only keep those instances correctly classified by at least two updated classifiers. We increment the number of classifiers that produce correct classification results in each step. Following this method, the remaining dataset correctly classified by at least five classifiers are retained in the last iteration. We stop from increasing the number of correct classifiers further as we notice the classification accuracies of all classifiers get saturated up to this iteration. Finally, those confusing instances that are misclassified by any of the classifiers are removed.

2). Data Clustering Analysis:

With the retained data, we next analyze the hidden structure of group MUT and WT and find the most representative cell instances of each group. Specifically, we recover the structure of the unlabeled data of each group and partition them into K sets by K-means clustering [17]. We change K from 1 to 10. For each value, we compute the Sum of Squared Errors (SSE). Overall, SSE decreases as we increase K. We aim to choose as a small K as possible that produces a sufficiently low SSE. The “elbow” point usually represents the point after which SSE does not reduce much by increasing cluster number. We detect the “elbow” point where the graph begins to flatten significantly (i.e. the decrease percentage of SSE is < 15% in our case) [19]. Following this approach, we find K = 3 and K = 4 as the optimal cluster number for MUT and WT group, respectively. Clustering analysis for each group is next performed with the optimal K value.

3). Retrieval of Representative Cells:

With data clusters from each group, representative cells are found with Algorithm 1 that computes the centroid of each cluster and identifies the first 100 most representative cells based on their Euclidean distances to the cluster centroid. In Algorithm 1, variable δ represents a distance vector that captures Euclidean distances from data points to the cluster centroid; I is the image for cell analysis; D0 and NB capture the cellular features and boundary coordinates. Representative cells capturing group difference signatures are extracted from corresponding images {I} by fitting minimum bounding rectangles from {NB} and assembled in a 10 × 10 cellular panel for each group, respectively.

III. EXPERIMENTAL RESULTS

As a driving use case, we test the developed analysis pipeline with histopathology images of astrocytoma tissues (i.e. a malignant brain tumor) from The Cancer Genome Atlas database. This dataset includes 200 images (1024 × 1024) of tumor regions from 50 patients equally from two molecular groups: Isocitrate Dehydrogenase (IDH) mutant (MUT) and wildtype (WT). In aggregation, 50,588 nuclei are automatically analyzed for feature computation, with 26,871 and 23,717 from MUT and WT, respectively.

To visualize data separability, we project data to a lower three-dimensional feature space with MDS and produce scatter plots with data before and after pruning in Figure 2. It is clear that the mapped feature data from two groups are highly clumped before data pruning, whereas post-pruning data present substantially improved separability between two groups. In Figure 4, we further visualize the decision boundary with the projected data from the two groups with QDA. The mode of the distribution of each group is specified by a confidence ellipse, with blue and red region for MUT and WT, respectively. The dataset after data pruning process includes 20,653 nuclei instances (i.e. about 41% of the original data) of which 8,257 and 12,396 cells are from group MUT and WT, respectively.

Fig. 4:

Fig. 4:

Decision boundaries from quadratic discriminant analysis (Left) before and (Right) after data pruning are presented.

Table I presents mean and standard deviation of the most discriminating features of cells between MUT and WT after data pruning. After statistical analysis and feature histogram study, we find Area and Perimeter are most discriminating cellular features between group MUT and WT.

TABLE I:

Mean and standard deviation of most discriminating cellular features for distinguishing cells in group MUT from WT are presented.

Group Area Perimeter MeanIntensity MaxDistance
MUT 448.28±162.88 76.36±14.96 52.50±16.00 28.33±6.16
WT 198.97±82.12 54.09±13.61 71.63±24.29 22.09±6.43

To visualize representative cells that carry discriminating information from each group, we cluster each group with K-means algorithm with optimal cluster number. We present scatter plot for each group in Figure 3. It is observed that relatively well compact clusters are formed with clear cluster separation for each group. Additionally, cells from same clusters have very similar feature representations and those from different clusters present distinct feature signature.

Fig. 3:

Fig. 3:

Scatter plots of clustered data from (Left) MUT and (Right) WT group are presented.

After clustering data for each group, we retrieve the first 100 representative cells based on the Euclidean distance to each cluster centroid for each group. A minimum bounding rectangle is found for each cell. The resulting cell image regions are extracted from the associated images and assembled into a 10 by 10 image panel. We repeat this process for each specific cluster of each group. In this way, we can facilitate the visualization of representative cells that capture discriminating information for differentiation of cell populations with distinct genetic or molecular properties. In Figure 5, we present such cell panels of one randomly selected cluster from MUT and WT, respectively. With these cell panels, we observe that cells from MUT are more circular in shape and larger in size than those from WT group. Additionally, cells from group MUT tend to have lower intensity than those from WT. The substantial difference in cellular phenotypic features related to the shape, size, and intensity between the two groups suggests the efficacy of our complete analysis pipeline for cell phenotypic feature comparison with large-scale molecularly distinct cell populations from histopathology images.

Fig. 5:

Fig. 5:

Two representative panels of retrieved cells from (Left) MUT and (Right) WT group are presented.

IV. CONCLUSIONS

We present a complete workflow that facilitates quantitative histopathology imaging investigations of phenotypic feature comparison for cells from different molecular groups. The developed analysis framework consists of image color normalization, deconvolution, segmentation, feature computation, and feature comparison analysis with representative cell retrieval. As our driving use case, we test our method on histopathology microscopy images of astrocytoma brain tumors from database of The Cancer Genome Atlas. With derived cellular features, we identify and retrieve representative cell instances that are phenotypically different between Isocitrate Dehydrogenase (IDH)-mutant (MUT) or wildtype (WT) groups by data pruning, dimensionality reduction, and unsupervised clustering. Our analysis is generic to a wide set of cell-based biomedical research.

Acknowledgments

Funded by NIH K25CA181503, NSF ACI 1443054 and IIS 1350885, and CNPq; The studies involving human subjects were approved by the Emory University IRB.

Contributor Information

Mousumi Roy, Stony Brook University, Dept. of Computer Science, Stony Brook, NY 11794.

Fusheng Wang, Stony Brook University, Dept. of Computer Science, Stony Brook, NY 11794.

George Teodoro, University of Brasília, Dept. of Computer Science, Brasília, DF, Brazil.

Jose Velazqeuz Vega, Emory University, Dept. of Biomedical Informatics, Atlanta, GA 30322.

Daniel Brat, Northwestern University, Dept. of Pathology, Chicago, IL 60611.

Jun Kong, Emory University, Dept. of Biomedical Informatics, Atlanta, GA 30322.

References

  • [1].Rangamani P, Lipshtat A, Azeloglu EU, Calizo RC, Hu M, Ghassemi S, Hone J, Scarlata S, Neves SR, Iyengar R, “Decoding Information in Cell Shape,” Cell, vol:154(6), pp.1356–1369, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Veta M, Kornegoor R, Huisman A, Anoek HJ, et al. “Prognostic value of automatically extracted nuclear morphometric features in whole slide images of male breast cancer,”modern Pathology, vol:25, pp.1559–1565, 2012. [DOI] [PubMed] [Google Scholar]
  • [3].Xing FY, Yang L, “Robuster Nucleus/Cell Detection and Segmentation in Digital Pathology and Microscopy Images: A Comprehensive Review,” IEEE reviews in biomedical engineering, vol:9, pp. 234–263, 2016 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Fathi A, Jamela B, Rabia M, Hussein H, Abdelbaset B and Yrjo C, “Correlation of Nuclear Morphometry of Breast Cancer in Histological Sections with Clinicopathological Features and Prognosis,” Anticancer Research, vol:29 no. 5, pp.1771–1776, May 2009. [PubMed] [Google Scholar]
  • [5].Ohgaki H and Kleihues P, “The definition of primary and secondary glioblastoma,” Clinical Cancer Research, vol:19(4), pp.764–772, 2013. [DOI] [PubMed] [Google Scholar]
  • [6].Aperio ImageScope, http://www.leicabiosystems.com
  • [7].Goode A, Gilbert B, Harkes J, Jukic D, Satyanarayanan M, “OpenSlide: A Vendor-Neutral Software Foundation for Digital Pathology,” Journal of Pathology Informatics, pp.4–27, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Magee D, Treanor D, Crellin D, Shires M, Smith K, Mohee K, Quirke P, “Colour Normalisation in Digital Histopathology Images,” Optical Tissue Image analysis in Microscopy, Histopathology and Endoscopy, pp.100–111, 2009. [Google Scholar]
  • [9].Reinhard E, Ashikhmin M, Gooch B, Shirley P, “Color Transfer between Images,” IEEE Comp. Graph. and App. Arch, 21(5), pp.34–41, 2001. [Google Scholar]
  • [10].Ruifrok AC and Johnston DA, “Quantification of histochemical staining by color deconvolution,” Anal Quant Cytol Histol, 23(4), pp.291–299, 2001. [PubMed] [Google Scholar]
  • [11].Frangi AF, Niessen WJ, Vincken KL, Viergever MA, “Multiscale vessel enhancement filtering,” Medical Image Computing and Computer-Assisted Interventation, Lecture Notes in Computer Science, vol:1496, pp. 130–137, 1998. [Google Scholar]
  • [12].Hancock ER and Kittler J, “Adaptive estimation of hysteresis thresholds,” Proceedings IEEE CVPR, pp.196–201, 1991. [Google Scholar]
  • [13].Borg I, Groenen P, Modern Multidimensional Scaling: theory and applications (2nd ed.) New York: Springer-Verlag, pp.207–212, 2005. [Google Scholar]
  • [14].Wickelmaier F, ”An Introduction to MDS,” Sound Quality Research Unit, Aalborg University, Denmark, May 4, 2003. [Google Scholar]
  • [15].Peng CYJ, Lee KL, Ingersoll GM, “An Introduction to Logistic Regression Analysis and Reporting,” J. Edu. Res, 96(1), pp.3–14, 2002. [Google Scholar]
  • [16].Rokach L, “Ensemble-based classifiers,” Artificial Intelligence Review, 33(1-2), pp.1–39, 2010. [Google Scholar]
  • [17].Celebi ME, Kingravi HA, Vela PA, “A comparative study of efficient initialization methods for the k-means clustering algorithm,” Expert Systems with Applications, 40 (1), pp.200–210, 2013. [Google Scholar]
  • [18].Kegl B, “The return of AdaBoost.MH: multi-class Hamming trees,” , arXiv:1312.6086, 20 December 2013. [Google Scholar]
  • [19].Kodinariya TM, Makwana PR, “Review on determining number of Cluster in K-Means Clustering,” International Journal of Advance Research in Computer Science and Management Studies, 1(6), pp.90–95, 2013. [Google Scholar]

RESOURCES