Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2009 Dec 10.
Published in final edited form as: Proc IEEE Int Symp Biomed Imaging. 2009 Jun 28;6:791–794. doi: 10.1109/ISBI.2009.5193168

MORPHOMETRIC SUBTYPING FOR A PANEL OF BREAST CANCER CELL LINES

Ju Han 1, Hang Chang 1, Gerald Fontenay 1, Nicholas J Wang 1, Joe W Gray 1, Bahram Parvin 1
PMCID: PMC2791417  NIHMSID: NIHMS125205  PMID: 20011457

Abstract

A panel of cell lines of diverse molecular background offers an improved model system for high-content screening, comparative analysis, and cell systems biology. A computational pipeline has been developed to collect images from cell-based assays, segment individual cells and colonies, represent segmented objects in a multidimensional space, and cluster them for identifying distinct subpopulations. While each segmentation strategy can vary for different imaging assays, representation and subpopulation analysis share a common thread. Application of this pipeline to a library of 41 breast cancer cell lines is demonstrated. These cell lines are grown in 2D and imaged through immunofluorescence microscopy. Subpopulations in this panel are identified and shown to correlate with previous subtyping literature that was derived from transcript data.

1. INTRODUCTION

An emerging trend in high-content screening has been to use a panel of cell lines for evaluating therapeutic responses. The panel introduces the necessary molecular diversity to collect and analyze heterogeneous responses that lead to subpopulation identification. While the current method for subtyping has been limited to transcriptome data [1], we introduce a method to identify potential subpopulations through morphological and localization properties. The main advantage of the cell-based assays is in a large number of readouts, where every individual cell can be considered a single sensor responding to the environmental perturbation. These readouts embed biological and technical variations on a single cell basis in context. The main disadvantage is the absence of multiple molecular readouts. However, from a computational perspective, a number of challenges have to be addressed, i.e., segmentation, representation, and identifying subpopulations. The method is applied to a data set derived from a panel of breast cancer cell lines. Forty-one different breast cancer cell lines are grown in 2D, stained for nuclear and proliferation state, and imaged through florescence microscopy.

We have developed a computational pipeline that quantifies cellular morphology through segmentation and multidimensional representations. Subsequently, this representation enables the identification of subpopulations among all cell lines. The pipeline consists of three major steps: (i) segmentation, (ii) morphological feature extraction, and (iii) consensus clustering of morphological features.

2. MULTIDIMENSIONAL PROFILING OF CELLULAR MORPHOLOGIES

The first step in multivariate profiling is segmentation, which separates individual nuclei from their background, and decomposes touching nuclei. Segmentation enables multivariate profiling and subsequent subtyping.

2.1. Nuclear segmentation

In a typical 2D cell culture assay that is stained for nuclear compartment, some nuclei are isolated and others are clustered together to form clumps. As a part of our operational base, we have developed a system that first detects isolated nuclei, and then decomposes a clump of nuclei based on convexity analysis [2]. This system has recently been improved with a significantly better computational profile, and is being integrated into our high-content screening platform [3]. In this implementation, segmentation of nuclear regions is realized by detecting elliptic features [4] corresponding to potential bright regions. These regions are further filtered for their intensities and morphological features.

Let the linear scale-space representation of the original image I0(x, y) at scale σ be given by:

I(x,y;σ)=I0(x,y)G(x,y;σ) (1)

where G(x, y; σ) is the Gaussian kernel with a standard deviation of σ. For simplicity, I(x, y; σ) is also denoted as I(x, y) below. At each point (x, y), the iso-intensity contour is defined by:

I(x+Δx,y+Δy)=I(x,y) (2)

where (Δx, Δy) is the displacement vector. Expanding and truncating the above equation using Taylor’s series, we have the following estimation:

12(Δx,Δy)H(x,y)(Δx,Δy)T+(Ix,Iy)(Δx,Δy)T=0 (3)

where

H(x,y)=(IxxIxyIxyIyy)

is the Hessian matrix of I(x, y). The entire image domain is divided by Equation (2) into two parts:

I(x+Δx,y+Δy)>I(x,y) (4)
I(x+Δx,y+Δy)<I(x,y) (5)

or locally

12(Δx,Δy)H(x,y)(Δx,Δy)T+(Ix,Iy)(Δx,Δy)T>0 (6)
12(Δx,Δy)H(x,y)(Δx,Δy)T+(Ix,Iy)(Δx,Δy)T<0 (7)

If H(x, y) is positive definite, then the region defined by Equation (4) is locally convex. Similarly, if H(x, y) is negative definite, then the region defined by Equation (5) is locally convex. To determine whether H(x, y) > 0, or whether H(x, y) < 0, we analyze this feature in both cases: (I) H(x, y) > 0. Then Ixx > 0, Iyy > 0, and hence Ixx + Iyy > 0, and positive Laplacian means that (x, y) is a “dark point,” i.e., a point that is darker than its neighbors; and (II) H(x, y) < 0. Then Ixx < 0, Iyy < 0, and hence Ixx + Iyy < 0, and negative Laplacian means that (x, y) is a “bright point,” i.e., a point that is brighter than its neighbors. Finally, computed masks are refined against intrinsic gradient-based features. Figure 1 shows two examples of segmentation from two fields corresponding to two different breast cancer cell lines. These images clearly indicate that (i) cellular responses are heterogeneous, and (ii) morphological features are not distinguishable to the naked eye. These samples have been fixed after 48 hours in media without any perturbation (e.g., therapeutic agents).

Fig. 1.

Fig. 1

Segmentation of two fields of cells grown in 2D shows that clumped cells can be partitioned. The secondary stain is BrdU, which labels for proliferation. The top and bottom rows correspond to morphological characterization of MCF12A and SUM185PE, which have been labeled as basal and luminal in the literature, respectively. The significance of this data is that these lines appear quite similar; however, through morphometric analysis, they can be differentiated.

2.2. Phenotypic feature extraction

Phenotypic signatures are computed from every imaging probe that labels an organelle or expression of a specific protein, and the details can be found in our previous research [5]. Regardless of the assay, our system computes three distinct groups of features corresponding to morphology, structure, and fluorescence signals from each imaging probe. For example, in the case of a marker associated with the nuclear region, morphological features of shape such as area, aspect ratio, and orientation are computed. Some of these features are computed by fitting an ellipse to the shape features for a more accurate representation. Other shape features correspond to bending energy at multiple scales, which are computed from bounding contours. Structural features correspond to textural attributes that are detected from first-, second-, and third-order derivatives of oriented Gaussian filters [5]. The fluorescence signal is then quantified at global and local scales. While global representation relies on the average signal within the organelle of interest, local representation characterizes how the fluorescence signal is spatially distributed within the nuclear mask. An example of this representation is the change in the fluorescence signal along the radial direction originating from the center of the mass. Since the texture feature vector is rather large, its dimensionality is reduced through principal component analysis (PCA) for subsequent analysis. We opted not to apply the PCA to the entire representation, since the physical meaning of the feature set will be lost during the projection operation. A total of 324 texture features are computed and, through PCA dimensionality reduction, 10 projected features that account for 99% of the total variance are retained for further analysis. Finally, computed features are normalized with a zero mean value and a variance of one across all samples.

2.3. Phenotypic clustering

The clustering of phenotypic signatures contributes to the categorization of morphological features, and the subsequent correlation analysis with the expression data. However, three issues need to be addressed: (i) objects have heterogeneous morphologies for the same cell line, and labeling can be inconsistent; (ii) the number of objects for each cell line is various; and (iii) there is no prior knowledge of the number of clusters. An important aspect of clustering has to do with validation, since most clustering methods are sensitive to the initial conditions. A proven method is consensus clustering, which is widely used for class discovery and the visualization of gene expression microarray data [6]. This iterative method is based on resampling, and is designed to partition the observed gene expression profiles into a set of exhaustive and nonoverlapping clusters. In each iteration, clustering is performed on a random subset of the data, and the consensus across repeated runs is aggregated in a matrix whose elements represent the probability for a pair of cell lines to be in the same cluster. Further visualization of the consensus matrix enables discovery and validation of the clusters in the observed samples. Our goal is to partition phenotypic fingerprints of objects associated with a panel of breast cancer cell lines into a set of exhaustive and non-overlapping clusters. The consensus clustering method is slightly modified, as follows:

  1. Select the number of clusters, n;

  2. Construct an equal number of samples from each of the cell lines through random sampling;

  3. Cluster randomly selected samples using the K-means method;

  4. Construct a probability distribution function by assigning samples for each cell line to the assigned cluster;

  5. Construct the similarity matrix whose elements are computed from the Kolmogorov-Smirnov (KS) test of the probability distributions between every pair of cell lines;

  6. Repeat steps 2-5 for a fixed number of k iterations, and compute the consensus matrix from similarity matrices;

  7. Repeat steps 1-6 for different n.

The KS test is nonparametric, makes no assumption about the distribution of the data, and outputs a p value between two distributions (e.g., pij). Each element of the similarity matrix M is represented as Mij = 1 – pij, and the final consensus matrix is constructed by averaging all similarity matrices for all k iterations. Subsequent visualization of the consensus matrix enables visual feedback for the performance of the clustering results.

3. EXPERIMENTAL RESULTS

A panel of 41 breast cancer cell lines was cultured in 2D, and samples were imaged with fluorescent microscopy at 10X. The data set includes 3,648 fluorescent images from the 41 breast cancer cell lines. This data set has produced 861,906 cells from all 41 cell lines. After segmentation and feature extraction, each cell is represented with a 34-dimensional index. In the modified consensus clustering algorithm, parameters are set at k = 100 and n = 2, 3, 4, 5.

In order to visualize clustering results, the consensus matrix is treated as a similarity matrix and reordered using hierarchical clustering. In this reordered consensus matrix, cell lines with similar morphologies are adjacent to each other; however, the map is inverted for visualization. Ideally, for a perfect consensus matrix, the displayed heat map should have crisp boundaries. These matrices are generated for a number of clusters ranging from 2 to 5, as shown in Figure 2. Consensus clustering assesses stability for the identification of potential subpopulations, and provides visual feedback as a potential component for the decision-making process. By computing a cumulative distribution from consensus matrices, the shape of cumulative distribution function (CDF) and its progression as a function of number of clusters suggest the presence of desirable subpopulations. An earlier paper by [6] evaluated this method, proposed a new measure of “concentration histogram” computed from the change in the shape of CDF, and suggested that the peak in the concentration histogram corresponds to a preferred number of clusters. The concentration histogram of Figure 3 suggests that two clusters best represent the content of subpopulations. In comparison with the transcript-based clustering results of similar data (Table 1 in [1]), the majority of cell lines in the first subpopulation in Table 1 are luminal (12 luminal, 4 basal, and 4 unknown cell lines), and the majority of cell lines in the second subpopulation are basal-like (14 luminal, 5 basal, and 2 unknown cell lines). Additionally, the majority of lines in the first and second clusters appear to be estrogen receptor (ER) positive and negative, respectively. These results indicate a strong association between subpopulations identified through transcript data and morphometric analysis. As a final step, we also examined cellular responses to CI1040, which is an anticancer drug designed to enforce cell cycle arrest in G1. Again, the same subpopulations persisted after the incubation of each line in the panel, which indicates that morphometric subpopulations remain stationary following a drug treatment.

Fig. 2.

Fig. 2

The consensus matrices for different numbers of clusters based on morphological representations are shown. A darker block indicates higher morphological similarity between two cell lines.

Fig. 3.

Fig. 3

Confidence in the number of clusters: (a) CDF for each cluster and (b) change in CDF as a function of cluster size indicates that three is the optimum number of subpopulations.

Table 1.

Two subpopulations of the the 41 breast cancer cell lines grown in 2D cell culture assay are revealed through consensus clustering. The cell lines listed on the left are mostly luminal, while the cell lines on the right are mostly basal. Luminal and BasalA/B are subpopulations in previous transcript-based clustering results [1].

Cluster 1 Cluster 2
Cell line Type Cell line Type
SUM185PE Luminal AU565 Luminal
ZR751 Luminal HCC1143 BasalA
MDAMB415 Luminal MCF12A BasalB
BT474 Luminal SUM229PE unknown
HCC70 BasalA SUM159PT BasalB
MDAMB453 Luminal BT549 BasalB
LY2 Luminal HCC38 BasalB
MDAMB175VII Luminal HBL100 BasalB
MCF7 Luminal HCC3153 BasalA
MDAMB361 Luminal MDAMB157 BasalB
ZR75B Luminal MDAMB231 BasalB
HCC1419 Unknown SUM52PE Luminal
HCC1500 BasalB HCC1937 BasalA
HCC1806 Unknown SUM1315MO2 BasalB
184A1 Unknown HCC1428 Luminal
184B5 Unknown HCC1954 BasalA
MDAMB436 BasalB T47D Luminal
MDAMB468 BasalA HS578T BasalB
CAMA1 Luminal SUM149PT BasalB
UACC812 Luminal HCC1395 Unknown
SKBR3 Luminal

4. CONCLUSION

In this paper, morphometric properties for a panel of breast cell lines were computed, and subpopulations were identified as a result of multidimensional representations of cellular features. The method has been applied to a data set of 41 cell lines grown in 2D cell culture models. We have shown that computed subpopulations correlate with an earlier analysis of transcript data. Our continued research is to couple morphometric data with array-based data (e.g., transcript, methylation), and to compute molecular predictors of each subpopulation.

Acknowledgments

THIS WORK WAS SUPPORTED BY THE DIRECTOR, OFFICE OF SCIENCE, OFFICE OF BIOLOGICAL AND ENVIRONMENTAL RESEARCH, OF THE U.S. DEPARTMENT OF ENERGY UNDER CONTRACT NO. DE-AC02-05CH11231, AND BY THE NATIONAL INSTITUTES OF HEALTH, NATIONAL CANCER INSTITUTE GRANTS P50 CA 58207 AND THE U54 CA 112970.

5. REFERENCES

  • [1].Neve R, Chin K, Fridlyand J, Yeh J, Baehner F, Fevr T, Clark L, Bayani N, Coppe J, Tong F. A collection of breast cancer cell lines for the study of functionally distinct cancer subtypes. Cancer Cell. 2006;vol. 10(no 6):515–527. doi: 10.1016/j.ccr.2006.10.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Raman S, Maxwell CA, Barcellos-Hoff MH, Parvin B. Geometric approach to segmentation and protein localization in cell culture assays. J Microscopy. 2007;vol. 225(no 1):22–30. doi: 10.1111/j.1365-2818.2007.01712.x. [DOI] [PubMed] [Google Scholar]
  • [3].Quan W, Chang H, Parvin B. A delaunay triangulation approach for segmentation clumps of nuclei. Proc IEEE Int Symp on Biomedical Imaging.2009. [Google Scholar]
  • [4].Yang Q, Parvin B. Harmonic cut and regularized centroid transform for localization of subcelular structures. IEEE Transaction on Biomedical Engineering. 2003;vol. 50(no 4):469–476. doi: 10.1109/TBME.2003.809493. [DOI] [PubMed] [Google Scholar]
  • [5].Han J, Chang H, Andarawewa K, Yaswen P, Barcellos-Hoff MH, Parvin B. Multidimensional profiling of cell surface proteins and nuclear markers. IEEE/ACM Transactions on Computational Biology and Bioinformatics. doi: 10.1109/TCBB.2008.134. in press. [DOI] [PubMed] [Google Scholar]
  • [6].Monti S, Tamayo P, Mesirov J, Golub T. Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning. 2003;vol. 52(no 12):91–118. [Google Scholar]

RESOURCES