Statistical Methods for Tissue Array Images – Algorithmic Scoring and Co-training

Donghui Yan; Pei Wang; Beatrice S Knudsen; Michael Linden; Timothy W Randolph

doi:10.1214/12-aoas543

. Author manuscript; available in PMC: 2012 Sep 13.

Published in final edited form as: Ann Appl Stat. 2012 Sep;6(3):1280–1305. doi: 10.1214/12-aoas543

Statistical Methods for Tissue Array Images – Algorithmic Scoring and Co-training

Donghui Yan ¹, Pei Wang ¹, Beatrice S Knudsen ², Michael Linden ³, Timothy W Randolph ¹

PMCID: PMC3441061 NIHMSID: NIHMS387049 PMID: 22984376

Abstract

Recent advances in tissue microarray technology have allowed immunohistochemistry to become a powerful medium-to-high throughput analysis tool, particularly for the validation of diagnostic and prognostic biomarkers. However, as study size grows, the manual evaluation of these assays becomes a prohibitive limitation; it vastly reduces throughput and greatly increases variability and expense. We propose an algorithm—Tissue Array Co-Occurrence Matrix Analysis (TACOMA)—for quantifying cellular phenotypes based on textural regularity summarized by local inter-pixel relationships. The algorithm can be easily trained for any staining pattern, is absent of sensitive tuning parameters and has the ability to report salient pixels in an image that contribute to its score. Pathologists’ input via informative training patches is an important aspect of the algorithm that allows the training for any specific marker or cell type. With co-training, the error rate of TACOMA can be reduced substantially for a very small training sample (e.g., with size 30). We give theoretical insights into the success of co-training via thinning of the feature set in a high dimensional setting when there is “sufficient” redundancy among the features. TACOMA is flexible, transparent and provides a scoring process that can be evaluated with clarity and confidence. In a study based on an estrogen receptor (ER) marker, we show that TACOMA is comparable to, or outperforms, pathologists’ performance in terms of accuracy and repeatability.

1 Introduction

Tissue microarray (TMA) technology was first described by Wan et al [50] in 1987 and substantially improved by Kononen et al [34] in 1998 as a high-throughput technology for the assessment of protein expression in tissue samples. As shown in the top panel of Figure 1, the construction of a TMA begins with cylindrical cores extracted from a donor block of formalin-fixed and paraffin-embedded tissues. The cores are transferred to the grid of the recipient block. This grid is generated by punching cylindrical holes at equal distance into a precast rectangular mold of solid paraffin wax. Once all the holes are filled with donor cores, the block is heated to fuse the cores to the wax of the block. Normally, recipient blocks contain 360 to 480 tissue cores from donor blocks, often in triplate samples from each block and are thus called tissue micro arrays (TMA). They are sectioned transversely and each section is captured on a glass slide, such that slides display a crossection of each core in a grid-like fashion. More than 100 slides can be generated from each TMA block for analysis with a separate probe. This standardizes the hybridization process of the probe across hundreds of tissue samples. The use of TMAs in cancer biology has increased dramatically in recent years [9, 23, 27, 47] for the rapid evaluation of DNA, RNA and protein expressions on large numbers of clinical tissue samples; they remain the most efficient method for validating proteomics data and tissue biomarkers. We limit our discussion to high-density immunohistochemistry (IHC) staining, a method used for the measurement of protein expression, as the most common method for subcellular localization on a per cell basis.

The top panel shows the steps involved in how a TMA image may be produced. The bottom panel displays example TMA images with the numbers in the left indicating the score of images in the same row. The top panel is reprinted from Kononen et al. (1998) by courtesy of the Nature Publishing Group.

The evaluation of protein expression requires the quantification, or scoring, of a TMA image. The scores can be used for the validation of biomarkers, assessment of therapeutic targets, analysis of clinical outcome etc [27]. The bottom panel of Figure 1 gives an example of several TMA images with scores assigned at a 4-point scale (see Section 4 for details).

Although the construction of TMAs has been automated for large-scale interrogation of markers in tissue samples, several factors limit the use of the TMA as a high-throughput assay. These include the variability, subjectivity and time-intensive effort inherent in the visual scoring of staining patterns [9, 48]. Indeed, a pathologist’s score relies on subjective judgments about colors, textures, intensities, densities and spatial relationships. As noted in [23], however, the human eye cannot provide an objective quantification that can be normalized to a reference. In general, problems stemming from the subjective and inconsistent scoring by pathologists are well known and have been highlighted by several studies [3, 4, 19, 33, 44, 49]. Thus, as study size grows, the value of TMAs in a rigorous statistical analysis may actually decrease without a consistent and objective scoring process.

These concerns have motivated the recent development of a variety of tools for automated scoring, ranging from sophisticated image enhancement tools, tissue segmentation to computer-assisted pathologist-based scoring. Many are focused on a particular cellular pattern, with HER2 (exhibiting nuclear staining) being the most commonly targeted marker; see, e.g., [25, 31, 36, 42, 43]. For a survey of commercial systems, we refer to [37] or [41], and also the review by Cregger et al. [16] which acknowledges that, given the rapid changes in this field, this information may become outdated as devices are abandoned, improved or newly developed. A property of most automated TMA scoring algorithms is that they rely on various forms of background subtraction, feature segmentation and thresholds for pixel intensity. Tuning of these algorithms can be difficult and may result in models sensitive to several variables including staining quality, background antibody binding, counterstain intensity, and the color and hue of chromogenic reaction products used to detect antibody binding. Moreover, such algorithms typically require tuning from the vendors with parameters specific to the markers’ staining pattern (e.g., nuclear versus cytoplasmic), or even require a dedicated person for such a system.

To address the further need for scoring of TMAs in large biomarker studies, we propose a framework—called Tissue Array Co-Occurrence Matrix Analysis (TACOMA)—that is trainable to any staining pattern or tissue type. By seeking texture-based patterns invariant in the images, TACOMA does not rely on intensity thresholds, color filters, image segmentation or shape recognition. It recognizes specific staining patterns based on expert input via a preliminary set of image patches. In addition to providing a score or categorization, TACOMA allows to see which pixels in an image contribute to its score. This clearly enhances interpretability and confidence in the results.

It should be noted that TACOMA is not designed for clinical diagnosis but rather a tool for use in large clinical studies that involve a range of potential biomarkers. Since many thousands of samples may be required, the cost and time required for pathologist-based scoring may be prohibitive and so an efficient automated alternative to human scoring can be essential. TACOMA is a framework for such a purpose.

An important concern in biomedical studies is that of the limited training sample size¹. The size of the training set may necessarily be small due to the cost, time or human efforts required to obtain them. We adopt co-training [53, 6] in the context of TACOMA to substantially reduce the training sample size. We explore the thinning of the feature set for co-training when a ‘natural’ split is not readily available but the features are fairly redundant, and this is supported by our theory that a thinned slice carries about the same classification power as the whole feature set under some conditions.

The organization of the remainder of this paper is as follows. We describe the TACOMA algorithm in Section 2, this is followed by a discussion on co-training to reduce the training sample size with some theoretical insights on the thinning scheme in Section 3. Then in Section 4, we present our experimental results. We conclude with a discussion in Section 5.

2 The TACOMA algorithm

The primary challenge TACOMA addresses is the lack of easily-quantified criteria for scoring: features of interest are not localized in position or size. Moreover, within any region of relevance—one containing primarily cancer cells—there is no well-defined (quantifiable) shape that characterizes a pattern of staining (see, for example, the bottom panel of Figure 1 for an illustration). The key insight that underlies TACOMA is that, despite the heterogeneity of TMA images, they exhibit strong statistical regularity in the form of visually observable textures or staining pattern (see, for example, Figure 2(b)). And, with the guidance of pathologists, TACOMA can be trained for this pattern regardless of the cancer cell type (breast, prostate, etc.) or marker type (e.g., nucleus, cytoplasmic, etc).

**(a)** Generating the GLCM from an image. This toy “image” (left) has 3 gray levels, {Dark, Grey, White}. Here, under the spatial relationship (↗, 1), the transition from Grey to White (indicated by ↗) occurs three times; accordingly, the entry of the GLCM corresponding to the Grey row and White column has a value of 3. **(b)** Example TMA images. Images of a tissue sample (left panel) and the Heatmap (right panel) of their GLCM (in log scale). The GLCM matrices are all 51 × 51; each GLCM cell in the heatmap indicates the frequency of the corresponding transition. The color scale is illustrated by the color bar on the right.

TACOMA captures the texture patterns exhibited by TMA images through a matrix of counting statistics, the Gray Level Co-occurrence Matrix (GLCM). Through a small number of representative image patches, TACOMA constructs a feature mask so that the algorithm will focus on those biologically relevant features (i.e., a subset of GLCM entries). Besides scoring, TACOMA also reports salient image pixels (i.e., those contribute to the scoring of an image) which will be useful for the purpose of training, comparison of multiple TMA images, estimation of staining intensity etc. For the rest of this section, we will briefly discuss these individual building blocks of TACOMA followed by an algorithmic description of TACOMA.

2.1 The gray level co-occurrence matrix

The GLCM was originally proposed by Haralick [26] and has proven successful in a variety of remote-sensing applications [51]. The GLCM, of an image, is a matrix whose entries count the frequency of transitions between pixel intensities across neighboring pixels with a particular spatial relationship; see Figure 2. The description here is essentially adopted from [51]. We start by defining the spatial relationship between a pair of pixels in an image.

Definition

A spatial relationship has two elements, the direction and the distance of interaction. The set of all possible spatial relationships is defined as

\begin{matrix} ℜ & = & D \otimes L \\ = & {↗, ↘, ↖, ↙, ↓, ↑, \to, \leftarrow} \otimes {1, \dots, d} \end{matrix}

where D is the set of potential directions and L is the distance of interaction between the pair of pixels involved in a spatial relationship. The distance of interaction is the minimal number of steps required to move from one pixel to the other along a given direction. The particular spatial relationships used in our application are (↗, 3), (↘, 1) and (↗, 1). Details about the choice of these spatial relationships can be seen in Section 4.

Although the definition of spatial relationships can be extended to involve more pixels [51], we have focused on pairwise relationships which appear to be sufficient. Next we define the GLCM.

Definition

Let N_g be the number of gray levels in an image ¹. For a given image (or a patch) and a fixed spatial relationship $~ \in ℜ$ , the GLCM is defined as

A N_g × N_g matrix such that its (a, b)-entry counts the number of pairs of pixels, with gray values a and b, respectively, having a spatial relationship ~, for a, b ∈ {1, 2, …, N_g}.

This definition is illustrated in Figure 2(a) with more realistic examples in Figure 2(b). Figure 2(b) gives a clear indication as to how the GLCM distinguishes between TMA images having different staining patterns.

Our use of the GLCM is nonstandard in that we do not use any of the common scaler-valued summaries of a GLCM (see [26] and [13]), but instead employ the entire matrix (with masking) in a classification algorithm (see also [51]). A GLCM may have a large number of entries, typically thousands, however, the exceptional capability of Random Forests [7] in feature selection allows us to directly use all (or a masked subset of) GLCM entries to determine a final score or classification.

2.2 Image patches for domain knowledge

In order to incorporate prior knowledge about the staining pattern, we mask the GLCM matrix so that the scoring will focus on biologically pertinent features. This is realized by first choosing a set of image patches representing regions that consist predominantly of cancer cells and are chosen to represent the staining patterns; see Figure 3. The collection of GLCMs from these patches are then used to define a template of “significant entries” (c.f., TACOMA algorithm in Section 2.5) for all future GLCMs: when the GLCM of a new image is formed, only the entries that correspond to this template are retained. This masking step enforces the idea that features used in a classifier should not be based on stromal, arterial or other non-pertinent tissue which may exhibit non-specific or background staining. Note that only one small set of image patches is required; these images patches are used to produce a common feature mask to be used throughout for all images in both training and scoring.

Four pathologist-chosen patches (left panel) and the feature mask as determined by all patches (right panel, see algorithmic description of TACOMA). Non-white entries in this matrix indicate the corresponding GLCM entries to be used in scoring. Note that one and only one feature mask is required throughout.

In this manner, feature selection is initiated by expert biological knowledge. This involves little human effort but gains substantially in both interpretability and accuracy. The underlying philosophy is that no machine learning algorithms beat domain knowledge. Since by using image patches we do not indicate which features to select but instead specify their effect, we achieve the benefits of a manual-based feature selection but avoid its difficulty. This is a novel form of nonparametric, or implicit, feature selection which is applicable to settings beyond TMAs.

2.3 Random Forests

TACOMA uses Random Forests (RF) [7] as the underlying classifier. RF was proposed by Breiman and is considered one of the best classifiers in high dimensional setting [10]. In our experience, RF achieves significantly better performance than SVM and Boosting on the TMA images we use (see Table 1). Additionally Holmes et al [29] argue that RF is superior to others in dealing with tissue images. The fundamental building block of RF is a tree-based classifier which can be non-stable and sensitive to noise. RF takes advantage of such instability and creates a strong ensemble by bagging a large number of trees [7]. Each individual tree is grown on a bootstrap sample from the training set. For the splitting of tree nodes, RF randomly selects a number of candidate features or linear combinations of features and splits the tree node with the one that achieves the most reduction in the node impurity as defined by the Gini index (or other measures such as the out of bag (oob) estimates of generalization error) defined as follows.

ϕ (p) = \sum_{i = 1}^{c} p_{i} (1 - p_{i})

(1)

where p = (p₁, …, p_C) denotes the proportion of examples from different classes and C is the number of different classes. RF grows each tree to the maximum and no pruning is required. For an illustration of RF, see Figure 4.

Table 1.

Performance comparison of RF, SVM, and Boosting of a naive Bayes classifier. The result for RF is adopted from Section 4.1. For SVM, we vary the choice of the kernel from {Gaussian, polynomial, sigmoid} with the best tuning parameters for N_g ∈ {7, 9, 13, 26, 37, 51, 64, 85} and the best result is reported. For boosting, the best result is reported by varying N_g ∈ {7, 9, 13, 26, 37, 51, 64, 85} and the number of boosting iterations from {1, 2, 3, 5, 10, 50, 100}.

Classifier	Accuracy

RF	78.57%
SVM	65.24%
Boosting	61.28%

Open in a new tab

In this illustration the data points reside in a unit square (left panel). The two classes are indicated by red and blue dots. The true decision boundary is the diagonal line shown. RF (center panel) grows many trees. Each tree corresponds to a recursive partition of the data space. These partitions are represented in right panel by a sequence of horizonal and vertical lines; the data space shown here is partitioned by many instances. The RF classifier eventually leads to a decision boundary (solid black curve) for this two-class classification problem.

To test a future example X, let X fall from each tree for which X receives a vote for the class of the terminal node it reaches. The final class membership of X is obtained by a majority vote over the number of votes it receives for each class. The features are ranked by their respective reduction of node impurity as measured by the Gini index. Alternatives include the permutation-based measure, that is, permute variables one at a time and then rank according to the respective amount of decrease in accuracy (as estimated on oob observations).

2.4 Salient pixels detection

A valuable property of TACOMA is its ability to report salient pixels in an image that determine its score (see Figure 8). This is based on a correspondence between the position of pixels in an image and entries in its GLCM, and made possible by the remarkable variable-ranking capability of RF. Here we use the importance measure (Gini index-based) provided by RF to rank the variables (i.e., entries of the GLCM) and then collect relevant image pixels associated with the important entries.

The left panel displays top features (indices of GLCM entries) from the classifier where the x-axis and y-axis indicate the row and column of the GLCM entries. The middle and right panels display images having scores 3 and 0, respectively; the pixels highlighted in white are those that correspond to the GLCM entries shown in the left panel. Note the highlighted pixels in the right panel are notably absent. For visualization, only part of the images are shown (see Appendix for larger images).

Since each entry of a GLCM is a counting statistic involving pairs of pixels, we can associate the (a, b)-entry of a GLCM with those pixels that make up this GLCM entry. The set of image pixels that are associated with the (a, b)-entry of a GLCM is formally represented as

G_{a, b} = {x, y : x ~ y, I (x) = a, I (y) = b} .

In the above representation, x and y represent the position of image pixels and we treat an image I as a map from the position of an image pixel to its gray value. Note that not all pairs of pixels with x ~ y such that I(x) = a, I(y) = b correspond to salient spots in a TMA image. However, if the (a, b)-feature is ‘important’ (e.g., as determined by RF) then typically most pixels in the set $G_{a, b}$ are relevant.

Whereas RF appears to be a black box—taking a large number of GLCM features and producing a score—salient pixels provide a quick peek into its internals. Effectively, RF works in roughly the same manner as a pathologist, that is, they both use salient pixels to score the TMA images; the seemingly mysterious image features are merely a way of representation that is to be understood by a computer algorithm.

2.5 An algorithmic description of TACOMA

Denote the training sample by (I₁, Y₁), …, (I_n, Y_n) where I_i’s are images and Y_i’s are scores. Additionally, assume there is a set of L ‘representative’ image patches. The training of TACOMA is described as Algorithm 1.

In the above description, τ_i is chosen as the median of entries of matrix Z_i for i = 1, …, L. Then, for a new image, TACOMA will: (i) derive the GLCM matrix; (ii) select the entries with indices in M; (iii) apply the trained classifier on the selected entries and output the score. The training and scoring with TACOMA are illustrated in Figure 5.

Algorithm 1.

The training in TACOMA

1: for i = 1 to L do

2: compute the GLCM for the i^th image patch and denote by Z_i;

3: M_i ← the index set of Z_i that survive thresholding at level τ_i;

4: end for

M \leftarrow \cup_{j = 1}^{L} M_{j}

;

6: for k = 1 to n do

7: compute the GLCM of image I_k and keep only entries in index set M;

8: denote the resulting matrix by X_k;

9: end for

10: Feed

\cup_{l = 1}^{n} {(X_{l}, Y_{l})}

into the RF classifier and obtain a classification rule.

Open in a new tab

The left and right panels illustrate, respectively, model training and the use of the model on future data.

3 Co-training with RF

The sample size is an important issue in the scoring of TMA images, mainly because of the high cost and human efforts involved in obtaining a large sample of high quality labels. For instance, it may take several hours for a well-trained pathologist to score 100 TMA images. Unfortunately, it is often the case that the classification performance drops sharply when the training sample size is reduced. For example, Figure 6 shows the error rate of TACOMA when the sample size varies. Our aim is to achieve reasonable accuracy for small sample size and co-training is adopted for this purpose.

There are 328 TMA images in the test sample.

Co-training was proposed in the landmark papers by Yarowsky [53] and Blum and Mitchell [6]. It is an effective way in training a model with an extremely small labeled sample and has been successfully applied in many applications [6, 38]. The idea is to train two separate classifiers (called coupling classifiers) each on a different set of features using a small number of labeled examples. Then the two classifiers iteratively transfer those confidently classified examples, along with the assigned label, to the labeled set. This process is repeated until all unlabeled examples have been labeled. For an illustration of the idea of co-training, see Figure 7. Co-training is relevant here due to the natural redundancy that exists among features that are based on GLCMs corresponding to different spatial relationships.

A learning mode that is closely related to co-training is self-learning [38] where a single classifier is used in the ‘label → transfer →label’ loop (c.f. Figure 7). However, empirical studies have shown that co-training is often superior [38]; the intuition is that, co-training allows the two coupling classifiers to progressively expand the ‘knowledge boundary’ of each other which is absent in self-learning.

Previous works in co-training use almost exclusively Expectation Maximization or Naive Bayes based classifiers where the posterior probability serves as the “confidence” required by co-training. Here we use RF [7] where the margin (to be defined shortly) provided by RF is used as a “natural” proxy for the “confidence”. The margin is defined through the votes received by an observation. For an observation x in the test set, let the number of votes it receives for the i^th class be denoted by N_i(x), i = 1, …, C where C is the number of classes. The margin of x is defined as

\max_{i \in {1, \dots, C}} N_{i} (x) - \underset{j \in {1, \dots, C}}{second} N_{j} (x)

where second in the above indicates the second-largest element in a list.

To give an algorithmic description of co-training, let the two subsets of features be denoted by $F_{1}$ and $F_{1}$ , respectively. Let the set of labeled and unlabeled examples be denoted by $L$ and $U$ , respectively. The co-training process proceeds as Algorithm 2. The final error rate and class membership are determined by a fixed coupling classifier, say f₁. We set m₁ = m₂ = 2 in our experiments according to [6].

3.1 Feature split for co-training

Co-training requires two subsets of features (or a feature split). However, co-training algorithms rarely provide a recipe for obtaining these feature splits. There are several possibilities one can explore.

The first is called a “natural” split, often resulting from an understanding of the problem structure. A rule of thumb as to what constitutes a natural

Algorithm 2.

The co-training algorithm

1: while the set

U

is not empty do

2: for k = 1, 2 do

3: Train RF classifier f_k on labeled examples from

L

using feature set

F_{k}

;

4: Classify examples in the set

U

with f_k;

5: Under f_k, calculate the margin for each observation in

U

;

6: pick m_k observations,

x_{1}^{(k)}, \dots, x_{m_{k}}^{(k)}

, with the largest margins;

7: end for

L \leftarrow L \cup {x_{1}^{(1)}, \dots, x_{m_{1}}^{(1)}, x_{1}^{(2)}, \dots, x_{m_{2}}^{(2)}}

U \leftarrow U ∖ {x_{1}^{(1)}, \dots, x_{m_{1}}^{(1)}, x_{1}^{(2)}, \dots, x_{m_{2}}^{(2)}}

10: end while

Open in a new tab

split is that each of the two feature subsets alone allows one to construct an acceptable classifier and that the two subsets somehow complement each other (e.g., conditional independence given the labels). Fortunately, TMA images represented in GLCM’s naturally have such properties. For a given problem, often there exist several spatial relationships (for example, (↗, 3) and (↘, 1) for TMA images studied in this work) with each inducing a GLCM sufficient to construct a classifier while the “dependence” among the induced GLCM’s is usually low. Thus it is ideal to apply co-training on TMA images using such natural splits.

When there is no natural split readily available, one has to find two proper subsets of features. One way is via random splitting. Co-training via a random split of features was initially considered by Nigam and Ghani [38] but has since been largely overlooked in the machine learning literature. Here we extend the idea of random splits to “thinning”, which is more flexible and potentially may lead to a better co-training performance. Specifically, rather than randomly splitting the original feature set $F = {1, \dots, p}$ into two halves, select two disjoint subsets of $F$ with size not necessary equal but non-vanishing compared to p. This leads to feature subsets smaller than $F$ , hence the name “thinning”. One concrete implementation of this is to divide $F$ into a number of, say J, equal-sized partitions (each partition is also called a thinned slice of $F$ ). In the following discussion, unless otherwise stated, thinning always refers to this concrete implementation. It is clear that this includes random splits as a special case. Thinning allows one to construct self-learning classifier (the features are taken from one of the J partitions), co-training (randomly pick 2 out of J partitions), and so on. For a given problem, one can explore various alternatives associated with thinning but here we shall focus on co-training.

The extension of random split to thinning may lead to improved co-training performance, as thinning may make features from different partitions less dependent and meanwhile well preserves the classification power in a high-dimensional setting when there is sufficient redundancy among features (see Section 3.2). The optimal number of partitions can be selected by heuristics such as the kernel independence test [2, 24], which we leave for future work.

3.2 Some theoretical insights on thinning

According to Blum and Mitchell [6], one essential ingredient of co-training is the “high” confidence of the two coupling classifiers in labeling the unlabeled examples. This is closely related to the strength of the two coupling classifiers which is in turn determined by the feature subsets involved. In this section, we study how much a thinned slice of the feature set $F$ preserves its classification power. This provides insight into the nature of thinning and is interesting at its own right due to its close connection to several lines of interesting work [28, 17] in machine learning (see Section 3.2.2). We present our theoretical analysis in Section 3.2.1 and list related work in Section 3.2.2. In Appendix 6.3, we provide additional simulation results related to our theoretical analysis.

3.2.1 Thinning “preserves” the ratio of separation

Our theoretical model is the Gaussian mixture specified as

Π N (μ_{1}, Σ) + (1 - Π) N (μ_{2}, Σ),

(2)

where Π ∈ {0, 1} indicates the label of an observation such that $P (Π = 1) = π$ , and $N (μ, Σ)$ stands for Gaussian distribution with mean $μ \in R^{p}$ and covariance matrix ∑. For simplicity, we consider $π = \frac{1}{2}$ and the 0-1 loss. We will define the ratio of separation as a measure of the fraction of “information” carried by the subset of features due to thinning with respect to that of the original feature set and show that this quantity is “preserved” upon thinning. For simplicity, we take J = 2 (i.e., random splits of $F$ ) and similar discussion applies to J > 2.

Let the feature set $F$ be decomposed as

F = F_{1} \cup F_{2}, such that F_{1} \cup F_{2} = \emptyset and ∣ F_{1} ∣ = \frac{p}{2} ≜ m .

(3)

We will show that each of the two subsets of features, $F_{1}$ and $F_{2}$ , carries a substantial fraction of the “information” contained in the original data when p is large, assuming the data is generated from Gaussian mixture (2).

A quantity that is crucial in our inquiry is

S_{F} = u_{F}^{T} Σ_{F}^{- 1} u_{F}

(4)

where $u_{F} = {(μ_{1} - μ_{2})}_{F} = {(U_{1}, U_{2}, \dots, U_{p})}_{F}$ . We call $S_{F}$ the separation of the Gaussian mixture induced by the feature set $F$ . The separation is closely related to the Bayes error rate for classification through a well-known result in multivariate statistics.

Lemma 1 ([1])

For Gaussian mixture (2) and 0-1 loss, the Bayes error rate is given by $Φ (- \frac{1}{2} {(u_{F}^{T} Σ^{- 1} u_{F})}^{1 ∕ 2})$ where Φ(·) is defined as $Φ (x) = \int_{- \infty}^{x} \frac{1}{\sqrt[]{2 π}} e^{- \frac{z^{2}}{c}} d z$ .

Let the covariance matrix ∑ be written as

Σ = [\begin{matrix} A & B^{T} \\ B & D \end{matrix}]

where we assume block A corresponds to features in $F_{1}$ after a permutation of rows and columns. Accordingly, write u as $u_{F} = (u_{F_{1}}, u_{F_{2}})$ and defined $S_{F_{1}}$ (called the separation induced by $F_{1}$ ) similarly as (4). Now we can define the ratio of separation for the feature subset $F_{1}$ as

γ = \frac{S_{F_{1}}}{S_{F}} .

(5)

To see why definition (5) is useful, we give here a numerical example. Assume there is a Gaussian mixture defined by (2) such that ∑_100×100 is a tri-diagonal matrix with diagonals being all 1 and off-diagonals being 0.6, $u_{F} = {(1, \dots, 1)}^{T}$ . Suppose one picks the first 50 variables and form a new Gaussian mixture with covariance matrix A and mixture center distance $u_{F_{1}}$ . We wish to see how much is affected in terms of the Bayes error rate. We have

S_{F} = 45.87, Φ (- \frac{1}{2} (u_{F}^{T} Σ^{- 1} u {F)}^{\frac{1}{2}}) = 3.54 \times 10^{- 4}

S F_{1} = 23.32, Φ (- \frac{1}{2} {(u {F_{1}}^{T} A^{- 1} u F_{1})}^{\frac{1}{2}}) = 7.87 \times 10^{- 3}

and γ = 0.5084. Here the difference between feature set $F_{1}$ and $F$ is very small in terms of their classification power. In general, if the dimension is sufficiently high and γ is non-vanishing, then using a subset of features will not incur much loss in classification power. In Theorem 2, we will show that, under certain conditions, γ does not vanish (i.e., γ > c for some positive constant c) so a feature subset is as good as the whole feature set in terms of classification power.

Our main assumption is actually a technical one, which is related to the “local” dependency among components of u after some variable transformation. The exact context will become clear later in the proof of Theorem 2. For now, let ∑ have a Cholesky decomposition ∑ = HH^T for some lower triangular matrix H. A variable transformation in the form of $y = H^{- 1} u_{F}$ will be introduced. The idea is that we desire ∑ = HH^T to possess a structure such that the components of $y = H^{- 1} u_{F}$ are “locally” dependent so that some form of law of large numbers may be applied. To avoid technical details in the main text, we shall discuss the assumption in the appendix.

Our main result is the following theorem.

Theorem 2

Assume the data are generated from Gaussian mixture (2). Further assume the smallest eigenvalue of ∑⁻¹, denoted by λ_min(∑⁻¹), is bounded away from 0 under permutations of rows and columns of ∑. Then, under assumptions of Lemma 3, the separation induced by the feature set $F_{1}$ satisfies

\frac{S_{F_{1}}}{S F} \geq {(\frac{1}{2})}^{-}

in probability as p → ∞ where (a)⁻ indicates any constant smaller than a. When the number of partitions J > 2, the right-hand side is replaced by 1/J.

Proof

See appendix for proof.

3.2.2 Related work

There are mainly two lines of work closely related to ours. One is the Johnson-Lindenstrauss lemma and related [30, 17]. The Johnson-Lindenstrauss (or J-L) lemma states that, for Gaussian mixtures in high-dimensional space, upon a random projection to a low-dimensional subspace, the separation between the mixture centers in the projected space is “comparable” to that in the original space with high probability. The difference is that the random projection in J-L is carried out via a nontrivial linear transformation and the separation is defined in terms of the Euclidean distance whereas, in our work, random projection is performed coordinate-wise in the original space and we define the separation with the Mahalanobis distance.

The other is the random subspace method [28], an early variant of the RF classifier ensemble algorithm that is comparable to bagging and Adaboost in terms of empirical performance. The random subspace method grows a tree by randomly selecting half of the features and then constructs a tree-based classifier. However, beyond simulations there has been no formal justification for the random selection of half of the features. Our result provides support on this aspect. In high dimensional data setting where the features are “redundant”, our result shows that a randomly selected half of the features can lead to a tree comparable, in terms of classification power, to a classifier that uses all the features; meanwhile the random nature of the set of features used in each tree makes the correlation between trees small, so good performance can be expected.

Our theoretical result, when used in co-training, can be viewed as a manifestation of the “blessings of the dimensionality” [20]. For high dimensional data analysis, the conventional wisdom is to do dimension reduction or projection pursuit. As a result, the “redundancy” among the features is typically not used and, in many cases, even becomes the nuisance one strives to get rid of. This is clearly a waste. When the “redundancy” among features is complementary, such redundancy actually allows one to construct two coupling learners from which co-training can be carried out. It should be emphasized that the splitting of the feature set works because of redundancy. We believe the exploration of this type of redundancy will have important impact in high dimensional data analysis.

4 Applications on TMA images

To assess the performance of TACOMA, we evaluate a collection of TMA images from the Stanford Tissue Microarray Database, or STMAD (see [35] and http://tma.stanford.edu/). TMAs corresponding to the potential expression of the estrogen receptor (ER) protein in breast cancer tissue are used since ER is a histologically well-studied marker that is expressed in the cell nucleus. An example of TMA images can be seen in Figure 2. There are 641 TMA images in this set and each image has been assigned a score from {0, 1, 2, 3}. The scoring criteria are: ‘0’ representing a definite negative (no staining of cancer cells), ‘3’ a definitive positive (a majority of cancer cells show dark nucleus staining) and ‘2’ for positive (a minority of cancer cells show nucleus staining or a majority show weak nucleus staining). The score ‘1’ indicates ambiguous weak staining in a minority of cancer cells. The class distribution of the scores is (65.90%, 2.90%, 7.00%, 24.20%). Such an unbalanced class distribution makes the scoring task even more challenging. In our experiments, we assess the accuracy by the proportion of images in the test sample that receives the same score by a scoring algorithm as that given by the reference set (e.g., STMAD). We split the images into a training and a test set of sizes 313 and 328, respectively. In the following, we will first describe our choice of various design options and parameters, and then report performance of TACOMA on large training set (i.e., a set of 313 TMA images) and small training set (i.e., a set of 30 TMA images) with co-training in Section 4.1 and Section 4.2, respectively.

We use three spatial relationships, (↗, 3), (↘, 1) and (↗, 1) in our experiments. In particular, (↗, 3) is used in our experiment on TACOMA for large training set while (↘, 1) is used along with (↗, 3) in our co-training experiment and (↗, 1) in additional experiment (see Table 3). Often (↗, 1) is the default choice in applications, we use (↗, 3) here to reflect the granularity of the visual pattern seen in the images. Indeed, as the staining patterns in TMA images for ER markers occur only in nucleus, (↗, 3) leads to a slight better scoring performance than (↘, 1) according to our experiment on TACOMA; moreover, no significant difference is observed for TACOMA when simply concatenating features derived from (↗, 3) and (↘, 1). (↘, 1) is used in co-training in hoping that it is less correlated to features derived from (↗, 3) than others as these two are on the orthogonal directions.

Table 3.

Accuracy of TACOMA on TMA images corresponding to protein markers CD117, CD34, NMB and ER. Except for ER (which has a fixed training and test set), we use (↗, 1) and 80% of the instances for training and the rest for test; this is repeated for 100 runs and results averaged.

Marker	Staining	#Instances	Accuracy

ER	Nucleus	641	78.57%
CD117	Cell Surface	1063	81.08%
NMB	Cytoplasmic	1036	84.17%
CD34	Cytoplasmic and cell surface	908	76.44%

Open in a new tab

For a good balance of computational efficiency, discriminative power, as well as ease of implementation, we take N_g = 51 (our experiments are not particularly sensitive to the choice of N_g in the range of 40 to 60) and apply uniform quantization over the 256 gray levels in our application. One can, of course, further explore this but we would not expect a substantial gain in performance due to the limitation of a computer algorithm (or even human eyes) in distinguishing subtle differences in gray levels given a moderate sample size.

We use the R package ‘randomForest’ ³ in this work. There are two important parameters, the number of trees in the ensemble and the number of features to explore at each node split. These are searched through {50, 100, 200, 500} and ${0.5 \sqrt[]{p}, \sqrt[]{p}, 2 \sqrt[]{p}}$ (the default value is $\sqrt[]{p}$ as suggested by the R package for p the number of features fed to RF), respectively, in this work and the best test set error rates are reported. More information on RF can be seen in [7].

4.1 Performance on large training set

The full set of 313 TMA images in the training set are used in this case. We run TACOMA on the training set (scores given by STMAD) and apply the trained classifier to the test set images to obtain TACOMA scores. Then, we blind STMAD scores in the test set of 328 images (100 of which are duplicated so totally there are 428 images) and have them re-evaluated by two experienced pathologists from two different institutions. The 100 duplicates allow us to evaluate the self-consistency of the pathologists.

Although the scores from STMAD do not necessarily represent ground truth, they serve as a fixed standard with respect to which the topics of accuracy and reproducibility can be examined. On the test set of 328 TMA images, TACOMA achieves a classification accuracy of 78.57% (accuracy defined as the proportion of images receiving the same score as STMAD). We argue this is close to the optimal. The Bayes rate is estimated for this particular data example (represented as GLCMs) with a simulation using a 1-nearest neighbor (1NN) classifier. The Bayes rate refers to the theoretically best classification rate given the data distribution. With the same training and test sets as RF classification, the accuracy achieved by 1NN is around 60%. According to a celebrated theorem of Cover and Hart [15], the error rate by 1NN is at most twice that of the Bayes rule. This implies an estimate of the Bayes rate at around 80% subject to small sample variation (the estimated Bayes rate on the original image or its quantized version are all bounded above by this number according to our simulation). Thus TACOMA is close to optimal.

In the above experiments, we use 4 image patches. To see if TACOMA is sensitive to the choice of image patches, we conduct experiments over a range of different patch sets and achieve an average accuracy at 78.66 ± 0.52%. This indicates that TACOMA is robust to the choices of image patches.

Salient spots detection

The ability of TACOMA to detect salient pixels is demonstrated in Figure 8 where image pixels are highlighted in white if they are associated with a significant scoring feature. These highlighted pixels are verified by the pathologists to be indicative. With relatively few exceptions, these locations correspond to areas of stained nuclei in cancer cells. We emphasize that these highlighted pixels indicate features most important for classification as opposed to identifying every property indicative of ER status. The highlighted pixels facilitate interpretation and the comparison of images by pathologists.

The experiments with pathologists

The superior classification performance of TACOMA is also demonstrated by scores provided by the two pathologists. These two copies of scores, along with STMAD, provide three independent pathologist-based scores. Among these, 142 images receive a unanimous score. Consequently, these may be viewed as a reference set of “true” scores against which the accuracy of TACOMA might be evaluated (accuracy being defined as the proportion of images receiving the same score as the reference set). Here, TACOMA achieves an accuracy of 90.14%; see Figure 9.

On the STMAD test set TACOMA achieves an accuracy of 78.57%. On the 142 images assigned a unanimous score by two pathologists and STMAD TACOMA agrees on about 90%.

Scores provided by the two pathologists are also used to assess their self-consistency. Here self-consistency is defined as the proportion of repeated images receiving an identical score by the same pathologist. Consensus among different pathologists is an issue of concern [40, 49]. In order to obtain information about the self-consistency of pathologist-based scores, 100 images are selected from the set of 328 images. These 100 images are rotated and/or inverted, and then mixed at random with the 328 images to avoid recognition (so each pathologist actually scored a total of 428 TMA images). The self-consistency rates of the two pathologists are found to be in the range 75-84%. Of course, one desirable feature of any automated algorithm such as TACOMA is its complete (i.e., 100%) self-consistency.

Performance comparison of RF to SVM and Boosting

We choose to use RF in TACOMA. Some popular alternatives include support vector machines (SVM) [14], Boosting [22] and Bayesian network [39] etc. Using the same training and test set as that for RF, we conduct experiments with SVM as well as Boosting of a naive Bayes classifier. The input features for both SVM and Boosting are the entries of the GLCM. We use the Libsvm [11] software for SVM. The naive Bayes classifier is adopted from [51]. The idea is to find the class that maximize the posterior probability for a new observation, denoted by

\arg \max_{k \in {0, 1, 2, 3}} Prob {k ∣ I (1, 1) = a_{1, 1}, \dots, I (N_{g}, N_{g}) = a_{N_{g}, N_{g}}}

Y_{i} = \sum_{j = 1}^{p} h_{i, j} U_{j}

(6)

under the assumption that (I(1, 1), …, I(N_g, N_g)) follows a multinomial distribution. More details can be found in [51]. The results are shown in Table 1. We can see that RF outperforms SVM and Boosting by a large margin. This is consistent with observations made by Holmes et al [29].

4.2 Experiments on small training sets

We conduct experiments on co-training with natural splits and thinning. For natural splits, we use GLCM’s corresponding to two spatial relationships, (↗, 3) and (↘, 1), as features. For thinning, we combine features corresponding to (↗, 3) and (↘, 1) and then split this combined feature set.

The number of labeled examples is fixed at 30 (compared to 313 in experiments with large training set). This choice is to make it easy to get a nonempty class 1 (which carries only about 2.90% of the cases). We suspect this number can be further reduced without suffering much in learning accuracy. The test set is the same as that in Section 4.1. The result is shown in Table 2. One interesting observation is that, co-training by thinning achieves an accuracy very close to that by natural splits. Additionally, Table 2 also lists error rates given by RF on features corresponding to (↗, 3) ∪ (↘, 1) and its thinned subsets. Here thinning of the feature set does not cause much loss in RF performance, consistent with our discussion in Section 3.2.

Table 2.

Performance of RF and co-training by thinning on TMA images. The unlabeled set is taken as the test set in Section 4.1 and the labeled set is randomly sampled from the corresponding training set. The subscript for “thinning” indicates the number of partitions. The results are averaged over 100 runs and over the two coupling classifiers for co-training.

Scheme	Error rate

RF on (↗, 3) ∪ (↘, 1)	34.36%
Thinning₂ on (↗, 3) ∪ (↘, 1)	34.21%
Thinning₃ on (↗, 3) ∪ (↘, 1)	34.18%

Co-training by natural split on (↗, 3) and (↘, 1)	27.49%
Co-training by thinning₂ on (↗, 3) ∪ (↘, 1)	27.89%
Co-training by thinning₃ on (↗, 3) ∪ (↘, 1)	27.62%

Open in a new tab

4.3 Experiments of TACOMA on additional datasets

This study focuses on the ER marker for which the staining is nuclear. However, the TACOMA algorithm can be applied with equal ease to markers that exhibit cell surface, cytoplasm or other staining patterns. Additional experiments are conducted on the Stanford TMA images corresponding to three additional protein markers: CD117, CD34 and NMB. These three sets of TMA images are selected for their large sample size and relatively few missing scores (excluded from experiment). The results are shown in Table 3. In contrast, the automated scoring of cytoplasmic markers is often viewed as more difficult and refined commercial algorithms for these were reportedly not available in a recent evaluation [9] of commercial scoring methods.

5 Discussion

We have presented a new algorithm that automatically scores TMA images in an objective, efficient, and reproducible manner. Our contributions include: 1) the use of co-occurrence counting statistics to capture the spatial regularity inherent in a heterogeneous and irregular set of TMA images; 2) the ability to report salient pixels in an image that determine its score; 3) the incorporation of pathologists’ input via informative training patches so that our algorithm can easily adapt to specific markers and cell types; 4) a very small training sample is achievable with co-training and we have provided some theoretical insights into co-training via thinning of the feature set. Our experiments show that TACOMA can achieve performance comparable to well-trained pathologists. It uses the similar set of pixels for scoring as that would be used by a pathologist and is not adversely sensitive to the choice of image patches. The theory we have developed on the thinning scheme in co-training gives insights on why thinning may rival the performance of a natural split in co-training; a thinned slice may be as good as the whole feature set in terms of classification power hence thinning can lead to two strong coupling classifiers that will be used in co-training and this is what a natural split may achieve.

The utility of TACOMA lies in large population-based studies that seek to evaluate potential markers using IHC in large cohorts. Such a study may be compromised by a scoring process that is protracted, prohibitively expensive or poorly reproducible. Indeed, a manual scoring for such a study could require hundreds of hours of pathologists’ time without achieving a reproducibly consistent set of scores. Experiments with several IHC markers demonstrates that our approach has the potential to be as accurate as manual scoring while providing a fast, objective, inexpensive and highly reproducible alternative. Even more generally, TACOMA may be adopted to other types of textured images such as those appearing in remote sensing applications. These properties provide obvious advantages for any subsequent statistical analysis in determining the validity or clinical utility of a potential marker. Regarding reproducibility, we note that the scores provided by two pathologists in our informal pilot study revealed an intra-observer agreement of around 80% and an accuracy only in the range of 70%, as defined by the STMAD reference set (excluding all images deemed unscorable by the pathologists). This low inter-observer agreement may be attributed to a variety of factors including a lack of a subjective criteria used for scoring or the lack of training against an established standard. This performance could surely be improved upon, but it highlights the inherent subjectivity and variability of human-based scoring.

In summary, TACOMA provides a transparent scoring process that can be evaluated with clarity and confidence. It is also flexible with respect to marker patterns of cellular localization: although the ER marker is characterized by staining of the cell nucleus, TACOMA applies with comparable ease and success to cytoplasmic or other marker staining patterns (see Table 2 in Section 4.1).

A software implementation of TACOMA is available upon request and the associated R package will be made available to R project.

6 Appendix

This appendix consists of four parts, corresponding to the Assumption $A_{1}$ , proofs for Theorem 2, simulation results for the ratio of separation, and additional large TMA images for salient pixel detection, respectively.

6.1 The assumption $A_{1}$

Let $H^{- 1} = {(h_{i, j})}_{i, j = 1}^{p}$ . It is known that H⁻¹ is also a lower triangular matrix. The components of b can be expressed as

\sum_{j = 1}^{i} h_{i, j}^{2}, \sum_{j, k = 1}^{i} h_{i, j}, h_{i, k}

(7)

for i = 1, …, p. The local dependence we are looking for is that the Y_i’s have the same distribution and, for any pair of (i, j), Y_i and Y_j are independent if ȣi – jȣ > W for some W which is either constant or grows sublinearly with p. Then for any index set $I \subset F$ with $∣ I ∣ = m, \sum_{i \in I} Y_{i}^{2} \approx \frac{1}{2} \sum_{i \in F} Y_{i}^{2}$ .

We can express our assumption as follows $A_{1}$ . For each i = 1, …, p, h_i;j = 0 for $j - i > B$ for some constant $B$ (referred to as the bandwidth). Further, we require the following be constant (possibly excluding the first and last few) across i ∈ {1, …, p}

\sup_{i = 1, \dots, p} \sum_{j = i - B}^{i} h_{i, j}^{4} \leq M .

(8)

Moreover, there exists a universal constant M > 0 such that

T_{I} = \frac{1}{∣ I ∣} \sum_{i \in I} Y_{i}^{2} = \frac{1}{∣ I ∣} \sum_{i \in I} {(\sum_{j = 1}^{i} h_{i, j} U_{j})}^{2}

6.2 Proofs

An important part for the proof of Theorem 2 concerns the sum of many “locally” dependent quadratic terms. We prove a useful tool for this first. Define

T_{I} \to E Y_{1}^{2}

We will show that $T_{I}$ is highly concentrated around its mean in the following lemma.

Lemma 3

Suppose assumption $A_{1}$ is true for the covariance matrix ∑ under all possible permutations of rows and columns for some universal constants $B$ and M. Further assume U₁ has bounded fourth moment. Then, for any instance of $I$ defined by a random split,

P r (∣ T_{m} - E T_{m} ∣ \geq a) \leq \frac{1}{a^{2} m^{2}} [\sum_{i = 1}^{m} V a r (Y_{i}^{2}) + \sum_{i \neq j} C o v (Y_{i}^{2}, Y_{j}^{2})] .

(9)

in probability as $∣ I ∣ \to \infty$ .

Proof

The proof is based on Chebychev’s inequality [12]. For simplicity details are omitted for the handling of the finite number (there are $B - 1$ of them) of Y_i’s that has less than $B$ non-zero terms in expression (6) (the sum these of terms tends to 0 in probability). W.L.O.G., let $I = {1, \dots, m}$ . Fix α > 0, then

\frac{1}{a^{2} m^{2}} \sum_{i = 1}^{m} V a r (y_{i}^{2}) = \frac{1}{a^{2} m^{2}} \sum_{i = 1}^{m} V a r {[\sum_{j = 1}^{p} h_{i, j} U_{j}]}^{2} \leq \frac{1}{a^{2} m^{2}} \sum_{i = 1}^{m} E [B^{3} \sum_{j = i - B}^{i} h_{i, j}^{4} U_{j}^{4}] = \frac{B^{3} E (U_{1}^{4})}{a^{2} m^{2}} \sum_{i = 1}^{m} \sum_{j = i - B}^{i} h_{i, j}^{4} \leq \frac{B^{3} M E (U_{1}^{4})}{a^{2} m}

We will show that both of the two terms in (9) vanish as m grows. We have

\frac{1}{a^{2} m^{2}} \sum_{i = 1}^{m} Var (Y_{i}^{2}) = \frac{1}{a^{2} m^{2}} \sum_{i = 1}^{m} Var {[\sum_{j = 1}^{p} h_{i, j} U_{j}]}^{2} \leq \frac{1}{a^{2} m^{2}} \sum_{i = 1}^{m} E [B^{3} \sum_{j = i - B}^{i} h_{i, j}^{4} U_{j}^{4}] = \frac{B^{3} E (U_{1}^{4})}{a^{2} m^{2}} \sum_{i = j}^{m} \sum_{j = i - B}^{i} h_{1, j}^{4} \leq \frac{B^{3} M E (U_{1}^{4})}{a^{2} m},

which clearly vanishes as m grows assuming $E U_{1}^{4}$ is bounded. Next we bound the covariance terms.

\frac{1}{a^{2} m^{2}} \sum_{∣ i - j ∣ \leq B} C o v (Y_{i}^{2}, Y_{j}^{2}) \leq \frac{C_{2}}{a^{2} m^{2}} \sum_{∣ i - j ∣ \leq B} {[E Y_{i}^{4} E Y_{j}^{4}]}^{1 ∕ 2} \leq \frac{1}{a^{2} m^{2}} \sum_{∣ i - j ∣ \leq B} {[(B^{3} E U_{1}^{4} \sum_{k = i - B}^{i} h_{i, k}^{4}) \cdot (B^{3} E U_{1}^{4} \sum_{k = j - B}^{j} h_{j, k}^{4})]}^{1 ∕ 2} = \frac{B^{3} E U_{1}^{4}}{a^{2} m^{2}} \sum_{∣ i - j ∣ \leq B} {[(\sum_{k = i - B}^{i} h_{i, k}^{4}) \cdot (\sum_{k = i - B}^{i} h_{i, k}^{4})]}^{1 ∕ 2} \leq \frac{B^{4} M E U_{1}^{4}}{a^{2} m} .

Thus

P r (∣ T_{m} - E T_{m} ∣ \geq a) \leq \frac{B^{3} (1 + B) M E U_{1}^{4}}{a^{2} m}

and the conclusion follows.

Remarks

In random matrix theory, it is typical to assume that the eigenvalues of the covariance matrix are uniformly bounded (in p) from below, from above or both (See, for example, [5, 32, 21, 8]). So our assumption on the eigenvalues of ∑⁻¹ is not unreasonable.
An interesting special case of $A_{1}$ is when H⁻¹ is a banded Toeplitz matrix, a subject of extensive study in time series, numerical analysis and covariance matrix estimation (see [45, 46, 18, 5, 8] and references therein). In this case, H⁻¹ has constant sub-diagonals so $A_{1}$ holds. Clearly if we permutate the entries in each row of H⁻¹ (the resulting matrix is then a permutated Toeplitz matrix) such that H⁻¹ is still lower triangular and banded (the bandwidth $B$ can grow sub-linearly with p), then it is easy to see that $A_{1}$ still holds.
From the proof of Lemma 3, we see that similar conclusions follow when we allow $B$ , M, or both, to grow sub-linearly (the exact rate can be determined by inspecting the proof of Lemma 3) with p. Under such conditions, since $E Y_{1}^{2}$ may be infinite, we can state the result in terms of $T_{I} ∕ T_{F}$ .
The assumptions required by Lemma 3 may be too restricted, it seems possible to relax by requiring the resulting covariance matrix under permutations be approximable by (H + Δ)(H + Δ)^T with ||Δ|| ≤ ζ||H|| for some small constant ζ where ||·|| denotes the operator norm. Accordingly, the lower bound becomes c for constant c s.t. 0 < c < 0.5 (see Section 6.3).

Proof of Theorem 2

Let u be written as u = (u₁, u₂)^T. To facilitate our proof, we introduce the following auxiliary matrix

\tilde{Σ} = [\begin{matrix} A & 0 \\ 0 & \tilde{D} \end{matrix}]

where $\tilde{D} = diag (d_{p}, \dots, d_{p})$ is a diagonal matrix with entries to be determined. We will use $\tilde{Σ}$ in place of A. Since

{\tilde{Σ}}^{- 1} = [\begin{matrix} A^{- 1} & 0 \\ 0 & {\tilde{D}}^{- 1} \end{matrix}]

if we can make d_p sufficiently large, then effectively we would have $u^{T} {\tilde{Σ}}^{- 1} u \approx u_{1}^{T} A^{- 1} u_{1}$ . The exact order of growth for d_p to control the errors resulting from this approximation will be determined later in the proof.

We wish to obtain a lower bound for

γ = \frac{u_{1}^{T} A^{- 1} u_{1}}{u^{T} Σ^{- 1} u} = \frac{u^{T} {\tilde{Σ}}^{- 1} u}{u^{T} Σ^{- 1} u} - \frac{u_{2}^{T} {\tilde{D}}^{- 1} u_{2}}{u^{T} Σ^{- 1} u} = γ_{1} - γ_{2} .

We have

γ_{2} = \frac{1}{d_{p}} \frac{u_{2}^{T} u_{2}}{u^{T} Σ^{- 1} u} \leq \frac{1}{d_{p}} \frac{u^{T} u}{u^{T} Σ^{- 1} u}

which vanishes as d_p → ∞. To obtain a bound for γ₁, note that it is equivalent to (i.e., via linear transformation y = H⁻¹u)

\frac{y^{T} H^{T} {\tilde{Σ}}^{- 1} H_{y}}{y^{T} y}

(10)

where lower triangular matrix H is defined in the Cholesky decomposition ∑ = HH^T. Write H in the following block form

H = [\begin{matrix} H_{1} & 0 \\ H_{2} & H_{4} \end{matrix}] .

Then $A = H_{1} H_{1}^{T}$ . We can compute the following

H^{T} {\tilde{Σ}}^{- 1} H = [\begin{matrix} H_{1}^{T} & H_{2}^{T} \\ 0 & H_{4}^{T} \end{matrix}] . [\begin{matrix} A^{- 1} & 0 \\ 0 & {\tilde{D}}^{- 1} \end{matrix}] . [\begin{matrix} H_{1} & 0 \\ H_{2} & H_{4} \end{matrix}] = [\begin{matrix} H_{1}^{T} A^{- 1} & H_{2}^{T} {\tilde{D}}^{- 1} \\ 0 & H_{4}^{T} {\tilde{D}}^{- 1} \end{matrix}] . [\begin{matrix} H_{1} & 0 \\ H_{2} & H_{4} \end{matrix}] = [\begin{matrix} I_{m \times m} + H_{2}^{T} {\tilde{D}}^{- 1} H_{2} & H_{2}^{T} {\tilde{D}}^{- 1} H_{4} \\ H_{4}^{T} {\tilde{D}}^{- 1} H_{2} & H_{4}^{T} {\tilde{D}}^{- 1} H_{4} \end{matrix}] .

It follows that

\begin{matrix} (y_{1}^{T} y_{2}^{T}) H^{T} {\tilde{Σ}}^{- 1} H [\begin{matrix} y_{1} \\ y_{2} \end{matrix}] \\ = & y_{1}^{T} y_{1} + y_{1}^{T} H_{2}^{T} {\tilde{D}}^{- 1} H_{2} y_{1} + y_{2}^{T} H_{4}^{T} {\tilde{D}}^{- 1} H_{2} y_{1} + y_{1}^{T} H_{2}^{T} {\tilde{D}}^{- 1} H_{4} y_{2} + y_{2}^{T} H_{4}^{T} {\tilde{D}}^{- 1} H_{4} y_{2} \\ = & y_{1}^{T} y_{1} + Q (y) \end{matrix} .

Repeated application of the matrix norm inequality yields the following

Q (y) \leq ‖ {\tilde{D}}^{- 1} ‖ \cdot (‖ H_{2} ‖^{2} + ‖ H_{4} ‖ \cdot ‖ H_{2} ‖ + ‖ H_{2} ‖ \cdot ‖ H_{4} ‖ + ‖ H_{4} ‖^{2}) \cdot ‖ y ‖^{2} \leq \frac{4 ‖ H ‖_{F}^{2}}{d_{p}} ‖ y ‖^{2}

where ||.||_F denotes the Frobenius norm. Simply setting $d_{p} = O (p ‖ H ‖_{F}^{2})$ make the term Q(y)/y^Ty vanish. So

γ_{1} = \frac{y^{T} H^{T} {\tilde{Σ}}^{- 1} H y}{y^{T} y} \geq \frac{y_{1}^{T} y_{1}}{y^{T} y} - o (1) .

Let y be written as y = (Y₁, …, Y_p). By Lemma 3, we can get $y_{1}^{T} y_{1} = \frac{p}{2} (E_{P} Y_{1}^{2} + op (1))$ and $y^{T} y = p (E_{P} Y_{1}^{2} + op (1))$ when p is large. Thus

\frac{{S_{F}}_{1}}{S_{F}} \geq γ_{1} - \frac{1}{d_{p}^{2}} \frac{u u^{T}}{u Σ^{- 1} u^{T}} = \frac{1}{2} - o_{p} (1)

as p grows.

6.3 Simulations for thinning

We conduct simulations on Gaussian mixtures $G_{1 - 4}$ corresponding to different types of covariance matrices. The aim is to examine the generality of Theorem 2 when there is a departure from assumption $A_{1}$ . Note that here we are not trying to model the TMA images with a Gaussian mixture rather we choose it for simplicity.

In all cases, the covariance matrix ∑ has dimension 2000. The components of u are generated i.i.d. uniform from [0, 1]. ∑ is defined as follows.

$G_{1}$ : ∑ is banded with bandwidth 200 and its nonzero (i, j)^th entry is defined by ∑ = ρ^|i−j| for ρ ∈ {0.1, 0.3, 0.6}.
$G_{2}$ : ∑ is banded with $B \in {50, 100, 200}$ and the nonzero (i, j)^th entry defined by ∑_ij = /(|i −j| + 0.5).
$G_{3}$ : ∑ is in the form of (X + X^T)/2 with entries of matrix X generated i.i.d. uniform from [0, μ] for μ ∈ {0.1, 0.5, 1.0}, except that the diagonals are the smallest number such that λ_min (∑⁻¹) ≥ 10⁻⁵.
$G_{4}$ : ∑ is in the form of (X + X^T)/2 with entries of matrix X generated i.i.d. from $N (0, σ^{2})$ for σ ∈ {0.1, 0.5, 1.0}, except that the diagonals are the smallest number such that λ_min (∑⁻¹) ≥ 10⁻⁵.

Table 4 shows the results where, in all cases, the ratio of separation is greater than 0.3 and fairly close to 0.5 (the ratio of separation when assumption $A_{1}$ is true). As the Bayes rate of the 2–class classification problem on Gaussian mixtures with the original feature set is close to 1, so is that on a thinned slice of the feature set.

Table 4.

Average ratio of separation for Gaussian mixtures generated with different types of covariance matrices. Results are averaged on the two subsets of features and over 100 runs. In all cases, the Bayes rate is close to 1.

$G_{1}$	ρ = 0.1 0.5355 ± 0.0100	ρ = 0.3 0.5799 ± 0.0123	ρ = 0.6 0.5090 ± 0.0128
$G_{2}$	$B$ 0.4104 ± 0.0123	$B$ 0.4034 ± 0.0124	$B$ 0.3990 ± 0.0131
$G_{3}$	μ = 0.1 0.5315 ± 0.0103	μ = 0.5 0.4088 ± 0.0121	μ = 1.0 0.3624 ± 0.0200
$G_{4}$	σ = 0.1 0.4038 ± 0.0125	σ = 0.5 0.3922 ± 0.0136	σ = 1.0 0.3376 ± 0.0188

Open in a new tab

6.4 TMA images with salient pixels marked

Additional figures include two example TMA images with salient pixels marked (highlighted in white) after scoring, see Figure 10 and Figure 11.

Salient pixels (highlighted in white) as detected by TACOMA on one TMA image for the study of ER staining in breast cancer tissue. This is the full image and larger view of the middle panel in Figure 10 in the main text. This TMA image receives a score of 3 according to STMAD.

Salient pixels (highlighted in white, but notably absent) as determined by TACOMA on one TMA image for the study of ER staining in breast cancer tissue. This is the full image and larger view of the right panel in Figure 10 in the main text. This TMA image receives a score of 0 according to STMAD.

Footnotes

The additional issue of label noise was studied elsewhere [52].

Often before the computing of GLCM, the gray level of each pixel in an image is scaled linearly from [1; N_o] to [1; N_g] for N_o the prede ned number, typically 256, of gray levels in the image.

Originally written in Fortran by Leo Breiman and Adele Cutler, and later ported to R by Andy Liaw and Matthew Wiener.

References

[1].Andersen TW. An introduction to multivariate statistical analysis. John Wiley; New York: 1958. [Google Scholar]
[2].Bach F, Jordan MI. Kernel independent component analysis. Journal of Machine Learning Research. 2003;3:1–48. [Google Scholar]
[3].Bentzen S, Buffa F, Wilson G. Multiple biomarker tissue microarrays: bioinformatics and practical approaches. Cancer and Metastasis Reviews. 2008;27(3):481–494. doi: 10.1007/s10555-008-9145-8. [DOI] [PubMed] [Google Scholar]
[4].Berger A, Davis D, Tellez C, Prieto V, Gershenwald J, Johnson M, Rimm D, Bar-Eli M. Automated quantitative analysis of activator protein-2 α subcellular expression in melanoma tissue microarrays correlates with survival prediction. Cancer research. 2005;65(23):11185. doi: 10.1158/0008-5472.CAN-05-2300. [DOI] [PubMed] [Google Scholar]
[5].Bickel PJ, Levina E. Regularized estimation of large covariance matrices. Annals of Statistics. 2008;36(1):199–227. [Google Scholar]
[6].Blum A, Mitchell T. Combining labeled and unlabeled data with co-training. Proceedings of the eleventh annual conference on Computational learning theory.1998. pp. 92–100. [Google Scholar]
[7].Breiman L. Random Forests. Machine Learning. 2001;45(1):5–32. [Google Scholar]
[8].Cai T, Zhang C-H, Zhou H. Optimal rates of convergence for covariance matrix estimation. Annals of Statistics. 2010;38(4):2118–2144. [Google Scholar]
[9].Camp R, Neumeister V, Rimm D. A decade of tissue microarrays: progress in the discovery and validation of cancer biomarkers. Journal of Clinical Oncology. 2008;26(34):5630–5637. doi: 10.1200/JCO.2008.17.3567. [DOI] [PubMed] [Google Scholar]
[10].Caruana R, Karampatziakis N, Yessenalina A. An empirical evaluation of supervised learning in high dimensions. Proceedings of the Twenty-Fifth International Conference on Machine Learning (ICML).2008. pp. 96–103. [Google Scholar]
[11].Chang C-C, Lin C-J. LIBSVM: a library for support vector machines. 2001 Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.
[12].Chung KL. A Course in Probability Theory. second edition Academic Press; 1974. [Google Scholar]
[13].Conners R, Harlow C. A theoretical comparison of texture algorithms. IEEE Transactions on Pattern Analyses and Machine Intelligence. 1980;2:204–222. doi: 10.1109/tpami.1980.4767008. [DOI] [PubMed] [Google Scholar]
[14].Cortes C, Vapnik V. Support vector networks. Machine Learning. 1995;20:273–297. [Google Scholar]
[15].Cover T, Hart P. Nearest neighbor pattern classification. IEEE Transactions on Information Theory. 1967;13(1):21–27. [Google Scholar]
[16].Cregger M, Berger A, Rimm D. Immunohistochemistry and quantitative analysis of protein expression. Archives of pathology & laboratory medicine. 2006;130(7):1026. doi: 10.5858/2006-130-1026-IAQAOP. [DOI] [PubMed] [Google Scholar]
[17].Dasgupta S, Gupta A. An elementary proof of the Johnson-Lindenstrauss lemma. Random Structures & Algorithms. 2002;22(1):60–65. [Google Scholar]
[18].Dickinson BW. Efficient solution of linear equations with banded Toeplitz matrices. IEEE Transactions on Acoustics, Speech and Signal Processing. 1979;ASSP-27:421–423. [Google Scholar]
[19].DiVito K, Camp R. Tissue microarrays - automated analysis and future directions. Breast Cancer Online. 2005;8(07) [Google Scholar]
[20].Donoho DL. High-dimensional data analysis: The curses and blessings of dimensionality. American Math Society on Math Challenges of the 21st century. 2000.
[21].Fan J, Fan Y, Lv J. High dimensional covariance matrix estimation using a factor model. Journal of Econometrics. 2008;147:186–197. [Google Scholar]
[22].Freund Y, Schapire RE. Experiments with a new boosting algorithm. Proceedings of the Thirteenth International Conference on Machine Learning (ICML).1996. pp. 148–156. [Google Scholar]
[23].Giltnane J, Rimm D. Technology insight: identification of biomarkers with tissue microarray technology. Nature Clinical Practice Oncology. 2004;1(2):104–111. doi: 10.1038/ncponc0046. [DOI] [PubMed] [Google Scholar]
[24].Gretton A, Fukumizu K, Teo CH, Song L, Schölkopf B, Smola AJ. Advances in Neural Information Processing Systems (NIPS) 2007. A kernel statistical test of independence; pp. 585–592. [Google Scholar]
[25].Hall B, Ianosi-Irimie M, Javidian P, Chen W, Ganesan S, Foran D. Computer-assisted assessment of the human epidermal growth factor receptor 2 immunohistochemical assay in imaged histologic sections using a membrane isolation algorithm and quantitative analysis of positive controls. BMC Medical Imaging. 2008;8(1):11. doi: 10.1186/1471-2342-8-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
[26].Haralick R. Statistical and structural approaches to texture. Proceedings of IEEE. 1979;67(5):786–803. [Google Scholar]
[27].Hassan S, Ferrario C, Mamo A, Basik M. Tissue microarrays: emerging standard for biomarker validation. Current Opinion in Biotechnology. 2008;19(1):19–25. doi: 10.1016/j.copbio.2007.10.009. [DOI] [PubMed] [Google Scholar]
[28].Ho TK. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1998;20:832–844. [Google Scholar]
[29].Holmes S, Kapelner A, Lee P. An interactive Java statistical image segmentation system: Gemident. Journal of Statistical Software. 2009;30(10):1–20. [PMC free article] [PubMed] [Google Scholar]
[30].Johnson W, Lindenstrauss J. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics. 1984;26:189–206. [Google Scholar]
[31].Joshi AS, Sharangpani GM, Porter K, Keyhani S, Morrison C, Basu AS, Gholap GA, Gholap AS, Barsky SH. Semi-automated imaging system to quantitate Her-2/neu membrane receptor immunoreactivity in human breast cancer. Cytometry Part A. 2007;71(5):273–285. doi: 10.1002/cyto.a.20374. [DOI] [PubMed] [Google Scholar]
[32].El Karoui N. Spectrum estimation for large dimensional covariance matrices using random matrix theory. 2006. arXiv:math/0609418v1.
[33].Kirkegaard T, Edwards J, Tovey S, McGlynn LM, Krishna SN, Mukherjee R, Tam L, Munro AF, Dunne B, Bartlett J. Observer variation in immunohistochemical analysis of protein expression, time for a change? Histopathology. 2006;48(7):787–794. doi: 10.1111/j.1365-2559.2006.02412.x. [DOI] [PubMed] [Google Scholar]
[34].Kononen J, Bubendorf L, Kallionimeni A, Bärlund M, Schraml P, Leighton S, Torhorst J, Mihatsch M, Sauter G, Kallionimeni O. Tissue microarrays for high-throughput molecular profiling of tumor specimens. Nature Medicine. 1998;4(7):844–847. doi: 10.1038/nm0798-844. [DOI] [PubMed] [Google Scholar]
[35].Marinelli R, Montgomery K, Liu C, Shah N, Prapong W, Nitzberg M, Zachariah Z, Sherlock G, Natkunam Y, West R, et al. The Stanford tissue microarray database. Nucleic Acids Research. 2007;36:D871–D877. doi: 10.1093/nar/gkm861. [DOI] [PMC free article] [PubMed] [Google Scholar]
[36].Masmoudi H, Hewitt S, Petrick N, Myers K, Gavrielides M. Automated Quantitative Assessment of HER-2/neu Immunohistochemical Expression in Breast Cancer. IEEE transactions on medical imaging. 2009 doi: 10.1109/TMI.2009.2012901. [DOI] [PMC free article] [PubMed] [Google Scholar]
[37].Mulrane L, Rexhepaj E, Penney S, Callanan JJ, Gallagher WM. Automated image analysis in histopathology: a valuable tool in medical diagnostics. Expert Review of Molecular Diagnostics. 2008;8(6):707–725. doi: 10.1586/14737159.8.6.707. [DOI] [PubMed] [Google Scholar]
[38].Nigam K, Ghani R. Analyzing the effectiveness and applicability of co-training. Proceedings of the ninth international conference on Information and knowledge management.2000. pp. 86–93. [Google Scholar]
[39].Pearl J. Bayesian networks: A model of self-activated memory for evidential reasoning. Proceedings of the seventh Conference of the Cognitive Science Society.1985. pp. 329–334. [Google Scholar]
[40].Penna A, Grilli R, Filardo G, Mainini F, Zola P, Mantovani L, Liberati A. Do different physicians’ panels reach similar conclusions? A case study on practice guidelines for limited surgery in breast cancer. European Journal of Public Health. 1997;7:436–440. [Google Scholar]
[41].Rojo MG, Bueno G, Slodkowska J. Review of imaging solutions for integrated quantitative immunohistochemistry in the pathology daily practice. Folia Histochemica et Cytobiologica. 2009;47(3):349–354. doi: 10.2478/v10042-008-0114-4. [DOI] [PubMed] [Google Scholar]
[42].Skaland I, Øvestad I, Janssen EAM, Klos J, Kjellevold KH, Helliesen T, Baak J. Digital image analysis improves the quality of subjective HER-2 expression scoring in breast cancer. Applied Immunohisto-chemistry & Molecular Morphology. 2008;16(2):185. doi: 10.1097/PAI.0b013e318059c20c. [DOI] [PubMed] [Google Scholar]
[43].Tawfik OW, Kimler BF, Davis M, Donahue JK, Persons DL, Fan F, Hagemeister S, Thomas P, Connor C, Jewell W, et al. Comparison of immunohistochemistry by automated cellular imaging system (ACIS) versus fluorescence in-situ hybridization in the evaluation of HER-2/neu expression in primary breast carcinoma. Histopathology. 2005;48(3):258–267. doi: 10.1111/j.1365-2559.2005.02322.x. [DOI] [PubMed] [Google Scholar]
[44].Thomson T, Hayes M, Spinelli J, Hilland E, Sawrenko C, Phillips D, Dupuis B, Parker R. HER-2/neu in breast cancer: interobserver variability and performance of immunohistochemistry with 4 antibodies compared with fluorescent in situ hybridization. Modern Pathology. 2001;14(11):1079–1086. doi: 10.1038/modpathol.3880440. [DOI] [PubMed] [Google Scholar]
[45].Trench WF. Weighting coefficients for the prediction of stationary time series from the finite past. SIAM Journal on Applied Mathematics. 1967;15:1502–1510. [Google Scholar]
[46].Trench WF. Inversion of Toeplitz band matrices. Mathematics of Computation. 1974;28:1089–1095. [Google Scholar]
[47].Voduc D, Kenney C, Nielsen T. Tissue microarrays in clinical oncology. Seminars in radiation oncology. 2008;18(2):89–97. doi: 10.1016/j.semradonc.2007.10.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
[48].Vrolijk H, Sloos W, Mesker W, Franken P, Fodde R, Morreau H, Tanke H. Automated Acquisition of Stained Tissue Microarrays for High-Throughput Evaluation of Molecular Targets. Journal of Molecular Diagnostics. 2003;5(3):160–167. doi: 10.1016/S1525-1578(10)60468-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
[49].Walker R. Quantification of immunohistochemistry - issues concerning methods, utility and semiquantitative assessment I. Histopathology. 2006;49(4):406–410. doi: 10.1111/j.1365-2559.2006.02514.x. [DOI] [PubMed] [Google Scholar]
[50].Wan WH, Fortuna MB, Furmanski P. A rapid and efficient method for testing immunohistochemical reactivity of monoclonal antibodies against multiple tissue samples simultaneously. Journal of Immunological Methods. 1987;103:121–129. doi: 10.1016/0022-1759(87)90249-3. [DOI] [PubMed] [Google Scholar]
[51].Yan D, Bickel PJ, Gong P. A discrete log density expansion based approach to Ikonos image classification. American Society for Photogrammetry and Remote Sensing Fall Speciality Conference.2006. [Google Scholar]
[52].Yan D, Gong P, Chen A, Zhong L. Classification under data contamination with application to image mis-registration in remote sensing. 2011. arXiv:1101.3594.
[53].Yarowsky D. Unsupervised word sense disambiguation rivaling supervised methods. in Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics.1995. pp. 189–196. [Google Scholar]

[R1] [1].Andersen TW. An introduction to multivariate statistical analysis. John Wiley; New York: 1958. [Google Scholar]

[R2] [2].Bach F, Jordan MI. Kernel independent component analysis. Journal of Machine Learning Research. 2003;3:1–48. [Google Scholar]

[R3] [3].Bentzen S, Buffa F, Wilson G. Multiple biomarker tissue microarrays: bioinformatics and practical approaches. Cancer and Metastasis Reviews. 2008;27(3):481–494. doi: 10.1007/s10555-008-9145-8. [DOI] [PubMed] [Google Scholar]

[R4] [4].Berger A, Davis D, Tellez C, Prieto V, Gershenwald J, Johnson M, Rimm D, Bar-Eli M. Automated quantitative analysis of activator protein-2 α subcellular expression in melanoma tissue microarrays correlates with survival prediction. Cancer research. 2005;65(23):11185. doi: 10.1158/0008-5472.CAN-05-2300. [DOI] [PubMed] [Google Scholar]

[R5] [5].Bickel PJ, Levina E. Regularized estimation of large covariance matrices. Annals of Statistics. 2008;36(1):199–227. [Google Scholar]

[R6] [6].Blum A, Mitchell T. Combining labeled and unlabeled data with co-training. Proceedings of the eleventh annual conference on Computational learning theory.1998. pp. 92–100. [Google Scholar]

[R7] [7].Breiman L. Random Forests. Machine Learning. 2001;45(1):5–32. [Google Scholar]

[R8] [8].Cai T, Zhang C-H, Zhou H. Optimal rates of convergence for covariance matrix estimation. Annals of Statistics. 2010;38(4):2118–2144. [Google Scholar]

[R9] [9].Camp R, Neumeister V, Rimm D. A decade of tissue microarrays: progress in the discovery and validation of cancer biomarkers. Journal of Clinical Oncology. 2008;26(34):5630–5637. doi: 10.1200/JCO.2008.17.3567. [DOI] [PubMed] [Google Scholar]

[R10] [10].Caruana R, Karampatziakis N, Yessenalina A. An empirical evaluation of supervised learning in high dimensions. Proceedings of the Twenty-Fifth International Conference on Machine Learning (ICML).2008. pp. 96–103. [Google Scholar]

[R11] [11].Chang C-C, Lin C-J. LIBSVM: a library for support vector machines. 2001 Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.

[R12] [12].Chung KL. A Course in Probability Theory. second edition Academic Press; 1974. [Google Scholar]

[R13] [13].Conners R, Harlow C. A theoretical comparison of texture algorithms. IEEE Transactions on Pattern Analyses and Machine Intelligence. 1980;2:204–222. doi: 10.1109/tpami.1980.4767008. [DOI] [PubMed] [Google Scholar]

[R14] [14].Cortes C, Vapnik V. Support vector networks. Machine Learning. 1995;20:273–297. [Google Scholar]

[R15] [15].Cover T, Hart P. Nearest neighbor pattern classification. IEEE Transactions on Information Theory. 1967;13(1):21–27. [Google Scholar]

[R16] [16].Cregger M, Berger A, Rimm D. Immunohistochemistry and quantitative analysis of protein expression. Archives of pathology & laboratory medicine. 2006;130(7):1026. doi: 10.5858/2006-130-1026-IAQAOP. [DOI] [PubMed] [Google Scholar]

[R17] [17].Dasgupta S, Gupta A. An elementary proof of the Johnson-Lindenstrauss lemma. Random Structures & Algorithms. 2002;22(1):60–65. [Google Scholar]

[R18] [18].Dickinson BW. Efficient solution of linear equations with banded Toeplitz matrices. IEEE Transactions on Acoustics, Speech and Signal Processing. 1979;ASSP-27:421–423. [Google Scholar]

[R19] [19].DiVito K, Camp R. Tissue microarrays - automated analysis and future directions. Breast Cancer Online. 2005;8(07) [Google Scholar]

[R20] [20].Donoho DL. High-dimensional data analysis: The curses and blessings of dimensionality. American Math Society on Math Challenges of the 21st century. 2000.

[R21] [21].Fan J, Fan Y, Lv J. High dimensional covariance matrix estimation using a factor model. Journal of Econometrics. 2008;147:186–197. [Google Scholar]

[R22] [22].Freund Y, Schapire RE. Experiments with a new boosting algorithm. Proceedings of the Thirteenth International Conference on Machine Learning (ICML).1996. pp. 148–156. [Google Scholar]

[R23] [23].Giltnane J, Rimm D. Technology insight: identification of biomarkers with tissue microarray technology. Nature Clinical Practice Oncology. 2004;1(2):104–111. doi: 10.1038/ncponc0046. [DOI] [PubMed] [Google Scholar]

[R24] [24].Gretton A, Fukumizu K, Teo CH, Song L, Schölkopf B, Smola AJ. Advances in Neural Information Processing Systems (NIPS) 2007. A kernel statistical test of independence; pp. 585–592. [Google Scholar]

[R25] [25].Hall B, Ianosi-Irimie M, Javidian P, Chen W, Ganesan S, Foran D. Computer-assisted assessment of the human epidermal growth factor receptor 2 immunohistochemical assay in imaged histologic sections using a membrane isolation algorithm and quantitative analysis of positive controls. BMC Medical Imaging. 2008;8(1):11. doi: 10.1186/1471-2342-8-11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] [26].Haralick R. Statistical and structural approaches to texture. Proceedings of IEEE. 1979;67(5):786–803. [Google Scholar]

[R27] [27].Hassan S, Ferrario C, Mamo A, Basik M. Tissue microarrays: emerging standard for biomarker validation. Current Opinion in Biotechnology. 2008;19(1):19–25. doi: 10.1016/j.copbio.2007.10.009. [DOI] [PubMed] [Google Scholar]

[R28] [28].Ho TK. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1998;20:832–844. [Google Scholar]

[R29] [29].Holmes S, Kapelner A, Lee P. An interactive Java statistical image segmentation system: Gemident. Journal of Statistical Software. 2009;30(10):1–20. [PMC free article] [PubMed] [Google Scholar]

[R30] [30].Johnson W, Lindenstrauss J. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics. 1984;26:189–206. [Google Scholar]

[R31] [31].Joshi AS, Sharangpani GM, Porter K, Keyhani S, Morrison C, Basu AS, Gholap GA, Gholap AS, Barsky SH. Semi-automated imaging system to quantitate Her-2/neu membrane receptor immunoreactivity in human breast cancer. Cytometry Part A. 2007;71(5):273–285. doi: 10.1002/cyto.a.20374. [DOI] [PubMed] [Google Scholar]

[R32] [32].El Karoui N. Spectrum estimation for large dimensional covariance matrices using random matrix theory. 2006. arXiv:math/0609418v1.

[R33] [33].Kirkegaard T, Edwards J, Tovey S, McGlynn LM, Krishna SN, Mukherjee R, Tam L, Munro AF, Dunne B, Bartlett J. Observer variation in immunohistochemical analysis of protein expression, time for a change? Histopathology. 2006;48(7):787–794. doi: 10.1111/j.1365-2559.2006.02412.x. [DOI] [PubMed] [Google Scholar]

[R34] [34].Kononen J, Bubendorf L, Kallionimeni A, Bärlund M, Schraml P, Leighton S, Torhorst J, Mihatsch M, Sauter G, Kallionimeni O. Tissue microarrays for high-throughput molecular profiling of tumor specimens. Nature Medicine. 1998;4(7):844–847. doi: 10.1038/nm0798-844. [DOI] [PubMed] [Google Scholar]

[R35] [35].Marinelli R, Montgomery K, Liu C, Shah N, Prapong W, Nitzberg M, Zachariah Z, Sherlock G, Natkunam Y, West R, et al. The Stanford tissue microarray database. Nucleic Acids Research. 2007;36:D871–D877. doi: 10.1093/nar/gkm861. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] [36].Masmoudi H, Hewitt S, Petrick N, Myers K, Gavrielides M. Automated Quantitative Assessment of HER-2/neu Immunohistochemical Expression in Breast Cancer. IEEE transactions on medical imaging. 2009 doi: 10.1109/TMI.2009.2012901. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] [37].Mulrane L, Rexhepaj E, Penney S, Callanan JJ, Gallagher WM. Automated image analysis in histopathology: a valuable tool in medical diagnostics. Expert Review of Molecular Diagnostics. 2008;8(6):707–725. doi: 10.1586/14737159.8.6.707. [DOI] [PubMed] [Google Scholar]

[R38] [38].Nigam K, Ghani R. Analyzing the effectiveness and applicability of co-training. Proceedings of the ninth international conference on Information and knowledge management.2000. pp. 86–93. [Google Scholar]

[R39] [39].Pearl J. Bayesian networks: A model of self-activated memory for evidential reasoning. Proceedings of the seventh Conference of the Cognitive Science Society.1985. pp. 329–334. [Google Scholar]

[R40] [40].Penna A, Grilli R, Filardo G, Mainini F, Zola P, Mantovani L, Liberati A. Do different physicians’ panels reach similar conclusions? A case study on practice guidelines for limited surgery in breast cancer. European Journal of Public Health. 1997;7:436–440. [Google Scholar]

[R41] [41].Rojo MG, Bueno G, Slodkowska J. Review of imaging solutions for integrated quantitative immunohistochemistry in the pathology daily practice. Folia Histochemica et Cytobiologica. 2009;47(3):349–354. doi: 10.2478/v10042-008-0114-4. [DOI] [PubMed] [Google Scholar]

[R42] [42].Skaland I, Øvestad I, Janssen EAM, Klos J, Kjellevold KH, Helliesen T, Baak J. Digital image analysis improves the quality of subjective HER-2 expression scoring in breast cancer. Applied Immunohisto-chemistry & Molecular Morphology. 2008;16(2):185. doi: 10.1097/PAI.0b013e318059c20c. [DOI] [PubMed] [Google Scholar]

[R43] [43].Tawfik OW, Kimler BF, Davis M, Donahue JK, Persons DL, Fan F, Hagemeister S, Thomas P, Connor C, Jewell W, et al. Comparison of immunohistochemistry by automated cellular imaging system (ACIS) versus fluorescence in-situ hybridization in the evaluation of HER-2/neu expression in primary breast carcinoma. Histopathology. 2005;48(3):258–267. doi: 10.1111/j.1365-2559.2005.02322.x. [DOI] [PubMed] [Google Scholar]

[R44] [44].Thomson T, Hayes M, Spinelli J, Hilland E, Sawrenko C, Phillips D, Dupuis B, Parker R. HER-2/neu in breast cancer: interobserver variability and performance of immunohistochemistry with 4 antibodies compared with fluorescent in situ hybridization. Modern Pathology. 2001;14(11):1079–1086. doi: 10.1038/modpathol.3880440. [DOI] [PubMed] [Google Scholar]

[R45] [45].Trench WF. Weighting coefficients for the prediction of stationary time series from the finite past. SIAM Journal on Applied Mathematics. 1967;15:1502–1510. [Google Scholar]

[R46] [46].Trench WF. Inversion of Toeplitz band matrices. Mathematics of Computation. 1974;28:1089–1095. [Google Scholar]

[R47] [47].Voduc D, Kenney C, Nielsen T. Tissue microarrays in clinical oncology. Seminars in radiation oncology. 2008;18(2):89–97. doi: 10.1016/j.semradonc.2007.10.006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] [48].Vrolijk H, Sloos W, Mesker W, Franken P, Fodde R, Morreau H, Tanke H. Automated Acquisition of Stained Tissue Microarrays for High-Throughput Evaluation of Molecular Targets. Journal of Molecular Diagnostics. 2003;5(3):160–167. doi: 10.1016/S1525-1578(10)60468-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] [49].Walker R. Quantification of immunohistochemistry - issues concerning methods, utility and semiquantitative assessment I. Histopathology. 2006;49(4):406–410. doi: 10.1111/j.1365-2559.2006.02514.x. [DOI] [PubMed] [Google Scholar]

[R50] [50].Wan WH, Fortuna MB, Furmanski P. A rapid and efficient method for testing immunohistochemical reactivity of monoclonal antibodies against multiple tissue samples simultaneously. Journal of Immunological Methods. 1987;103:121–129. doi: 10.1016/0022-1759(87)90249-3. [DOI] [PubMed] [Google Scholar]

[R51] [51].Yan D, Bickel PJ, Gong P. A discrete log density expansion based approach to Ikonos image classification. American Society for Photogrammetry and Remote Sensing Fall Speciality Conference.2006. [Google Scholar]

[R52] [52].Yan D, Gong P, Chen A, Zhong L. Classification under data contamination with application to image mis-registration in remote sensing. 2011. arXiv:1101.3594.

[R53] [53].Yarowsky D. Unsupervised word sense disambiguation rivaling supervised methods. in Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics.1995. pp. 189–196. [Google Scholar]

PERMALINK

Statistical Methods for Tissue Array Images – Algorithmic Scoring and Co-training

Donghui Yan

Pei Wang

Beatrice S Knudsen

Michael Linden

Timothy W Randolph

Abstract

1 Introduction

Figure 1. Illustration of the TMA technology and example TMA images.

2 The TACOMA algorithm

Figure 2. Example images and their GLCMs.

2.1 The gray level co-occurrence matrix

Definition

Definition

2.2 Image patches for domain knowledge

Figure 3. Representative image patches and the induced feature mask.

2.3 Random Forests

Table 1.

Figure 4. Random Forests classification.

2.4 Salient pixels detection

Figure 8. The salient pixels (highlighted in white).

2.5 An algorithmic description of TACOMA

Algorithm 1.

Figure 5. TACOMA illustrated.

3 Co-training with RF

Figure 6. Error rate of TACOMA as the training sample size varies.

Figure 7.

3.1 Feature split for co-training

Algorithm 2.

3.2 Some theoretical insights on thinning

3.2.1 Thinning “preserves” the ratio of separation

Lemma 1 ([1])

Theorem 2

Proof

3.2.2 Related work

4 Applications on TMA images

Table 3.

4.1 Performance on large training set

Salient spots detection

The experiments with pathologists

Figure 9. Classification performance of TACOMA.

Performance comparison of RF to SVM and Boosting

4.2 Experiments on small training sets

Table 2.

4.3 Experiments of TACOMA on additional datasets

5 Discussion

6 Appendix

6.1 The assumption A1

6.2 Proofs

Lemma 3

Proof

Remarks

Proof of Theorem 2

6.3 Simulations for thinning

Table 4.

6.4 TMA images with salient pixels marked

Figure 10. Salient pixels illustrated.

Figure 11. Salient pixels illustrated.

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

6.1 The assumption $A_{1}$