Multiclass cancer classification based on gene expression comparison

Sitan Yang; Daniel Q Naiman

doi:10.1515/sagmb-2013-0053

. Author manuscript; available in PMC: 2016 Mar 2.

Published in final edited form as: Stat Appl Genet Mol Biol. 2014 Aug;13(4):477–496. doi: 10.1515/sagmb-2013-0053

Multiclass cancer classification based on gene expression comparison

Sitan Yang ^*, Daniel Q Naiman ^†

PMCID: PMC4775275 NIHMSID: NIHMS757203 PMID: 24918456

Abstract

As the complexity and heterogeneity of cancer is being increasingly appreciated through genomic analyses, microarray-based cancer classification comprising multiple discriminatory molecular markers is an emerging trend. Such multiclass classification problems pose new methodological and computational challenges for developing novel and effective statistical approaches. In this paper, we introduce a new approach for classifying multiple disease states associated with cancer based on gene expression profiles. Our method focuses on detecting small sets of genes in which the relative comparison of their expression values leads to class discrimination. For an m-class problem, the classification rule typically depends on a small number of m-gene sets, which provide transparent decision boundaries and allow for potential biological interpretations. We first test our approach on seven common gene expression datasets and compare it with popular classification methods including support vector machines and random forests. We then consider an extremely large cohort of leukemia cancer to further assess its effectiveness. In both experiments, our method yields comparable or even better results to benchmark classifiers. In addition, we demonstrate that our approach can integrate pathway analysis of gene expression to provide accurate and biological meaningful classification.

Keywords: Multiclass cancer classification, Biomarker discovery, Gene expression analysis

1 Introduction

In recent years, microarray-based gene expression profiling has become a widespread approach for identifying biomarkers associated with cancer. In particular, expression patterns of genes have been sought extensively through statistical learning techniques to classify molecular subtypes, predict clinical outcomes and chemotherapy responses (Quackenbush, 2006). This has resulted in a proliferation of such methods introduced and developed in the literature (Statnikov et al., 2008). While many early studies have focused on distinguishing two disease classes, relatively few approaches have been designed specifically for classification in presence of multiple tumor classes (see e.g., Statnikov et al., 2005). In fact, as the heterogeneity of cancer has become more clear in recent studies (Burgess, 2011), new cancer subtypes are expected to continue to be discovered, leading to a growing number of multiclass problems. Moreover, clinical experiments for investigating tumors in stage, grade, survival time, and drug sensitivity are also likely to produce multiclass microarray datasets, see e.g., Dyrskjot et al. (2007) and Shah et al. (2011). Therefore, there is an increasing need for developing multiclass methods.

However, the methodological development of gene expression classifiers often suffers from two limitations. First, the sample size available for most microarray datasets is small, but at the same time, these involve a large number of gene transcripts, leading to what is commonly referred to as the “small n, large p” dilemma. As a result, classification performance of complex models can be degraded by the large variance resulting from parameter estimation, and it is often necessary to restrict attention to classifiers of limited complexity. Second, although well-established and advanced machine learning techniques such as support vector machines (Cortes and Vapnik, 1995) and neural networks (Khan et al., 2001) can be immediately introduced as gene expression classifiers, their decision rules in most cases behave as “black boxes” that do not lend easily to biological mechanistic understanding.

To address these limitations, many statistical methods have been proposed. Tibshirani et al. (2002) developed “Prediction Analysis of Microarrays” (PAM) that modifies the diagonal linear discriminant analysis method by introducing a shrinkage parameter, which creates a “de-noised” version of the class centroids (i.e., mean expression levels of classes) used in the discriminant function. Also, Grate (2005) investigated the discriminatory power of small gene subsets. Each gene set with size three or less is analyzed as a candidate for constructing a parameterized linear hyper-plane for distinguishing cancer classes, which is similar to the traditional separating hyper-plane classifiers. In addition, Leban et al. (2005) proposed the “VizRank” method that focuses on visualizing different cancer classes through data projections, which was further used by Mramor et al. (2007) for cancer classification. In general, these methods provide simplified decision rules that achieve comparable classification performance to traditional techniques.

One approach attempting to take both classifier complexity and biological interpretability into consideration was proposed by Geman et al. (2004). Here, a concept (later called “Relative Expression Analysis” in Eddy et al., 2010) was introduced to construct classifiers using the relative orderings (instead of raw values) of gene expression within each sample. In view of extensive preprocessing required for gene expression data, these relative orderings seem to be reliable pieces of information: they are likely to be preserved under slight perturbations of gene expression values and are robust against effects that shift expression values in the same direction. For example, they have been proved (Lin, 2008) to be invariant under commonly used preprocessing techniques such as convolution and quantile normalization of RMA (Irizarry et al., 2003). Based on this concept, “Top Scoring Pair” (TSP) was introduced as a new binary classification approach by simply comparing expression levels in one or more pairs of genes (i.e., top scoring pairs) for class prediction (Figure 1). As shown by Geman et al., the TSP approach provides transparent but powerful decision rules that compete with many sophisticated machine learning methods. In addition, gene pairs selected by TSP in various subsequent studies have been found to be biologically informative, see, e.g., Edelman et al. (2009), Zhao et al. (2010) and Patnaik et al. (2010).

Gene expression patterns for a top scoring pair of genes. The figure displays the expression levels of gene *SPTAN1* and *CD33* on 72 patient samples in Golub et al. (1999), which are grouped according to two types of leukemia cancer: acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL), with 25 and 47 samples respectively. For classifying a sample, the decision rule predicts ALL if *SPTAN1* has higher expression level than that of *CD33* in the sample, and AML otherwise.

In the multiclass case, cancer classification becomes considerably more challenging. First, the “small n, large p” problem is especially compounded when subdivision of an already small set of samples into subclasses leads to dramatically smaller sample sizes for subclasses. Second, multiclass methods typically require significantly more computation, and decision rules generated can become substantially more complex (Shen and Tan, 2006). Many classifiers developed in binary problems do not naturally apply to the multiclass case and have to rely on decomposition of the problem into many binary sub-classification problems, together with an aggregation scheme for combining various sub-classifications, which is likely to increase the computation time and decrease the interpretability of the final decision rule.

Motivated by these considerations, in this paper, we introduce a new approach called “Top Scoring Set” (TSS) for multiclass cancer classification based on gene expression microarrays. TSS is a generalization of the TSP classifier in the multiclass case. It is parameter-free, purely data driven and robust to some common microarray preprocessing transformations. For an m-class problem, the class prediction is determined by a relatively small number of m-gene sets, namely, top scoring sets. Each top scoring set votes for a class based on the ordering of expression levels of its genes. The final prediction is the class that receives the majority of votes. In principle, TSS makes specific statistical hypotheses about gene expression comparison that could have biological interpretations, and even without the potential interpretability, the decision rule itself can be easily appreciated by non-specialists.

An example of a TSS classifier is illustrated in Figure 2, here we consider a more difficult task than that in Figure 1 to distinguish three cancer subtypes of the leukemia data in Golub et al. (1999). Figure 2 depicts a top scoring set that has been found consisting of gene PLCB2, MB-1 and LCK. Class prediction for a particular sample is determined by the gene whose expression level in this set is the highest. As shown later, TSS yields 95.83% prediction accuracy on this dataset using leave-one-out cross-validation (see Methods). To demonstrate the effectiveness of our approach, we evaluate its predictive performance on seven common human cancer gene expression datasets and compare it with popular benchmark classifiers including PAM, support vector machines (SVMs) and random forests (Breiman, 2001). In most cases, TSS has achieved comparable or better classification performance. Moreover, we validate TSS on an extremely large multiclass cohort of leukemia cancer (Haferlach et al., 2010) containing 14 subclasses, and its predictive ability is demonstrated to compete with that of an large ensemble of SVMs.

Gene expression patterns for a top scoring set of genes. The set consists of three genes: *MB-1*, *LCK* and *PLCB2*. The figure shows the expression levels of these genes on 72 patient samples, which are ordered according to three subtypes of leukemia cancer: B-cell ALL (B-ALL), T-cell ALL (T-ALL) and AML, with the size of 25, 38 and 9 respectively.

There is no question that some gene expression patterns that are potentially useful for classification may be dismissed by TSS, and the assumptions under which TSS would likely prove the most useful might seem overly simplistic to reflect biological conditions in complex diseases. However, TSS provides a practical attempt for modeling the statistical dependency structure among genes given the amount of data. Also, as multiple top scoring sets are often found on a particular dataset, our results demonstrate that the information in the ordering of gene expressions is sufficient to reliably perform classification. Furthermore, we will show that TSS can also integrate biological information from functional pathway analysis of genes. Publicly available databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) in Kanehisa et al. (2004) can be used by TSS to provide accurate and biologically meaningful classification.

2 Methods

Let us consider G genes measured using DNA microarrays and their expression levels X = {X₁, X₂, ..., X_G} regarded as a random vector. Each observed gene profile x is a realization of X and has a true label y representing its class. A microarray dataset is a collection of many, say N, observed gene profiles and can be represented as a matrix {x_ij} with G rows of genes and N columns of samples (typically G >> N). In this section, we first provide a brief review of the TSP method. Then the TSS approach is introduced in the multiclass setting. In addition, we discuss some implementation details of TSS.

2.1 A short review of TSP

As discussed in Geman et al. (2004), for a two-class problem (with classes denoted by 1 and 2), TSP aims to find each “marker” gene pair (i, j) (i, j ∈ {1, 2, ..., G}) that has a simple relation whose probability distribution changes significantly from one class to the other. The simple relation considered here is the comparison between the expression levels of gene i and j, and a highly relevant quantity of interest is the conditional probability P(X_i > X_j | y) where y is the class variable, y ∈ {1, 2}. So if P(X_i > X_j | y = 1) is high while P(X_i > X_j | y = 2) is low, it will be very likely to observe X_i > X_j in class 1 but not in class 2 where X_i < X_j is more likely to happen. As a result, this property of (i, j) leads to the ability to distinguish between two classes simply by determining the gene having the higher expression value, a simple decision rule for predicting class labels. In TSP, a score is defined for each distinct gene pair (i, j) as |P̂(X_i > X_j | y = 1) – P̂(X_i > X_j | y = 2)| in order to estimate the probability change from class to class where P̂(X_i > X_j | y) is the frequency observed from the data. The ones that achieve the highest score among all possible gene pairs (i.e., top scoring pairs) are involved in the decision rule. For a top scoring pair that has P̂(X_i > X_j | y = 1) > P̂(X_i > X_j | y = 2), it predicts the class label ŷ of a new sample x as

\hat{y} = {\begin{matrix} class 1, if x_{i} > x_{j}, \\ class 2, if x_{i} < x_{j} . \end{matrix}

(1)

Then the predictions for each class are summed up over all top scoring gene pairs, and the majority rule is applied to produce the final prediction. From (1), we can see that the decision rule of TSP is only based on simple comparisons of gene pairs. However, as mentioned earlier, it has been shown as an effective classifier on many cancer datasets, and some top gene pairs from these studies are shown to be informative. In addition, several extensions of TSP have also been developed. Xu et al. (2007) considered the average ranks in two groups of genes (rather than a pair of genes) for constructing the decision rule. Tan et al. (2005) introduced the k-TSP classifier where the top k scoring pairs are involved using the majority rule in the decision process. Also, Lin et al. (2009) proposed the “Top Scoring Triplet” method in which relative orderings in each triplet (i.e., three genes) are investigated using a similar approach in TSP. Recently, Kaur et al. (2012) introduced the “ProtPair” method that uses TSP for human disease prognosis based on protein expression data. Thus far, all of these derivations have been aimed at the binary classification problem.

2.2 Top Scoring Set

In this section, we introduce TSS as a new multiclass classification approach. The motivation of TSS comes from the relative comparison idea used in TSP. As discussed in Geman et al. (2004), such relative comparison of mRNA concentrations indicated by gene expression levels provides a natural link with biochemical activity, and proposes concrete hypotheses for a small list of genes. Therefore, our goal here is to discover valuable information for multiclass separation by comparing expression patterns of a few genes. In particular, for an m-class problem (with classes denoted by 1,2,...,m), we are interested in finding m “marker” genes $S = {i_{1}, i_{2}, \dots, i_{m}} \subset {1, 2 \dots, G}$ where the presence of some simple relations among these genes with high conditional probability depending on the class leads to class separability. Specifically, we aim for the high expression level of gene i_c relative to the other m – 1 genes in $S$ being indicative of a sample coming from class c. To be precise, the desired statistical property for $S$ is that ∀c ∈ {1, 2, ..., m}

P [\arg \max {X_{r}, r \in S} = i_{c} ∣ y = c] ≫ P [\arg \max {X_{r}, r \in S} = i_{c} ∣ y \neq c] .

(2)

In other words, gene i_c is much more likely to be the gene among the m genes in the set that has the maximum expression level in $S$ for class c than for any other class. In this case, a classification rule can be constructed by determining which gene is most expressed in $S$ with a simple “arg max” function. Therefore, it is essential to find gene sets satisfying (2) with the greatest possibility. For this purpose, we define a score for each m-gene set to estimate its probability of holding (2). The sets with the highest score are hence referred to as the top scoring sets and will be used for classification.

In general, TSS searches for gene sets exhibiting a particular pattern that may be suitable for classification. There are, of course, many other patterns that one might consider with the potential for effective classification. Still, it is important to be mindful that increasing the size of the pattern search space would result in significant increases in already substantial computational costs, and is more likely to produce over-fitting.

2.2.1 Gene set score

To illustrate the score calculation for gene sets, we start with a previous example of the leukemia data in Golub et al. (1999), which consists of 7,129 genes and three leukemia subtypes identified as AML, B-ALL and T-ALL, with 25, 38 and 9 samples respectively. For this three-class problem, we score a particular gene set consisting of PLCB2, MB-1 and LCK. As described in Figure 2, this is a top scoring set identified by TSS. Here, we denote three genes as i₁, i₂ and i₃ respectively, and we calculate the observed class conditional frequencies of their expression comparison in Table 1.

Table 1.

Observed frequencies of expression comparison within a three-gene set.

	Leukemia
	AML	B-ALL	T-ALL
X_i₁ > max(X_i₂, X_i₃)	1	0	0
X_i₂ > max(X_i₁, X_i₃)	0	0.9737	0
X_i₃ > max(X_i₁, X_i₂)	0	0.0263	1

	Class
	y = 1	y = 2	...	y = m
$X_{i_{1}} > \max {X_{r}, r \in S \ i_{1}}$	p̂ ₁₁	p̂ ₁₂	...	p̂ _1m
$X_{i_{2}} > \max {X_{r}, r \in S \ i_{2}}$	p̂ ₂₁	p̂ ₂₂	...	p̂ _2m
......	......
$X_{i_{m}} > \max {X_{r}, r \in S \ i_{m}}$	p̂ _m1	p̂ _m2	...	p̂_mm

Dataset	No. of classes	No. of genes	No. of samples	Reference
Leukemia	3	7129	72	Golub et al. (1999)
MLL	3	12582	72	Armstrong et al. (2002)
Lung	3	7129	96	Beer et al. (2002)
SRBCT	4	2308	83	Tibshirani et al. (2002)
Bladder	3	7129	40	Dyrskjot et al. (2003)
ChildALL	4	12625	60	Cheok et al. (2003)
NSCLC	3	12599	33	Dehan et al. (2007)

Dataset	No. of sets	No. of genes	Score	Error
Leukemia	73	36	2.97/3.00	1 /72
MLL	7	12	2.96/3.00	1 /72
Lung	1	3	2.72/3.00	16/96
SRBCT	2	5	3.93/4.00	2/83
Bladder	3	5	3.00/3.00	0/40
ChildALL	1	4	3.09/4.00	13/60
NSCLC	2	6	2.88/3.00	1/33

(b)
	Max gene
Class	GYG2	EST	CDH2	HCLS1
EWS	0.93	0	0.07	0
RMS	0	1	0	0
NB	0	0	1	0
BL	0	0	0	1

Method	Leukemia	MLL	Lung	SRBCT	Bladder	ChildALL	NSCLC
G-TSS	95.83	94.44	70.83	90.36	100.00	48.33	90.91
MW-TSS	95.83	76.39	62.50	89.15	57.50	36.67	87.87
kNN	81.94	91.67	75.00	95.18	77.50	45.00	72.72
NB	94.44	95.83	75.00	98.80	82.50	46.67	66.67
RF	93.06	94.44	78.13	100.00	90.00	48.33	72.72
PAM	97.22	93.06	70.83	100.00	85.00	31.67	69.70
l-SVM	93.06	94.44	83.33	100.00	92.50	41.67	81.82

Class	Diagnosis	No. of samples
		Training	Test
-	B-ALL	576	357
C1	Mature B-ALL with t(8;14)	13	5
C2	Pro-B-ALL with t(11q23)/MLL	70	23
C3	c-ALL/Pre-B-ALL with t(9;22)	122	62
C5	ALL with t(12;21)	58	64
C6	ALL with t(1;19)	36	10
C7	ALL with hyperdiploid karyotype	40	35
C8	c-ALL/Pre-B-ALL without t(9;22)	237	158
C4	T-ALL	174	79
-	AML	542	257
C9	AML with t(8;21)	40	16
C10	AML with t(15;17)	37	20
C11	AML with inv(16)/t(16;16)	28	20
C12	AML with t(11q23)/MLL	38	17
C13	AML with normal kt./other abn.	351	160
C14	AML complex aberrant karyotype	48	24

Class	G-TSS	l-SVM
C1	5 (100.0)	4 (80.0)
C2	20 (87.0)	23 (100.0)
C3	51 (82.3)	53 (85.5)
C4	75 (94.9)	75 (94.9)
C5	62 (96.9)	59 (92.2)
C6	10 (100.0)	10 (100.0)
C7	30 (85.7)	22 (62.9)
C8	76 (48.1)	141 (89.2)
C9	14 (87.5)	16 (100.0)
C10	19 (95.0)	19 (95.0)
C11	17 (85.0)	20 (100.0)
C12	11 (64.7)	15 (88.2)
C13	127 (79.4)	148 (92.5)
C14	17 (70.8)	17 (70.8)

Gene set	Size	p-value	FDR
Primary immunodeficiency	32	< 0.001	< 0.001
T cell receptor signaling pathway	96	< 0.001	< 0.001
B cell receptor signaling pathway	68	< 0.001	< 0.001
Hematopoietic cell lineage	104	< 0.001	< 0.001
Rheumatoid arthritis	88	0.005	0.178

Method	Test error	No. of genes
Golub et al. (1999)	4/34	50
kNN	4/34	7129
NB	4/34	7129
RF	4/34	1273
PAM	1/34	47
l-SVM	5/34	7129
Pathway TSS	1/34	10

	Class
	y = 1	y = 2	...	y = m
$X_{i_{1}} > \max {X_{r}, r \in S \ i_{1}}$	p₁₁	p₁₂	...	p_1m
$X_{i_{2}} > \max {X_{r}, r \in S \ i_{2}}$	p₂₁	p₂₂	...	p_2m
......	......
$X_{i_{m}} > \max {X_{r}, r \in S \ i_{m}}$	p_m1	p_m2	...	p_mm

(a)
	Max gene
Class	KRT14	CNGB1	GDF10
SCC	1	0	0
ADCA	0	1	0
N	0	0.12	0.88

	δ = 1	δ = 2	...	δ = m
y = 1	l₁₁	l₁₂	...	l_1m
y = 2	l₂₁	l₂₂	...	l_2m
...	......
y = m	l_m1	l_m2	...	l_mm

Acceleration Algorithm
Input:	N training samples, gene set collection $G = {g_{1}, g_{2}, \dots}$
Output:	The reduced gene set list Ω.
1.	For each gene set g_i, compute the lower bound L_gi(n) and the upper bound U_gi(n) under all possible situations that n training samples are left out.
2.	Rank all L_gi(n) in descending order and take L = max{L_gi(n)}
3.	Generate the list Ω consisting of all g_i for which U_gi(n) ≥ L

	Predicted
GS	C1	C2	C3	C5	C6	C7	C8
C1	5	0	0	0	0	0	0
C2	0	20	0	0	0	2	1
C3	0	0	51	0	0	2	9
C5	0	0	0	62	0	1	1
C6	0	0	0	0	10	0	0
C7	0	0	2	1	0	30	2
C8	2	2	28	9	3	33	76

	Predicted
GS	C1	C2	C3	C5	C6	C7	C8
C1	5	0	0	0	0	0	0
C2	0	20	0	0	0	2	1
C3	0	0	51	0	0	2	9
C5	0	0	0	62	0	1	1
C6	0	0	0	0	10	0	0
C7	0	0	2	1	0	30	2
C8	2	2	28	9	3	33	76

	Predicted
GS	C9	C10	C11	C12	C13	C14
C9	14	0	0	0	0	0
C10	0	19	0	0	0	0
C11	0	0	17	0	1	0
C12	0	0	0	11	4	0
C13	0	1	1	3	127	16
C14	1	0	0	1	5	17

PERMALINK

Multiclass cancer classification based on gene expression comparison

Sitan Yang

Daniel Q Naiman

Abstract

1 Introduction

Figure 1.

Figure 2.

2 Methods

2.1 A short review of TSP

2.2 Top Scoring Set

2.2.1 Gene set score

Table 1.

Table 2.

2.2.2 Decision rule

2.2.3 Greedy search

Figure 3.

3 Results

3.1 Gene expression data

Table 3.

3.2 Top scoring gene sets

Table 4.

Table 5.

3.3 Classification accuracy

Table 6.

3.4 Leukemia study

Table 7.

Figure 4.

Table 8.

3.5 Pathway-based classification

Table 9.

Table 10.

4 Discussion

Acknowledgements

Appendix

A Bayesian decision-theoretic interpretation

B The acceleration algorithm

Claim

Proof

C Confusion matrices for acute leukemia subtypes

Table A.1.

Table A.2.

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

	Predicted
GS	C1	C2	C3	C5	C6	C7	C8
C1	5	0	0	0	0	0	0
C2	0	20	0	0	0	2	1
C3	0	0	51	0	0	2	9
C5	0	0	0	62	0	1	1
C6	0	0	0	0	10	0	0
C7	0	0	2	1	0	30	2
C8	2	2	28	9	3	33	76

	Predicted
GS	C9	C10	C11	C12	C13	C14
C9	14	0	0	0	0	0
C10	0	19	0	0	0	0
C11	0	0	17	0	1	0
C12	0	0	0	11	4	0
C13	0	1	1	3	127	16
C14	1	0	0	1	5	17