Study design in high-dimensional classification analysis

Brisa N Sánchez; Meihua Wu; Peter X K Song; Wen Wang

doi:10.1093/biostatistics/kxw018

. 2016 May 5;17(4):722–736. doi: 10.1093/biostatistics/kxw018

Study design in high-dimensional classification analysis

Brisa N Sánchez ^1,^*, Meihua Wu ², Peter X K Song ³, Wen Wang ³

PMCID: PMC5031947 PMID: 27154835

Abstract

Advances in high throughput technology have accelerated the use of hundreds to millions of biomarkers to construct classifiers that partition patients into different clinical conditions. Prior to classifier development in actual studies, a critical need is to determine the sample size required to reach a specified classification precision. We develop a systematic approach for sample size determination in high-dimensional (large Inline graphic small ) classification analysis. Our method utilizes the probability of correct classification (PCC) as the optimization objective function and incorporates the higher criticism thresholding procedure for classifier development. Further, we derive the theoretical bound of maximal PCC gain from feature augmentation (e.g. when molecular and clinical predictors are combined in classifier development). Our methods are motivated and illustrated by a study using proteomics markers to classify post-kidney transplantation patients into stable and rejecting classes.

Keywords: Design, Higher criticism threshold, Large p small n, Linear discrimination, Sample size

1. Introduction

In recent years, high-dimensional classification analysis has received heightened attention due to its importance for personalized medicine: if validated classifiers (e.g. diagnostic tests) are available, clinicians can use them to design effective treatment plans for individual patients (Hamburg and Collins, 2010). Several approaches to deriving classifiers based on high-dimensional biomarkers have been developed in the literature, and when applied to real world experiments, some promising results have been reported, e.g. Clarke and others (2008), Simon (2008), and Wang and others (2008). However, rapid technological advances enabling the collection of hundreds to millions of biomarkers from a single patient give rise to study design challenges (Mardis, 2008; Schuster, 2008), including how to determine adequate sample size to train classifiers.

We address two key study design issues for classification studies with high-dimensional predictors (i.e. “ Inline graphic ” scenarios); namely, how to: (i) determine sample size that accounts for actual data analyses plans in advance and (ii) assess gain in classification precision associated with feature augmentation. Given space constraints, we focus only on study design related issues, an area that has received less attention. The current design literature in this area focuses on classifiers that are constructed by first screening biomarkers that may be differentially expressed across disease groups (i.e. assuming biomarkers important for classification are sparse), and subsequently combine the selected biomarkers into a classification rule. These types of classifiers rely on threshold cutoffs for selecting important features, the estimation of which needs to be accounted for in the design stage.

This work is motivated by two collaborative projects. The first is our work with the Nephrotic Syndrome Study Network (NEPTUNE), which studies molecular mechanisms for rare renal diseases. One of NEPTUNE's goals is to identify tissue-based mRNA biomarkers to classify patients into risk groups and predict disease remission. A comprehensive generalization of the NEPTUNE study design method (Gadegbeku and others, 2013) is frequently needed in practice.

A second collaboration is joint work with a clinician at The University of Michigan Kidney Transplantation Center, who aimed to predict patient's graft survival status (stable vs. rejecting), a measure of treatment effectiveness, after kidney transplant. The proposed study will proceed in two stages (Figure 1). First, the investigator would like to know how many transplant patients are sufficient to derive and validate a powerful classifier based on protein biomarkers. Second, the investigator would like to know if the classification prediction can be improved by adding clinical predictors such as routine measures of patient's laboratory tests (e.g. albium and hemoglobin) and demographic characteristics—i.e. the gain in prediction accuracy due to feature augmentation.

Fig. 1. — Study flowchart for constructing classifiers of graft survival after kidney transplant. In Step I, investigators will collect proteomics biomarkers using microarrays for each patient in the stable and rejecting groups, and a classifier of graft survival status will be developed. In Stage II, investigators will consider adding other clinical characteristics and patient demographics to hopefully improve classification precision.

Inline graphic — Study flowchart for constructing classifiers of graft survival after kidney transplant. In Step I, investigators will collect proteomics biomarkers using microarrays for each patient in the stable and rejecting groups, and a classifier of graft survival status will be developed. In Stage II, investigators will consider adding other clinical characteristics and patient demographics to hopefully improve classification precision.

Aside from sample size determination methods that optimize hypothesis testing criteria in high-dimensional data settings (e.g. Hwang and others, 2002), few sample size methods for building classifiers are available. One ground breaking method for classification analysis was proposed by Dobbin and Simon (2007) (hereafter DS2007), which is based on optimizing the probability of correct classification (PCC, Mukherjee and others, 2003). The classifier's PCC (or sensitivity or specificity) is a more appropriate target for sample size determination for classification studies, rather than the classical concepts of Type I and Type II errors for testing differences across groups. One limitation of DS2007's method is that the threshold for feature selection is optimized for the given design parameters (e.g. number of important features and their effect size), and this threshold is treated as known in the sample size calculation. As a result, this design approach does not have a counterpart in the data analysis stage, because during analyses the true differences between groups, and thus the threshold, are unknown. Liu and others (2012) develop sample size determination methods for classifiers based on single nucleotide polymorphisms, which was extended by Liu and others (2014) to multi-class classifiers. de Valpine and others (2009) develop a simulation-approximation approach to determine sample size.

The benefits of feature augmentation in terms of the receiver operator curve have been investigated (Pepe and others, 2006; Cai and Cheng, 2008; Pfeiffer and Bur, 2008; Lin and others, 2011). However, there is no theoretical work specifically quantifying the amount of PCC gain due to feature augmentation, nor under which scenarios PCC gain can be maximized.

Section 2 describes the model formulation for the features, the PCC definition, and two thresholding techniques used to select important features with which the classification rule is constructed. One is the higher criticism threshold (HCT) proposed by Donoho and Jin (2009), which is particularly relevant when important features are rare and weak, and the other is a method based on cross validation (CV). Section 3 presents our proposed methods for sample size determination which incorporate thresholding techniques. We introduce a new simulation method to efficiently evaluate the PCC of HCT-based classifiers. In Section 4, we establish a novel inequality with both the upper and lower bounds for PCC gain due to feature augmentation. Section 5 illustrates the performance of three sample size determination strategies and their use in the second motivating example of predicting kidney graft status, followed by a discussion.

2. Model, PCC, and feature selection

In this section, we review existing work and modeling set up that serves as the context for our proposed sample size determination methods. Suppose the study population can be divided into two groups: Group Inline graphic and Group . The design question is on how many subjects, , a set of training data will be collected to construct a classifier, where is the group label for subject ; population group prevalences are and , respectively; and is a high-dimensional vector of features for subject (e.g. proteomics biomarkers). For brevity of exposition, in the rest of the paper we assume the sample size collected from each group is equal by design (e.g. stratified sampling is used), irrespective of the group prevalences in the population. Supplementary material available at Biostatistics online describes modifications needed when sample sizes are unequal for the groups.

We assume that features follow the multivariate normal distribution within each group with equal variances: Inline graphic ; and , where the vector , with elements , , represents the signal strengths of the features. Setting is purely for notational convenience and is not needed in practice; this notation allows us to write the mean differences of features between groups in terms of a single vector, namely Inline graphic . A higher value of suggests a better separation between two groups by feature , and consequently feature would be important for classification. Assuming the equality of variances is needed to construct a linear classification rule (Johnson and Wichern, 2002), which we assume at the design stage. Without loss of generality, we assume the diagonal elements of Inline graphic equal 1, which enables us to refer to as the vector of effect sizes. In practice, this is achieved by dividing each feature by its pooled standard deviation calculated with the training data.

The dimension Inline graphic of may be very high; hence it is commonly assumed that only a small number of features, say, , have non-zero effect sizes. The features are considered essential to construct a classifier, while the other features are noise (). For the ease of exposition, we reorder features such that the important Inline graphic features are listed first, i.e. , where is a zero vector of length . The values of effect sizes, , are unknown at the design stage and is supplied by subject-matter scientists. Assume that it is possible to specify a lower bound for based on some prior research results or a certain scientific hypothesis, and replace Inline graphic by , where is a vector of ones of length . Then, we will simply state the effect size as . Under the linear classification rule and given the weighting scheme defined in (2.1), using a lower bound in the design will lead to a conservative estimate of PCC and thus sample size, which is acceptable in practice when no reliable pilot data are available to estimate Inline graphic satisfactorily.

In this paper, we consider a linear classifier in the design stage. Constructing a linear classifier is equivalent to using training data Inline graphic to derive a certain weighting scheme that allocates weights . Let and denote the inner product of two vectors. The classification rule for a new subject is: if , subject is assigned to Group ; otherwise to Group . In general, the weighting scheme can assign non-zero weights to all available features; however, this can harm PCC if most of them are not important. Instead, when Inline graphic using regularized feature selection allows us to include only important features in the classifier, thus enhancing the classifier's PCC. Feature selection is primarily driven by pairwise associations between features and group membership . Let be the vector of test statistics derived from training data Inline graphic . Then for unimportant features, and for where is the signal strength. A natural strategy for feature selection is to choose an appropriate threshold such that we only include features satisfying , . Given threshold , we incorporate this feature selection mechanism into the definition of the weighting scheme:

(2.1)

The threshold Inline graphic is determined empirically given ; Section 2.2 describes procedures to select it.

2.1. Objective function and connection to sample size

Following DS2007, we use PCC as the primary objective function for sample size determination. With two groups, the PCC is the weighted average of the classifier's sensitivity and specificity, with weights equal to the group prevalences. Under the assumed model and for fixed weights Inline graphic , it can be easily shown that , where is the standard normal CDF. The weights , however, are random and depend on (and thus ) and the sample size of .

To make the connection between PCC and sample size, it is useful to think of the PCC as dependent on sample size and the selected threshold Inline graphic , hence defining as the PCC of a classifier built using training data on subjects. Further, to define the optimal PCC, it is useful to think of the upper bound of the PCC among linear classifiers. A linear classifier can reach the upper bound if it is the oracle classifier, or if it is constructed from a study with infinite sample size (DS2007). This optimal classifier has Inline graphic , which simplifies to when , and is replaced by at the design stage.

Clearly, a practically achievable PCC will be lower than the upper bound, and its exact value depends on how much relevant information can be extracted from the training data. At the design stage, a PCC target is set lower than Inline graphic ; for instance, DS2007 set as the smallest PCC that satisfies . The sample size requirement is then defined as the smallest such that . If the inverse function of could be analytically derived, then the sample size would be easily determined by . Since a closed form of rarely exists, we employ numerical algorithms (e.g. the binary search algorithm) to determine sample size.

Finally, one must consider how to calculate PCC at the design stage. DS2007 consider an approximation of PCC to warrant its fast computation. First they compute an optimal, fixed threshold Inline graphic using information on the assumed , and show with Monte Carlo (MC) simulations that for a fixed the approximation is accurate. However, at the analysis stage, optimizing only depends only on the data , since is unknown. We describe two procedures to determine based only on simulated data and are thus advantageous because they have an actual parallel in the analysis stage.

2.2. Feature selection procedures

2.2.1. CV threshold.

We consider the following straightforward K-fold CV thresholding method to determine Inline graphic given only training data , denoted by . Such is chosen to maximize the apparent PCC, , which is a function of both threshold and training data . The apparent PCC can be computed via the following steps:

Follow the sampling strategy for to divide it into equal-sized subsets, (e.g. divide cases and controls separately if stratified sampling is used to collect ).
For each , treat , which has sample size , as a CV testing set and the rest of the data as a CV training set; given a threshold value , which is one of many threshold values on a dense grid, use (2.1) to obtain the weighting from the training set , where -scores are calculated from dataset only.
For , calculate the apparent PCC based on testing set , as .
Calculate the overall apparent PCC: .
Calculate the apparent PCC on a dense grid of values for , and select the optimal threshold that maximizes the overall apparent PCC: .

For an analysis employing the CV threshold, the expected PCC is calculated over the distribution of Inline graphic , This procedure optimizes the threshold without bearing on knowledge of ; when embedded in sample size calculations it accounts for uncertainty in . When features are independent and important features have the same effect size, CV thresholding with weights in (2.1) results in the optimal Bayes classification rule, except for uncertainty in Inline graphic , and thus the optimal sample size.

2.2.2. Higher criticism threshold.

Proposed by Donoho and Jin (2009), HCT provides a data-driven approach to determine Inline graphic in a high-dimensional classification analysis. HCT determines a suitable threshold based on the distribution of -values obtained from univariate tests for associations of individual features with the group assignment, and then the weighting scheme (2.1) can be applied. Let denote the HCT procedure when applied to training data Inline graphic . The association test for feature with group results in a two-sided -value . For an unimportant feature, ; for an important feature the resulting does not follow and tends to be smaller than those of the unimportant features. HCT only focuses on the smallest -values sorted in increasing order: Inline graphic ; a typical choice is . Donoho and Jin (2009) showed that the th ordered -value with provides an appropriate cutoff for feature selection: features whose -values are less than are considered important for classification. The -score threshold is thus . The resulting PCC is .

As detailed by Donoho and Jin (2009), the theory of HCT brings new insights to the asymptotic properties of linear classifiers under the so-called rare-and-weak model, which is of interest in the context of high-dimensional classification because it gives a structure under which the number of important features Inline graphic and signal strength vary with the total number of features . This structure enables study of asymptotic classification feasibility. In this rare-and-weak model, increases with according to (Donoho and Jin, 2009), where controls the sparsity. Similarly, instead of , which becomes arbitrarily large with increasing sample size, in this model the signal strength follows Inline graphic ; controls the signal strength, and for some . This implies that important features become rarer and their effect size becomes weaker when the total number of features increases, which is regarded as a more realistic mechanism (NCI-NHGRI, 2007; Jin, 2009) than a mechanism where PCC always increases as Inline graphic increases. It has been shown (Donoho and Jin, 2009; Jin, 2009) that as the number of features the PCC of any linear classifier is characterized only by through a certain function given in Section B of supplementary material available at Biostatistics online: (i) when , the classification analysis is asymptotically feasible, in the sense that the PCC of the HCT linear classifier approaches to 1 as Inline graphic and (ii) when , the classification analysis is asymptotically infeasible. The asymptotic result of feasibility is critical to guide the design of classification analysis. Verifying the inequality can help investigators make a timely decision on the feasibility of a study at the planning stage (see Section 5 for an illustration).

3. Implementation

Given an approach to evaluate PCC, sample size can be determined by inverting the PCC function numerically. We thus focus on PCC estimation approaches that incorporate thresholding procedures so the resulting PCC would more closely reflect what can be achieved in practice. Section C of supplementary material available at Biostatistics online describes the evaluation of PCC for CV-based classifiers. Our primary contributions here focus on approaches needed for HCT-based classifiers.

Since Inline graphic is a data-driven thresholding procedure, MC simulation can be applied to evaluate . However, because depends on the training data exclusively through the smallest -values, we propose a computationally fast MC algorithm that directly simulates the smallest -values from the distribution of ordered statistics instead of simulating Inline graphic . The algorithm is takes the following steps:

Simulate -scores for important features from , .
Convert the above -scores to two-sided -values by , .
Simulate a random variable .
Simulate variables independently from .
Sort vector in an ascending order.

The Inline graphic smallest values in have the same joint distribution as the smallest -values derived from (see supplementary material available at Biostatistics online, Section D for the proof). As a by-product, the above algorithm also supplies the -scores for the important features. This proposed algorithm has a clear computational benefit: instead of generating Inline graphic random variables needed for (and calculating all -values), only variables are generated. The computational efficiency ratio is (since is much smaller than ); e.g. for and , the algorithm is 1000 times more efficient. Furthermore, the algorithm below used to calculate PCC does not require generating test data.

To evaluate Inline graphic , we repeat for iterations:

For given , and , use Steps (a)–(e) to generate the smallest -values and the corresponding -scores for important features.
Determine the optimal threshold .
Use (2.1) to calculate weights of important features using their -scores and .
Calculate , where is the number of elements in that are smaller than .

Then, the fast MC estimate of Inline graphic is given by .

Correlated features. When features are correlated, estimating PCC is more challenging. One difficulty pertains to computing the denominator, Inline graphic , within the PCC formula, . DS2007 proposed replacing this quantity with an upper bound , where is the largest eigenvalue of . This bound could potentially be applied at Step 4 of the HCT-based algorithm above when calculating . However, this does not work for the HCT method since Step 1 relies on independence to prove that the Inline graphic used to generate the smallest -values follows a Beta distribution. Hence, we instead evaluate the expected PCC using an alternative MC simulation strategy. The simulation strategy follows Steps 1–4 as above, with the following modifications. First, we specify an assumed working correlation structure Inline graphic for the features. In Step 1, we now use this assumed to generate correlated features on subjects, and compute statistics and accompanying -values. The choice of structure for (e.g. block diagonal) may be informed by substantive knowledge, if available, and the magnitude of the correlations should be varied to assess its impact on the sample size calculation. Steps 2 and 3 of the algorithm remain unchanged to reflect that at the analysis stage the features are screened using pairwise associations and treated as independent. In Step 4, we evaluate Inline graphic , where is an estimate of based on the working correlation and is defined as and denotes the entry of . As with the eigenvalue approach, this approximation relies on the working correlation structure ; however, this approach is less conservative than using the eigenvalue-based bound (see Illustrations section). The MC approach for the CV method with correlated features similarly relies on generating correlated data (see supplementary material available at Biostatistics online).

4. Feature augmentation

Because in practice multiple sources of features are typically collected, a key study design question is to investigate the potential PCC gain resulting from adding new sources of features to the classification analysis. For simplicity, let us focus on two sources of features (e.g. molecular biomarkers and clinical variables). Denote features already in the study by Type A and the new set by Type B, with respective dimensions Inline graphic and . For subject , we collect measurements and . As in Section 2, assume that the joint conditional distribution for features and is:

where Inline graphic and and , , and are the respective effect size vectors, variance, and covariance matrices. Here we do not place any restrictions on either the effect size vectors (e.g. sparsity is not assumed) or the variance matrices.

To study PCC gain, we need to consider the PCC of linear classifiers in three cases, including (i) Type A features only; (ii) Type B features only; and (iii) Type A features augmented with Type B features. Denote the respective weights in Cases (i) and (ii) by Inline graphic and ; we do not place any assumptions on these weights, e.g. they may be derived from any thresholding procedure for feature selection. Conditioning on the weights and , and assuming group prevalence is , the PCC of the classifier in Case (i) is ; in Case (ii) is ; and finally, in Case (iii) is Inline graphic . When , the term drops from the denominator and we obtain . In Section E of supplementary material available at Biostatistics online, we prove that:

(4.1)

The first equality holds when the relative variance of the linear predictors goes to 0, i.e. Inline graphic where if and if . In the latter case, the equality is reached when and the linear predictors have equal variance . Inequality (4.1) provides the upper bound of the PCC of the classifier when the linear predictors are combined into a new classification rule. If either or approach 1, the upper bound will approach 1.

In practice, the features may be correlated, Inline graphic , thus the linear predictors and will too. Given the monotonicity of and that the covariance of the linear predictors, , appears in the denominator , the PCC gain will depend on the sign of the correlation: . In Section E of supplementary material available at Biostatistics online, we also prove that Inline graphic , and that where is the relative standard deviation of the linear predictors. Finally, in Section E of supplementary material available at Biostatistics online we also prove inequality (4.1) for any proportion when optimal weights are used in classifier construction.

5. Illustrations

PCC estimation and sample size determination given effect size and number of important features. For a given effect size, the PCC evaluated at the design stage will depend on a pre-specified thresholding procedure and in turn impact the sample size. Thus, we first illustrate the estimated PCC using the DS2007 method, and using the CV and HCT thresholding methods. Figure 2 shows the PCC as a function of sample size when the number of available features is Inline graphic or . The PCC estimated by the DS method is always the highest, primarily because it uses the true effect size to choose the optimal threshold . However, due to its reliance on to obtain the optimal threshold at the design stage, the DS method has no counterpart in actual data analysis. On the other hand, the PCC estimated by the CV and the HCT methods rely only on the simulated data to estimate the threshold, which introduces uncertainty in the feature selection threshold, and thus yields lower PCC estimates. When the classification problem is more difficult (e.g. Inline graphic ), HCT yields higher PCC than CV as expected (Donoho and Jin, 2008). Since PCC estimates from CV and HCT reflect more closely the achievable performance of the corresponding classifiers in real applications, the sample size estimates from these methods would better approximate the sample size required in practice.

Fig. 2. — PCC estimates as a function of sample size estimated by DS, CV, and HCT methods, assuming the minimal effect size of important features is and group prevalences are . (a) , (b) , (c) , and (d) . In (a) and (b), important features are rarer ( important feature) compared with (c) and (d) , which results in a marked difference in (gray horizontal line). The PCC estimated using the DS method is always higher for a given sample size, leading to lower sample size estimates. When features are rarer, (a) and (b), HCT gives a higher PCC, leading to lower sample size requirements. When features are less rare , selecting features using HCT (CV) leads to lower sample size requirements for lower (higher) PCC targets compared with CV (HCT), given the crossing of the PCC curves. PCC estimates for CV and HCT are obtained using MC simulations using the algorithms described in Section 3, with 500 replicates for CV and 1000 replicates for HCT.

Figure 3 shows the sample size requirements for a range of effect sizes, Inline graphic or 10, and with . In general, sample sizes obtained from the DS method are consistently lower than with CV or HCT methods, as can be expected given in Figure 2. Sample sizes become comparable (difference 2) when the effect size is large. However, when the features are relatively weaker, then the DS method will tend to underestimate the needed sample size. Figure 2 also explains the facts that, for a fixed Inline graphic , the sample sizes from HCT shown in Figure 3 are lower for scenarios when features are rarer (i.e. ) and lower for CV when features are less rare (). Hence, only the proposed HCT approach will give sufficient sample size for cases when features are relatively weaker and rarer, without being overly conservative (CV method is conservative in those cases, since the HCT classifier can achieve the target PCC with a lower sample size). These assertions are verified with MC simulations shown in Table 1 (top rows where Inline graphic ).

Fig. 3. — Sample size requirements estimated using DS, CV, and HCT design methods for a range of effect sizes . (a) , (b) , (c) , and (d) . For each effect size and combination of and , the is shown as the inset value on the -axis, and the target PCC is set as . Sample size estimates for CV and HCT are obtained by numerically inverting the PCC function and selecting the smallest that satisfies ; is estimated using the MC algorithms described in Section 3 with 500 replicates for CV and 1000 for HCT. The sample size required decreases as effect sizes of important features increase, even in high target PCC cases. Sample sizes obtained with the DS method are lower, but, as shown in Table 1, underestimate the required sample size particularly for rare-and-weak features.

Table 1.

Sample size Inline graphic calculated by DS, CV, and HCT design methods using the specified and differences between the target and what can be achieved in practice,


				DS	CV	HCT	CV	CV	HCT	HCT	CV	HCT
1	500	0.8	0.605	98	0.035	0.018	134	0.001	0.014	118	0.019	0.007
		1.2	0.676	56	0.045	0.023	76	0.003	0.02	68	0.020	0.002
		1.6	0.738	36	0.047	0.028	50	0.011	0.023	46	0.008	0.008
1	10 000	0.8	0.605	138	0.033	0.026	182	0.001	0.013	170	0.011	0.001
		1.2	0.676	76	0.034	0.032	96	0.005	0.005	94	0.005	0.000
		1.6	0.738	50	0.043	0.042	62	0	0.005	62	0.000	0.005
10	500	0.8	0.847	94	0.014	0.031	102	0	0.025	146	0.029	0.002
		1.2	0.921	38	0.015	0.027	44	0.003	0.011	52	0.015	0.005
		1.6	0.944	20	0.025	0.005	24	0.004	0.01	24	0.004	0.010
10	10 000	0.8	0.847	138	0.01	0.032	152	0.001	0.021	202	0.034	0.001
		1.2	0.921	58	0.021	0.032	66	0.006	0.014	76	0.019	0.000
		1.6	0.944	30	0.031	0.027	36	0.008	0.001	36	0.008	-0.001
1	500	1.6	0.738	NA	–	–	44	0.000	0.041	62	0.034	-0.006
			0.738	NA	–	–	46	0.005	0.060	168	0.048	0.004
			0.738	NA	–	–	46	0.002	0.054	160	0.044	0.004
10	500	1.6	0.762	26	0.002	0.037	26	0.002	0.037	18	0.064	0.013
			0.762	26	0.029	0.015	30	0.000	0.030	24	0.032	0.010
			0.762	NA	–	–	28	0.001	0.028	44	0.039	0.010
			0.762	NA	–	–	26	0.002	0.026	42	0.044	0.006

Open in a new tab

The scenarios and respective sample sizes shown here are a subset of those shown in Figure 3 (see Figure 3 legend for details on how sample sizes are obtained). Given the computed sample size Inline graphic , was computed by generating 1000 training datasets of size and test datasets of size 100; training and test datasets were generated according to the model defined by , , , and . The average of the across the 1000 replicates are shown. The are block diagonal as follows: is a identity; Inline graphic has first block being compound symmetry structure and correlation 0.80, denoted by , and second block ; has 50 blocks of ; has first block , second block , and third block ; has the first block , and the second block . NA indicates sample size estimates from DS2007 eigenvalue method were prohibitive, Inline graphic .

It is evident from Figures 2 and 3, and Table 1 that it is important to not only calculate the sample size based on the PCC estimates that are achievable by statistical methods at the analysis stage, but also to select an analysis approach that can more efficiently attain the target PCC under a given parameter space. In rare-and-weak cases, for example, the HCT-based classifier has been shown to perform better (Donoho and Jin, 2008), and thus we recommend determining sample sizes using our proposed HCT sample size calculator in these scenarios.

Sample size calculations when features are correlated. The bottom part of Table 1 gives the sample sizes computed from each method under different structures for Inline graphic (). The DS2007 method using the correction based on the largest eigenvalue sometimes yields prohibitive sample sizes (, denoted as ), because of the excessively large maximum eigenvalue of . Both CV- and HCT-based methods give sample sizes at which the target PCC is achieved. It is worth noting that when important features are very rare (e.g. Inline graphic ), the HCT-based method yields conservative sample sizes whereas the CV method can achieve the same target PCC with lower sample sizes.

Feature augmentation. Figure 4 illustrates the upper bound (left panel) of the PCC and PCC gain (right column) due to feature augmentation discussed in Section 4. First, in the case when features are independent (Figure 4(a)), we note that if both Inline graphic and are small (or one is large), then will be small (or large). Hence, the upper bound of the PCC of classifiers with both Type A and Type B features is only slightly higher than . Combining two sets of features where both are very good or both are poor does not greatly improve PCC. If both types of features are of medium quality (e.g. both Inline graphic and are in the medium range around ), then we could obtain the highest gain (at most 10%) in PCC by feature augmentation. When features are negatively correlated, the PCC gain can be substantial (Figure 4(b)).

Fig. 4. — The upper bounds of when two sets of features, A and B, are combined compared with using a single type of features. In (a), features are independent, (all ), or and have . The upper bound is attained when the linear predictors have equal variance. In (b), , group 1 prevalence is , and gives the relative standard deviation of the linear predictors. The upper bound shown in (b) is attained when the linear predictor constructed from features A and B are perfectly negatively correlated.

An application. We demonstrate the proposed methods using the kidney transplant study (Figure 1). In Stage I, the investigator hypothesizes that among Inline graphic proteins, at least of them are likely informative for predicting graft survival status. Pilot data, , showed an effect size of approximately (i.e. ). Given these design parameters, the signal strength is ; the sparsity parameter is ; and the strength parameter lies above the feasibility boundary: Inline graphic . Hence, the classification problem feasible (see Section B of supplementary material available at Biostatistics online), and we can proceed to calculate sample size requirements for given PCC targets (Figure 5(a)).

Fig. 5. — Application: study design for predicting graft survival after kidney transplant. (a) As expected, a larger sample size is required when a higher PCC target is chosen and other design parameters are held constant (). The DS method yields the lowest sample size requirements, but these may underestimate the needed sample size (see Table 1); whether HCT or CV require larger sample size depends on the target PCC. (b) PCC with the proteomics markers () is fixed at 0.7 (dashed line). If new features are not as informative as the proteomics markers (, combining both sets of features leads to a limited improvement of the classifier, and in some cases data augmentation might actually degrade the classifier (shadowed area below 0.7) due to the noise introduced by low-quality features in the new data source (e.g. some proteins can be measured with substantial errors if urine samples are not stored under stringent conditions). If the new features are more informative (, incorporating them can substantially enhance the PCC.

For Stage II, the investigator considers improving the PCC of the classifier with proteomics biomarkers only say, Inline graphic , by incorporating an additional set of features, including proteinuria, GFR, hematuria, albium, and cholesterol. Figure 5(b) shows a region describing the achievable PCC with both types of features ( for various values of the PCC with the additional features alone, i.e. ). Substantial enhancements to PCC occur when the second set of features is at least as informative as the proteomics biomarkers ( Inline graphic ).

6. Discussion

We addressed two study design questions for studies using high-dimensional features for classification. First, we developed sample size determination strategies for CV- and HCT-based classifiers. Our strategies incorporate uncertainty of feature selection thresholds within the PCC calculation, which is particularly relevant when important features are hypothesized to be rare and weak. We proposed a computationally efficient algorithm based on order statistics to compute the PCC, and thus the sample size requirements, for the HCT-based classifier. Second, we established an inequality for the upper and lower bounds of the achievable PCC associated with feature augmentation. The approaches were illustrated with numerical examples and a practical study, and are implemented in our R package HDDesign (available at https://cran.r-project.org/).

Our proposed methods can be improved in the following directions. Classification of more than two groups commonly appears in clinical studies, thus extensions in this direction are of great importance. Strong deviations from linearity (e.g. U-shaped associations) may undermine the applicability of the proposed approaches. In this case, it may be possible to categorize the predictors and apply and/or extend the study design methods of Liu and others (2012) to the case of rare-and-weak features. It is also of interest to further investigate how correlations among features may be effectively incorporated into the sample size determination. We proposed to directly plug in an assumed working correlation matrix within the CV- and HCT-based approaches. As expected, positive correlations among features result in larger required sample sizes, although our not as large as DS2007's preliminary eigenvalue-based approach. Nevertheless, our approach requires specifying sensible working correlation structures at the design stage, which may be difficult to obtain in practice. Varying the structure and magnitude of the correlations based on available scientific knowledge is needed with our proposed approach. Further improvements in this direction may be possible by using the innovated HCT suggested by Hall and Jin (2010), or by developing sample size determination methods based on regularized regression-based approaches that do not require pre-filtering and hence do not rely on the marginal effects of the features. However, developing sample size calculations using regression-based procedures (e.g. LASSO) would require specifying the adjusted effect sizes and, importantly, quantifying the uncertainty in feature selection, which remains an open problem in high-dimensional inference.

In summary, we advocate the use of sample size determination methods that match, as closely as possible, the analytic approaches that will be actually applied at the data analysis stage and that capitalize on prior knowledge of the underlying mechanism of interest. If the important features have strong signals, both HCT- and DS-based approaches provide adequate sample size calculations, and there is little difference between them. Given that the HCT method is computationally fast and accounts for uncertainty in the feature selection threshold, it is recommended in practice. If the important features are relatively abundant but weak, we recommend the CV approach as it gives the least conservative sample size, albeit computationally intensive. If the important features are rare and weak, we recommend the HCT-based approach since it provides desired sample sizes with little conservatism, and is computationally efficient. In summary, our work builds upon and further advances the pioneering work of DS2007, for sample, size determination in high-dimensional classification problems.

Supplementary material

Supplementary Material is available at http://biostatistics.oxfordjournals.org.

Funding

M.W. and B.N.S. acknowledge NIH grant R21DA024273 for salary support during the initial conduct of this study. P.X.K.S.'s research is funded in part by NIH U54-DK-083912-05 and NSF DMS-1513595. This work was also partially funded by grants NIH/EPA P20 ES018171/RD83480001 and P01 ES022844/RD83543601.

Supplementary Material

Supplementary Data

supp_17_4_722__index.html^{(749B, html)}

Acknowledgments

The publication's contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH or US EPA. Conflict of Interest: None declared.

References

Cai T., Cheng S. (2008). Robust combination of multiple diagnostic tests for classifying censored event times. Biostatistics 92, 216–233. [DOI] [PubMed] [Google Scholar]
Clarke R., Ressom H. W., Wang A., Xuan J., Liu M. C., Gehan E. A., Wang Y. (2008). The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nature Reviews. Cancer 81, 37–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
de Valpine P., Bitter H. M., Brown M. P. S., Heller J. (2009). A simulation-approximation approach to sample size planning for high-dimensional classification studies. Biostatistics 103, 424–435. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dobbin K. K., Simon R. M. (2007). Sample size planning for developing classifiers using high-dimensional DNA microarray data. Biostatistics 81, 101–117. [DOI] [PubMed] [Google Scholar]
Donoho D., Jin J. (2008). Higher criticism thresholding: optimal feature selection when useful features are rare and weak. PNAS 10539, 14790–14795. [DOI] [PMC free article] [PubMed] [Google Scholar]
Donoho D., Jin J. (2009). Feature selection by higher criticism thresholding achieves the optimal phase diagram. Philosophical Transactions. Series A, Mathematical, Physical, and Engineering Sciences 3671906, 4449–4470. [DOI] [PubMed] [Google Scholar]
Gadegbeku C. A., Gipson D. S., Holzman L., Ojo A. O., Song P. X., Barisoni L., Sampson M. G., Kopp J. B., Lemley K. V., Nelson P. J.. and others (2013). Design of the nephrotic syndrome study network (neptune): a multi-disciplinary approach to understanding primary glomerular nephropathy. Kidney International 834, 749–756. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hamburg M. A., Collins F. S. (2010). The path to personalized medicine. New England Journal of Medicine 3634, 301–304. [DOI] [PubMed] [Google Scholar]
Hall P., Jin J. (2010). Innovated higher criticism for detecting sparse signals in correlated noise. The Annals of Statistics 383, 1686–1732. [Google Scholar]
Hwang D., Schmitt W. A., Stephanopoulos G., Stephanopoulos G. (2002). Determination of minimum sample size and discriminatory expression patterns in microarray data. Bioinformatics (Oxford, England) 189, 1184–1193. [DOI] [PubMed] [Google Scholar]
Jin J. (2009). Impossibility of successful classification when useful features are rare and weak. Proceedings of the National Academy of Sciences 10622, 8859–8864. [DOI] [PMC free article] [PubMed] [Google Scholar]
Johnson R. A., Wichern D. W. (2002) Applied Multivariate Statistics, 5th edition Upper Saddle River, NJ: Prentice-Hall, Inc. [Google Scholar]
Lin H., Zhou L., Peng H., Zhou X. H. (2011). Selection and combination of biomarkers using ROC method for disease classification and prediction. Canadian Journal of Statistics 392, 324–343. [Google Scholar]
Liu X., Wang Y., Rekaya R., Sriram T. N. (2012). Sample size determination for classifiers based on single-nucleotide polymorphisms. Biostatistics 132, 217–227. [DOI] [PubMed] [Google Scholar]
Liu X., Wang Y., Sriram T. N. (2014). Determination of sample size for a multi-class classifier based on single-nucleotide polymorphisms: a volume under the surface approach. Journal of Biomedical Informatics 15, 190. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mardis E. R. (2008). The impact of next-generation sequencing technology on genetics. Trends in Genetics 243, 133–141. [DOI] [PubMed] [Google Scholar]
Mukherjee S., Tamayo P., Rogers S., Rifkin R., Engle A., Campbell C., Golub T. R., Mesirov J. P. (2003). Estimating dataset size requirements for classifying DNA microarray data. Journal of Computational Biology: A Journal of Computational Molecular Cell Biology 102, 119–142. [DOI] [PubMed] [Google Scholar]
NCI-NHGRI, Working group on replication in association studies (2007). Replicating genotype-phenotype associations. Nature 4477145, 655–660. [DOI] [PubMed] [Google Scholar]
Pepe M. S., Cai T., Longton G. (2006). Combining predictors for classification using the area under the receiver operating characteristic curve. Biometrics 621, 221–229. [DOI] [PubMed] [Google Scholar]
Pfeiffer R. M., Bur E. (2008). A model free approach to combining biomarkers. Biometrical Journal 504, 558–570. [DOI] [PubMed] [Google Scholar]
Schuster S. C. (2008). Next-generation sequencing transforms today's biology. Nature Methods 51, 16–18. [DOI] [PubMed] [Google Scholar]
Simon R. (2008). Development and validation of biomarker classifiers for treatment selection. Journal of Statistical Planning and Inference 1382, 308–320. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang Y., Miller D. J., Clarke R. (2008). Approaches to working in high-dimensional data spaces: gene expression microarrays. British Journal of Cancer 986, 1023–1028. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

supp_17_4_722__index.html^{(749B, html)}

supp_kxw018_kxw018supp.pdf^{(306.3KB, pdf)}

[kxw018C1] Cai T., Cheng S. (2008). Robust combination of multiple diagnostic tests for classifying censored event times. Biostatistics 92, 216–233. [DOI] [PubMed] [Google Scholar]

[kxw018C2] Clarke R., Ressom H. W., Wang A., Xuan J., Liu M. C., Gehan E. A., Wang Y. (2008). The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nature Reviews. Cancer 81, 37–49. [DOI] [PMC free article] [PubMed] [Google Scholar]

[kxw018C3] de Valpine P., Bitter H. M., Brown M. P. S., Heller J. (2009). A simulation-approximation approach to sample size planning for high-dimensional classification studies. Biostatistics 103, 424–435. [DOI] [PMC free article] [PubMed] [Google Scholar]

[kxw018C4] Dobbin K. K., Simon R. M. (2007). Sample size planning for developing classifiers using high-dimensional DNA microarray data. Biostatistics 81, 101–117. [DOI] [PubMed] [Google Scholar]

[kxw018C5] Donoho D., Jin J. (2008). Higher criticism thresholding: optimal feature selection when useful features are rare and weak. PNAS 10539, 14790–14795. [DOI] [PMC free article] [PubMed] [Google Scholar]

[kxw018C6] Donoho D., Jin J. (2009). Feature selection by higher criticism thresholding achieves the optimal phase diagram. Philosophical Transactions. Series A, Mathematical, Physical, and Engineering Sciences 3671906, 4449–4470. [DOI] [PubMed] [Google Scholar]

[kxw018C7] Gadegbeku C. A., Gipson D. S., Holzman L., Ojo A. O., Song P. X., Barisoni L., Sampson M. G., Kopp J. B., Lemley K. V., Nelson P. J.. and others (2013). Design of the nephrotic syndrome study network (neptune): a multi-disciplinary approach to understanding primary glomerular nephropathy. Kidney International 834, 749–756. [DOI] [PMC free article] [PubMed] [Google Scholar]

[kxw018C8] Hamburg M. A., Collins F. S. (2010). The path to personalized medicine. New England Journal of Medicine 3634, 301–304. [DOI] [PubMed] [Google Scholar]

[kxw018C9] Hall P., Jin J. (2010). Innovated higher criticism for detecting sparse signals in correlated noise. The Annals of Statistics 383, 1686–1732. [Google Scholar]

[kxw018C10] Hwang D., Schmitt W. A., Stephanopoulos G., Stephanopoulos G. (2002). Determination of minimum sample size and discriminatory expression patterns in microarray data. Bioinformatics (Oxford, England) 189, 1184–1193. [DOI] [PubMed] [Google Scholar]

[kxw018C11] Jin J. (2009). Impossibility of successful classification when useful features are rare and weak. Proceedings of the National Academy of Sciences 10622, 8859–8864. [DOI] [PMC free article] [PubMed] [Google Scholar]

[kxw018C12] Johnson R. A., Wichern D. W. (2002) Applied Multivariate Statistics, 5th edition Upper Saddle River, NJ: Prentice-Hall, Inc. [Google Scholar]

[kxw018C13] Lin H., Zhou L., Peng H., Zhou X. H. (2011). Selection and combination of biomarkers using ROC method for disease classification and prediction. Canadian Journal of Statistics 392, 324–343. [Google Scholar]

[kxw018C14] Liu X., Wang Y., Rekaya R., Sriram T. N. (2012). Sample size determination for classifiers based on single-nucleotide polymorphisms. Biostatistics 132, 217–227. [DOI] [PubMed] [Google Scholar]

[kxw018C15] Liu X., Wang Y., Sriram T. N. (2014). Determination of sample size for a multi-class classifier based on single-nucleotide polymorphisms: a volume under the surface approach. Journal of Biomedical Informatics 15, 190. [DOI] [PMC free article] [PubMed] [Google Scholar]

[kxw018C16] Mardis E. R. (2008). The impact of next-generation sequencing technology on genetics. Trends in Genetics 243, 133–141. [DOI] [PubMed] [Google Scholar]

[kxw018C17] Mukherjee S., Tamayo P., Rogers S., Rifkin R., Engle A., Campbell C., Golub T. R., Mesirov J. P. (2003). Estimating dataset size requirements for classifying DNA microarray data. Journal of Computational Biology: A Journal of Computational Molecular Cell Biology 102, 119–142. [DOI] [PubMed] [Google Scholar]

[kxw018C18] NCI-NHGRI, Working group on replication in association studies (2007). Replicating genotype-phenotype associations. Nature 4477145, 655–660. [DOI] [PubMed] [Google Scholar]

[kxw018C19] Pepe M. S., Cai T., Longton G. (2006). Combining predictors for classification using the area under the receiver operating characteristic curve. Biometrics 621, 221–229. [DOI] [PubMed] [Google Scholar]

[kxw018C20] Pfeiffer R. M., Bur E. (2008). A model free approach to combining biomarkers. Biometrical Journal 504, 558–570. [DOI] [PubMed] [Google Scholar]

[kxw018C21] Schuster S. C. (2008). Next-generation sequencing transforms today's biology. Nature Methods 51, 16–18. [DOI] [PubMed] [Google Scholar]

[kxw018C22] Simon R. (2008). Development and validation of biomarker classifiers for treatment selection. Journal of Statistical Planning and Inference 1382, 308–320. [DOI] [PMC free article] [PubMed] [Google Scholar]

[kxw018C23] Wang Y., Miller D. J., Clarke R. (2008). Approaches to working in high-dimensional data spaces: gene expression microarrays. British Journal of Cancer 986, 1023–1028. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Study design in high-dimensional classification analysis

Brisa N Sánchez

Meihua Wu

Peter X K Song

Wen Wang

Abstract

1. Introduction

Fig. 1.

2. Model, PCC, and feature selection

2.1. Objective function and connection to sample size

2.2. Feature selection procedures

2.2.1. CV threshold.

2.2.2. Higher criticism threshold.

3. Implementation

4. Feature augmentation

5. Illustrations

Fig. 2.

Fig. 3.

Table 1.

Fig. 4.

Fig. 5.

6. Discussion

Supplementary material

Funding

Supplementary Material

Acknowledgments

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Study design in high-dimensional classification analysis

Brisa N Sánchez

Meihua Wu

Peter X K Song

Wen Wang

Abstract

1. Introduction

Fig. 1.

2. Model, PCC, and feature selection

2.1. Objective function and connection to sample size

2.2. Feature selection procedures

2.2.1. CV threshold.

2.2.2. Higher criticism threshold.

3. Implementation

4. Feature augmentation

5. Illustrations

Fig. 2.

Fig. 3.

Table 1.

Fig. 4.

Fig. 5.

6. Discussion

Supplementary material

Funding

Supplementary Material

Acknowledgments

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases