Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2016 May 5;17(4):722–736. doi: 10.1093/biostatistics/kxw018

Study design in high-dimensional classification analysis

Brisa N Sánchez 1,*, Meihua Wu 2, Peter X K Song 3, Wen Wang 3
PMCID: PMC5031947  PMID: 27154835

Abstract

Advances in high throughput technology have accelerated the use of hundreds to millions of biomarkers to construct classifiers that partition patients into different clinical conditions. Prior to classifier development in actual studies, a critical need is to determine the sample size required to reach a specified classification precision. We develop a systematic approach for sample size determination in high-dimensional (large Inline graphic small Inline graphic) classification analysis. Our method utilizes the probability of correct classification (PCC) as the optimization objective function and incorporates the higher criticism thresholding procedure for classifier development. Further, we derive the theoretical bound of maximal PCC gain from feature augmentation (e.g. when molecular and clinical predictors are combined in classifier development). Our methods are motivated and illustrated by a study using proteomics markers to classify post-kidney transplantation patients into stable and rejecting classes.

Keywords: Design, Higher criticism threshold, Large p small n, Linear discrimination, Sample size

1. Introduction

In recent years, high-dimensional classification analysis has received heightened attention due to its importance for personalized medicine: if validated classifiers (e.g. diagnostic tests) are available, clinicians can use them to design effective treatment plans for individual patients (Hamburg and Collins, 2010). Several approaches to deriving classifiers based on high-dimensional biomarkers have been developed in the literature, and when applied to real world experiments, some promising results have been reported, e.g. Clarke and others (2008), Simon (2008), and Wang and others (2008). However, rapid technological advances enabling the collection of hundreds to millions of biomarkers from a single patient give rise to study design challenges (Mardis, 2008; Schuster, 2008), including how to determine adequate sample size to train classifiers.

We address two key study design issues for classification studies with high-dimensional predictors (i.e. “Inline graphic” scenarios); namely, how to: (i) determine sample size that accounts for actual data analyses plans in advance and (ii) assess gain in classification precision associated with feature augmentation. Given space constraints, we focus only on study design related issues, an area that has received less attention. The current design literature in this area focuses on classifiers that are constructed by first screening biomarkers that may be differentially expressed across disease groups (i.e. assuming biomarkers important for classification are sparse), and subsequently combine the selected biomarkers into a classification rule. These types of classifiers rely on threshold cutoffs for selecting important features, the estimation of which needs to be accounted for in the design stage.

This work is motivated by two collaborative projects. The first is our work with the Nephrotic Syndrome Study Network (NEPTUNE), which studies molecular mechanisms for rare renal diseases. One of NEPTUNE's goals is to identify tissue-based mRNA biomarkers to classify patients into risk groups and predict disease remission. A comprehensive generalization of the NEPTUNE study design method (Gadegbeku and others, 2013) is frequently needed in practice.

A second collaboration is joint work with a clinician at The University of Michigan Kidney Transplantation Center, who aimed to predict patient's graft survival status (stable vs. rejecting), a measure of treatment effectiveness, after kidney transplant. The proposed study will proceed in two stages (Figure 1). First, the investigator would like to know how many transplant patients are sufficient to derive and validate a powerful classifier based on protein biomarkers. Second, the investigator would like to know if the classification prediction can be improved by adding clinical predictors such as routine measures of patient's laboratory tests (e.g. albium and hemoglobin) and demographic characteristics—i.e. the gain in prediction accuracy due to feature augmentation.

Fig. 1.

Fig. 1.

Study flowchart for constructing classifiers of graft survival after kidney transplant. In Step I, investigators will collect Inline graphic proteomics biomarkers using microarrays for each patient in the stable and rejecting groups, and a classifier of graft survival status will be developed. In Stage II, investigators will consider adding other clinical characteristics and patient demographics to hopefully improve classification precision.

Aside from sample size determination methods that optimize hypothesis testing criteria in high-dimensional data settings (e.g. Hwang and others, 2002), few sample size methods for building classifiers are available. One ground breaking method for classification analysis was proposed by Dobbin and Simon (2007) (hereafter DS2007), which is based on optimizing the probability of correct classification (PCC, Mukherjee and others, 2003). The classifier's PCC (or sensitivity or specificity) is a more appropriate target for sample size determination for classification studies, rather than the classical concepts of Type I and Type II errors for testing differences across groups. One limitation of DS2007's method is that the threshold for feature selection is optimized for the given design parameters (e.g. number of important features and their effect size), and this threshold is treated as known in the sample size calculation. As a result, this design approach does not have a counterpart in the data analysis stage, because during analyses the true differences between groups, and thus the threshold, are unknown. Liu and others (2012) develop sample size determination methods for classifiers based on single nucleotide polymorphisms, which was extended by Liu and others (2014) to multi-class classifiers. de Valpine and others (2009) develop a simulation-approximation approach to determine sample size.

The benefits of feature augmentation in terms of the receiver operator curve have been investigated (Pepe and others, 2006; Cai and Cheng, 2008; Pfeiffer and Bur, 2008; Lin and others, 2011). However, there is no theoretical work specifically quantifying the amount of PCC gain due to feature augmentation, nor under which scenarios PCC gain can be maximized.

Section 2 describes the model formulation for the features, the PCC definition, and two thresholding techniques used to select important features with which the classification rule is constructed. One is the higher criticism threshold (HCT) proposed by Donoho and Jin (2009), which is particularly relevant when important features are rare and weak, and the other is a method based on cross validation (CV). Section 3 presents our proposed methods for sample size determination which incorporate thresholding techniques. We introduce a new simulation method to efficiently evaluate the PCC of HCT-based classifiers. In Section 4, we establish a novel inequality with both the upper and lower bounds for PCC gain due to feature augmentation. Section 5 illustrates the performance of three sample size determination strategies and their use in the second motivating example of predicting kidney graft status, followed by a discussion.

2. Model, PCC, and feature selection

In this section, we review existing work and modeling set up that serves as the context for our proposed sample size determination methods. Suppose the study population can be divided into two groups: Group Inline graphic and Group Inline graphic. The design question is on how many subjects, Inline graphic, a set of training data Inline graphic will be collected to construct a classifier, where Inline graphic is the group label for subject Inline graphic; population group prevalences are Inline graphic and Inline graphic, respectively; and Inline graphic is a high-dimensional vector of features for subject Inline graphic (e.g. proteomics biomarkers). For brevity of exposition, in the rest of the paper we assume the sample size collected from each group is equal by design (e.g. stratified sampling is used), irrespective of the group prevalences in the population. Supplementary material available at Biostatistics online describes modifications needed when sample sizes are unequal for the groups.

We assume that features follow the multivariate normal distribution within each group with equal variances: Inline graphic; and Inline graphic, where the vector Inline graphic, with elements Inline graphic, Inline graphic, represents the signal strengths of the features. Setting Inline graphic is purely for notational convenience and is not needed in practice; this notation allows us to write the mean differences of features between groups in terms of a single vector, namely Inline graphic. A higher value of Inline graphic suggests a better separation between two groups by feature Inline graphic, and consequently feature Inline graphic would be important for classification. Assuming the equality of variances is needed to construct a linear classification rule (Johnson and Wichern, 2002), which we assume at the design stage. Without loss of generality, we assume the diagonal elements of Inline graphic equal 1, which enables us to refer to Inline graphic as the vector of effect sizes. In practice, this is achieved by dividing each feature by its pooled standard deviation calculated with the training data.

The dimension Inline graphic of Inline graphic may be very high; hence it is commonly assumed that only a small number of features, say, Inline graphic, have non-zero effect sizes. The Inline graphic features are considered essential to construct a classifier, while the other Inline graphic features are noise (Inline graphic). For the ease of exposition, we reorder features such that the important Inline graphic features are listed first, i.e. Inline graphic, where Inline graphic is a zero vector of length Inline graphic. The values of effect sizes, Inline graphic, are unknown at the design stage and Inline graphic is supplied by subject-matter scientists. Assume that it is possible to specify a lower bound Inline graphic for Inline graphic based on some prior research results or a certain scientific hypothesis, and replace Inline graphic by Inline graphic, where Inline graphic is a vector of ones of length Inline graphic. Then, we will simply state the effect size as Inline graphic. Under the linear classification rule and given the weighting scheme defined in (2.1), using a lower bound in the design will lead to a conservative estimate of PCC and thus sample size, which is acceptable in practice when no reliable pilot data are available to estimate Inline graphic satisfactorily.

In this paper, we consider a linear classifier in the design stage. Constructing a linear classifier is equivalent to using training data Inline graphic to derive a certain weighting scheme Inline graphic that allocates weights Inline graphic. Let Inline graphic and Inline graphic denote the inner product of two vectors. The classification rule for a new subject is: if Inline graphic, subject Inline graphic is assigned to Group Inline graphic; otherwise to Group Inline graphic. In general, the weighting scheme Inline graphic can assign non-zero weights Inline graphic to all available features; however, this can harm PCC if most of them are not important. Instead, when Inline graphic using regularized feature selection allows us to include only important features in the classifier, thus enhancing the classifier's PCC. Feature selection is primarily driven by pairwise associations between features Inline graphic and group membership Inline graphic. Let Inline graphic be the vector of test statistics derived from training data Inline graphic. Then Inline graphic for unimportant features, and Inline graphic for Inline graphic where Inline graphic is the signal strength. A natural strategy for feature selection is to choose an appropriate threshold Inline graphic such that we only include features satisfying Inline graphic, Inline graphic. Given threshold Inline graphic, we incorporate this feature selection mechanism into the definition of the weighting scheme:

graphic file with name M71.gif (2.1)

The threshold Inline graphic is determined empirically given Inline graphic; Section 2.2 describes procedures to select it.

2.1. Objective function and connection to sample size

Following DS2007, we use PCC as the primary objective function for sample size determination. With two groups, the PCC is the weighted average of the classifier's sensitivity and specificity, with weights equal to the group prevalences. Under the assumed model and for fixed weights Inline graphic, it can be easily shown that Inline graphic, where Inline graphic is the standard normal CDF. The weights Inline graphic, however, are random and depend on Inline graphic (and thus Inline graphic) and the sample size Inline graphic of Inline graphic.

To make the connection between PCC and sample size, it is useful to think of the PCC as dependent on sample size and the selected threshold Inline graphic, hence defining Inline graphic as the PCC of a classifier built using training data on Inline graphic subjects. Further, to define the optimal PCC, it is useful to think of the upper bound of the PCC among linear classifiers. A linear classifier can reach the upper bound if it is the oracle classifier, or if it is constructed from a study with infinite sample size (DS2007). This optimal classifier has Inline graphic, which simplifies to Inline graphic when Inline graphic, Inline graphic and Inline graphic is replaced by Inline graphic at the design stage.

Clearly, a practically achievable PCC will be lower than the upper bound, and its exact value depends on how much relevant information can be extracted from the training data. At the design stage, a PCC target is set lower than Inline graphic; for instance, DS2007 set Inline graphic as the smallest PCC that satisfies Inline graphic. The sample size requirement is then defined as the smallest Inline graphic such that Inline graphic. If the inverse function of Inline graphic could be analytically derived, then the sample size would be easily determined by Inline graphic. Since a closed form of Inline graphic rarely exists, we employ numerical algorithms (e.g. the binary search algorithm) to determine sample size.

Finally, one must consider how to calculate PCC at the design stage. DS2007 consider an approximation of PCC to warrant its fast computation. First they compute an optimal, fixed threshold Inline graphic using information on the assumed Inline graphic, and show with Monte Carlo (MC) simulations that for a fixed Inline graphic the approximation Inline graphic is accurate. However, at the analysis stage, optimizing Inline graphic only depends only on the data Inline graphic, since Inline graphic is unknown. We describe two procedures to determine Inline graphic based only on simulated data and are thus advantageous because they have an actual parallel in the analysis stage.

2.2. Feature selection procedures

2.2.1. CV threshold.

We consider the following straightforward K-fold CV thresholding method to determine Inline graphic given only training data Inline graphic, denoted by Inline graphic. Such Inline graphic is chosen to maximize the apparent PCC, Inline graphic, which is a function of both threshold Inline graphic and training data Inline graphic. The apparent PCC can be computed via the following steps:

  1. Follow the sampling strategy for Inline graphic to divide it into Inline graphic equal-sized subsets, Inline graphic (e.g. divide cases and controls separately if stratified sampling is used to collect Inline graphic).

  2. For each Inline graphic, treat Inline graphic, which has sample size Inline graphic, as a CV testing set and the rest of the data Inline graphic as a CV training set; given a threshold value Inline graphic, which is one of many threshold values on a dense grid, use (2.1) to obtain the weighting Inline graphic from the training set Inline graphic, where Inline graphic-scores are calculated from dataset Inline graphic only.

  3. For Inline graphic, calculate the apparent PCC based on testing set Inline graphic, as Inline graphic.

  4. Calculate the overall apparent PCC: Inline graphic.

  5. Calculate the apparent PCC on a dense grid of values for Inline graphic, and select the optimal threshold Inline graphic that maximizes the overall apparent PCC: Inline graphic.

For an analysis employing the CV threshold, the expected PCC is calculated over the distribution of Inline graphic, Inline graphic This procedure optimizes the threshold Inline graphic without bearing on knowledge of Inline graphic; when embedded in sample size calculations it accounts for uncertainty in Inline graphic. When features are independent and important features have the same effect size, CV thresholding with weights in (2.1) results in the optimal Bayes classification rule, except for uncertainty in Inline graphic, and thus the optimal sample size.

2.2.2. Higher criticism threshold.

Proposed by Donoho and Jin (2009), HCT provides a data-driven approach to determine Inline graphic in a high-dimensional classification analysis. HCT determines a suitable threshold Inline graphic based on the distribution of Inline graphic-values obtained from univariate tests for associations of individual features with the group assignment, and then the weighting scheme (2.1) can be applied. Let Inline graphic denote the HCT procedure when applied to training data Inline graphic. The association test for feature Inline graphic with group Inline graphic results in a two-sided Inline graphic-value Inline graphic. For an unimportant feature, Inline graphic; for an important feature the resulting Inline graphic does not follow Inline graphic and tends to be smaller than those of the unimportant features. HCT only focuses on the smallest Inline graphic Inline graphic-values sorted in increasing order: Inline graphic; a typical choice is Inline graphic. Donoho and Jin (2009) showed that the Inline graphicth ordered Inline graphic-value with Inline graphic provides an appropriate cutoff for feature selection: features whose Inline graphic-values are less than Inline graphic are considered important for classification. The Inline graphic-score threshold is thus Inline graphic. The resulting PCC is Inline graphic.

As detailed by Donoho and Jin (2009), the theory of HCT brings new insights to the asymptotic properties of linear classifiers under the so-called rare-and-weak model, which is of interest in the context of high-dimensional classification because it gives a structure under which the number of important features Inline graphic and signal strength Inline graphic vary with the total number of features Inline graphic. This structure enables study of asymptotic classification feasibility. In this rare-and-weak model, Inline graphic increases with Inline graphic according to Inline graphic (Donoho and Jin, 2009), where Inline graphic controls the sparsity. Similarly, instead of Inline graphic, which becomes arbitrarily large with increasing sample size, in this model the signal strength follows Inline graphic; Inline graphic controls the signal strength, and Inline graphic for some Inline graphic. This implies that important features become rarer and their effect size becomes weaker when the total number of features Inline graphic increases, which is regarded as a more realistic mechanism (NCI-NHGRI, 2007; Jin, 2009) than a mechanism where PCC always increases as Inline graphic increases. It has been shown (Donoho and Jin, 2009; Jin, 2009) that as the number of features Inline graphic the PCC of any linear classifier is characterized only by Inline graphic through a certain function Inline graphic given in Section B of supplementary material available at Biostatistics online: (i) when Inline graphic, the classification analysis is asymptotically feasible, in the sense that the PCC of the HCT linear classifier approaches to 1 as Inline graphic and (ii) when Inline graphic, the classification analysis is asymptotically infeasible. The asymptotic result of feasibility is critical to guide the design of classification analysis. Verifying the inequality Inline graphic can help investigators make a timely decision on the feasibility of a study at the planning stage (see Section 5 for an illustration).

3. Implementation

Given an approach to evaluate PCC, sample size can be determined by inverting the PCC function numerically. We thus focus on PCC estimation approaches that incorporate thresholding procedures so the resulting PCC would more closely reflect what can be achieved in practice. Section C of supplementary material available at Biostatistics online describes the evaluation of PCC for CV-based classifiers. Our primary contributions here focus on approaches needed for HCT-based classifiers.

Since Inline graphic is a data-driven thresholding procedure, MC simulation can be applied to evaluate Inline graphic. However, because Inline graphic depends on the training data Inline graphic exclusively through the Inline graphic smallest Inline graphic-values, we propose a computationally fast MC algorithm that directly simulates the Inline graphic smallest Inline graphic-values from the distribution of ordered statistics instead of simulating Inline graphic. The algorithm is takes the following steps:

  1. Simulate Inline graphic-scores for Inline graphic important features from Inline graphic, Inline graphic.

  2. Convert the above Inline graphic-scores to two-sided Inline graphic-values by Inline graphic, Inline graphic.

  3. Simulate a random variable Inline graphic.

  4. Simulate variables Inline graphic independently from Inline graphic.

  5. Sort vector Inline graphic in an ascending order.

The Inline graphic smallest values in Inline graphic have the same joint distribution as the Inline graphic smallest Inline graphic-values Inline graphic derived from Inline graphic (see supplementary material available at Biostatistics online, Section D for the proof). As a by-product, the above algorithm also supplies the Inline graphic-scores Inline graphic for the Inline graphic important features. This proposed algorithm has a clear computational benefit: instead of generating Inline graphic random variables needed for Inline graphic (and calculating all Inline graphic-values), only Inline graphic variables are generated. The computational efficiency ratio is Inline graphic (since Inline graphic is much smaller than Inline graphic); e.g. for Inline graphic and Inline graphic, the algorithm is Inline graphic 1000 times more efficient. Furthermore, the algorithm below used to calculate PCC does not require generating test data.

To evaluate Inline graphic, we repeat for Inline graphic iterations:

  1. For given Inline graphic, Inline graphic and Inline graphic, use Steps (a)–(e) to generate the Inline graphic smallest Inline graphic-values Inline graphic and the corresponding Inline graphic-scores for Inline graphic important features.

  2. Determine the optimal threshold Inline graphic.

  3. Use (2.1) to calculate weights Inline graphic of important features using their Inline graphic-scores and Inline graphic.

  4. Calculate Inline graphic, where Inline graphic is the number of elements in Inline graphic that are smaller than Inline graphic.

Then, the fast MC estimate of Inline graphic is given by Inline graphic.

Correlated features. When features are correlated, estimating PCC is more challenging. One difficulty pertains to computing the denominator, Inline graphic, within the PCC formula, Inline graphic. DS2007 proposed replacing this quantity with an upper bound Inline graphic, where Inline graphic is the largest eigenvalue of Inline graphic. This bound could potentially be applied at Step 4 of the HCT-based algorithm above when calculating Inline graphic. However, this does not work for the HCT method since Step 1 relies on independence to prove that the Inline graphic used to generate the Inline graphic smallest Inline graphic-values follows a Beta distribution. Hence, we instead evaluate the expected PCC using an alternative MC simulation strategy. The simulation strategy follows Steps 1–4 as above, with the following modifications. First, we specify an assumed working correlation structure Inline graphic for the features. In Step 1, we now use this assumed Inline graphic to generate Inline graphic correlated features on Inline graphic subjects, and compute Inline graphic statistics and accompanying Inline graphic-values. The choice of structure for Inline graphic (e.g. block diagonal) may be informed by substantive knowledge, if available, and the magnitude of the correlations should be varied to assess its impact on the sample size calculation. Steps 2 and 3 of the algorithm remain unchanged to reflect that at the analysis stage the features are screened using pairwise associations and treated as independent. In Step 4, we evaluate Inline graphic, where Inline graphic is an estimate of Inline graphic based on the working correlation and is defined as Inline graphic and Inline graphic denotes the Inline graphic entry of Inline graphic. As with the eigenvalue approach, this approximation relies on the working correlation structure Inline graphic; however, this approach is less conservative than using the eigenvalue-based bound (see Illustrations section). The MC approach for the CV method with correlated features similarly relies on generating correlated data (see supplementary material available at Biostatistics online).

4. Feature augmentation

Because in practice multiple sources of features are typically collected, a key study design question is to investigate the potential PCC gain resulting from adding new sources of features to the classification analysis. For simplicity, let us focus on two sources of features (e.g. molecular biomarkers and clinical variables). Denote features already in the study by Type A and the new set by Type B, with respective dimensions Inline graphic and Inline graphic. For subject Inline graphic, we collect measurements Inline graphic and Inline graphic. As in Section 2, assume that the joint conditional distribution for features Inline graphic and Inline graphic is:

graphic file with name M276.gif

where Inline graphic and Inline graphic and Inline graphic, Inline graphic, and Inline graphic are the respective effect size vectors, variance, and covariance matrices. Here we do not place any restrictions on either the effect size vectors (e.g. sparsity is not assumed) or the variance matrices.

To study PCC gain, we need to consider the PCC of linear classifiers in three cases, including (i) Type A features only; (ii) Type B features only; and (iii) Type A features augmented with Type B features. Denote the respective weights in Cases (i) and (ii) by Inline graphic and Inline graphic; we do not place any assumptions on these weights, e.g. they may be derived from any thresholding procedure for feature selection. Conditioning on the weights Inline graphic and Inline graphic, and assuming group prevalence is Inline graphic, the PCC of the classifier in Case (i) is Inline graphic; in Case (ii) is Inline graphic; and finally, in Case (iii) is Inline graphic. When Inline graphic, the term Inline graphic drops from the denominator and we obtain Inline graphic. In Section E of supplementary material available at Biostatistics online, we prove that:

graphic file with name M293.gif (4.1)

The first equality holds when the relative variance of the linear predictors goes to 0, i.e. Inline graphic where Inline graphic if Inline graphic and Inline graphic if Inline graphic. In the latter case, the equality is reached when Inline graphic and the linear predictors have equal variance Inline graphic. Inequality (4.1) provides the upper bound of the PCC of the classifier when the linear predictors are combined into a new classification rule. If either Inline graphic or Inline graphic approach 1, the upper bound will approach 1.

In practice, the features may be correlated, Inline graphic, thus the linear predictors Inline graphic and Inline graphic will too. Given the monotonicity of Inline graphic and that the covariance of the linear predictors, Inline graphic, appears in the denominator Inline graphic, the PCC gain will depend on the sign of the correlation: Inline graphic. In Section E of supplementary material available at Biostatistics online, we also prove that Inline graphic, and that Inline graphic where Inline graphic is the relative standard deviation of the linear predictors. Finally, in Section E of supplementary material available at Biostatistics online we also prove inequality (4.1) for any proportion Inline graphic when optimal weights are used in classifier construction.

5. Illustrations

PCC estimation and sample size determination given effect size and number of important features. For a given effect size, the PCC evaluated at the design stage will depend on a pre-specified thresholding procedure and in turn impact the sample size. Thus, we first illustrate the estimated PCC using the DS2007 method, and using the CV and HCT thresholding methods. Figure 2 shows the PCC as a function of sample size when the number of available features is Inline graphic or Inline graphic. The PCC estimated by the DS method is always the highest, primarily because it uses the true effect size to choose the optimal threshold Inline graphic. However, due to its reliance on Inline graphic to obtain the optimal threshold at the design stage, the DS method has no counterpart in actual data analysis. On the other hand, the PCC estimated by the CV and the HCT methods rely only on the simulated data to estimate the threshold, which introduces uncertainty in the feature selection threshold, and thus yields lower PCC estimates. When the classification problem is more difficult (e.g. Inline graphic), HCT yields higher PCC than CV as expected (Donoho and Jin, 2008). Since PCC estimates from CV and HCT reflect more closely the achievable performance of the corresponding classifiers in real applications, the sample size estimates from these methods would better approximate the sample size required in practice.

Fig. 2.

Fig. 2.

PCC estimates as a function of sample size estimated by DS, CV, and HCT methods, assuming the minimal effect size of important features is Inline graphic and group prevalences are Inline graphic. (a) Inline graphic, (b) Inline graphic, (c) Inline graphic, and (d) Inline graphic. In (a) and (b), important features are rarer (Inline graphic important feature) compared with (c) and (d) Inline graphic, which results in a marked difference in Inline graphic (gray horizontal line). The PCC estimated using the DS method is always higher for a given sample size, leading to lower sample size estimates. When features are rarer, (a) and (b), HCT gives a higher PCC, leading to lower sample size requirements. When features are less rare Inline graphic, selecting features using HCT (CV) leads to lower sample size requirements for lower (higher) PCC targets compared with CV (HCT), given the crossing of the PCC curves. PCC estimates for CV and HCT are obtained using MC simulations using the algorithms described in Section 3, with 500 replicates for CV and 1000 replicates for HCT.

Figure 3 shows the sample size requirements for a range of effect sizes, Inline graphic or 10, and Inline graphic with Inline graphic. In general, sample sizes obtained from the DS method are consistently lower than with CV or HCT methods, as can be expected given in Figure 2. Sample sizes become comparable (difference Inline graphic2) when the effect size is large. However, when the features are relatively weaker, then the DS method will tend to underestimate the needed sample size. Figure 2 also explains the facts that, for a fixed Inline graphic, the sample sizes from HCT shown in Figure 3 are lower for scenarios when features are rarer (i.e. Inline graphic) and lower for CV when features are less rare (Inline graphic). Hence, only the proposed HCT approach will give sufficient sample size for cases when features are relatively weaker and rarer, without being overly conservative (CV method is conservative in those cases, since the HCT classifier can achieve the target PCC with a lower sample size). These assertions are verified with MC simulations shown in Table 1 (top rows where Inline graphic).

Fig. 3.

Fig. 3.

Sample size requirements estimated using DS, CV, and HCT design methods for a range of effect sizes Inline graphic. (a) Inline graphic, (b) Inline graphic, (c) Inline graphic, and (d) Inline graphic. For each effect size and combination of Inline graphic and Inline graphic, the Inline graphic is shown as the inset value on the Inline graphic-axis, and the target PCC is set as Inline graphic. Sample size estimates for CV and HCT are obtained by numerically inverting the PCC function and selecting the smallest Inline graphic that satisfies Inline graphic; Inline graphic is estimated using the MC algorithms described in Section 3 with 500 replicates for CV and 1000 for HCT. The sample size required decreases as effect sizes of important features increase, even in high target PCC cases. Sample sizes obtained with the DS method are lower, but, as shown in Table 1, underestimate the required sample size particularly for rare-and-weak features.

Table 1.

Sample size Inline graphic calculated by DS, CV, and HCT design methods using the specified Inline graphic and differences between the target and what can be achieved in practice, Inline graphic

Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic DS CV HCT CV CV HCT HCT CV HCT
Inline graphic 1 500 0.8 0.605 98 0.035 0.018 134 Inline graphic0.001 Inline graphic0.014 118 0.019 Inline graphic0.007
1.2 0.676 56 0.045 0.023 76 Inline graphic0.003 Inline graphic0.02 68 0.020 Inline graphic0.002
1.6 0.738 36 0.047 0.028 50 Inline graphic0.011 Inline graphic0.023 46 0.008 Inline graphic0.008
Inline graphic 1 10 000 0.8 0.605 138 0.033 0.026 182 Inline graphic0.001 Inline graphic0.013 170 0.011 Inline graphic0.001
1.2 0.676 76 0.034 0.032 96 Inline graphic0.005 Inline graphic0.005 94 0.005 0.000
1.6 0.738 50 0.043 0.042 62 0 Inline graphic0.005 62 0.000 Inline graphic0.005
Inline graphic 10 500 0.8 0.847 94 0.014 0.031 102 0 0.025 146 Inline graphic0.029 Inline graphic0.002
1.2 0.921 38 0.015 0.027 44 Inline graphic0.003 0.011 52 Inline graphic0.015 Inline graphic0.005
1.6 0.944 20 0.025 0.005 24 Inline graphic0.004 Inline graphic0.01 24 Inline graphic0.004 Inline graphic0.010
Inline graphic 10 10 000 0.8 0.847 138 0.01 0.032 152 0.001 0.021 202 Inline graphic0.034 Inline graphic0.001
1.2 0.921 58 0.021 0.032 66 Inline graphic0.006 0.014 76 Inline graphic0.019 0.000
1.6 0.944 30 0.031 0.027 36 Inline graphic0.008 Inline graphic0.001 36 Inline graphic0.008 -0.001
Inline graphic 1 500 1.6 0.738 NA 44 Inline graphic0.000 0.041 62 Inline graphic0.034 -0.006
Inline graphic 0.738 NA 46 Inline graphic0.005 0.060 168 Inline graphic0.048 0.004
Inline graphic 0.738 NA 46 Inline graphic0.002 0.054 160 Inline graphic0.044 0.004
Inline graphic 10 500 1.6 0.762 26 Inline graphic0.002 Inline graphic0.037 26 Inline graphic0.002 Inline graphic0.037 18 0.064 Inline graphic0.013
Inline graphic 0.762 26 0.029 Inline graphic0.015 30 0.000 Inline graphic0.030 24 0.032 Inline graphic0.010
Inline graphic 0.762 NA 28 0.001 0.028 44 Inline graphic0.039 Inline graphic0.010
Inline graphic 0.762 NA 26 0.002 0.026 42 0.044 Inline graphic0.006

The scenarios and respective sample sizes shown here are a subset of those shown in Figure 3 (see Figure 3 legend for details on how sample sizes are obtained). Given the computed sample size Inline graphic, Inline graphic was computed by generating 1000 training datasets of size Inline graphic and test datasets of size 100; training and test datasets were generated according to the model defined by Inline graphic, Inline graphic, Inline graphic, and Inline graphic. The average of the Inline graphic across the 1000 replicates are shown. The Inline graphic are block diagonal as follows: Inline graphic is a Inline graphic identity; Inline graphic has first block being Inline graphic compound symmetry structure and correlation 0.80, denoted by Inline graphic, and second block Inline graphic; Inline graphic has 50 blocks of Inline graphic; Inline graphic has first block Inline graphic, second block Inline graphic, and third block Inline graphic; Inline graphic has the first block Inline graphic, and the second block Inline graphic. NA indicates sample size estimates from DS2007 eigenvalue method were prohibitive, Inline graphic.

It is evident from Figures 2 and 3, and Table 1 that it is important to not only calculate the sample size based on the PCC estimates that are achievable by statistical methods at the analysis stage, but also to select an analysis approach that can more efficiently attain the target PCC under a given parameter space. In rare-and-weak cases, for example, the HCT-based classifier has been shown to perform better (Donoho and Jin, 2008), and thus we recommend determining sample sizes using our proposed HCT sample size calculator in these scenarios.

Sample size calculations when features are correlated. The bottom part of Table 1 gives the sample sizes computed from each method under different structures for Inline graphic (Inline graphic). The DS2007 method using the correction based on the largest eigenvalue sometimes yields prohibitive sample sizes (Inline graphic, denoted as Inline graphic), because of the excessively large maximum eigenvalue of Inline graphic. Both CV- and HCT-based methods give sample sizes at which the target PCC is achieved. It is worth noting that when important features are very rare (e.g. Inline graphic), the HCT-based method yields conservative sample sizes whereas the CV method can achieve the same target PCC with lower sample sizes.

Feature augmentation. Figure 4 illustrates the upper bound (left panel) of the PCC and PCC gain (right column) due to feature augmentation discussed in Section 4. First, in the case when features are independent (Figure 4(a)), we note that if both Inline graphic and Inline graphic are small (or one is large), then Inline graphic will be small (or large). Hence, the upper bound of the PCC of classifiers with both Type A and Type B features is only slightly higher than Inline graphic. Combining two sets of features where both are very good or both are poor does not greatly improve PCC. If both types of features are of medium quality (e.g. both Inline graphic and Inline graphic are in the medium range around Inline graphic), then we could obtain the highest gain (at most 10%) in PCC by feature augmentation. When features are negatively correlated, the PCC gain can be substantial (Figure 4(b)).

Fig. 4.

Fig. 4.

The upper bounds of Inline graphic when two sets of features, A and B, are combined compared with using a single type of features. In (a), features are independent, Inline graphic (all Inline graphic), or Inline graphic and have Inline graphic. The upper bound is attained when the linear predictors have equal variance. In (b), Inline graphic, group 1 prevalence is Inline graphic, and Inline graphic gives the relative standard deviation of the linear predictors. The upper bound shown in (b) is attained when the linear predictor constructed from features A and B are perfectly negatively correlated.

An application. We demonstrate the proposed methods using the kidney transplant study (Figure 1). In Stage I, the investigator hypothesizes that among Inline graphic proteins, at least Inline graphic of them are likely informative for predicting graft survival status. Pilot data, Inline graphic, showed an effect size of approximately Inline graphic (i.e. Inline graphic). Given these design parameters, the signal strength is Inline graphic; the sparsity parameter is Inline graphic; and the strength parameter Inline graphic lies above the feasibility boundary: Inline graphic. Hence, the classification problem feasible (see Section B of supplementary material available at Biostatistics online), and we can proceed to calculate sample size requirements for given PCC targets (Figure 5(a)).

Fig. 5.

Fig. 5.

Application: study design for predicting graft survival after kidney transplant. (a) As expected, a larger sample size is required when a higher PCC target is chosen and other design parameters are held constant (Inline graphic). The DS method yields the lowest sample size requirements, but these may underestimate the needed sample size (see Table 1); whether HCT or CV require larger sample size depends on the target PCC. (b) PCC with the proteomics markers (Inline graphic) is fixed at 0.7 (dashed line). If new features are not as informative as the proteomics markers (Inline graphic, combining both sets of features leads to a limited improvement of the classifier, and in some cases data augmentation might actually degrade the classifier (shadowed area below 0.7) due to the noise introduced by low-quality features in the new data source (e.g. some proteins can be measured with substantial errors if urine samples are not stored under stringent conditions). If the new features are more informative (Inline graphic, incorporating them can substantially enhance the PCC.

For Stage II, the investigator considers improving the PCC of the classifier with proteomics biomarkers only say, Inline graphic, by incorporating an additional set of features, including proteinuria, GFR, hematuria, albium, and cholesterol. Figure 5(b) shows a region describing the achievable PCC with both types of features (Inline graphic for various values of the PCC with the additional features alone, i.e. Inline graphic). Substantial enhancements to PCC occur when the second set of features is at least as informative as the proteomics biomarkers (Inline graphic).

6. Discussion

We addressed two study design questions for studies using high-dimensional features for classification. First, we developed sample size determination strategies for CV- and HCT-based classifiers. Our strategies incorporate uncertainty of feature selection thresholds within the PCC calculation, which is particularly relevant when important features are hypothesized to be rare and weak. We proposed a computationally efficient algorithm based on order statistics to compute the PCC, and thus the sample size requirements, for the HCT-based classifier. Second, we established an inequality for the upper and lower bounds of the achievable PCC associated with feature augmentation. The approaches were illustrated with numerical examples and a practical study, and are implemented in our R package HDDesign (available at https://cran.r-project.org/).

Our proposed methods can be improved in the following directions. Classification of more than two groups commonly appears in clinical studies, thus extensions in this direction are of great importance. Strong deviations from linearity (e.g. U-shaped associations) may undermine the applicability of the proposed approaches. In this case, it may be possible to categorize the predictors and apply and/or extend the study design methods of Liu and others (2012) to the case of rare-and-weak features. It is also of interest to further investigate how correlations among features may be effectively incorporated into the sample size determination. We proposed to directly plug in an assumed working correlation matrix within the CV- and HCT-based approaches. As expected, positive correlations among features result in larger required sample sizes, although our not as large as DS2007's preliminary eigenvalue-based approach. Nevertheless, our approach requires specifying sensible working correlation structures at the design stage, which may be difficult to obtain in practice. Varying the structure and magnitude of the correlations based on available scientific knowledge is needed with our proposed approach. Further improvements in this direction may be possible by using the innovated HCT suggested by Hall and Jin (2010), or by developing sample size determination methods based on regularized regression-based approaches that do not require pre-filtering and hence do not rely on the marginal effects of the features. However, developing sample size calculations using regression-based procedures (e.g. LASSO) would require specifying the adjusted effect sizes and, importantly, quantifying the uncertainty in feature selection, which remains an open problem in high-dimensional inference.

In summary, we advocate the use of sample size determination methods that match, as closely as possible, the analytic approaches that will be actually applied at the data analysis stage and that capitalize on prior knowledge of the underlying mechanism of interest. If the important features have strong signals, both HCT- and DS-based approaches provide adequate sample size calculations, and there is little difference between them. Given that the HCT method is computationally fast and accounts for uncertainty in the feature selection threshold, it is recommended in practice. If the important features are relatively abundant but weak, we recommend the CV approach as it gives the least conservative sample size, albeit computationally intensive. If the important features are rare and weak, we recommend the HCT-based approach since it provides desired sample sizes with little conservatism, and is computationally efficient. In summary, our work builds upon and further advances the pioneering work of DS2007, for sample, size determination in high-dimensional classification problems.

Supplementary material

Supplementary Material is available at http://biostatistics.oxfordjournals.org.

Funding

M.W. and B.N.S. acknowledge NIH grant R21DA024273 for salary support during the initial conduct of this study. P.X.K.S.'s research is funded in part by NIH U54-DK-083912-05 and NSF DMS-1513595. This work was also partially funded by grants NIH/EPA P20 ES018171/RD83480001 and P01 ES022844/RD83543601.

Supplementary Material

Supplementary Data

Acknowledgments

The publication's contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH or US EPA. Conflict of Interest: None declared.

References

  1. Cai T., Cheng S. (2008). Robust combination of multiple diagnostic tests for classifying censored event times. Biostatistics 92, 216–233. [DOI] [PubMed] [Google Scholar]
  2. Clarke R., Ressom H. W., Wang A., Xuan J., Liu M. C., Gehan E. A., Wang Y. (2008). The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nature Reviews. Cancer 81, 37–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. de Valpine P., Bitter H. M., Brown M. P. S., Heller J. (2009). A simulation-approximation approach to sample size planning for high-dimensional classification studies. Biostatistics 103, 424–435. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Dobbin K. K., Simon R. M. (2007). Sample size planning for developing classifiers using high-dimensional DNA microarray data. Biostatistics 81, 101–117. [DOI] [PubMed] [Google Scholar]
  5. Donoho D., Jin J. (2008). Higher criticism thresholding: optimal feature selection when useful features are rare and weak. PNAS 10539, 14790–14795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Donoho D., Jin J. (2009). Feature selection by higher criticism thresholding achieves the optimal phase diagram. Philosophical Transactions. Series A, Mathematical, Physical, and Engineering Sciences 3671906, 4449–4470. [DOI] [PubMed] [Google Scholar]
  7. Gadegbeku C. A., Gipson D. S., Holzman L., Ojo A. O., Song P. X., Barisoni L., Sampson M. G., Kopp J. B., Lemley K. V., Nelson P. J.. and others (2013). Design of the nephrotic syndrome study network (neptune): a multi-disciplinary approach to understanding primary glomerular nephropathy. Kidney International 834, 749–756. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Hamburg M. A., Collins F. S. (2010). The path to personalized medicine. New England Journal of Medicine 3634, 301–304. [DOI] [PubMed] [Google Scholar]
  9. Hall P., Jin J. (2010). Innovated higher criticism for detecting sparse signals in correlated noise. The Annals of Statistics 383, 1686–1732. [Google Scholar]
  10. Hwang D., Schmitt W. A., Stephanopoulos G., Stephanopoulos G. (2002). Determination of minimum sample size and discriminatory expression patterns in microarray data. Bioinformatics (Oxford, England) 189, 1184–1193. [DOI] [PubMed] [Google Scholar]
  11. Jin J. (2009). Impossibility of successful classification when useful features are rare and weak. Proceedings of the National Academy of Sciences 10622, 8859–8864. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Johnson R. A., Wichern D. W. (2002) Applied Multivariate Statistics, 5th edition Upper Saddle River, NJ: Prentice-Hall, Inc. [Google Scholar]
  13. Lin H., Zhou L., Peng H., Zhou X. H. (2011). Selection and combination of biomarkers using ROC method for disease classification and prediction. Canadian Journal of Statistics 392, 324–343. [Google Scholar]
  14. Liu X., Wang Y., Rekaya R., Sriram T. N. (2012). Sample size determination for classifiers based on single-nucleotide polymorphisms. Biostatistics 132, 217–227. [DOI] [PubMed] [Google Scholar]
  15. Liu X., Wang Y., Sriram T. N. (2014). Determination of sample size for a multi-class classifier based on single-nucleotide polymorphisms: a volume under the surface approach. Journal of Biomedical Informatics 15, 190. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Mardis E. R. (2008). The impact of next-generation sequencing technology on genetics. Trends in Genetics 243, 133–141. [DOI] [PubMed] [Google Scholar]
  17. Mukherjee S., Tamayo P., Rogers S., Rifkin R., Engle A., Campbell C., Golub T. R., Mesirov J. P. (2003). Estimating dataset size requirements for classifying DNA microarray data. Journal of Computational Biology: A Journal of Computational Molecular Cell Biology 102, 119–142. [DOI] [PubMed] [Google Scholar]
  18. NCI-NHGRI, Working group on replication in association studies (2007). Replicating genotype-phenotype associations. Nature 4477145, 655–660. [DOI] [PubMed] [Google Scholar]
  19. Pepe M. S., Cai T., Longton G. (2006). Combining predictors for classification using the area under the receiver operating characteristic curve. Biometrics 621, 221–229. [DOI] [PubMed] [Google Scholar]
  20. Pfeiffer R. M., Bur E. (2008). A model free approach to combining biomarkers. Biometrical Journal 504, 558–570. [DOI] [PubMed] [Google Scholar]
  21. Schuster S. C. (2008). Next-generation sequencing transforms today's biology. Nature Methods 51, 16–18. [DOI] [PubMed] [Google Scholar]
  22. Simon R. (2008). Development and validation of biomarker classifiers for treatment selection. Journal of Statistical Planning and Inference 1382, 308–320. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Wang Y., Miller D. J., Clarke R. (2008). Approaches to working in high-dimensional data spaces: gene expression microarrays. British Journal of Cancer 986, 1023–1028. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES