Abstract
Advances in high throughput technology have accelerated the use of hundreds to millions of biomarkers to construct classifiers that partition patients into different clinical conditions. Prior to classifier development in actual studies, a critical need is to determine the sample size required to reach a specified classification precision. We develop a systematic approach for sample size determination in high-dimensional (large
small
) classification analysis. Our method utilizes the probability of correct classification (PCC) as the optimization objective function and incorporates the higher criticism thresholding procedure for classifier development. Further, we derive the theoretical bound of maximal PCC gain from feature augmentation (e.g. when molecular and clinical predictors are combined in classifier development). Our methods are motivated and illustrated by a study using proteomics markers to classify post-kidney transplantation patients into stable and rejecting classes.
Keywords: Design, Higher criticism threshold, Large p small n, Linear discrimination, Sample size
1. Introduction
In recent years, high-dimensional classification analysis has received heightened attention due to its importance for personalized medicine: if validated classifiers (e.g. diagnostic tests) are available, clinicians can use them to design effective treatment plans for individual patients (Hamburg and Collins, 2010). Several approaches to deriving classifiers based on high-dimensional biomarkers have been developed in the literature, and when applied to real world experiments, some promising results have been reported, e.g. Clarke and others (2008), Simon (2008), and Wang and others (2008). However, rapid technological advances enabling the collection of hundreds to millions of biomarkers from a single patient give rise to study design challenges (Mardis, 2008; Schuster, 2008), including how to determine adequate sample size to train classifiers.
We address two key study design issues for classification studies with high-dimensional predictors (i.e. “
” scenarios); namely, how to: (i) determine sample size that accounts for actual data analyses plans in advance and (ii) assess gain in classification precision associated with feature augmentation. Given space constraints, we focus only on study design related issues, an area that has received less attention. The current design literature in this area focuses on classifiers that are constructed by first screening biomarkers that may be differentially expressed across disease groups (i.e. assuming biomarkers important for classification are sparse), and subsequently combine the selected biomarkers into a classification rule. These types of classifiers rely on threshold cutoffs for selecting important features, the estimation of which needs to be accounted for in the design stage.
This work is motivated by two collaborative projects. The first is our work with the Nephrotic Syndrome Study Network (NEPTUNE), which studies molecular mechanisms for rare renal diseases. One of NEPTUNE's goals is to identify tissue-based mRNA biomarkers to classify patients into risk groups and predict disease remission. A comprehensive generalization of the NEPTUNE study design method (Gadegbeku and others, 2013) is frequently needed in practice.
A second collaboration is joint work with a clinician at The University of Michigan Kidney Transplantation Center, who aimed to predict patient's graft survival status (stable vs. rejecting), a measure of treatment effectiveness, after kidney transplant. The proposed study will proceed in two stages (Figure 1). First, the investigator would like to know how many transplant patients are sufficient to derive and validate a powerful classifier based on protein biomarkers. Second, the investigator would like to know if the classification prediction can be improved by adding clinical predictors such as routine measures of patient's laboratory tests (e.g. albium and hemoglobin) and demographic characteristics—i.e. the gain in prediction accuracy due to feature augmentation.
Fig. 1.
Study flowchart for constructing classifiers of graft survival after kidney transplant. In Step I, investigators will collect
proteomics biomarkers using microarrays for each patient in the stable and rejecting groups, and a classifier of graft survival status will be developed. In Stage II, investigators will consider adding other clinical characteristics and patient demographics to hopefully improve classification precision.
Aside from sample size determination methods that optimize hypothesis testing criteria in high-dimensional data settings (e.g. Hwang and others, 2002), few sample size methods for building classifiers are available. One ground breaking method for classification analysis was proposed by Dobbin and Simon (2007) (hereafter DS2007), which is based on optimizing the probability of correct classification (PCC, Mukherjee and others, 2003). The classifier's PCC (or sensitivity or specificity) is a more appropriate target for sample size determination for classification studies, rather than the classical concepts of Type I and Type II errors for testing differences across groups. One limitation of DS2007's method is that the threshold for feature selection is optimized for the given design parameters (e.g. number of important features and their effect size), and this threshold is treated as known in the sample size calculation. As a result, this design approach does not have a counterpart in the data analysis stage, because during analyses the true differences between groups, and thus the threshold, are unknown. Liu and others (2012) develop sample size determination methods for classifiers based on single nucleotide polymorphisms, which was extended by Liu and others (2014) to multi-class classifiers. de Valpine and others (2009) develop a simulation-approximation approach to determine sample size.
The benefits of feature augmentation in terms of the receiver operator curve have been investigated (Pepe and others, 2006; Cai and Cheng, 2008; Pfeiffer and Bur, 2008; Lin and others, 2011). However, there is no theoretical work specifically quantifying the amount of PCC gain due to feature augmentation, nor under which scenarios PCC gain can be maximized.
Section 2 describes the model formulation for the features, the PCC definition, and two thresholding techniques used to select important features with which the classification rule is constructed. One is the higher criticism threshold (HCT) proposed by Donoho and Jin (2009), which is particularly relevant when important features are rare and weak, and the other is a method based on cross validation (CV). Section 3 presents our proposed methods for sample size determination which incorporate thresholding techniques. We introduce a new simulation method to efficiently evaluate the PCC of HCT-based classifiers. In Section 4, we establish a novel inequality with both the upper and lower bounds for PCC gain due to feature augmentation. Section 5 illustrates the performance of three sample size determination strategies and their use in the second motivating example of predicting kidney graft status, followed by a discussion.
2. Model, PCC, and feature selection
In this section, we review existing work and modeling set up that serves as the context for our proposed sample size determination methods. Suppose the study population can be divided into two groups: Group
and Group
. The design question is on how many subjects,
, a set of training data
will be collected to construct a classifier, where
is the group label for subject
; population group prevalences are
and
, respectively; and
is a high-dimensional vector of features for subject
(e.g. proteomics biomarkers). For brevity of exposition, in the rest of the paper we assume the sample size collected from each group is equal by design (e.g. stratified sampling is used), irrespective of the group prevalences in the population. Supplementary material available at Biostatistics online describes modifications needed when sample sizes are unequal for the groups.
We assume that features follow the multivariate normal distribution within each group with equal variances:
; and
, where the vector
, with elements
,
, represents the signal strengths of the features. Setting
is purely for notational convenience and is not needed in practice; this notation allows us to write the mean differences of features between groups in terms of a single vector, namely
. A higher value of
suggests a better separation between two groups by feature
, and consequently feature
would be important for classification. Assuming the equality of variances is needed to construct a linear classification rule (Johnson and Wichern, 2002), which we assume at the design stage. Without loss of generality, we assume the diagonal elements of
equal 1, which enables us to refer to
as the vector of effect sizes. In practice, this is achieved by dividing each feature by its pooled standard deviation calculated with the training data.
The dimension
of
may be very high; hence it is commonly assumed that only a small number of features, say,
, have non-zero effect sizes. The
features are considered essential to construct a classifier, while the other
features are noise (
). For the ease of exposition, we reorder features such that the important
features are listed first, i.e.
, where
is a zero vector of length
. The values of effect sizes,
, are unknown at the design stage and
is supplied by subject-matter scientists. Assume that it is possible to specify a lower bound
for
based on some prior research results or a certain scientific hypothesis, and replace
by
, where
is a vector of ones of length
. Then, we will simply state the effect size as
. Under the linear classification rule and given the weighting scheme defined in (2.1), using a lower bound in the design will lead to a conservative estimate of PCC and thus sample size, which is acceptable in practice when no reliable pilot data are available to estimate
satisfactorily.
In this paper, we consider a linear classifier in the design stage. Constructing a linear classifier is equivalent to using training data
to derive a certain weighting scheme
that allocates weights
. Let
and
denote the inner product of two vectors. The classification rule for a new subject is: if
, subject
is assigned to Group
; otherwise to Group
. In general, the weighting scheme
can assign non-zero weights
to all available features; however, this can harm PCC if most of them are not important. Instead, when
using regularized feature selection allows us to include only important features in the classifier, thus enhancing the classifier's PCC. Feature selection is primarily driven by pairwise associations between features
and group membership
. Let
be the vector of test statistics derived from training data
. Then
for unimportant features, and
for
where
is the signal strength. A natural strategy for feature selection is to choose an appropriate threshold
such that we only include features satisfying
,
. Given threshold
, we incorporate this feature selection mechanism into the definition of the weighting scheme:
![]() |
(2.1) |
The threshold
is determined empirically given
; Section 2.2 describes procedures to select it.
2.1. Objective function and connection to sample size
Following DS2007, we use PCC as the primary objective function for sample size determination. With two groups, the PCC is the weighted average of the classifier's sensitivity and specificity, with weights equal to the group prevalences. Under the assumed model and for fixed weights
, it can be easily shown that
, where
is the standard normal CDF. The weights
, however, are random and depend on
(and thus
) and the sample size
of
.
To make the connection between PCC and sample size, it is useful to think of the PCC as dependent on sample size and the selected threshold
, hence defining
as the PCC of a classifier built using training data on
subjects. Further, to define the optimal PCC, it is useful to think of the upper bound of the PCC among linear classifiers. A linear classifier can reach the upper bound if it is the oracle classifier, or if it is constructed from a study with infinite sample size (DS2007). This optimal classifier has
, which simplifies to
when
,
and
is replaced by
at the design stage.
Clearly, a practically achievable PCC will be lower than the upper bound, and its exact value depends on how much relevant information can be extracted from the training data. At the design stage, a PCC target is set lower than
; for instance, DS2007 set
as the smallest PCC that satisfies
. The sample size requirement is then defined as the smallest
such that
. If the inverse function of
could be analytically derived, then the sample size would be easily determined by
. Since a closed form of
rarely exists, we employ numerical algorithms (e.g. the binary search algorithm) to determine sample size.
Finally, one must consider how to calculate PCC at the design stage. DS2007 consider an approximation of PCC to warrant its fast computation. First they compute an optimal, fixed threshold
using information on the assumed
, and show with Monte Carlo (MC) simulations that for a fixed
the approximation
is accurate. However, at the analysis stage, optimizing
only depends only on the data
, since
is unknown. We describe two procedures to determine
based only on simulated data and are thus advantageous because they have an actual parallel in the analysis stage.
2.2. Feature selection procedures
2.2.1. CV threshold.
We consider the following straightforward K-fold CV thresholding method to determine
given only training data
, denoted by
. Such
is chosen to maximize the apparent PCC,
, which is a function of both threshold
and training data
. The apparent PCC can be computed via the following steps:
Follow the sampling strategy for
to divide it into
equal-sized subsets,
(e.g. divide cases and controls separately if stratified sampling is used to collect
).
For each
, treat
, which has sample size
, as a CV testing set and the rest of the data
as a CV training set; given a threshold value
, which is one of many threshold values on a dense grid, use (2.1) to obtain the weighting
from the training set
, where
-scores are calculated from dataset
only.
For
, calculate the apparent PCC based on testing set
, as
.
Calculate the overall apparent PCC:
.
Calculate the apparent PCC on a dense grid of values for
, and select the optimal threshold
that maximizes the overall apparent PCC:
.
For an analysis employing the CV threshold, the expected PCC is calculated over the distribution of
,
This procedure optimizes the threshold
without bearing on knowledge of
; when embedded in sample size calculations it accounts for uncertainty in
. When features are independent and important features have the same effect size, CV thresholding with weights in (2.1) results in the optimal Bayes classification rule, except for uncertainty in
, and thus the optimal sample size.
2.2.2. Higher criticism threshold.
Proposed by Donoho and Jin (2009), HCT provides a data-driven approach to determine
in a high-dimensional classification analysis. HCT determines a suitable threshold
based on the distribution of
-values obtained from univariate tests for associations of individual features with the group assignment, and then the weighting scheme (2.1) can be applied. Let
denote the HCT procedure when applied to training data
. The association test for feature
with group
results in a two-sided
-value
. For an unimportant feature,
; for an important feature the resulting
does not follow
and tends to be smaller than those of the unimportant features. HCT only focuses on the smallest
-values sorted in increasing order:
; a typical choice is
. Donoho and Jin (2009) showed that the
th ordered
-value with
provides an appropriate cutoff for feature selection: features whose
-values are less than
are considered important for classification. The
-score threshold is thus
. The resulting PCC is
.
As detailed by Donoho and Jin (2009), the theory of HCT brings new insights to the asymptotic properties of linear classifiers under the so-called rare-and-weak model, which is of interest in the context of high-dimensional classification because it gives a structure under which the number of important features
and signal strength
vary with the total number of features
. This structure enables study of asymptotic classification feasibility. In this rare-and-weak model,
increases with
according to
(Donoho and Jin, 2009), where
controls the sparsity. Similarly, instead of
, which becomes arbitrarily large with increasing sample size, in this model the signal strength follows
;
controls the signal strength, and
for some
. This implies that important features become rarer and their effect size becomes weaker when the total number of features
increases, which is regarded as a more realistic mechanism (NCI-NHGRI, 2007; Jin, 2009) than a mechanism where PCC always increases as
increases. It has been shown (Donoho and Jin, 2009; Jin, 2009) that as the number of features
the PCC of any linear classifier is characterized only by
through a certain function
given in Section B of supplementary material available at Biostatistics online: (i) when
, the classification analysis is asymptotically feasible, in the sense that the PCC of the HCT linear classifier approaches to 1 as
and (ii) when
, the classification analysis is asymptotically infeasible. The asymptotic result of feasibility is critical to guide the design of classification analysis. Verifying the inequality
can help investigators make a timely decision on the feasibility of a study at the planning stage (see Section 5 for an illustration).
3. Implementation
Given an approach to evaluate PCC, sample size can be determined by inverting the PCC function numerically. We thus focus on PCC estimation approaches that incorporate thresholding procedures so the resulting PCC would more closely reflect what can be achieved in practice. Section C of supplementary material available at Biostatistics online describes the evaluation of PCC for CV-based classifiers. Our primary contributions here focus on approaches needed for HCT-based classifiers.
Since
is a data-driven thresholding procedure, MC simulation can be applied to evaluate
. However, because
depends on the training data
exclusively through the
smallest
-values, we propose a computationally fast MC algorithm that directly simulates the
smallest
-values from the distribution of ordered statistics instead of simulating
. The algorithm is takes the following steps:
Simulate
-scores for
important features from
,
.
Convert the above
-scores to two-sided
-values by
,
.
Simulate a random variable
.
Simulate variables
independently from
.
Sort vector
in an ascending order.
The
smallest values in
have the same joint distribution as the
smallest
-values
derived from
(see supplementary material available at Biostatistics online, Section D for the proof). As a by-product, the above algorithm also supplies the
-scores
for the
important features. This proposed algorithm has a clear computational benefit: instead of generating
random variables needed for
(and calculating all
-values), only
variables are generated. The computational efficiency ratio is
(since
is much smaller than
); e.g. for
and
, the algorithm is
1000 times more efficient. Furthermore, the algorithm below used to calculate PCC does not require generating test data.
To evaluate
, we repeat for
iterations:
For given
,
and
, use Steps (a)–(e) to generate the
smallest
-values
and the corresponding
-scores for
important features.
Determine the optimal threshold
.
Use (2.1) to calculate weights
of important features using their
-scores and
.
Calculate
, where
is the number of elements in
that are smaller than
.
Then, the fast MC estimate of
is given by
.
Correlated features. When features are correlated, estimating PCC is more challenging. One difficulty pertains to computing the denominator,
, within the PCC formula,
. DS2007 proposed replacing this quantity with an upper bound
, where
is the largest eigenvalue of
. This bound could potentially be applied at Step 4 of the HCT-based algorithm above when calculating
. However, this does not work for the HCT method since Step 1 relies on independence to prove that the
used to generate the
smallest
-values follows a Beta distribution. Hence, we instead evaluate the expected PCC using an alternative MC simulation strategy. The simulation strategy follows Steps 1–4 as above, with the following modifications. First, we specify an assumed working correlation structure
for the features. In Step 1, we now use this assumed
to generate
correlated features on
subjects, and compute
statistics and accompanying
-values. The choice of structure for
(e.g. block diagonal) may be informed by substantive knowledge, if available, and the magnitude of the correlations should be varied to assess its impact on the sample size calculation. Steps 2 and 3 of the algorithm remain unchanged to reflect that at the analysis stage the features are screened using pairwise associations and treated as independent. In Step 4, we evaluate
, where
is an estimate of
based on the working correlation and is defined as
and
denotes the
entry of
. As with the eigenvalue approach, this approximation relies on the working correlation structure
; however, this approach is less conservative than using the eigenvalue-based bound (see Illustrations section). The MC approach for the CV method with correlated features similarly relies on generating correlated data (see supplementary material available at Biostatistics online).
4. Feature augmentation
Because in practice multiple sources of features are typically collected, a key study design question is to investigate the potential PCC gain resulting from adding new sources of features to the classification analysis. For simplicity, let us focus on two sources of features (e.g. molecular biomarkers and clinical variables). Denote features already in the study by Type A and the new set by Type B, with respective dimensions
and
. For subject
, we collect measurements
and
. As in Section 2, assume that the joint conditional distribution for features
and
is:
![]() |
where
and
and
,
, and
are the respective effect size vectors, variance, and covariance matrices. Here we do not place any restrictions on either the effect size vectors (e.g. sparsity is not assumed) or the variance matrices.
To study PCC gain, we need to consider the PCC of linear classifiers in three cases, including (i) Type A features only; (ii) Type B features only; and (iii) Type A features augmented with Type B features. Denote the respective weights in Cases (i) and (ii) by
and
; we do not place any assumptions on these weights, e.g. they may be derived from any thresholding procedure for feature selection. Conditioning on the weights
and
, and assuming group prevalence is
, the PCC of the classifier in Case (i) is
; in Case (ii) is
; and finally, in Case (iii) is
. When
, the term
drops from the denominator and we obtain
. In Section E of supplementary material available at Biostatistics online, we prove that:
![]() |
(4.1) |
The first equality holds when the relative variance of the linear predictors goes to 0, i.e.
where
if
and
if
. In the latter case, the equality is reached when
and the linear predictors have equal variance
. Inequality (4.1) provides the upper bound of the PCC of the classifier when the linear predictors are combined into a new classification rule. If either
or
approach 1, the upper bound will approach 1.
In practice, the features may be correlated,
, thus the linear predictors
and
will too. Given the monotonicity of
and that the covariance of the linear predictors,
, appears in the denominator
, the PCC gain will depend on the sign of the correlation:
. In Section E of supplementary material available at Biostatistics online, we also prove that
, and that
where
is the relative standard deviation of the linear predictors. Finally, in Section E of supplementary material available at Biostatistics online we also prove inequality (4.1) for any proportion
when optimal weights are used in classifier construction.
5. Illustrations
PCC estimation and sample size determination given effect size and number of important features. For a given effect size, the PCC evaluated at the design stage will depend on a pre-specified thresholding procedure and in turn impact the sample size. Thus, we first illustrate the estimated PCC using the DS2007 method, and using the CV and HCT thresholding methods. Figure 2 shows the PCC as a function of sample size when the number of available features is
or
. The PCC estimated by the DS method is always the highest, primarily because it uses the true effect size to choose the optimal threshold
. However, due to its reliance on
to obtain the optimal threshold at the design stage, the DS method has no counterpart in actual data analysis. On the other hand, the PCC estimated by the CV and the HCT methods rely only on the simulated data to estimate the threshold, which introduces uncertainty in the feature selection threshold, and thus yields lower PCC estimates. When the classification problem is more difficult (e.g.
), HCT yields higher PCC than CV as expected (Donoho and Jin, 2008). Since PCC estimates from CV and HCT reflect more closely the achievable performance of the corresponding classifiers in real applications, the sample size estimates from these methods would better approximate the sample size required in practice.
Fig. 2.
PCC estimates as a function of sample size estimated by DS, CV, and HCT methods, assuming the minimal effect size of important features is
and group prevalences are
. (a)
, (b)
, (c)
, and (d)
. In (a) and (b), important features are rarer (
important feature) compared with (c) and (d)
, which results in a marked difference in
(gray horizontal line). The PCC estimated using the DS method is always higher for a given sample size, leading to lower sample size estimates. When features are rarer, (a) and (b), HCT gives a higher PCC, leading to lower sample size requirements. When features are less rare
, selecting features using HCT (CV) leads to lower sample size requirements for lower (higher) PCC targets compared with CV (HCT), given the crossing of the PCC curves. PCC estimates for CV and HCT are obtained using MC simulations using the algorithms described in Section 3, with 500 replicates for CV and 1000 replicates for HCT.
Figure 3 shows the sample size requirements for a range of effect sizes,
or 10, and
with
. In general, sample sizes obtained from the DS method are consistently lower than with CV or HCT methods, as can be expected given in Figure 2. Sample sizes become comparable (difference
2) when the effect size is large. However, when the features are relatively weaker, then the DS method will tend to underestimate the needed sample size. Figure 2 also explains the facts that, for a fixed
, the sample sizes from HCT shown in Figure 3 are lower for scenarios when features are rarer (i.e.
) and lower for CV when features are less rare (
). Hence, only the proposed HCT approach will give sufficient sample size for cases when features are relatively weaker and rarer, without being overly conservative (CV method is conservative in those cases, since the HCT classifier can achieve the target PCC with a lower sample size). These assertions are verified with MC simulations shown in Table 1 (top rows where
).
Fig. 3.
Sample size requirements estimated using DS, CV, and HCT design methods for a range of effect sizes
. (a)
, (b)
, (c)
, and (d)
. For each effect size and combination of
and
, the
is shown as the inset value on the
-axis, and the target PCC is set as
. Sample size estimates for CV and HCT are obtained by numerically inverting the PCC function and selecting the smallest
that satisfies
;
is estimated using the MC algorithms described in Section 3 with 500 replicates for CV and 1000 for HCT. The sample size required decreases as effect sizes of important features increase, even in high target PCC cases. Sample sizes obtained with the DS method are lower, but, as shown in Table 1, underestimate the required sample size particularly for rare-and-weak features.
Table 1.
Sample size
calculated by DS, CV, and HCT design methods using the specified
and differences between the target and what can be achieved in practice, 
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|
|||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
DS | CV | HCT | CV | CV | HCT | HCT | CV | HCT |
![]() |
1 | 500 | 0.8 | 0.605 | 98 | 0.035 | 0.018 | 134 |
0.001 |
0.014 |
118 | 0.019 |
0.007 |
| 1.2 | 0.676 | 56 | 0.045 | 0.023 | 76 |
0.003 |
0.02 |
68 | 0.020 |
0.002 |
|||
| 1.6 | 0.738 | 36 | 0.047 | 0.028 | 50 |
0.011 |
0.023 |
46 | 0.008 |
0.008 |
|||
![]() |
1 | 10 000 | 0.8 | 0.605 | 138 | 0.033 | 0.026 | 182 |
0.001 |
0.013 |
170 | 0.011 |
0.001 |
| 1.2 | 0.676 | 76 | 0.034 | 0.032 | 96 |
0.005 |
0.005 |
94 | 0.005 | 0.000 | |||
| 1.6 | 0.738 | 50 | 0.043 | 0.042 | 62 | 0 |
0.005 |
62 | 0.000 |
0.005 |
|||
![]() |
10 | 500 | 0.8 | 0.847 | 94 | 0.014 | 0.031 | 102 | 0 | 0.025 | 146 |
0.029 |
0.002 |
| 1.2 | 0.921 | 38 | 0.015 | 0.027 | 44 |
0.003 |
0.011 | 52 |
0.015 |
0.005 |
|||
| 1.6 | 0.944 | 20 | 0.025 | 0.005 | 24 |
0.004 |
0.01 |
24 |
0.004 |
0.010 |
|||
![]() |
10 | 10 000 | 0.8 | 0.847 | 138 | 0.01 | 0.032 | 152 | 0.001 | 0.021 | 202 |
0.034 |
0.001 |
| 1.2 | 0.921 | 58 | 0.021 | 0.032 | 66 |
0.006 |
0.014 | 76 |
0.019 |
0.000 | |||
| 1.6 | 0.944 | 30 | 0.031 | 0.027 | 36 |
0.008 |
0.001 |
36 |
0.008 |
-0.001 | |||
![]() |
1 | 500 | 1.6 | 0.738 | NA | – | – | 44 |
0.000 |
0.041 | 62 |
0.034 |
-0.006 |
![]() |
0.738 | NA | – | – | 46 |
0.005 |
0.060 | 168 |
0.048 |
0.004 | |||
![]() |
0.738 | NA | – | – | 46 |
0.002 |
0.054 | 160 |
0.044 |
0.004 | |||
![]() |
10 | 500 | 1.6 | 0.762 | 26 |
0.002 |
0.037 |
26 |
0.002 |
0.037 |
18 | 0.064 |
0.013 |
![]() |
0.762 | 26 | 0.029 |
0.015 |
30 | 0.000 |
0.030 |
24 | 0.032 |
0.010 |
|||
![]() |
0.762 | NA | – | – | 28 | 0.001 | 0.028 | 44 |
0.039 |
0.010 |
|||
![]() |
0.762 | NA | – | – | 26 | 0.002 | 0.026 | 42 | 0.044 |
0.006 |
The scenarios and respective sample sizes shown here are a subset of those shown
in Figure 3 (see Figure 3 legend for details on how sample sizes are obtained).
Given the computed sample size
,
was computed by generating 1000 training
datasets of size
and test datasets of size 100; training and test datasets were
generated according to the model defined by
,
,
, and
. The average of the
across
the 1000 replicates are shown. The
are block diagonal as follows:
is a
identity;
has first block being
compound symmetry structure and correlation 0.80, denoted
by
, and second block
;
has 50 blocks of
;
has first block
, second block
, and
third block
;
has the first block
, and the second block
. NA indicates sample size
estimates from DS2007 eigenvalue method were prohibitive,
.
It is evident from Figures 2 and 3, and Table 1 that it is important to not only calculate the sample size based on the PCC estimates that are achievable by statistical methods at the analysis stage, but also to select an analysis approach that can more efficiently attain the target PCC under a given parameter space. In rare-and-weak cases, for example, the HCT-based classifier has been shown to perform better (Donoho and Jin, 2008), and thus we recommend determining sample sizes using our proposed HCT sample size calculator in these scenarios.
Sample size calculations when features are correlated. The bottom part of Table 1 gives the sample sizes computed from each method under different structures for
(
). The DS2007 method using the correction based on the largest eigenvalue sometimes yields prohibitive sample sizes (
, denoted as
), because of the excessively large maximum eigenvalue of
. Both CV- and HCT-based methods give sample sizes at which the target PCC is achieved. It is worth noting that when important features are very rare (e.g.
), the HCT-based method yields conservative sample sizes whereas the CV method can achieve the same target PCC with lower sample sizes.
Feature augmentation. Figure 4 illustrates the upper bound (left panel) of the PCC and PCC gain (right column) due to feature augmentation discussed in Section 4. First, in the case when features are independent (Figure 4(a)), we note that if both
and
are small (or one is large), then
will be small (or large). Hence, the upper bound of the PCC of classifiers with both Type A and Type B features is only slightly higher than
. Combining two sets of features where both are very good or both are poor does not greatly improve PCC. If both types of features are of medium quality (e.g. both
and
are in the medium range around
), then we could obtain the highest gain (at most 10%) in PCC by feature augmentation. When features are negatively correlated, the PCC gain can be substantial (Figure 4(b)).
Fig. 4.
The upper bounds of
when two sets of features, A and B, are combined compared with using a single type of features. In (a), features are independent,
(all
), or
and have
. The upper bound is attained when the linear predictors have equal variance. In (b),
, group 1 prevalence is
, and
gives the relative standard deviation of the linear predictors. The upper bound shown in (b) is attained when the linear predictor constructed from features A and B are perfectly negatively correlated.
An application. We demonstrate the proposed methods using the kidney transplant study (Figure 1). In Stage I, the investigator hypothesizes that among
proteins, at least
of them are likely informative for predicting graft survival status. Pilot data,
, showed an effect size of approximately
(i.e.
). Given these design parameters, the signal strength is
; the sparsity parameter is
; and the strength parameter
lies above the feasibility boundary:
. Hence, the classification problem feasible (see Section B of supplementary material available at Biostatistics online), and we can proceed to calculate sample size requirements for given PCC targets (Figure 5(a)).
Fig. 5.
Application: study design for predicting graft survival after kidney transplant. (a) As expected, a larger sample size is required when a higher PCC target is chosen and other design parameters are held constant (
). The DS method yields the lowest sample size requirements, but these may underestimate the needed sample size (see Table 1); whether HCT or CV require larger sample size depends on the target PCC. (b) PCC with the proteomics markers (
) is fixed at 0.7 (dashed line). If new features are not as informative as the proteomics markers (
, combining both sets of features leads to a limited improvement of the classifier, and in some cases data augmentation might actually degrade the classifier (shadowed area below 0.7) due to the noise introduced by low-quality features in the new data source (e.g. some proteins can be measured with substantial errors if urine samples are not stored under stringent conditions). If the new features are more informative (
, incorporating them can substantially enhance the PCC.
For Stage II, the investigator considers improving the PCC of the classifier with proteomics biomarkers only say,
, by incorporating an additional set of features, including proteinuria, GFR, hematuria, albium, and cholesterol. Figure 5(b) shows a region describing the achievable PCC with both types of features (
for various values of the PCC with the additional features alone, i.e.
). Substantial enhancements to PCC occur when the second set of features is at least as informative as the proteomics biomarkers (
).
6. Discussion
We addressed two study design questions for studies using high-dimensional features for classification. First, we developed sample size determination strategies for CV- and HCT-based classifiers. Our strategies incorporate uncertainty of feature selection thresholds within the PCC calculation, which is particularly relevant when important features are hypothesized to be rare and weak. We proposed a computationally efficient algorithm based on order statistics to compute the PCC, and thus the sample size requirements, for the HCT-based classifier. Second, we established an inequality for the upper and lower bounds of the achievable PCC associated with feature augmentation. The approaches were illustrated with numerical examples and a practical study, and are implemented in our R package HDDesign (available at https://cran.r-project.org/).
Our proposed methods can be improved in the following directions. Classification of more than two groups commonly appears in clinical studies, thus extensions in this direction are of great importance. Strong deviations from linearity (e.g. U-shaped associations) may undermine the applicability of the proposed approaches. In this case, it may be possible to categorize the predictors and apply and/or extend the study design methods of Liu and others (2012) to the case of rare-and-weak features. It is also of interest to further investigate how correlations among features may be effectively incorporated into the sample size determination. We proposed to directly plug in an assumed working correlation matrix within the CV- and HCT-based approaches. As expected, positive correlations among features result in larger required sample sizes, although our not as large as DS2007's preliminary eigenvalue-based approach. Nevertheless, our approach requires specifying sensible working correlation structures at the design stage, which may be difficult to obtain in practice. Varying the structure and magnitude of the correlations based on available scientific knowledge is needed with our proposed approach. Further improvements in this direction may be possible by using the innovated HCT suggested by Hall and Jin (2010), or by developing sample size determination methods based on regularized regression-based approaches that do not require pre-filtering and hence do not rely on the marginal effects of the features. However, developing sample size calculations using regression-based procedures (e.g. LASSO) would require specifying the adjusted effect sizes and, importantly, quantifying the uncertainty in feature selection, which remains an open problem in high-dimensional inference.
In summary, we advocate the use of sample size determination methods that match, as closely as possible, the analytic approaches that will be actually applied at the data analysis stage and that capitalize on prior knowledge of the underlying mechanism of interest. If the important features have strong signals, both HCT- and DS-based approaches provide adequate sample size calculations, and there is little difference between them. Given that the HCT method is computationally fast and accounts for uncertainty in the feature selection threshold, it is recommended in practice. If the important features are relatively abundant but weak, we recommend the CV approach as it gives the least conservative sample size, albeit computationally intensive. If the important features are rare and weak, we recommend the HCT-based approach since it provides desired sample sizes with little conservatism, and is computationally efficient. In summary, our work builds upon and further advances the pioneering work of DS2007, for sample, size determination in high-dimensional classification problems.
Supplementary material
Supplementary Material is available at http://biostatistics.oxfordjournals.org.
Funding
M.W. and B.N.S. acknowledge NIH grant R21DA024273 for salary support during the initial conduct of this study. P.X.K.S.'s research is funded in part by NIH U54-DK-083912-05 and NSF DMS-1513595. This work was also partially funded by grants NIH/EPA P20 ES018171/RD83480001 and P01 ES022844/RD83543601.
Supplementary Material
Acknowledgments
The publication's contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH or US EPA. Conflict of Interest: None declared.
References
- Cai T., Cheng S. (2008). Robust combination of multiple diagnostic tests for classifying censored event times. Biostatistics 92, 216–233. [DOI] [PubMed] [Google Scholar]
- Clarke R., Ressom H. W., Wang A., Xuan J., Liu M. C., Gehan E. A., Wang Y. (2008). The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nature Reviews. Cancer 81, 37–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- de Valpine P., Bitter H. M., Brown M. P. S., Heller J. (2009). A simulation-approximation approach to sample size planning for high-dimensional classification studies. Biostatistics 103, 424–435. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dobbin K. K., Simon R. M. (2007). Sample size planning for developing classifiers using high-dimensional DNA microarray data. Biostatistics 81, 101–117. [DOI] [PubMed] [Google Scholar]
- Donoho D., Jin J. (2008). Higher criticism thresholding: optimal feature selection when useful features are rare and weak. PNAS 10539, 14790–14795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Donoho D., Jin J. (2009). Feature selection by higher criticism thresholding achieves the optimal phase diagram. Philosophical Transactions. Series A, Mathematical, Physical, and Engineering Sciences 3671906, 4449–4470. [DOI] [PubMed] [Google Scholar]
- Gadegbeku C. A., Gipson D. S., Holzman L., Ojo A. O., Song P. X., Barisoni L., Sampson M. G., Kopp J. B., Lemley K. V., Nelson P. J.. and others (2013). Design of the nephrotic syndrome study network (neptune): a multi-disciplinary approach to understanding primary glomerular nephropathy. Kidney International 834, 749–756. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hamburg M. A., Collins F. S. (2010). The path to personalized medicine. New England Journal of Medicine 3634, 301–304. [DOI] [PubMed] [Google Scholar]
- Hall P., Jin J. (2010). Innovated higher criticism for detecting sparse signals in correlated noise. The Annals of Statistics 383, 1686–1732. [Google Scholar]
- Hwang D., Schmitt W. A., Stephanopoulos G., Stephanopoulos G. (2002). Determination of minimum sample size and discriminatory expression patterns in microarray data. Bioinformatics (Oxford, England) 189, 1184–1193. [DOI] [PubMed] [Google Scholar]
- Jin J. (2009). Impossibility of successful classification when useful features are rare and weak. Proceedings of the National Academy of Sciences 10622, 8859–8864. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Johnson R. A., Wichern D. W. (2002) Applied Multivariate Statistics, 5th edition Upper Saddle River, NJ: Prentice-Hall, Inc. [Google Scholar]
- Lin H., Zhou L., Peng H., Zhou X. H. (2011). Selection and combination of biomarkers using ROC method for disease classification and prediction. Canadian Journal of Statistics 392, 324–343. [Google Scholar]
- Liu X., Wang Y., Rekaya R., Sriram T. N. (2012). Sample size determination for classifiers based on single-nucleotide polymorphisms. Biostatistics 132, 217–227. [DOI] [PubMed] [Google Scholar]
- Liu X., Wang Y., Sriram T. N. (2014). Determination of sample size for a multi-class classifier based on single-nucleotide polymorphisms: a volume under the surface approach. Journal of Biomedical Informatics 15, 190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mardis E. R. (2008). The impact of next-generation sequencing technology on genetics. Trends in Genetics 243, 133–141. [DOI] [PubMed] [Google Scholar]
- Mukherjee S., Tamayo P., Rogers S., Rifkin R., Engle A., Campbell C., Golub T. R., Mesirov J. P. (2003). Estimating dataset size requirements for classifying DNA microarray data. Journal of Computational Biology: A Journal of Computational Molecular Cell Biology 102, 119–142. [DOI] [PubMed] [Google Scholar]
- NCI-NHGRI, Working group on replication in association studies (2007). Replicating genotype-phenotype associations. Nature 4477145, 655–660. [DOI] [PubMed] [Google Scholar]
- Pepe M. S., Cai T., Longton G. (2006). Combining predictors for classification using the area under the receiver operating characteristic curve. Biometrics 621, 221–229. [DOI] [PubMed] [Google Scholar]
- Pfeiffer R. M., Bur E. (2008). A model free approach to combining biomarkers. Biometrical Journal 504, 558–570. [DOI] [PubMed] [Google Scholar]
- Schuster S. C. (2008). Next-generation sequencing transforms today's biology. Nature Methods 51, 16–18. [DOI] [PubMed] [Google Scholar]
- Simon R. (2008). Development and validation of biomarker classifiers for treatment selection. Journal of Statistical Planning and Inference 1382, 308–320. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Y., Miller D. J., Clarke R. (2008). Approaches to working in high-dimensional data spaces: gene expression microarrays. British Journal of Cancer 986, 1023–1028. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


















































































