Tests and classification methods in adaptive designs with applications

Diana Q Chen; Si-Qi Mao; Xu-Feng Niu

doi:10.1080/02664763.2022.2026898

. 2022 Jan 21;50(6):1334–1357. doi: 10.1080/02664763.2022.2026898

Tests and classification methods in adaptive designs with applications

Diana Q Chen ^a, Si-Qi Mao ^b, Xu-Feng Niu ^b,^CONTACT,^✉

PMCID: PMC10071978 PMID: 37025279

Abstract

Statistical tests for biomarker identification and classification methods for patient grouping are two important topics in adaptive designs of clinical trials related to genomic studies. In this article, we evaluate four test methods for biomarker identification in the first stage of an adaptive design: a model-based identification method, the popular two-sided t-test, the nonparametric Wilcoxon Rank-Sum test (two-sided), and the Regularized Generalized Linear Models. For patients grouping in the second stage, we examine classification methods such as Random Forest, Elastic-net Regularized Generalized Linear Models, Support Vector Machine (SVM), Gradient Boosting Machine (GBM), and Extreme Gradient Boosting (XGBoost). Simulation studies are carried out to assess the performance of the different methods. The best identification methods are chosen based on the well-known $F_{1}$ score, while the best classification techniques are selected based on the area under a receiver operating characteristic curve (AUC). The chosen methods are then applied to the Adaptive Signature Design (ASD) with a real data set from breast cancer patients for the purpose of evaluating the performance of ASD in different situations.

Keywords: Boosting and optimization, logistic regression, classification trees, genes, sensitive and non-sensitive patients, targeted agent

1. Introduction

Clinical trials play an important role in medical research, in which participants (usually human volunteers) receive specific interventions based on the protocol designed by the researchers. The interventions in a clinical trial could be different medical products, such as new drugs, new devices, or new procedures that are compared with a placebo. Adaptive designs in clinical trials were proposed in the 1970s when Efron [11] discussed how to balance a sequential experiment. Wei [34] introduced a class of designs for sequential clinical trials, the biased-coin design, for the purpose of reducing experimental bias and increasing the precision of inference about treatment effects. The main idea of an adaptive design in clinical trials is that the investigator may modify trial and/or statistical procedures based on the review of data from different stages during the experimental process, which may identify clinical benefits of the treatments more efficiently and increase the success probability of the clinical development without undermining the validity and integrity of the trial.

Chow et al. [8] presented some statistical consideration of adaptive methods in clinical development, in which the authors mentioned that statistical procedures in a clinical trial include randomization, study design, study objectives/hypotheses, sample size, data monitoring and interim analysis, statistical analysis plan, and/or methods for data analysis. Group sequential designs in clinical trials were discussed by many authors, including Gordon Lan and DeMets [19], Posch and Bauer [26], Jennison and Turnbull [21] and Liu et al. [25]. Chow and Chang [7] provided a review on adaptive design methods in clinical trials, and pointed out that the popularity of adaptive designs is mainly due to three reasons: (1) it reflects the medical practice in the real world; (2) it is ethical with respect to both efficacy and safety (toxicity) of the test treatments under investigation; and finally (3) it is flexible and efficient in the early and late phases of clinical development.

Adaptive Design for clinical trials of a targeted agent was proposed in the last ten or more years. For instance, due to the heterogeneous nature of tumor types in an ontology study, a new generation of agents under development is molecularly targeted. When these agents enter the definitive stage of clinical evaluation, researchers ideally wish to use reliable assays to select sensitive patients, restrict eligibility to patients with sensitive tumors, and perform specific evaluations on the subset. Freidlin and Simon [14] proposed an Adaptive Signature Design (ASD) for generating and prospectively testing a gene expression signature for sensitive patients. The ASD consists of three steps: biomarker identification, classifier development, and performance assessment. In the first step, a set of candidate predictive biomarkers (genes) are identified using the training data set. In the second step, the predictive biomarkers (or sensitive genes) identified in stage one are used to classify the patients in the test data set as sensitive or non-sensitive patients. In the third and final step, after the classification of patients into groups is done, a test of the treatment effect is performed on the subgroup with the sensitive patients.

In this article, we propose to study statistical tests and methods used in the first two steps of the biomarker adaptive designs. The study will involve (1) comparing four test methods for biomarker identification, i.e. a logistic regression model-based identification method, the popular two-sided Student t-test, the two-sided nonparametric Wilcoxon Rank-Sum test, and the Regularized Generalized Linear methods. And (2) extending the classification method comparison performed by Lee et al. [22] by including the recently developed machine learning approaches such as Random Forest, Regularized Generalized Linear Models, Support Vector Machine (SVM), Gradient Boosting Machine (GBM) and the Extreme Gradient Boosting (XGBoost). The best identification and classification methods will be selected based on the $F_{1}$ scores and the AUC values.

The rest of this article will be organized as follows. In Section 2, we give a brief description of the classification methods and propose a comparison procedure for evaluating different test methods in biomarker identification and patient classification. In Section 3, statistical simulations are carried out to assess the performance of biomarker identification methods and the classification procedures, in which training data and testing data will be generated in different situations for the comparison of different methods. In Section 4, the selected best identification methods and the best classification techniques will be applied to the Adaptive Signature Design (ASD) for the purpose of evaluating the performance of ASD in different situations. A real data set for breast cancer patients will be used in the application. Discussion and future studies related to tests and classification methods in clinical trials will be presented in Section 5.

2. Methods and comparison

In this section, we first present a brief introduction of the main classification methods that will be used in this article. For evaluating different test methods in biomarker identification and for selecting the best classification methods in an adaptive design, we propose a comparison procedure in Section 2.1 to achieve this purpose.

2.1. Classification models

2.1.1. Elastic-net regularized generalized linear models (elastic-net regularized GLMs)

Elastic-net Regularization and Variable Selection method was firstly introduced by Zou and Hastie [36] to improve the performance of variable selection by combining the lasso and ridge penalty together. Elastic-net Regularization can be applied in generalized linear models. For statistical computing fitting an elastic-net regularized generalized linear model may have the following steps:

°
Fits a generalized linear model via penalized maximum likelihood:
$min_{β_{0}, β} \frac{1}{n} \sum_{i = 1}^{n} w_{i} l (y_{i}, β_{0} + β^{T} x_{i}) + λ [(1 - α) ‖ β ‖_{2}^{2} / 2 + α ‖ β ‖_{1}],$
over a grid of values of λ covering the entire range.
°
Here, $l (y_{i}, η_{i})$ is the negative log-likelihood contribution for observation i.
°
The elastic-net penalty is controlled by α, and bridges the gap between lasso (α = 1, the default) and ridge regression (α = 0).

In the simulation study and application of this article, we implemented the Elastic-net Regularized Generalized Linear Models by the GLMNET package in R and use the name ‘GLMNET’ as an abbreviation for this method in plots of the simulation study. The hyperparameters λ and α in the penalized maximum likelihood are randomly generated, and the best hyperparameters are selected by cross-validation.

2.1.2. Support vector machine (SVM)

Support Vector Machine (SVM) is a supervised modern machine learning technique for classification, which is widely applied to solve all kinds of real problems, such as facial detection, text or image classification. The original SVM algorithm was invented by Vapnik and Lerner [31]. Cortes and Vapnik [9] pointed out that support-vector network is a new learning machine for two-group classification problems and was implemented for the restricted case where the training data can be separated without errors. The authors introduced the soft margin classifier that extended the support-vector network to non-separable training data. The books [32,33] presented the fundamental ideas which lie behind the statistical theory of learning and generalization, discussed a method for determining the necessary and sufficient conditions for consistency of learning process, and displayed applications of function estimation in real-life problems. The papers by Bartlett [2] and Shawe-Taylor et al. [28] gave the first rigorous statistical bound on the generalization of hard margin SVMs. In a later paper, Shawe-Taylor and Cristianini [29] gave statistical bounds on the generalization of soft margin algorithms and for the regression case.

2.1.3. Random forest

Breiman [4] proposed Random Forest based on bagging, which is a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. Liaw and Wiener [23] gave a nice introduction on classification and regression by using the Random Forest, in which they pointed out that Random Forest constructs classification and regression trees using a modified approach that adds an additional layer of randomness to bagging. The main difference between standard trees and a random forest is: the former constructs trees such that each node of a tree is split using the best split among all predictors, while in the latter approach each node of a tree is split using the best among a subset of predictors randomly chosen at that node. Liaw and Wiener [23] pointed out that the random tree approach is kind of a counterintuitive strategy but turns out to perform very well compared to many other classifiers, including discriminant analysis, support vector machines and neural networks, and is robust against overfitting [4]. Moreover, the Random Forest approach is very user-friendly in the sense that it has only two parameters (the number of variables in the random subset at each node and the number of trees in the forest).

2.1.4. Gradient boosting machines (GBM)

Breiman [3] first observed that boosting in machine learning can actually be interpreted as an optimization algorithm on a suitable cost function. Based on this idea Friedman (written in 1999 and published in 2001 [16]) proposed gradient boosting machines (GBM), developed explicit regression gradient boosting algorithms, and discussed the techniques in a follow-up paper [17]. Generally speaking, GBM is machine learning techniques for regression and classification problems using the connection between boosting and optimization that was discussed in Freund and Schapire [15] and Friedman et al. [18]. The main techniques of GBM consist of a gradient-descent-based formulation of boosting methods, which produces a prediction model in the form of an ensemble of weak prediction models by optimizing an arbitrary differentiable loss function.

The main difference between Random Forest and GBM is that the former is based on bagging while GBM is based on boosting and optimization. Random Forest uses bootstrapping to generate many samples from the original data and a decision tree is constructed for each sample. A single decision tree may be prone to overfitting, thus generating poor predictions. In order to solve the overfitting problem, Random Forest generates a large number of trees with each of the trees maybe overfitting the data in a different way. High-accuracy predictions are expected to obtain by majority voting to average out the differences. In Random Forest classification, the regression trees (or classifiers) are generated independently. On the other hand, GBM is based on boosting and optimization, which start at some weak classifiers and the next classifier is generated to improve the already trained ensemble classifier. The final predictions in GBM are generated by minimizing an arbitrary differentiable loss function. In practice, Random Forest is probably much easier to tune than GBM since Random Forest basically has only one hyper-parameter to set: the number of features to randomly select at each node of a tree. However, GBMs have several hyper-parameters needed to be tuned, including the number of trees, the depth (or a number of leaves), and the shrinkage (or learning rate).

2.1.5. Extreme gradient boosting (XGBoost)

Extreme Gradient Boosting (XGBoost) is an implementation of gradient boosting machines, which was developed by Tianqi Chen and his colleagues in recent years and summarized in Chen and Guestrin [6]. Both GBM and XGBoost were developed based on traditional boosting techniques, where new models are added to correct the errors made by existing models sequentially until no further improvements to the model could be made. If the minimization of the loss is executed by the gradient descent algorithm and the ensemble models are applied to make the final prediction, the approach is called Gradient boosting. The main difference between GBM and XGBoost is that XGBoost uses a more regularized model formalization in the boosting process to control over-fitting, which may perform better in practice than other boosting methods including GBM in terms of prediction.

2.2. A comparison procedure of different methods

In the study of animal and human genetics, usually a large number of genes are involved. For instance, recent studies estimated the number of genes in humans to be between 19,000 and 20,000 (e.g. [12]). For a given biological state or disease such as cancer, researchers often need to decide which genes and how many are related to the disease. As we mentioned in Section 1, in this study, the following four tests will be used to screen and identify sensitive genes: the logistic regression model-based test, the well-known two-sided Student t-test, the Wilcoxon Rank-Sum test, and the elastic-net regularized generalized linear models that fit generalized linear via penalized maximum likelihood in the R Data Analysis Software. For simplicity of the study, we consider only one treatment such as a new drug in a clinical trial versus a control group. The response variable is assumed to be binary with two possibilities: the patient has a response to the treatment such as a reduction in tumor size, or the patient has no response to the treatment.

It should be pointed out that many other statistical test methods could be used to detect sensitive genes among a large group of genes, such as the popular stepwise method. However, when several thousand genes (which act as predictors) are involved in a research, using the stepwise method for variable selection is time-consuming and therefore not recommended.

A simulation study will be carried out to compare the performance of the four gene-screening methods in Section 3. Similar to the criteria used in Troyanskaya et al. [30], true positive rate (also called sensitivity or Recall, and denoted by $P_{sen}^{g}$ in this article) and true negative rate (also called specificity and denoted by $P_{spec}^{g}$ ) will be calculated for each of the four testing methods:

\begin{aligned} P_{sen}^{g} & = \frac{Number of sensitive genes identified}{Number of known sensitive genes}, \\ P_{spec}^{g} & = \frac{Number of non - sensitive genes identified}{Number of known non - sensitive genes} . \end{aligned}

(1)

Another measurement of the effectiveness for a test is the well-known $F_{1}$ score, which to some degree combines both sensitivity and specificity of a test and will be used to evaluate the performance of the four gene-screening methods. The $F_{1}$ score is defined as

F_{1} = 2 \times \frac{P_{sen}^{g} \times Presicion}{P_{sen}^{g} + Presicion},

(2)

where

Presicion = \frac{Number of sensitive genes identified}{Number of sensitive genes identified + Number of False Positives} .

In the simulation study presented in the next section, patients with sensitive genes (called sensitive patients) and without sensitive genes (called non-sensitive patients) will be generated. Six classification methods will be first trained on a training data set and then applied to a test data set for patient identification. The six classification methods are Logistic Regression, Support Vector Machine (SVM), Random Forest, the Elastic-net Regularized Generalized Linear Models, the Gradient Boosting Machine (GBM), and the Extreme Gradient Boosting (XGBoost).

Similar to the gene screening procedure, the true positive rate (denoted by $P_{sen}^{p}$ ) and true negative rate (denoted by $P_{spec}^{p}$ ) defined below will be calculated for each of the six methods and used to evaluate the performance of the classification methods:

\begin{aligned} P_{sen}^{p} & = \frac{Number of sensitive patients identified}{Number of known sensitive patients}, \\ P_{spec}^{p} & = \frac{Number of non-sensitive patients identified}{Number of known non-sensitive patients} . \end{aligned}

(3)

In clinical data analysis, the Receiver Operating Characteristic (ROC) graph, a plot of the true positive rate (sensitivity $P_{sen}^{g}$ ) against the false positive rate (1-specificity, i.e. $1 - P_{spec}^{g}$ ) in a two-dimensional space, is frequently used to assess the effectiveness of a classification method and to compare different methods. The area under a receiver operating characteristic curve (AUC) for a classifier is a scalar between 0 and 1. It combines both sensitivity and specificity of a classifier and provides a very useful measurement of the benefit of using the classifier in the analysis. When comparing different classification methods, the higher the value of the AUC, the better performance of a binary classifier. In this article, we will use AUC as the measurement to assess the performance of different classification methods.

3. Simulation study

In Section 2, we have mentioned four test methods applicable for biomarker-identification (i.e. sensitive genes in this study) and six methods that are available for classification in the second stage of an adaptive design. In this section, a simulation study will be conducted to evaluate the performance of these methods. The best biomarker-identification method and the best subgroup classification method will be determined based on the results of our simulation study. All the procedures, from data generation to model fitting, are done by using the statistical computing software R. The main packages used in the study includes: ‘gbm’, ‘glm’ for Logistic Regression, ‘glmnet’, ‘randomForest’, ‘xgboost’, and ‘e1071’ for SVM.

3.1. Simulation setup

In this simulation design, we assume that patients with tumors are randomly assigned to a treatment group or a control group in a clinical trial. Data are simulated to describe expression levels of different microarray genes: the higher mean of gene expression value, the more sensitive the gene is. The number of sensitive genes in a patient determines the sensitivity of the patient. Therefore patients' response rates to treatment in the simulated data are determined in advance.

Specifically, assume that in a clinical trial there are N patients, each patient has L evaluated genes, and patients with K sensitive genes are called ‘sensitive’ patients. For the ith patient, let $π_{i}$ denote the response rate and let $t_{i}$ be the treatment the patient receives ( $t_{i} = 0$ for patient with standard treatment or placebo, $t_{i} = 1$ for patient with novel treatment).

Similar to the simulation study conducted in Freidlin and Simon [14], gene expression levels were generated from (a) a multivariate normal distribution with mean μ and variance $σ_{1}^{2}$ and correlation rho for sensitive genes in sensitive patients; (b) multivariate normal distribution with mean 0 and variance $σ_{2}^{2}$ and correlation ρ for non-sensitive genes in both sensitive and non-sensitive patients (with zero sensitive genes). In addition, uniform random noises are added to the gene expressions for the purpose of approximating the true distributions of gene expressions observed in practice (see, e.g. [30]). In the simulation, the response probabilities of patients are generated by the following logistic regression model:

\log (\frac{π_{i}}{1 - π_{i}}) = β_{0} + λ t_{i} + γ_{1} t_{i} x_{i 1} + \dots + γ_{K} t_{i} x_{i K}, for i = 1, \dots, N;

(4)

where $π_{i}$ denotes the probability of response for the ith patient, λ is the base level of treatment main effect regardless of gene expression of different patients, and the $γ_{j}$ 's are the coefficients of the interaction terms between treatment and sensitive gene expressions. To simplify the simulation, all gene main effects and the treatment-expression interactions for non-sensitive genes are assumed to be 0.

Simplifying further we assume $γ_{1} = \dots = γ_{K}$ in (4). Then the response rates for patients generated from (4) will be

π_{i} = \frac{e^{(β_{0} + λ t_{i} + γ \cdot \sum_{k = 1}^{K} t_{i} x_{i k})}}{1 + e^{(β_{0} + λ t_{i} + γ \cdot \sum_{k = 1}^{K} t_{i} x_{i k})}} .

(5)

We would like to point out that Model 4 only focuses on linear relationships between the generated gene expression data and the response variable. Nonlinear relationships such as different functions of the genes are not considered in this article due to time restrictions. We will investigate more topics in gene data analysis in our future study.

In this simulation study, gene expressions for 400 patients will be generated first, with 200 patients in the treatment group and the rest in the control group. Because of computing power limitations and without any loss of generality, we assume that each patient has L = 1000 genes.

Among the L = 1000 genes, we assume each sensitive patient has K = 10 sensitive genes, which are generated from normal distributions with positive mean μ: the larger the mean μ is, the higher the response rate of the patient. All generated gene data are blurred by a set of randomly generated uniform noise. Non-sensitive patients are defined as those patients with non-sensitive genes that are generated from normal distributions with mean 0. Uniform random noises are also added to non-sensitive gene expressions.

Among the 400 patients, we assume that 40 patients (10% of the total number of patients) in the treatment group have the K = 10 sensitive genes; thus, are sensitive to the treatment while patients in the control group are assumed to be all non-sensitive, i.e. with mean 0 of gene expression.

The simulation setup in this study can be summarized as:

For representing different scenarios, gene expression levels are generated as follows:
1. sensitive genes in sensitive patients are generated from a multivariate normal distribution with $ρ = 0.6$ (the same as the simulation study conducted in [14]), and in four different scenarios with different mean μ's ( $μ \in {1.3, 1, 0.8, 0.6}$ ) and variance $σ_{1}^{2}$ ( $σ_{1}^{2} = 1$ ).
2. non-sensitive genes in both sensitive and non-sensitive patients are generated from a multivariate normal distribution with $ρ = 0.6$ , and with mean 0 and variance $σ_{2}^{2}$ ( $σ_{2}^{2} = 1$ ).
Random noises for each gene are generated from the uniform distribution. Different noise levels are added to the generated gene data: $U (- 0.01, 0.01)$ , $U (- 0.1, 0.1)$ and $U (- 0.5, 0.5)$ .
Binary response $Y_{i}$ for patient i is generated from the Bernoulli distribution with probability $π_{i}$ calculated from Model 5.
For each of the 12 combinations (four different means, and three uniform random noise levels), the simulation is repeated 500 times (500 loops).

For each of 16 combinations and 500 loops, we set $β_{0} = - 1.1$ , $λ = 0.55$ , $γ_{1} = \dots = γ_{K} = 0.3$ . The response rates for the four mean scenarios based on (5) when $x_{i k} \equiv μ$ for $i = 1, \dots, N$ and $k = 1, \dots, K$ are listed in Table 1.

Table 1.

Response rates for sensitive patients and non-sensitive patients when $x_{i k} \equiv μ$ for $i = 1, \dots, N$ and $k = 1, \dots, K$ .

Sensitive patients with $t_{i} = 1$	Mean μ of gene expression μ
0.9661	1.3
0.9206	1.0
0.8641	0.8
0.7773	0.6
Non-sensitive patients with $t_{i} = 0$	Mean μ of gene expression
0.2497	0
0.2497	0
0.2497	0
0.2497	0

Open in a new tab

3.2. Simulation procedure

Step 1: For each generated data set, we first compare the four biomarker identification methods: model-based selection, Wilcoxon Rank-Sum test, the Student t-test, and the elastic-net regularized generalized linear model. The 400 patients in this simulation study are split into two groups based on their responses, the ‘Success Group’ with $Y_{i} = 1$ and the ‘Failure Group’ with $Y_{i} = 0$ . Comparison of the four identification methods is based on the AUC values calculated from the sensitivity and specificity of selecting correct sensitive genes (with p-value <0.05 as a sensitive gene).

Logistic regression model-based method: for each gene j, fit the single gene logistic model $logit (π_{i}) = μ + λ t_{i} + β_{j} t_{i} x_{i j}$ with the treatment-expression interaction term when data are generated from Model 4. Claim gene j to be sensitive if the p-value for testing $β_{j} = 0$ is below a specified level ( $α = 0.05$ in this study). Let $N_{l 1}^{g} (k)$ be the number of sensitive genes correctly selected and $N_{l 2}^{g} (k)$ be the non-sensitive gene correctly selected for a given loop k. The sensitivity and specificity of the logistic model-based method in each of the 500 loops are calculated as
$\begin{aligned} P_{sen}^{g} (k) & = \frac{N_{l 1}^{g} (k)}{K}, for k = 1, \dots, 500, \\ P_{spec}^{g} (k) & = \frac{N_{l 2}^{g} (k)}{L - K}, for k = 1, \dots, 500, \end{aligned}$
where K = 10 is the number of sensitive genes among the L = 1000 genes. The $F_{1} (k)$ value for the kth loop is calculated based the sensitivity and precision values. The $F_{1}$ score of the logistic model-based method is defined as the averages $F_{1} = \sum_{k = 1}^{500} F_{1} (k) / 500$ .
Elastic-net Regularized Generalized Linear Models: this method selects sensitive genes from the L = 1000 genes directly. Similar to the logistic model-based method, the $F_{1}$ score of this method is calculated based on the results of the 500 loops.
Two-sided Wilcoxon Rank-Sum test: for each gene j, test the null hypothesis that the distribution of expressions for the ‘Success Group’ is the same as that of expressions for the ‘Failure Group’. Claim gene j to be sensitive if the p-value for the Wilcoxon Rank-Sum test statistic ‘ $W_{g}$ ’ is less than a specified level ( $α = 0.05$ in this study). The $F_{1}$ score of the Wilcoxon Rank-Sum test will be calculated using the same procedure specified for the logistic model-based method.
Two-sided Student t-test: for each gene j, test its expression between the ‘Success Group’ and the ‘Failure Group’. Claim gene j to be sensitive if the p-value for Student t-test statistic ‘ $T_{g}$ ’ is less than a specified level ( $α = 0.05$ in this study). Again, the $F_{1}$ score of the Student t-test will be calculated using the same procedure specified for the logistic model-based method.

Step 2: After the biomarker identification step, we combine the response variable Y, treatment variable TRT, and the selected significant genes into a new data set for each of the four methods. The new data set is then split into two halves, 200 patients in Stage I with 100 patients in the treatment group and another 100 in the control group with 20 sensitive patients as the training data set, and 200 patients in Stage II with 100 patients in the treatment group and another 100 in the control group as the testing data set. Based on the four biomarker identification methods, four pairs of training and testing data sets are formed. For example, after performing the Student t-test, the selected ‘sensitive’ genes together with response and treatment variables form one training data set named ‘trainDF_t’ and one testing data set named ‘testDF_t’. The six classification methods will be applied to each pair of the training and testing data sets for patients classification.

Step 3: For each pair of the training and testing data sets, each classification method will be applied to the training data set ‘trainDF’ establishing a model that will be used on the testing data set ‘testDF’ for prediction. Response rates of all patients in the ‘testDF’ data set will be predicted. A patient with a response rate of at least 0.5 is defined as a sensitive patient ; otherwise, the patient is classified as non-sensitive. The sensitivity and specificity of each classification method are calculated using the same procedure specified for the logistic model-based method in Step 1. For example, in loop k let $N_{t_s v m_{1}}^{p} (k)$ be the number of sensitive patients correctly detected in the test data set ‘testDF_t’ by the SVM (support vector machine) method and $N_{t_s v m_{2}}^{p} (k)$ be the number of non-sensitive patients correctly selected in ‘testDF_t’. Then the sensitivity and specificity of the SVM method in this loop are defined as

\begin{aligned} P_{sen}^{p} (k) & = \frac{N_{t_s v m_{1}}^{p} (k)}{20}, for k = 1, \dots, 500, \\ P_{spec}^{p} (k) & = \frac{N_{t_s v m_{2}}^{p} (k)}{180}, for k = 1, \dots, 500. \end{aligned}

The area under a receiver operating characteristic curve $AUC (k)$ for the kth loop is calculated based on the sensitivity $P_{sen}^{p} (k)$ and specificity $P_{spec}^{p} (k)$ . The final AUC value for the classification method SVM are defined as the averages $AUC = \sum_{k = 1}^{500} AUC (k) / 500$ .

3.3. Simulation results

Using the procedure discussed in Section 3.3, the four biomarker identification methods in Stage I of an adaptive design and the six classification methods in Stage II are evaluated. Results of the comparison based on the simulated data are presented in this section.

3.3.1. Results on comparing the four biomarker identification methods

Table 2 lists the $F_{1}$ scores for the four-gene identification methods based on K = 10 sensitive genes among L = 1000 genes and 500 simulation replicates, in which the data were simulated from Model 4. In the table, the four columns are results from the four methods: logistic regression model, the Elastic-net Regularized Generalized Linear Models, Student t-test, and the Wilcoxon Rank-Sum Test. Rows in the table give different scenarios for the mean values of gene expression at the four different levels $μ \in {1.3, 1.0, 0.8, 0.6}$ . The first block in the table shows the $F_{1}$ scores for the four identification methods based on expression data generated from multivariate normal distributions with the four different means and without uniform random noises added (U = 0), where the best $F_{1}$ score in each row was marked in the bold case. The last three blocks present the $F_{1}$ scores of the four methods based on the normally generated gene expressions data with three uniform random noises $U (- 0.01, 0.01)$ , $U (- 0.1, 0.1)$ , and $U (- 0.5, 0.5)$ added.

Table 2.

$F_{1}$ scores by the four identification methods with uniform noises U = 0, $U (- 0.01, 0.01)$ , $U (- 0.1, 0.1)$ , $U (- 0.5, 0.5)$ added to gene expressions generated from multivariate normal distributions with $ρ = 0.6$ and four different means.

Mean of gene expression	Logistic regression	Elastic-net Regularized GLM	Wilcoxon Rank-Sum test	Student t-test
U = 0
$μ = 1.3$	0.593	0.623	0.381	0.583
$μ = 1.0$	0.607	0.621	0.386	0.587
$μ = 0.8$	0.596	0.594	0.386	0.585
$μ = 0.6$	0.583	0.588	0.393	0.581
$U (- 0.01, 0.01)$
$μ = 1.3$	0.570	0.609	0.368	0.562
$μ = 1.0$	0.581	0.615	0.371	0.567
$μ = 0.8$	0.602	0.610	0.389	0.588
$μ = 0.6$	0.583	0.583	0.384	0.566
$U (- 0.1, 0.1)$
$μ = 1.3$	0.576	0.598	0.371	0.573
$μ = 1.0$	0.586	0.610	0.369	0.579
$μ = 0.8$	0.579	0.568	0.381	0.573
$μ = 0.6$	0.589	0.575	0.390	0.579
$U (- 0.5, 0.5)$
$μ = 1.3$	0.554	0.616	0.356	0.551
$μ = 1.0$	0.543	0.600	0.346	0.533
$μ = 0.8$	0.530	0.597	0.342	0.518
$μ = 0.6$	0.518	0.575	0.337	0.510

Open in a new tab

From the results in Table 2, we can see that the Elastic-net Regularized Generalized Linear Models provided the best $F_{1}$ scores among the four methods for almost all the four means. The logistic regression model ranked second in terms of the $F_{1}$ score. Therefore, we recommend the Elastic-net Regularized Generalized Linear Models and the logistic regression model-based method for biomarker identification in practice. Other simulations such as independent gene expression data and a weaker correlation with $ρ = 0.3$ were also conducted in this study. The results are very similar to the case with $ρ = 0.6,$ thus not presented in this article.

3.3.2. Results on comparing the six patient classification methods

For the purpose of comparing the six patient classification methods in Stage II of an adaptive design, we follow the procedure described in Step 2 and Step 3 in Section 3.3. Specifically, after the biomarker identification step by the four methods, we combine the response variable Y, treatment variable TRT, and the selected significant genes into a new data set for each of the four methods. Each of the four new data sets was then split into two halves, 200 patients in Stage I (‘trainDF’) with 100 patients in the treatment group and another 100 in the control group, and 200 patients in Stage II (‘testDF’) with 100 patients in the treatment group and another 100 in the control group. Each of the six classification methods was used on the training data set ‘trainDF’ to learn a model that will be applied to the testing data set ‘testDF’ to identify the sensitive patients (20 out of 200 in our simulation study).

In order to compare the performance of the six classification methods under different gene identification methods (Logistic Model-based, Elastic-net Regularized Generalized Linear Models, Wilcoxon Rank-Sum test, and the Student t-test), we present the AUC values of the six classification methods in the following figures and tables.

Figure 1 presents the AUC values of the six patient classification methods after the four-gene identification methods based on 500 replicates under Model 4. In this figure, gene expression data were generated from multivariate normal distributions with $ρ = 0.6$ and the four different means $μ \in {1.3, 1, 0.8, 0.6}$ with a constant variance of $σ_{1}^{2} = 1$ , no uniform random noise was added in this case. From the plots, we can see that for mean value $μ = 1.3$ , Random Forest and XGBoost performed better than the other four classification methods; for mean value $μ = 1.0$ , Random Forest, SVM, and XGBoost performed better than other three classification methods; while for $μ = 0.8$ and $μ = 0.6$ . i.e. when the response rates are relatively lower, Random Forest and SVM performed better than the other four classification methods. For all the four different means, performance of Logistic Regression and Elastic-net Regularized Generalized Linear Models was not as good as the other four classification methods in terms of the AUC values. When the biomarkers were identified by the Elastic-net Regularized Generalized Linear Models at the first stage, the performances of the six classification methods were actually quite close to each other, as shown in the first column of the four subplots.

Table 3 provides numerical results on the performance of the six patient classification methods, in which the AUC values of the six methods were listed in the last six columns of the table and the corresponding four gene identification methods were listed in rows of the table within the four blocks for the four different means. The highest AUC value in each row was marked in the bold case. From the table, we can see that for the four mean values $μ \in {1.3, 1, 0.8, 0.6}$ , Random Forest almost always gave the highest AUC values among the six classification methods, except for $μ = 1.3$ and when the Elastic-net Regularized Generalized Linear Model was used for biomarker identification, XGBoost provided the highest AUC value of 0.973 (slightly better than AUC value of 0.971 for Random Forest); and for $μ = 0.8$ and when the t-test method was used for biomarker identification, AUC values provided by SVM and Random Forest tied as the highest value of 0.932. Furthermore, the AUC values given by Random Forest and SVM are very close to each other for all four means. XGBoost ranked third in terms of the AUC values, while Logistic Regression performed the worst among the six classification methods.

Table 3.

AUC Values for the six patient classification methods based on 500 replicates under Model 4.

Identification methods	Logistic regression	Elastic-net Regularized GLM	SVM	Random forest	GBM	XGBoost
mean = 1.3
Elastic-net Regularized GLM	0.961	0.966	0.963	0.971	0.966	0.973
Logistic regression	0.948	0.964	0.965	0.975	0.964	0.973
t-test	0.946	0.964	0.964	0.974	0.965	0.973
Wilcoxon Rank-Sum test	0.941	0.964	0.965	0.975	0.964	0.973
mean = 1.0
Elastic-net Regularized GLM	0.934	0.938	0.953	0.954	0.943	0.951
Logistic regression	0.921	0.938	0.956	0.958	0.944	0.952
t-test	0.908	0.935	0.957	0.958	0.944	0.954
Wilcoxon Rank-Sum test	0.913	0.940	0.957	0.958	0.943	0.954
mean = 0.8
Elastic-net Regularized GLM	0.894	0.902	0.931	0.933	0.909	0.922
Logistic regression	0.869	0.905	0.932	0.935	0.914	0.928
t-test	0.859	0.902	0.932	0.932	0.893	0.910
Wilcoxon Rank-Sum test	0.854	0.898	0.927	0.929	0.907	0.922
mean = 0.6
Elastic-net Regularized GLM	0.848	0.864	0.895	0.897	0.870	0.885
Logistic regression	0.822	0.854	0.901	0.904	0.871	0.885
t-test	0.814	0.856	0.896	0.898	0.859	0.878
Wilcoxon Rank-Sum test	0.815	0.858	0.898	0.902	0.871	0.892

Open in a new tab

Gene expression data were generated from multivariate normal distributions with $ρ = 0.6$ and the four different means $μ \in {1.3, 1, 0.8, 0.6}$ without uniform random noise added.

Figure 2 presents the AUC values of the six patient classification methods after the four-gene identification methods based on 500 replicates under Model 4. In this figure, gene expression data were generated from the same multivariate normal distributions used in Figure 1 but with uniform random noises $U (- 0.1, 0.1)$ added to the generated gene data. The patterns in the four subplots are very similar to those shown in Figure 1: for mean value $μ = 1.3$ , Random Forest and XGBoost performed better than the other four classification methods; for mean values $μ = 1.0$ and $μ = 0.8$ , Random Forest, SVM, and XGBoost performed better than the other three classification methods; while for $μ = 0.6$ , Random Forest and SVM performed better than the other four classification methods. Again, the performance of Logistic Regression was not as good as the other five classification methods for all four mean values in terms of the AUC values.

Similar to the results given in Tables 3 and 4 provides numerical AUC values of the six classification methods, with the highest AUC value in each row marked in the bold case. Again we see that for the four mean values $μ \in {1.3, 1, 0.8, 0.6}$ , Random Forest almost always gave the highest AUC values among the classification methods, except for $μ = 1.3$ and when the t-test method was used for biomarker identification, the AUC value provided by XGBoost tied with the highest value given by Random Forest at 0.970; and for $μ = 0.8$ and when the Logistic Regression method was used for biomarker identification, the AUC value of 0.936 provided by SVM is the highest among the six AUC values in the row. For $μ = 0.6$ and when the Elastic-net Regularized GLM was used for biomarker identification, the AUC values provided by SVM and Random Forest tied at the highest value of 0.899.

Table 4.

AUC Values for the six patient classification methods based on 500 replicates under Model 4.

Identification methods	Logistic regression	Elastic-net Regularized GLM	SVM	Random forest	GBM	XGBoost
mean = 1.3
Elastic-net Regularized GLM	0.962	0.968	0.964	0.971	0.966	0.969
Logistic regression	0.942	0.963	0.964	0.974	0.966	0.972
t-test	0.943	0.961	0.958	0.970	0.966	0.970
Wilcoxon Rank-Sum test	0.937	0.964	0.962	0.975	0.964	0.972
mean = 1.0
Elastic-net Regularized GLM	0.933	0.937	0.948	0.954	0.945	0.943
Logistic regression	0.904	0.934	0.951	0.952	0.937	0.948
t-test	0.900	0.933	0.951	0.952	0.935	0.949
Wilcoxon Rank-Sum test	0.908	0.934	0.952	0.954	0.935	0.952
mean = 0.8
Elastic-net Regularized GLM	0.900	0.906	0.936	0.927	0.919	0.929
Logistic regression	0.878	0.909	0.936	0.942	0.919	0.931
t-test	0.882	0.914	0.935	0.937	0.916	0.933
Wilcoxon Rank-Sum test	0.887	0.916	0.934	0.939	0.915	0.933
mean = 0.6
Elastic-net Regularized GLM	0.855	0.870	0.894	0.897	0.869	0.888
Logistic regression	0.809	0.852	0.899	0.899	0.861	0.887
t-test	0.827	0.867	0.901	0.904	0.876	0.888
Wilcoxon Rank-Sum test	0.806	0.845	0.896	0.899	0.855	0.887

Open in a new tab

Gene expression data were generated from multivariate normal distributions with $ρ = 0.6$ and the four different means $μ \in {1.3, 1, 0.8, 0.6}$ with uniform random noise $U (- 0.1, 0.1)$ added.

Simulations with gene expression data generated from multivariate normal distributions with uniform random noises $U (- 0.01, 0.01)$ and $U (- 0.5, 0.5)$ added were also conducted in this study. The results from the simulations were very similar to those from the cases without uniform random noises or with uniform random noises $U (- 0.1, 0.1)$ added, which are not presented in this article.

In summary, we have the following conclusions on the six classification methods based on our simulation study:

In terms of the higher AUC value, Random Forest outperforms the other five classification methods under all the four-gene identification methods when gene expression data were generated from normal distributions with four different means, without uniform random noises or with uniform random noises $U (- 0.1, 0.1)$ added.
SVM and XGBoost gave slightly lower AUC values than Random Forest based on our simulation results. These two classification methods are also highly recommended in real applications.
Classification performance of logistic regression, Elastic-net Regularized GLM and GBM is not as good as Random Forest, SVM, and XGBoost in terms of the AUC values under all scenarios in our simulation study. However, logistic regression as a simpler and popular classification method is often used in the practice too, which will be used in our application section as a benchmark methods.

Based on the study by Troyanskaya et al. [30] and examinations by ourselves, gene expression data generated from normal distributions with uniform random noises added are very good approximation to the true distributions of gene expressions observed in practice. But it should be pointed out that conclusions based on our simulation study still have their limitations. Further simulation studies should be carried out under different distribution families with different correlation structures among gene expression data. Effects of different sample sizes should be examined too. We will continue our investigation on different gene identification methods and classification methods in our future study.

4. Application of adaptive signature design to breast cancer data

Based on the comparison results of the four biomarker identification methods in Section 3, we found that the Elastic-net Regularized GLM method and the logistic model-based method perform better than the other two methods. Similarly, in the comparison of the six classification methods conducted in Section 3, with different uniform noises added in the four response scenarios at different gene expression levels, we found that Random Forest, SVM, and XGBoost have better performance than the other three classification methods in terms of higher AUC values. In this section, we will apply the selected methods to the Adaptive Signature Design (ASD) proposed by Freidlin and Simon [14] for the analysis of breast cancer gene data that will be discussed in the following section.

4.1. Breast cancer data description

Yu et al. [35] studied how to choose an optimal chemotherapy regimen for breast cancer patients. The authors obtained their data set from the GEO (Gene Expression Ominibus [1]) database, which was previously presented in seven independent studies with a total of 1079 breast cancer patients who received neoadjuvant chemotherapy. All the Gene Expression data were normalized within each dataset using the Robust Multi-array Average (RMA) method [20]. There are three regimen groups in the data set, i.e. anthracycline only (A-group), anthracycline plus paclitaxe (TA-group), and anthracycline plus docetaxel (TxA-group), and each patient was assigned to one group. In their study, patients' responses to chemotherapy were coded as 1 for pCR (pathologic complete response) or 0 for RD (residual disease) where pCR is a potential surrogate marker for survival, a measure for chemosensitivity, and associated with a favorable outcome from chemotherapy. Yu et al. [35] developed a new strategy, called PRES (Personalized REgimen Selection), which include random forest models with variable selection using both genetic and clinical variables to predict the response of a patient. Their study showed that PRES may significantly increase response rates for breast cancer patients, especially those with HER2 (human epidermal growth factor receptor 2) and ER (estrogen receptor) negative tumors.

In this section, the data set from Yu et al. [35] will be used to show the usefulness of the ASD in detecting the differences of chemotherapy treatments. Together with the response variable, treatment variable, gene expression variables and five feature covariates such as AGE, ER, HER2, cancer t-stage and n-stage were provided in the data set. The number of genes for each patient is 22,283, so that the dimension of the whole data set is $1079 * 22,283$ . Rosner's test introduced by Rosner [27] is employed to identify outliers in each of the 22,283 genes, which is implemented by the R package ‘EnvStats’ and allows one to test for several possible outliers and avoids the problem of masking. Based on Rosner's test, 13,047 genes have no outliers detected, while the number of genes with one outlier is 4530, with two outliers is 2371, with three or more outliers is 2335, respectively. In this study, we run our analysis with and without the 2335 genes and found out the results were only slightly different. We decided to start our analysis without removing any of the 22,283 genes since the gene data were already normalized and the identified outliers may be informative real observations. A brief summary of the data set is given in Table 5, including the numbers of patients in the three regimens and the percentages of patients who have pCR in the three groups.

Table 5.

Summary of breast cancer data from seven independent studies.

Number of patients in each group
Regimen			Total
Anthracycline (A)	Paclitaxel and Anthracycline (TA)	Docetaxel and Anthracycline (TxA)
139	730	210	1079
Percent of patients who have pCR in each group
Regimen			Total
Anthracycline (A)	Paclitaxel and Anthracycline (TA)	Docetaxel and Anthracycline (TxA)
8.6%	19.7%	30.5%	20.4%

Open in a new tab

In this study, we compare two treatments instead of three in the data set. As an illustration, we apply an enriched ASD procedure to compare the anthracycline only (A-group) and the anthracycline plus paclitaxe (TA-group). The dimension of the data set after combining the A-group and TA-group is $869 * 22,283$ that is relatively high. For significant gene identification, we first perform the variable selection by the logistic model-based method followed by the Elastic-net Regularized GLM to reduce the gene number. Specifically, at the first step for each gene k we fit the single gene logistic model $logit (π_{k}) = μ + β_{k} x_{k}$ . Decide a gene k to be sensitive if the p-value for $β_{k}$ is significant at a specified level ( $α = 0.05$ ). After the first step, the gene number is reduced from 22,283 to 5024. Second, the Elastic-net Regularized GLM is fit to the 5024 variables, and only less than 50 genes are kept after variable selection to be used for the ASD procedure assessment.

From Table 5, we know that the number of patients in the A-group and TA-group are not balanced: A-group has 139 patients while TA-group has 730 patients. The pCR rates in the two groups are 8.6% and 19.7%, respectively. In this section, we form four new groups of patients viewed as ‘targeted breast cancer populations’. Each group is made up of the 139 patients from the A-group and fixed sample size, n, of patients randomly sampled from the 730 patients in the TA-group. The values of sample size n considered for the four groups are: 500, 499, 300 and 200. In each new group, the 139 patients from the A-group will be treated as the control while those from the TA-group will be treated as the treatment group. We name the corresponding data frames to the four new groups or data sets as ( $d f_{11}, d f_{12}, d f_{13}, d f_{14}$ ). Each of the new data set will be further split into two data sets, training data and testing data. The ASD procedure will be applied to training data and testing data for the purpose of assessing the effect of the procedure.

For example, when n = 500, we first randomly sample 500 patients from the TA group and form a new data set named $d f_{11}$ with N = 639 patients by including the 139 patients in the A-group. The ASD procedure will be applied to the data set $d f_{11}$ in the following way:

Test the overall TA (Paclitaxel and Anthracycline) treatment effect at the significance level $α_{0} = 0.05$ using the full data set $d f_{11}$ . In other words, compare the TA-effect and the A-effect using the full data set.
Test the overall TA (Paclitaxel and Anthracycline) treatment effect at a significance level $α_{1} < α_{0}$ using the full data set $d f_{11}$ .
Randomly split data set $d f_{11}$ into training data and testing data: the training data set with 70 patients from the A-group and 250 patients from the TA-group, while the testing data set with 69 patients from the A-group and 250 patients from the TA-group. The training data set is used to develop classifiers. The classification methods we used in this application are Logistic Regression, Random Forest, SVM, and XGBoost, in which logistic regression classification is used as a benchmark reference.
The classification model built-in step (3) is used to classify sensitive patients in the testing data. We define patients whose response rates are predicted to be greater than 0.5 as sensitive patients. The same number of patients as those classified as sensitive is randomly selected from the control group (A-group) and combined with the sensitive patients to form the new subgroup.
If the treatment effect is tested to be not significant in step (2) at the significance level $α_{1} < α_{0}$ , then treatment effect will be tested for patients in subgroup formed in step (4) at a significance level $α_{2}$ .
In this study, we pre-specify the error rate to be $α_{0} = 0.05$ with $α_{0} = α_{1} + α_{2}$ for simplicity.

It should be pointed out that the Alpha-Allocation $α_{0} = α_{1} + α_{2}$ in Stage I and Stage II is just for simplicity. In practice, optimal Alpha-Allocation in Stage I and Stage II needs to be carefully studied. Chen et al. [5] investigated an adaptive informational design of confirmatory Phase III trials, in which they proposed an Adaptive Alpha-Allocation in Stage I and Stage II and addressed the question of how to improve the chance of success for clinical trials. The motivation example used in their study is a large Phase III study that failed only because the order for hypothesis testing of the two co-primary hypotheses (overall population and a biomarker subpopulation) was wrongly pre-specified. Chen et al. [5] considered a typical Phase III survival trial with two co-primary hypotheses, one in the overall population and the other in a subpopulation defined by a predictive biomarker. Generally, researchers would conclude that the trial is positive if the outcome of the primary endpoint is positive in either population, i.e. the treatment effect is tested to be significant in either the overall population or the subpopulation. Chen et al. [5] pointed out that a conservative approach to control the overall Type I error rate is to use $α_{1} + α_{2} = α$ (i.e. the Bonferroni approach). However, the Bonferroni approach does not account for the correlation between the two test statistics of the hypotheses. In their paper, Chen et al. [5] proposed approaches adaptive to the correlation to improve the efficiency of clinical trials that may be applied to address a wide range of statistical issues encountered in expedited development of personalized medicines.

4.2. An enriched adaptive signature design

After splitting the patients in a clinical trial into the training data set ‘trainDF’ and the testing data set ‘testDF’, the procedure for ASD proposed by Freidlin and Simon [14] consists of the following two steps:

Use patients from the training data set ‘trainDF’, fit a logistic model for each gene and identify genes that have a significant treatment-expression interaction effect.
Classify patients from the testing data set ‘testDF’ to be sensitive or non-sensitive to the treatment based on significant genes in step 1. A patient is designated sensitive if the predicted response rate exceeds a specified threshold, e.g. 0.5.

In our analysis, a modification of the ASD procedure will be used. The new procedure will be named as an enriched ASD, in which the biomarker identification and patient classification will be carried out as follows:

Use patients from the training data set ‘trainDF’, fit a logistic model for each gene, identify genes that have significant treatment-expression interaction, and establish a model using each of four classification methods, Random Forest, SVM, XGBoost, and logistic regression, with all the significant genes identified. The model was built based on each of the methods is called a classifier.
Use the classifiers developed in Step 1, classify patients in the testing data set ‘testDF’ as sensitive or non-sensitive patients based on a given threshold (p = 0.5 in this study). For the balance of the design and statistical test, if the number of sensitive patients identified in the treatment group is larger than that in the control group, patients from the control group will be randomly selected to make the sample size the same in the two groups.

To evaluate the performance of our modified ASD, we will conduct the following analysis:

In Stage I, perform overall treatment effect test at significance level $α_{0}$ and significance level $α_{1}$ .
In Stage II, perform treatment effect in the selected subset of sensitive patients at significance level $α_{2}$ . In order to control type I error rate at $α_{0}$ level in the full ASD procedure, we set $α_{0} = α_{1} + α_{2}$ .

4.3. Application results

ASD assessment procedure for the A-group and TA-group described in Section 4.3 was conducted for 500 runs and for each of the four sample sizes. For a given sample size and a given data set formed from the sample, a patient with response $Y_{i} = 1$ is defined as a ‘Success’ patient and a patient with response $Y_{i} = 0$ is defined as a ‘Failure’ patient. Sensitive patients and non-sensitive patients identified in the procedure are compared with the ‘Success’ group and ‘Failure’ group defined by the observed response variable. Sensitivity and specificity in the testing data set based on 500 runs are calculated for the four classification methods and the four sample sizes.

Table 6 lists the AUC values provided by the four classification methods for the four sample sizes, with the highest value in each row marked in the bold case. From the results, we can see that the Random Forest method provided the highest AUC values under the four sampling schemes. The XGBoost method ranked second in terms of the AUC values.

Table 6.

AUC values of Patients classification by the four classification methods when the randomly selected numbers of patients from the TA-group are 200, 300, 400, and 500.

Methods	Logistic regression	Random forest	XGBoost	SVM
N = 200
Area under the ROC curve	0.807	0.850	0.835	0.837
N = 300
Area under the ROC curve	0.807	0.843	0.815	0.806
N = 400
Area under the ROC curve	0.806	0.845	0.812	0.809
N = 500
Area under the ROC curve	0.809	0.848	0.826	0.814

Open in a new tab

Figure 3 presents the empirical powers of ASD by the four classification methods based on 500 runs when the sample size from the TA-group is n = 500. The light blue bars represent the percentages (82.4%) of significance for treatment TA vs the control (A-group) at the level $α = 0.05$ , while the white bars show the percentages (63.6%) of significance for treatment TA vs the control at the level $α = 0.04$ . The dark blue bars show the improvement of the ASD procedure: If the treatment effect is not significant using $α = 0.04$ , a subgroup was identified and the treatment effect was tested within the subgroup using the significant level $α = 0.01$ . The ASD procedure obviously improves the empirical power of the test. Specifically, the empirical powers of XGBoost, Random Forest, and Logistic Regression using the ASD procedure are all over 95% and the empirical powers of SVM is slightly lower at 94.8%. The XGBoost methods gave the highest improvement percentage at 34.4% and Random Forest had the second-highest improvement percentage at 32.8%. It should be pointed out that the Logistic Regression method actually provided a very good improvement percentage too (32.2%) in this example, which shows why this traditional classification method is so popular in practice. The green stars in the figure show the areas under ROC given by the four classification methods.

Similar to Figures 3, 4 presents the empirical powers of ASD by the four classification methods based on 500 runs when the sample size from the TA-group is n = 400. With fewer patients selected from the TA-group (n = 400 instead of n = 500), the percentage of the treatment TA effect tested to be significant using the full data set at the significance level $α = 0.05$ is down from about $82.4 %$ to $60.6 %$ . Again, the ASD procedure obviously improves the empirical power of the test. Specifically, the improvement percentages of empirical powers are 46.8% for XGBoost, 43.6% for Random Forest, 43% for Logistics Regression, and 33.4% for SVM, respectively.

Figure 4. — Empirical powers of ASD by the 4 classification methods when n = 400.

Figure 5 presents the empirical powers of ASD by the four classification methods based on 500 runs when the sample size from the TA-group is n = 300. In this case, the percentage of the treatment TA effect tested to be significant using the full data set and a significance level of 0.05% is only about 24%, much lower than the 82.4% for sample size n = 500 and 60.6% for sample size n = 400. Again, the ASD procedure improves the empirical powers of the tests very significantly, with 62% improvement for XGBoost, 54.2% for Random Forest, 53.6% for Logistics Regression, and 44.4% for SVM, respectively. This example shows that when the difference between a treatment and the control group is hard to be detected by the ‘unmodified ASD’, the enriched ASD procedure performed extremely well.

Figure 6 presents the empirical powers of ASD by the four classification methods based on 500 runs when the sample size from the TA-group is n = 200. In this case, the percentage of the treatment TA effect tested to be significant using the full data set and a significance level of 0.05% was down to zero. However, with the ASD procedure, the empirical powers of the four classification methods are 33.6% for XGBoost, 27.4% for Random Forest, 22.6% for Logistics Regression, and 16.2% for SVM, respectively, still a very significant improvement over the traditional procedure.

Based on the results of comparing the anthracycline only (A-group) and anthracycline plus paclitaxe (TA-group) in the breast cancer study, we can see that the enriched ASD procedure is much more effective in detecting treatment effects in an experiment design than regular statistical testing methods. The enriched ASD procedure is also applied to another two-treatment combined data sets in this study: the Paclitaxel and Anthracycline (TA-group) and the Docetaxel and Anthracycline (TxA group), and the results are similar to those for the A-group vs the TA-group thus not presented in this paper.

5. Summary and discussion

In this article, we compared the performance of four methods for biomarker identification and six methods for patient classification in adaptive designs, including the ASD design based on simulations. The selected methods and the enriched ASD procedure were applied to a real data set for detecting treatment effects in a breast cancer study. Similar to the results presented in Freidlin and Simon [14], we have shown that when the proportion of patients sensitive to the new drug is relatively low, the adaptive design will substantially reduce the chance of false rejection of effective new treatments.

The methods and procedures studied in this article could be extended and applied to different types of data in many fields. In our simulation study, we assume binary response and logistic model. However, in the real world, the types of covariates, responses, and models could vary. So we may also consider continuous response such as survival time of patients after treatments or time to recurrence of an event. For continuous response, Cox Proportional Hazard Model or the General Hazard Rate Model that extends the time-varying covariates and time-dependent effects models could be used in ASD and subgroup identification.

In our enriched ASD design discussed in Sections 2 and 4, we use biomarker identification methods to search the covariate space of genomic data and use some of the most recent machine learning classification methods to identify subgroups upon which treatment effect analysis is performed. There are several exploratory approaches for simultaneously performing subgroup identification and evaluating treatment effect reported in recent years, which utilize tree-based methods such as the basic CART algorithm or random forest. These methods include the Virtual Twins method proposed by Foster et al. [13], the subgroup identification based on differential effect search (SIDES) method introduced by Lipkovich et al. [24], and mining data to find subsets of high activity (ARF) by Amaratunga et al. [10], which take a more localized partition of the covariance space and focus on the identification of interesting areas in the covariate space, such as the areas specified by $I (x_{1} < c 1)$ and $I (x_{2} > c 2)$ where the treatment effect is likely to be large.

For example, subgroup identification based on differential effect search (SIDES) due to Lipkovich et al. [24], aims at finding a specific covariate subspace, associated with patients to have a higher treatment effect than the counter subset. The elements of SIDES method include splitting criterion, continuation criterion and selection criterion. The splitting criterion is used to decide which splitting protocol is optimal among all possible splittings for one covariate. The continuation criterion is to evaluate whether the next splitting of this covariate is necessary or not. The selection criterion defines the rules of selecting promising subgroups out of all the results of child subgroups.

In future research, we plan to study these more localized partitions of the covariance spaces methods such as the SIDES method, ARF (Activity Region Finder), and Virtual Twins carefully. The performance of these methods will be compared with that of the traditional subgroup identification approaches evaluated in this article, including the machine learning methods SVM, Random Forest, and XGBoost. Both the traditional subgroup identification methods and the covariate local-focused methods will be applied to real data analysis.

Acknowledgments

The authors would like to thank the Editor, the associate editor, and the two referees for their insightful comments and detailed suggestions that let to a much improved version of this paper.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

1.Barrett T., Wilhite S.E., Ledoux P., Evangelista C., Kim I.F., Tomashevsky M., Marshall K.A., Phillippy K.H., Sherman P.M., Holko M., Yefanov A., Lee H., Zhang N., Robertson C.L., Serova N., Davis S., and Soboleva A., NCBI GEO: Archive for functional genomics data sets update, Nucl. Acids Res. 41 (2013), pp. 991–995. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Bartlett P.L., The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network, IEEE Trans. Inf. Theory 44 (1998), pp. 525–536. [Google Scholar]
3.Breiman L., Pasting bites together for prediction in large data sets and on-line, Tech. Rep., Dept. Statistics, Univ. California, Berkeley, 1997.
4.Breiman L., Random forests, Mach. Learn. 45 (2001), pp. 5–32. [Google Scholar]
5.Chen C., Li N., and Shentua Y., Adaptive informational design of confirmatory phase III trials with an uncertain biomarker effect to improve the probability of success, Stat. Biopharm. Res. 8 (2016), pp. 237–247. [Google Scholar]
6.Chen T. and Guestrin C., Xgboost: A Scalable Tree Boosting System, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery, New York, NY, 2016, pp. 785–794.
7.Chow S.C. and Chang M., Adaptive design methods in clinical trials: A review. Orphanet J. Rare Dis. 3 (2008), p. 11. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Chow S.C., Chang M., and Pong A., Statistical consideration of adaptive methods in clinical development, J. Biopharm. Stat. 15 (2005), pp. 575–591. [DOI] [PubMed] [Google Scholar]
9.Cortes C. and Vapnik V., Support vector networks, Mach. Learn. 20 (1992), pp. 273–297. [Google Scholar]
10.Dhammika A. and Javier C., Mining data to find subsets of high activity, J. Stat. Plan. Inference 122 (2004), pp. 23–41. [Google Scholar]
11.Efron B., Forcing a sequential experiment to be balanced, Biometrika 58 (1971), pp. 403–417. [Google Scholar]
12.Ezkurdia I., Juan D., Rodriguez J.M., Frankish A., Diekhans M., Harrow J., Vazquez J., Valencia A., and Tress M.L., Multiple evidence strands suggest that there may be as few as 19,000 human protein-coding genes, Hum. Mol. Genet. 23 (2014), pp. 5866–5878. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Foster J.C., Taylor J.M.G., and Ruberg S.J., Subgroup identification from randomized clinical trial data, Stat. Med. 30 (2011), pp. 2867–880. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Freidlin B. and Simon R., Adaptive signature design: An adaptive clinical trial design for generating and prospectively testing a gene expression signature for sensitive patients, J. Am. Stat. Assoc. 11 (2005), pp. 7872–7878. [DOI] [PubMed] [Google Scholar]
15.Freund Y. and Schapire R.E., A decision-theoretic generalization of online learning and an application to boosting, J. Comput. Syst. Sci. 55 (1997), pp. 119–139. [Google Scholar]
16.Friedman J.H., Greedy function approximation: A gradient boosting machine, Ann. Stat. 29 (2001), pp. 1189–1232. [Google Scholar]
17.Friedman J.H., Stochastic gradient boosting, Comput. Stat. Data Anal. 38 (2002), pp. 367–378. [Google Scholar]
18.Friedman J.H., Hastie T., and Tibshirani R., Additive logistic regression: A statistical view of boosting, Ann. Stat. 28 (2000), pp. 337–374. [Google Scholar]
19.Gordon Lan K.K. and DeMets D.L., Group sequential procedures: Calendar versus information time, Stat. Med. 8 (1978), pp. 1191–1198. [DOI] [PubMed] [Google Scholar]
20.Irizarry R.A., Hobbs B., and Collin F., Exploration, normalization, and summaries of high density oligonucleotide array probe level data, Biostatistics 4 (2003), pp. 249–264. [DOI] [PubMed] [Google Scholar]
21.Jennison C. and Turnbull B.W., Group Sequential Methods with Applications to Clinical Trials, Chapman and Hall/CRC, New York, NY, 1999. [Google Scholar]
22.Lee J.W., Lee J.B., Park M., and Song S.H., An extensive comparison of recent classification tools applied to microarray data, Comput. Stat. Data Anal. 48 (2005), pp. 869–885. [Google Scholar]
23.Liaw A. and Wiener M., Classification and regression by randomForest, R News 2 (2002), pp. 18–22. [Google Scholar]
24.Lipkovich I., Dmitrienko A., and Denne J., Subgroup identification based on differential effect search-a recursive partitioning method for establishing response to treatment in patient subpopulations, Stat. Med. 30 (2011), pp. 2601–2621. [DOI] [PubMed] [Google Scholar]
25.Liu Q., Proschan M., and Pledger G.W., A unified theory of two-stage adaptive designs, J. Am. Stat. Assoc. 97 (1999), pp. 1034–1041. [Google Scholar]
26.Posch M. and Bauer P., Adaptive two-stage designs and the conditional error function, Biom. J. 41 (1999), pp. 689–696. [Google Scholar]
27.Rosner B., On the detection of many outliers, Technometrics 17 (1975), pp. 221–227. [Google Scholar]
28.Shawe-Taylor J., Bartlett P.L., Williamson R.C., and Anthony M., Structural risk minimization over data-dependent hierarchies, IEEE Trans. Inf. Theory 44 (1998), pp. 1926–1940. [Google Scholar]
29.Shawe-Taylor J. and Cristianini N., Margin distribution and soft margin, in Advances in Large Margin Classifiers, A.J. Smola et al., eds., The MIT Press, Cambridge, MA, 2000, pp. 349–358.
30.Troyanskaya O.G., Garber M.E., Brown P.O., Botstein D., and Altman R.B., Nonparametric methods for identifying differentially expressed genes in microarray data, Bioinformatics 18 (2002), pp. 1454–1461. [DOI] [PubMed] [Google Scholar]
31.Vapnik V. and Lerner A., Pattern recognition using generalized portrait method, Autom. Remote Control 24 (1963), pp. 774–780. [Google Scholar]
32.Vapnik V.N., The Nature of Statistical Learning Theory, Springer-Verlag, New York, 1995. [Google Scholar]
33.Vapnik V.N., Statistical Learning Theory, Wiley-Interscience, New York, 1998. [Google Scholar]
34.Wei L.J., The adaptive biased-coin design for sequential experiments, Ann. Stat. 6 (1978), pp. 92–100. [Google Scholar]
35.Yu K., Sang Q., Lung P., Tan W., Lively T., Sherrield C., Dargham M., Liu J., and Zhang J., Personalized chemotherapy selection for breast cancer using gene expression profiles. Sci. Rep. 7 (2017), 43294. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Zou H. and Hastie T., Regularization and variable selection via the elastic net, J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67 (2005), pp. 301–320. [Google Scholar]

[CIT0001] 1.Barrett T., Wilhite S.E., Ledoux P., Evangelista C., Kim I.F., Tomashevsky M., Marshall K.A., Phillippy K.H., Sherman P.M., Holko M., Yefanov A., Lee H., Zhang N., Robertson C.L., Serova N., Davis S., and Soboleva A., NCBI GEO: Archive for functional genomics data sets update, Nucl. Acids Res. 41 (2013), pp. 991–995. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0002] 2.Bartlett P.L., The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network, IEEE Trans. Inf. Theory 44 (1998), pp. 525–536. [Google Scholar]

[CIT0003] 3.Breiman L., Pasting bites together for prediction in large data sets and on-line, Tech. Rep., Dept. Statistics, Univ. California, Berkeley, 1997.

[CIT0004] 4.Breiman L., Random forests, Mach. Learn. 45 (2001), pp. 5–32. [Google Scholar]

[CIT0005] 5.Chen C., Li N., and Shentua Y., Adaptive informational design of confirmatory phase III trials with an uncertain biomarker effect to improve the probability of success, Stat. Biopharm. Res. 8 (2016), pp. 237–247. [Google Scholar]

[CIT0006] 6.Chen T. and Guestrin C., Xgboost: A Scalable Tree Boosting System, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery, New York, NY, 2016, pp. 785–794.

[CIT0007] 7.Chow S.C. and Chang M., Adaptive design methods in clinical trials: A review. Orphanet J. Rare Dis. 3 (2008), p. 11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0008] 8.Chow S.C., Chang M., and Pong A., Statistical consideration of adaptive methods in clinical development, J. Biopharm. Stat. 15 (2005), pp. 575–591. [DOI] [PubMed] [Google Scholar]

[CIT0009] 9.Cortes C. and Vapnik V., Support vector networks, Mach. Learn. 20 (1992), pp. 273–297. [Google Scholar]

[CIT0010] 10.Dhammika A. and Javier C., Mining data to find subsets of high activity, J. Stat. Plan. Inference 122 (2004), pp. 23–41. [Google Scholar]

[CIT0011] 11.Efron B., Forcing a sequential experiment to be balanced, Biometrika 58 (1971), pp. 403–417. [Google Scholar]

[CIT0012] 12.Ezkurdia I., Juan D., Rodriguez J.M., Frankish A., Diekhans M., Harrow J., Vazquez J., Valencia A., and Tress M.L., Multiple evidence strands suggest that there may be as few as 19,000 human protein-coding genes, Hum. Mol. Genet. 23 (2014), pp. 5866–5878. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0013] 13.Foster J.C., Taylor J.M.G., and Ruberg S.J., Subgroup identification from randomized clinical trial data, Stat. Med. 30 (2011), pp. 2867–880. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0014] 14.Freidlin B. and Simon R., Adaptive signature design: An adaptive clinical trial design for generating and prospectively testing a gene expression signature for sensitive patients, J. Am. Stat. Assoc. 11 (2005), pp. 7872–7878. [DOI] [PubMed] [Google Scholar]

[CIT0015] 15.Freund Y. and Schapire R.E., A decision-theoretic generalization of online learning and an application to boosting, J. Comput. Syst. Sci. 55 (1997), pp. 119–139. [Google Scholar]

[CIT0016] 16.Friedman J.H., Greedy function approximation: A gradient boosting machine, Ann. Stat. 29 (2001), pp. 1189–1232. [Google Scholar]

[CIT0017] 17.Friedman J.H., Stochastic gradient boosting, Comput. Stat. Data Anal. 38 (2002), pp. 367–378. [Google Scholar]

[CIT0018] 18.Friedman J.H., Hastie T., and Tibshirani R., Additive logistic regression: A statistical view of boosting, Ann. Stat. 28 (2000), pp. 337–374. [Google Scholar]

[CIT0019] 19.Gordon Lan K.K. and DeMets D.L., Group sequential procedures: Calendar versus information time, Stat. Med. 8 (1978), pp. 1191–1198. [DOI] [PubMed] [Google Scholar]

[CIT0020] 20.Irizarry R.A., Hobbs B., and Collin F., Exploration, normalization, and summaries of high density oligonucleotide array probe level data, Biostatistics 4 (2003), pp. 249–264. [DOI] [PubMed] [Google Scholar]

[CIT0021] 21.Jennison C. and Turnbull B.W., Group Sequential Methods with Applications to Clinical Trials, Chapman and Hall/CRC, New York, NY, 1999. [Google Scholar]

[CIT0022] 22.Lee J.W., Lee J.B., Park M., and Song S.H., An extensive comparison of recent classification tools applied to microarray data, Comput. Stat. Data Anal. 48 (2005), pp. 869–885. [Google Scholar]

[CIT0023] 23.Liaw A. and Wiener M., Classification and regression by randomForest, R News 2 (2002), pp. 18–22. [Google Scholar]

[CIT0024] 24.Lipkovich I., Dmitrienko A., and Denne J., Subgroup identification based on differential effect search-a recursive partitioning method for establishing response to treatment in patient subpopulations, Stat. Med. 30 (2011), pp. 2601–2621. [DOI] [PubMed] [Google Scholar]

[CIT0025] 25.Liu Q., Proschan M., and Pledger G.W., A unified theory of two-stage adaptive designs, J. Am. Stat. Assoc. 97 (1999), pp. 1034–1041. [Google Scholar]

[CIT0026] 26.Posch M. and Bauer P., Adaptive two-stage designs and the conditional error function, Biom. J. 41 (1999), pp. 689–696. [Google Scholar]

[CIT0027] 27.Rosner B., On the detection of many outliers, Technometrics 17 (1975), pp. 221–227. [Google Scholar]

[CIT0028] 28.Shawe-Taylor J., Bartlett P.L., Williamson R.C., and Anthony M., Structural risk minimization over data-dependent hierarchies, IEEE Trans. Inf. Theory 44 (1998), pp. 1926–1940. [Google Scholar]

[CIT0029] 29.Shawe-Taylor J. and Cristianini N., Margin distribution and soft margin, in Advances in Large Margin Classifiers, A.J. Smola et al., eds., The MIT Press, Cambridge, MA, 2000, pp. 349–358.

[CIT0030] 30.Troyanskaya O.G., Garber M.E., Brown P.O., Botstein D., and Altman R.B., Nonparametric methods for identifying differentially expressed genes in microarray data, Bioinformatics 18 (2002), pp. 1454–1461. [DOI] [PubMed] [Google Scholar]

[CIT0031] 31.Vapnik V. and Lerner A., Pattern recognition using generalized portrait method, Autom. Remote Control 24 (1963), pp. 774–780. [Google Scholar]

[CIT0032] 32.Vapnik V.N., The Nature of Statistical Learning Theory, Springer-Verlag, New York, 1995. [Google Scholar]

[CIT0033] 33.Vapnik V.N., Statistical Learning Theory, Wiley-Interscience, New York, 1998. [Google Scholar]

[CIT0034] 34.Wei L.J., The adaptive biased-coin design for sequential experiments, Ann. Stat. 6 (1978), pp. 92–100. [Google Scholar]

[CIT0035] 35.Yu K., Sang Q., Lung P., Tan W., Lively T., Sherrield C., Dargham M., Liu J., and Zhang J., Personalized chemotherapy selection for breast cancer using gene expression profiles. Sci. Rep. 7 (2017), 43294. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0036] 36.Zou H. and Hastie T., Regularization and variable selection via the elastic net, J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67 (2005), pp. 301–320. [Google Scholar]

PERMALINK

Tests and classification methods in adaptive designs with applications

Diana Q Chen

Si-Qi Mao

Xu-Feng Niu

Abstract

1. Introduction

2. Methods and comparison

2.1. Classification models

2.1.1. Elastic-net regularized generalized linear models (elastic-net regularized GLMs)

2.1.2. Support vector machine (SVM)

2.1.3. Random forest

2.1.4. Gradient boosting machines (GBM)

2.1.5. Extreme gradient boosting (XGBoost)

2.2. A comparison procedure of different methods

3. Simulation study

3.1. Simulation setup

Table 1.

3.2. Simulation procedure

3.3. Simulation results

3.3.1. Results on comparing the four biomarker identification methods

Table 2.

3.3.2. Results on comparing the six patient classification methods

Figure 1.

Table 3.

Figure 2.

Table 4.

4. Application of adaptive signature design to breast cancer data

4.1. Breast cancer data description

Table 5.

4.2. An enriched adaptive signature design

4.3. Application results

Table 6.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

5. Summary and discussion

Acknowledgments

Disclosure statement

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases