Abstract
For many diseases and disorders, such as Alzheimer’s disease, patients demonstrate considerable heterogeneity in their responses to treatment interventions. One treatment may be most effective for some patients, while another may be most effective for others, and neither may be effective for another subset of patients. This potentially renders the conventional parallel group design highly inefficient. An attractive alternative is the N-of-1 design, also called the multi-crossover randomized controlled trial. In this design, each participant services as their own control in a series of randomized blocks of treatment assignments. We propose novel designs for both the single-person and multi-person N-of-1 trials that employ sequential monitoring. In particular, we allow for early stopping for a single participant as soon as there is sufficient evidence of a preferred treatment for them, and early stopping for the group of participants as soon as there is sufficient evidence of a preferred treatment for the population of patients. We provide sample size calculations and decision rules for terminating the trial early, and illustrate their properties in simulation studies. We apply our proposed methods to N-of-1 studies of brain tumor excisions and of methylphenidate in mild cognitive impairment
Keywords: Interim analysis, Cross-over trial, Personalized medicine
1 |. INTRODUCTION
Current medical treatment guidelines depend heavily on randomized controlled trials, which yield the highest level of evidence about population-average effects of therapeutic interventions [1]. However, the best treatment, on average, for a population of individuals may, or may not, coincide with the best treatment for a particular individual [2]. In a study of olanzapine for patients with schizophrenia, it was found that some patients may experience marked improvement while others may not, or may even have worsening symptoms [3]. Furthermore, heterogeneous disease pathways among patients are a major obstacle in clinical trials as they introduce considerable variability and often require very large sample sizes. This feature often leads to strict eligibility requirements, which limit the generalizability of the findings and may exclude subgroups for which there is efficacy [4]. Better understanding of the efficacy of new treatments and variables that affect individual patient responses to these treatments is required to make progress toward the goal of personalized medicine, with which health outcomes for particular individuals are optimized [5, 6, 7].
An alternative design that overcomes some of these challenges of patient heterogeneity is the “N-of-1” trial, also called the multi-crossover trial. N-of-1 trials can play an important role in advancing patient-centered outcome research and have drawn recent attention in the literature [4, 8, 5], though date back to the 1950’s [9]. The N-of-1 trial is a rigorous blinded trial that randomizes treatments within multiple cycles. Typically two treatments are applied within each cycle. A washout period is included between cycles. These are most useful for drugs that provide symptomatic relief for chronic conditions, or for slow-progressing degenerative conditions. N-of-1 trials have been conducted in many different clinical settings, including Alzheimer’s [10], cancer [11, 12], fibromyalgia [13] and chronic pain [14, 15].
The need for rigorous statistical methods for the analysis of N-of-1 trials is well-established in the literature. These include several papers on the analysis of N-of-1 trials in the context of mixed effects linear models. [16] discussed and compared four models and recommended the paired t-test when the outcome is normally distributed. [17] emphasized the difference between fixed and mixed effects linear models and argued that the choice between them depends on the purpose of the analysis. For the case of more than one patient in the trial, [18] presented a systematic review of analyzing the separate N-of-1 trials using meta-analysis and [19] presented a hierarchical Bayesian random effects model for inference. A more general overview of the essential design elements such as washout period and analysis procedures can be found in a published user’s guide [20].
Less attention has been paid to the design of the N-of-1 trial in the literature. The typical sample size calculation is based on a target treatment effect for the population and a pre-specified number of cycles [21]. Adaptive designs and interim monitoring have not been developed in this setting.
In this paper we propose two substantial extensions to the standard N-of-1 design that incorporate interim monitoring and possible early stopping. In particular, we consider the O’Brien-Fleming stopping rule [22] and repeated confidence interval based monitoring [23]. This offers efficiency to the trial design in that it enables early stopping of individuals’ trials as soon as they present sufficiently strong evidence that one treatment is more effective than the other, or that both are equivalent. We also allow for early curtailment [24] of the multi-N-of-1 trial, i.e., for all individuals, as soon as there is sufficient evidence of a population-level treatment benefit or equivalence. We consider simultaneous use of both individual stopping rules and the group stopping rule, though in some settings one or the other might be preferred. The multi-person trial is useful to gain knowledge at the population level and it leverages efficiencies at the single-person level. Once this knowledge is established, it may remain important to optimize treatment for individuals, via the N-of-1 trial. We present stopping rules that acknowledge both superiority and equivalence as possible outcomes. We derive accompanying calculations for the requisite number of cycles and individuals for sufficient power, subject to control of the set of possible errors. While sequential monitoring is now a common feature for large parallel group clinical trials, it has not been applied to N-of-1 trials or to multi-N-of-1 trials.
In Section 2 we define the notation and underlying model. In Section 3 we introduce stopping rules to the N-of-1 trial and discuss issues of design. In Section 4 we introduce a stopping rule for the multi-N-of-1 trial and detail the associated considerations and sample size calculations. In Section 5 we evaluate our proposed N-of-1 and multi-N-of-1 designs using estimates from a brain tumor trial and an Alzheimer’s disease trial, and demonstrate the potential savings that they offer for individuals and for populations. We conclude in Section 6.
2 |. NOTATION AND MODEL
We present notation for the typical N-of-1 trial that compares two treatments ( and ). Such a trial consists of a series of cycles for each individual, with each cycle containing a number of occasions, each of which involves treatment with or . We consider the most common case of two occasions per cycle: one with treatment and one with treatment (Figure 1).
FIGURE 1.

Schematic representation of a single-person n-of-1 trial. The allocation of treatment within each cycle is random, .
We acknowledge that there may be individual-specific treatment effects, and that there is individual-specific random variation around the treatment effects. In particular, we assume that each individual has an average random treatment effect denoted by , where is multinomial taking values () with probabilities (). In particular, when treatment is better than by units for individual , when the treatments are equivalent for individual , and when is better than by units for individual . Further, we assume that each individual’s treatment effect incorporates random variation around these mean treatment effects and let denote the realized treatment effect for individual , where . It would be natural to extend this framework by modeling the multinomial probabilities () as a functions of individual level covariates [25].
For continuous and approximately normally distributed outcomes, we consider the linear model of the form similar to those proposed by [17] and [21],
| (1) |
where is the outcome for individual , cycle and occasion , , , . This is comprised of an individual-specific outcome, , a cycle -specific outcome for individual , , an individual-specific treatment effect of for treatment and for treatment (i.e., if treatment is allocated to individual , in cycle on occasion , and if is allocated), and an independent and identically distributed error term, . We assume mutual independence of , and . For analysis in this setting we construct paired differences of treatment minus treatment that reflect the within-cycle, within-individual treatment effect of versus , i.e., when is allocated in occasion 1,
| (2) |
where .
3 |. SINGLE-PERSON TRIAL
In the case of the single-person trial, we take so that is fixed at and equation (2) reduces to
| (3) |
Suppose there is a maximum of interim analyses conducted for the individual. We let denote the accumulated number of cycles up to the th interim analysis, , . The mean difference between the two treatments through the kth interim is . Note that . We let () denote the lower and upper confidence limits for at the kth interim analysis. Specifically, and , where denotes the critical value that is adjusted for the interim analyses. We will use the O’Brien-Fleming stopping rule to calculate the critical values, .
Let denote the clinically relevant difference between the two treatments, i.e., the minimum absolute difference between and that is meaningful. Let denote the equivalence margin, i.e., the maximum difference between and for which they are considered clinically equivalent. We would like to stop the trial for an individual as soon as we have compelling evidence that either , or . To this end, we propose a set of decision rules that are applied via sequential monitoring for the single-person trial, which allows for early stopping for efficacy of , for efficacy of or for equivalence of and . Additionally, we allow for early stopping for futility, i.e., when there is a sufficiently small probability of a finding of efficacy or equivalence at the th analysis conditional on the current data.
We propose the following stopping rule. At interim analysis ,
STOP if and declare equivalence of and
Else STOP if and declare superiority of over
Else STOP if and declare superiority of over
- Else STOP for futility if
for some pre-specified Else RETURN to step 1 if
Else declare the trial INCONCLUSIVE
These decision rules are evaluated at each interim analysis. For the sequel we assume that interim analyses are conducted after each cycle, so that . In fact, we might conduct the first analysis after cycles, so that , to preserve some of the type I error and due to instability of estimates with sparse data, i.e., early in the trial. The number of cycles () and the critical values are derived to attain certain operating characteristics for the trial. Without interim analyses, the type I error, , is the probability that superiority is declared after cycles when equivalence holds, i.e., . This is given by
| (4) |
| (5) |
where is the critical value at the final analysis that yields an overall type I error of . This elucidates that for , the critical values at the interim analyses, , are well-approximated by those from an O’Brien-Fleming stopping rule for a one-sided stopping boundary with type I error of and a null hypothesis value of zero. We can obtain these from standard software, such as the ldBounds function in the ldbounds package in R, along with the input of the maximum number of cycles. Specifically, for a maximum of cycles (determined via a power calculation, as given below), and acknowledging that we cannot conduct an analysis after one cycle with unknown , the , are given by
For ease of analysis, we may extract from this vector of bounds the final bound, and approximate each as [22].
The power of the N-of-1 trial, , is the probability of correctly identifying superiority when it exists, i.e., for some such that . This is given by
Setting this to the desired power of , we calculate the number of cycles, , from a standard O’Brien-Fleming design for a two-group, one-sided, level test with alternative value of . For the purpose of the derivation of the design parameters the test is one-sided, as clarified by the type I error calculation (4). As an example, for the N-of-1 MPH trial [26] with , , , , we calculate that we require cycles to obtain power.
We are additionally interested in the probability of correctly declaring equivalence when it holds with . This is given by
Thus, to detect equivalence with probability , we calculate the number of cycles, , required for power for a one-sided level sequential test with alternative . We refer to as the type III error. For example, with , , and and one-sided type I error, of 0.05, we obtain 90% power with cycles using the O’Brien-Fleming design (i.e., ). In general, the detection of superiority and equivalence are both goals of the trial, we would take the maximum of the number of cycles required for each goal.
At each interim analysis we will additionally test for futility, as outlined in step 4 above. Specifically, we will stop for futility if the following three conditions hold:
In summary, we have derived the requisite design steps for a sequential N-of-1 trial for a trial with cycles and 2 occasions, with a clinically meaningful mean treatment difference of and equivalence margin of , and with normally distributed outcomes as in (1), with overall type I error (declaring superiority when equivalence holds) of , with type II error (not declaring superiority when it holds at a selected ) of , and with type III error (not declaring equivalence when it holds at 0) of . These are:
Calculate , the number of interim analyses (i.e., cycles) required to obtain power of to detect an alternative of using a one-sided test of level .
Calculate the corresponding O’Brien-Fleming critical value at the final analysis, .
Calculate the critical values at each interim analysis as .
Calculate , the number of interim analyses required to obtain power of to detect an alternative of using a one-sided test of level .
Calculate the corresponding O’Brien-Fleming critical values.
If it is important to detect equivalence between and if it exists, plan for a maximum of cycles. If it is not important to detect equivalence if it exists, plan for a maximum of cycles.
4 |. MULTI-N-OF-1 TRIAL
A multi-N-of-1 trial is the conglomeration of multiple N-of-1 trials. It involves testing of each individual via the N-of-1 trial, in conjunction with ongoing testing of accumulating N-of-1 trials for the population-level treatment effect. The multi-N-of-1 trial enables investigators to learn about the population response to a new treatment, while optimizing treatment selection for each individual. We thus propose tests of proportions (i.e., of individuals for whom , and ), with type I error and power of for specified alternative and null values of the population-level proportions.
The population-level parameters of interest are (), the probabilities of superiority of over , equivalence of and , and superiority of over , respectively. We let denote the probabilities specified by the null hypothesis and denote those specified by the alternative hypothesis. The context will dictate the particular test of interest for a given multi-N-of-1 trial; there are several possibilities. It may be of primary interest to determine if the two treatments are equivalent for a minimum percentage of the population. This translates into the test versus . Alternatively, it may be of primary interest to determine if either treatment is superior to the other for a minimum percentage of the population, i.e., vs . We focus on the scenario in which it is of primary interest to detect whether one treatment is superior to the other, i.e., versus , powered at the alternative value of .
In some instances there may be a desire to stop the trial as soon as a population-level finding is established. For this purpose, we propose an information-based population-level stopping rule. The data for this are the individual-level conclusions based on the constituent N-of-1 trials. Each individual contributes a binary outcome corresponding to the test of interest (i.e., three possible binary outcomes are declared superior to (yes/no), declared equivalent to (yes/no), or declared superior to (yes/no)). Only a certain proportion of N-of-1 trials embedded in the multi-N-of-1 trial will arrive at a definitive conclusion (i.e., equivalence or superiority). For example, if we design the constituent N-of-1 trials to have power of to detect equivalence, of the N-of-1 trials in which there is equivalence will come to a conclusive finding.
Letting denote the translation of the alternative probabilities to the subpopulation of individuals for whom there is a conclusive finding, the proportion that exhibit equivalence of B and A under the alternative hypothesis, , is:
| (6) |
Similarly, among conclusive trials, the proportion that exhibit superiority of over under the alternative hypothesis, , is:
| (7) |
Likewise, define to be the translation of the null probability of interest to the subpopulation of individuals for whom there is a conclusive finding, e.g.,
| (8) |
The overall probability of a conclusive finding under the alternative, , is given by
and analogously under the null with () replacing ().
The most frequent interim analyses would occur at each time at which an individual completes a cycle. An alternative schedule would be at each time that an individual’s trial is complete. If the multi-N-of-1 trial stops early with a conclusive finding at the population level, accrual of additional patients would stop and future patients would be started on the treatment found most favorable for the population. If superiority is found, then the superior treatment would be selected. If equivalence is found, then treatment would be selected according to side effect profile and cost. Patients who have not yet completed their N-of-1 trials at the time that the multi-N-of-1 trial is stopped may or may not be continued to their individual completions. Given the investment made by these patients in the trial, this approach may be taken to provide them with the best individualized treatment recommendation. This would make the most sense if equivalence is concluded for the population.
Consider the population level test of versus . We seek the sample size for the multi-N-of-1 trial, , and the decision threshold, , that yield the desired type I and type II errors, and , relative to the null value in the sample of and the alternative value in the sample of . Jointly, these parameters are the solution to
| (9) |
is the number of individuals for whom there is a conclusive finding (i.e., of superiority or equivalence) from their N-of-1 trial. For the design of the multi-N-of-1 trial we require (conclusive) individuals total, to acknowledge that not all individuals attain conclusive findings from their individual trials and thus would not be included in the multi-N-of-1 trial. As soon as there are individuals for whom the conclusive finding is , we stop the multi-N-of-1 trial and declare superior to at the population level. Analogous designs can be derived for other population-level hypothesis tests that may be of interest.
The probabilities of a conclusive individual finding under both the null and the alternative are important inputs to the design of the multi-N-of-1 trial. An estimate of (conclusive) can be obtained from historical N-of-1 trial results or from a simulation of the N-of-1 trial under the possible outcomes (i.e., , and ) at their specified mean values (e.g., ).
We summarize the requisite stops for the multi-N-of-1 test of superiority of treatment over treatment , i.e., versus at a specific value of the alternative, :
Estimate the number of cycles , given the number of interim analyses , as described in Section 3 with pre-specified type I error and power for the hypothesis of interest.
Obtain estimates of and from pilot data or from a simulation study of N-of-1 trials run under the specified values of , , and and and .
Calculate the number of N-of-1 trials, , and the threshold for rejecting the null, , for those conclusive N-of-1 trials to satisfy overall type I error of and power .
Calculate the corresponding final sample size (conclusive).
5 |. EXAMPLES AND SIMULATIONS
In this section we design and evaluate in simulation two N-of-1 studies: one in brain tumor excision and one in early-stage Alzheimer’s disease. We base the N-of-1 and multi-N-of-1 designs on pilot data from these small studies. We conduct simulation studies to verify our proposed design procedures from Sections 3 and 4. We measure the efficiency of these trials as the early stopping afforded by the interim monitoring, both within individuals and among individuals. We provide software in the form of R code on our GitHub repository (https://github.com/jj113/nof1).
5.1 |. Moter Recovery following Brain Tumor Excision
A single patient N-of-1 trial of levodpoa/carbidopa (LD/CD) versus placebo for motor recovery in a patient with residual hemiparesis secondary to removal of a benign oligoastrocytoma was conducted [27]. The patient was treated in an outpatient physiatry practice at an academic center. Motor recovery was assessed using the Fugl-Meyer Assessment (FMA) score. The rationale for conducting an N-of-1 trial for this patient was to obtain an objective assessment of the efficacy of LD/CD to potentially justify its long term use for motor improvement. The duration of the double blinded study was six weeks, which were divided into three two-week cycles. During each cycle, the patient received placebo or LD/CD (100 mg LD and 25 mg CD combined into a single capsule) for 1 week each, assigned in random order. A total of seven FMA scores (including one at baseline) were recorded for the single patient in the study. The mean difference in total FMA score between LD/CD and placebo was 6.9, with 95% CI (1.24, 12.57), indicating a significant benefit for LD/CD versus placebo. This was calculated using a generalized least squares model with an AR(1) autoregressive correlation structure.
To demonstrate our proposed methods, we designed an N-of-1 trial with interim monitoring using the parameter estimates from this pilot study and evaluated the results based on the observed data. Based on the results in Table 1 of [27], we took , and . We further assumed a minimum clinically meaningful difference between LD/CD and placebo to be , with , so that . Based on these parameters, we identified an O’Brien-Fleming design with seven cycles, with interim analyses following cycles two through seven. This yields and as the first two critical values. For the [27] trial, the summary data are provided for three cycles (Figure 3 in the paper), with and . Thus, , , and . We can conclude that had the [27] trial been designed to have a maximum of seven cycles with the possibility for early stopping, it would not have stopped early for equivalence or for superiority after three cycles based on examination of () relative to .
We additionally conducted a simulation study (1000 repetitions) of the N-of-1 trial using the parameters of the actual [27] trial: , . We again set , which requires cycles maximum for 80% power to detect superiority, but has only 4.1% power to detect equivalence. To detect equivalence with 80% power, we would require 52 cycles. As a practical consideration, we may elect to keep the maximum number of cycles at seven, and forgo the ability to detect equivalence. Table 1 contains the results of the simulation study in which we examine the findings for and for and 0.2, potentially allowing for early stopping for futility. The findings confirm our design calculations as we achieve 79.7% power to detect when and cycles. Likewise, we achieve 79.7% power to detect when and cycles. When and , the expected number of cycles required is 4.6, which offers considerable savings relative to the maximum of 7 cycles. When early stopping for futility is allowed (i.e., ), the power under is lowered slightly to 71.7%. As expected, the expected number of cycles is further reduced to 3.8. Also, there are considerably fewer inconclusive findings, as many of those are ascribed to futility.
TABLE 1.
Estimates under the single-person trial (Ennis et al, 2013) with , , , and 1000 simulations.
| A=B | A>B | B>A | Inconclusive | Futility | Expected # of cycles | |||
|---|---|---|---|---|---|---|---|---|
|
| ||||||||
| 52 | 0 | 0 | 79.7% | 1.3% | 1.6% | 17.4% | - | 40.6 |
| 7 | 0 | 0 | 4.1% | 6.2% | 4.4% | 85.3% | - | 6.4 |
| 4.2 | 1.3% | 0.2% | 39.1% | 59.4% | - | 5.8 | ||
| 4.6 | 1.1% | 0.2% | 43.6% | 55.1% | - | 5.7 | ||
| 5.1 | 1.2% | 0.0% | 55.2% | 43.6% | - | 5.3 | ||
| 5.6 | 0.8% | 0.0% | 63.8% | 35.4% | - | 5.2 | ||
| 6.1 | 0.6% | 0.0% | 71.0% | 28.4% | - | 4.8 | ||
| 6.9 | 0.1% | 0.2% | 79.7% | 20.0% | - | 4.6 | ||
| 7.1 | 0.4% | 0.0% | 82.6% | 17.0% | - | 4.5 | ||
| 7.6 | 0.1% | 0.0% | 88.1% | 11.8% | - | 4.3 | ||
| 52 | 0.2 | 0 | 28.7% | 0.9% | 1.1% | 2.7% | 66.6% | 18.2 |
| 7 | 0.2 | 0 | 4.0% | 4.5% | 4.6% | 37.4% | 49.5% | 4.7 |
| 4.2 | 1.6% | 0.1% | 34.1% | 3.7% | 53.0% | 3.8 | ||
| 4.6 | 1.9% | 0.4% | 40.3% | 4.4% | 36.9% | 3.7 | ||
| 5.1 | 1.2% | 0% | 45.6% | 4.7% | 48.5% | 3.7 | ||
| 5.6 | 1.0% | 0.1% | 56.3% | 3.0% | 39.6% | 3.8 | ||
| 6.1 | 1.1% | 0% | 59.4% | 3.2% | 36.3% | 3.8 | ||
| 6.9 | 0.4% | 0.0% | 71.7% | 3.7% | 24.2% | 3.8 | ||
| 7.1 | 0.2% | 0% | 73.3% | 2.4% | 24.1% | 3.8 | ||
| 7.6 | 0.1% | 0.0% | 81.2% | 1.8% | 16.9% | 3.7 | ||
We then designed multi-N-of-1 trials under four alternative values for and for the test of versus , at four alternatives , and the null . We conducted an interim analysis after each patient’s N-of-1 trial was completed, starting with the second patient. The N-of-1 trials were designed as described above, and consisted of up to seven two-week cycles (). Assuming , , , , we calculated the requisite number of individuals, , and decision threshold, . We conducted 1000 simulations and report the expected number of individuals accrued at the time of the stopping of the study, the expected number of cycles per individual, the individual-level probabilities of the possible decisions and the population-level probability of arriving at the correct conclusion. The results are displayed in Table 2. For example, when , we calculate and , from which we calculate and thus require a total of individuals. In this scenario, the expected number of individuals that are tested is only 5.8 and the expected number of cycles per individual is 5.2. This differs from the 4.6 of Table 1 because the population is a mixture of those with and those with . We correctly conclude that with probability 79.6% and we correctly identify for individuals with probability 80.3%.
TABLE 2.
Estimates under the multi-person trial (Ennis et al. 2013) with , , , , , and , over 1000 simulations.
| Individual Probability | Population Probability | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
||||||||||||||||
| Expected # of individuals | Expected # of cycles | Conditional | Conclusion | Futility | Inconclusive | Claiming False Superiority | Claiming False Equivalence | Correct Conclusion: p3 > p03 | ||||||||
|
| ||||||||||||||||
| 0 | (0, 0,1) | 1 | 0.46 | 6 | 4 | 3 | 6 | 3.3 | 4.7 | 79.8% | - | 19.6% | 82.0% | |||
| τ = T | 79.8% | - | 19.6% | 0.0% | 0.6% | |||||||||||
| τ = 0 | - | - | - | - | - | |||||||||||
| τ = −T | - | - | - | - | - | |||||||||||
| (0,2/10,8/10) | 0.96 | 0.46 | 7 | 4 | 3 | 6 | 3.2 | 4.9 | 66.4% | - | 31.5% | 80.2% | ||||
| τ = T | 81.1% | - | 18.4% | 0.0% | 0.5% | |||||||||||
| τ = 0 | 4.8% | - | 86.2% | 9.0% | - | |||||||||||
| τ = −T | - | - | - | - | - | |||||||||||
| (0, 3/10, 7/10) | 0.93 | 0.46 | 12 | 7 | 5 | 6 | 5.8 | 5.2 | 57.8% | - | 39.4% | 79.6% | ||||
| τ = T | 80.3% | - | 19.4% | 0.0% | 0.3% | |||||||||||
| τ = 0 | 4.7% | - | 86.6% | 8.7% | - | |||||||||||
| τ = −T | - | - | - | - | - | |||||||||||
| (1/5,1/5,3/5) | 0.72 | 0.46 | 35 | 23 | 14 | 6 | 19.6 | 5.0 | 64.9% | - | 32.8% | 81.0% | ||||
| τ = T | 80.7% | - | 18.7% | 0.0% | 0.6% | |||||||||||
| τ = 0 | 4.2% | - | 86.9% | 8.9% | - | |||||||||||
| τ = −T | 80.3% | - | 19.2% | 0.0% | 0.5% | |||||||||||
| 0.2 | (0, 0,1) | 1 | 0.46 | 6 | 4 | 3 | 6 | 3.2 | 4.6 | 72.7% | 2.5% | 24.4% | 79.1% | |||
| τ = T | 72.7% | 2.5% | 24.4% | 0.0% | 0.4% | |||||||||||
| τ = 0 | - | - | - | - | - | |||||||||||
| τ = −T | - | - | - | - | - | |||||||||||
| (0,2/10,8/10) | 0.96 | 0.46 | 7 | 4 | 3 | 6 | 3.5 | 4.0 | 58.6% | 8.8% | 30.4% | 79.1% | ||||
| τ = T | 71.6% | 2.7% | 25.3% | 0.0% | 0.4% | |||||||||||
| τ = 0 | 3.9% | 34.2% | 53.4% | 8.5% | - | |||||||||||
| τ = −T | - | - | - | - | - | |||||||||||
| (0, 3/10, 7/10) | 0.93 | 0.46 | 13 | 7 | 5 | 6 | 5.8 | 4.0 | 52.1% | 12.3% | 32.7% | 80.1% | ||||
| τ = T | 71.8% | 2.8% | 24.8% | 0.0% | 0.6% | |||||||||||
| τ = 0 | 4.2% | 35.5% | 51.9% | 8.3% | - | |||||||||||
| τ = −T | - | - | - | - | - | |||||||||||
| (1/5,1/5,3/5) | 0.72 | 0.46 | 44 | 26 | 16 | 6 | 22.1 | 4.0 | 58.6% | 9.7% | 29.6% | 79.1% | ||||
| τ = T | 72.5% | 3.1% | 24.0% | 0.0% | 0.4% | |||||||||||
| τ = 0 | 4.0% | 36.3% | 51.3% | 8.4% | - | |||||||||||
| τ = −T | 72.5% | 2.7% | 24.4% | 0.0% | 0.4% | |||||||||||
5.2 |. Methylphenidate in early Alzheimer’s disease
A pilot N-of-1 study of a potential symptomatic treatment, methylphenidate, for early-stage AD was conducted [26]. Seven participants underwent three, month-long treatment blocks (two weeks of treatment with methylphenidate and two weeks of placebo in random order). The primary endpoint was cognition, measured using a standard cognitive assessment tool, the Repeatable Battery for the Assessment of Neuropsychological Status (RBANS) [28], at the end of each treatment period.
We consider two study designs for a future N-of-1 trial of MPH: one with cycles, as in [26] and one with cycles. We calculated from our pilot data for three RBANS indices: immediate memory, delayed memory and total scale. We selected the corresponding equivalence margins, , to be the standard deviations reported for these indices in patients with mild cognitive impairment (MCI) in [29]. Table 3 lists the detectable (with 80% power) mean differences for superiority () for these RBANS indices in the context of the sequential design with O’Brien-Fleming stopping boundaries, as well as the associated power to detect equivalence.
TABLE 3.
MPH study design: 80% power for superiority
| RBANS Index | Mean difference for 80% power | Expected mean difference MCI vs. healthy control | Power to detect equivalence | |||
|---|---|---|---|---|---|---|
|
| ||||||
| 3 | Immediate Memory | 8.8 | 14.1 | 32.1 | 16.2 | 42% |
| Delayed Memory | 7.8 | 16.5 | 32.5 | 28.0 | 56% | |
| Total Scale | 4.2 | 11.1 | 19.7 | 21.6 | 88% | |
| 10 | Immediate Memory | 8.8 | 14.1 | 24.16 | 16.2 | 93% |
| Delayed Memory | 7.8 | 16.5 | 25.42 | 28.0 | 100% | |
| Total Scale | 4.2 | 11.1 | 15.89 | 21.6 | 100% | |
Selecting the RBANS Total Scale on the basis of its having the smallest standard deviation and the smallest margin for equivalence, we further explore its properties in simulation. Table 4 lists the results of a simulation study (1000 repetitions) of the RBANS Total Scale with 10 cycles, , and of 14.1, 14.6, 15.1, 15.89, 16.1, 16.6, 17.1. These results confirm our design approach as there is 80.5% power to detect with 10 cycles and 78.9% power to detect equivalence with 3 cycles. The savings due to the sequential design is apparent with the expected number of cycles of 6.3 when . Inclusion of a stopping boundary for futility reduces the power to 64.6% and the expected number of cycles to 4.3.
TABLE 4.
Estimates under the single-person trial (DesRuisseaux et al, 2020) with , , , and 1000 simulations.
| A=B | A>B | B>A | Inconclusive | Futility | Expected # of cycles | |||
|---|---|---|---|---|---|---|---|---|
|
| ||||||||
| 3 | 0 | 0 | 78.9% | 0.1% | 0.0% | 21.0% | - | 2.5 |
| 10 | 0 | 0 | 100% | 0.0% | 0.0% | 0.0% | - | 4.0 |
| 14.1 | 1.8% | 0.0% | 57.3% | 40.9% | - | 7.4 | ||
| 14.6 | 2.5% | 0.0% | 58.8% | 38.7% | - | 7.3 | ||
| 15.1 | 1.6% | 0.0% | 68.0% | 30.4% | - | 6.9 | ||
| 15.89 | 1.5% | 0.0% | 80.5% | 18.0% | - | 6.3 | ||
| 16.1 | 0.9% | 0.0% | 83.9% | 15.2% | - | 6.3 | ||
| 16.6 | 0.9% | 0.0% | 86.9% | 12.2% | - | 5.8 | ||
| 17.1 | 0.6% | 0.0% | 92.6% | 6.8% | - | 5.6 | ||
| 3 | 0.2 | 0 | 74.8% | 0.0% | 0.0% | 12.9% | 12.3% | 2.5 |
| 10 | 0.2 | 0 | 87.9% | 0.2% | 0.0% | 0.0% | 11.9% | 3.7 |
| 14.1 | 2.3% | 0.0% | 40.1% | 3.8% | 51.2% | 3.9 | ||
| 14.6 | 1.4% | 0.0% | 47.4% | 5.4% | 45.8% | 4.1 | ||
| 15.1 | 1.4% | 0.0% | 55.0% | 3.5% | 40.1% | 4.2 | ||
| 15.89 | 1.0% | 0.0% | 64.6% | 2.2% | 32.2% | 4.3 | ||
| 16.1 | 0.5% | 0.0% | 66.2% | 2.2% | 31.1% | 4.2 | ||
| 16.6 | 0.7% | 0.0% | 71.0% | 1.8% | 26.5% | 4.3 | ||
| 17.1 | 0.7% | 0.0% | 77.5% | 0.8% | 21.0% | 4.1 | ||
We then designed multi-N-of-1 trials under four alternative values for and for the test of versus , at four alternatives , and the null . We conducted an interim analysis after each patient’s N-of-1 trial was completed, starting with the second patient. The N-of-1 trials were designed as described above, and consisted of up to 10 two-week cycles (). Assuming , , , , we calculated the requisite number of individuals, , and decision threshold, . We conducted 1000 simulations and report the expected number of individuals accrued at the time of the stopping of the study, the expected number of cycles per individual, and the individual-level probabilities of the possible decisions and the population-level probability of arriving at the correct conclusion. The results are displayed in Table 5. For example, when , we calculate and , from which we calculate and thus require a total of individuals. In this scenario, the expected number of individuals that are tested is only 9.9 and the expected number of cycles per individual is 5.7. We correctly conclude that with probability 79.8% and we correctly identify for individuals with probability 80.3%.
TABLE 5.
Estimates under the multi-person trial (DesRuisseaux et al, 2020) with , , , , , and , over 1000 simulations.
| Individual Probability | Population Probability | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|||||||||||||||||
| Expected # of individuals | Expected # of cycles | Conditional on | Correct Conclusion | Futility | Inconclusive | Claiming False Superiority | Claiming False Equivalence | Correct Conclusion: p3 > p03 | |||||||||
|
| |||||||||||||||||
| 0 | (0, 0, 1) | 1 | 0.31 | 4 | 3 | 2 | 9 | 2.6 | 6.3 | 79.8% | 19.0% | 82.7% | |||||
| τ = T | 79.8% | - | 19.0% | 0.0% | 1.2% | ||||||||||||
| τ = 0 | - | - | - | - | - | ||||||||||||
| τ = −T | - | - | - | - | - | ||||||||||||
| (0, 2/10, 8/10) | 0.77 | 0.31 | 11 | 9 | 5 | 9 | 7.2 | 5.9 | 84.4% | - | 14.7% | 81.9% | |||||
| τ = T | 80.8% | - | 18.2% | 0.0% | 1.0% | ||||||||||||
| τ = 0 | 99.8% | - | 0.1% | 0.1% | - | ||||||||||||
| τ = −T | - | - | - | - | - | ||||||||||||
| (0, 3/10, 7/10) | 0.66 | 0.31 | 14 | 12 | 6 | 9 | 9.9 | 5.7 | 86.0% | - | 13.3% | 79.8% | |||||
| τ = T | 80.3% | - | 18.9% | 0.0% | 0.8% | ||||||||||||
| τ = 0 | 99.9% | - | 0.0% | 0.01% | - | ||||||||||||
| τ = −T | - | - | - | - | - | ||||||||||||
| (1/5,1/5, 3/5) | 0.58 | 0.31 | 26 | 22 | 10 | 9 | 18.5 | 5.8 | 84.6% | - | 14.5% | 79.4% | |||||
| τ = T | 80.8% | - | 18.2% | 0.0% | 1.0% | ||||||||||||
| τ = 0 | 99.7% | - | 0.2% | 0.1% | - | ||||||||||||
| τ = −T | 80.8% | - | 18.1% | 0.0% | 1.1% | ||||||||||||
| 0.2 | (0, 0,1) | 1 | 0.30 | 5 | 3 | 2 | 9 | 2.5 | 4.2 | 64.3% | 32.1% | 2.5% | 78.3% | ||||
| τ = T | 64.3% | 32.1% | 2.5% | 0.0% | 1.1% | ||||||||||||
| τ = 0 | - | - | - | - | - | ||||||||||||
| τ = −T | - | - | - | - | - | ||||||||||||
| (0, 2/10, 8/10) | 0.75 | 0.30 | 13 | 9 | 5 | 9 | 7.1 | 4.1 | 68.5% | 28.1% | 2.6% | 79.8% | |||||
| τ = T | 63.9% | 32.0% | 3.1% | 0.0% | 1.0% | ||||||||||||
| τ = 0 | 86.9% | 12.4% | 0.6% | 0% | |||||||||||||
| τ = −T | - | - | - | - | - | ||||||||||||
| (0, 3/10, 7/10) | 0.63 | 0.30 | 21 | 15 | 7 | 9 | 11.8 | 4.1 | 71.6% | 26.0% | 1.8% | 79.5% | |||||
| τ = T | 64.5% | 32.1% | 2.6% | 0.0% | 0.8% | ||||||||||||
| τ = 0 | 88.1% | 11.8% | 0% | 0.1% | |||||||||||||
| τ = −T | - | - | - | - | - | ||||||||||||
| (1/5,1/5, 3/5) | 0.56 | 0.30 | 36 | 25 | 11 | 9 | 20.9 | 4.1 | 69.0% | 27.6% | 2.6% | 80.4% | |||||
| τ = T | 64.0% | 32.0% | 3.1% | 0% | 0.9% | ||||||||||||
| τ = 0 | 89.2% | 10.4% | 0.4% | 0% | - | ||||||||||||
| τ = −T | 63.7% | 32.0% | 3.2% | 0% | 1.1% | ||||||||||||
6 |. DISCUSSION
We have introduced a novel approach to N-of-1 trials, applicable to both single-person and multi-person settings, which incorporates sequential monitoring for improved efficiency. While meta-analysis has been used previously to combine N-of-1 trials [18], sequential monitoring has not. This approach aligns well with the principles of personalized medicine, as it focuses on unique patient responses in clinical decision-making. It is well-suited for some Alzheimer’s disease trials, which typically require large sample sizes and years of follow-up, and do not acknowledge patient heterogeneity. In general, the N-of-1 design is best suited for diseases whose symptoms are stable over the time frame of the study (in the absence of treatment), and for treatments that aim to alleviate symptoms rather than the underlying disease process. Treatments should have rapid onsets and cessations of action in order to reasonably minimize the duration of each treatment and placebo period and minimize crossover effects [4]. As AD cognitive and functional symptoms are generally stable over periods of about a year [30], it is an appropriate condition for evaluating certain approved treatments for alleviation of symptoms in N-of-1 trials [26, 10]. In this setting, there may or may not be one drug that is superior for all patients and there is a need to discover the best drug for each patient as quickly as possible.
For single-person N-of-1 trials, our approach permits early stopping as soon as there is sufficient evidence of a treatment preference for the individual. The multi-person N-of-1 trial, layered upon the single-person N-of-1 trials, enables us to ascertain whether there is an optimal treatment preference at the population level. This is applied as treatments are being investigated in a patient population. It is based on a biased sample of the population as it includes those for whom a definitive finding is obtained and it preferentially includes those with the strongest treatment effects. We have addressed this in our design of the multi-person trial with adjustment of the frequencies of treatment outcomes. The design leverages the speed of the detection at the individual level, as those for whom an optimal treatment is detected earlier enter the multi-person trial sooner than those for whom a treatment is detected later. Ultimately this aligns with a fast and conclusive finding about the rankings of the treatments. We have adopted the simplest, “curtailed” sequential design for the multi-person trial based on binary outcomes in that we are stopping early based on the deterministic assessment of the possibility of a significant result at the planned end of the trial. Probability-based designs, such as through bona fide stopping rules, would lead to even shorter trials. Once a population-level ranking or partial ranking of treatments is ascertained, single-person N-of-1 trials may still be utilized for optimal treatment of individuals.
Our simulations that emulated a future brain tumor excision trial and an early AD trial demonstrated the efficiencies afforded by our proposed design. With the increased opportunities for remote clinical trials with wireless clinical monitoring devices, the sequential N-of-1 design will be easily implemented and has tremendous potential to shorten trials and reduce sample sizes. This will enable effective testing of established and innovative treatments in appropriate disease settings.
Supplementary Material
Supporting Information
Software in the form of R code is available on our GitHub repository (https://github.com/jj113/nof1). This repository includes both the sample size calculator as well as code for replicating the simulation results.
Acknowledgements
This research was supported in part by NIH grants R01NS094610, P30AG066512, P01AG036694, and UL1 TR002345.
references
- [1].Piantadosi S Clinical Trials: A Methodologic Perspective. Wiley Series in Probability and Statistics; 2017. [Google Scholar]
- [2].Veradhan R, Seeger JD. Estimation and Reporting of Heterogeneity of Treatment Effects. Developing a Protocol for Observational Comparative Effectiveness Research: A User’s Guide; 2013. [Google Scholar]
- [3].Ishigooka J, Murasaki M, Miura S. Olanzapine optimal dose: results of an open-label multicenter study in schizophrenic patients. olanzapine late-phase ii study group. Psychiatry Clin Neurosci 2000;54:467 – 478. [DOI] [PubMed] [Google Scholar]
- [4].Arnold SE, Betensky RA. Multicrossover Randomized Controlled Trial Designs in Alzheimer Disease. Annals of Neurology 2018;84:168 – 175. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Schork NJ. Personalized medicine: Time for one-person trials. Nature 2015;520:609 – 611. [DOI] [PubMed] [Google Scholar]
- [6].Gabler N, Duan N, Vohra S, Kravitz R. N-Of-1 Trials in the Medical Literature: A Systematic Review. Medical Care 2011;49:761 – 768. [DOI] [PubMed] [Google Scholar]
- [7].Guyatt GH, Kelly JL, Jaeschke R, Rosenbloom D, Adachi JD, Newhouse MT. The n-of-1 Randomized Controlled Trial: Clinical Usefulness: Our Three-Year Experience. Annals of Internal Medicine 1991;112:293 – 299. [DOI] [PubMed] [Google Scholar]
- [8].Duan N, Kravitz RL, Schmid CH. Single-patient (n-of-1) trials: a pragmatic clinical decision methodology for patient-centered comparative effectiveness research. Journal of Clinical Epidemiology 2013;66:s21 – s28. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Guyatt G, Sackett D, Taylor DW, Ghong J, Roberts R, Pugsley S. Determining Optimal Therapy — Randomized Trials in Individual Patients. New England Journal of Medicine 1986;314(14):889–892. 10.1056/NEJM198604033141406, [DOI] [PubMed] [Google Scholar]
- [10].Molly DW, Guyatt GH, Wilson DB, Duke R, Rees L, Singer J. Effect of tetrahydroaminoaridine on cognition,function and behaviour in Alzheimer’s disease. Can Med Assoc J 1991;144:29 – 34. [PMC free article] [PubMed] [Google Scholar]
- [11].Silvestris N, Ciliberto G, Paoli PD, Apolone G, Lavitrano ML, Pierotti MA, et al. Liquid dynamic medicine and N-of-1 clinical trials: a change of perspective in oncology research. Journal of Experimental & Clinical Cancer Research 2017;36:128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Sung L, Feldman BM. Liquid dynamic medicine and N-of-1 clinical trials: a change of perspective in oncology research. J Pediatr Hematol Oncol 2006;28:263 – 266.16679928 [Google Scholar]
- [13].Zucker DR, Ruthazer R, Schmid CH. Individual (N-of-1) trials can be combined to give population comparative treatment effect estimates: Methodologic considerations. J Clin Epidemiol 2010;63(12):1312 – 1323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Haines DR, Gaines SP. N of 1 randomised controlled trials of oral ketamine in patients with chronic pain. Pain 1999;82:283 – 287. [DOI] [PubMed] [Google Scholar]
- [15].Marcucci M, Germini F, Coerezza A, Andreineitti L, Bellintani L, Nobili A, et al. Efficacy of ultra-micronized palmitoylethanolamide (um-PEA) in geriatric patients with chronic pain: study protocol for a series of N-of-1 randomized trials. Trials 2016;17:369. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Chen X, Chen P. A comparison of four methods for the analysis of n-of-1 trials. PLoS ONE 2014;9(2):e87752. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Araujo A, Julious S, Senn S. Understanding variation in sets of n-of-1 trials. PLoS ONE 2016;11(12):e0167167. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Punja S, Bukutu C, Shamseer L, Sampson M, Hartling L, Urichuk L, et al. N-of-1 trials are a tapestry of heterogeneity. Journal of Clinical Epidemiology 2016;76:47 – 56. [DOI] [PubMed] [Google Scholar]
- [19].Zucker DR, Schmid CH, McIntosh MW, D ARB, Selker HP, Lau J. Combining Single Patient (N-of- 1) Trials to Estimate Population Treatment Effects and to Evaluate Individual Patient Responses to Treatment. J Clin Epidemiol 1997;50(4):401 – 410. [DOI] [PubMed] [Google Scholar]
- [20].Brigham and Women’s Hospital DEcIDE Methods Center. Design and Implementation of N-of-1 Trials: A User’s Guide. Agency for Healthcare Research and Quality; 2014. [Google Scholar]
- [21].Senn S Sample size considerations for n-of-1 trials. Statistical Methods in Medical Research 2017;0(0):1 – 12. [DOI] [PubMed] [Google Scholar]
- [22].O’Brien PC, Fleming TR. A Multiple Testing Procedure for Clinical Trials. Systematic Reviews 1979;35:549 – 556. [PubMed] [Google Scholar]
- [23].Jennison C, Turnbull BW. Group sequential methods with applications to clinical trials. Chapman & Hall; 1999. [Google Scholar]
- [24].Siegmund D Sequential analysis: tests and confidence intervals. Springer Science & Business Media; 1985. [Google Scholar]
- [25].McCullagh P, Nelder JA. Generalized Linear Models. London New York: Chapman and Hall; 1989. [Google Scholar]
- [26].DesRuisseaux LA, Williams VJ, McManus AJ, Gupta AS, Carlyle BC, Azami H, et al. A pilot protocol to assess the feasibility of a virtual multiple crossover, randomized controlled trial design using methylphenidate in mild cognitive impairment. Trials 2020. Dec;21(1):1016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].Ennis JD, Harvey D, Ho E, Chari V, Garham A, Nesathurai S. Levodopa/carbidopa to improve motor function subsequent to brain tumor excision. Am J Phys Med Rehabil 2013;92:307 – 311. [DOI] [PubMed] [Google Scholar]
- [28].Randolph C, Tierney MC, Mohr E, Chase TN. The Repeatable Battery for the Assessment of Neuropsychological Status (RBANS): preliminary clinical validity. J Clin Exp Neuropsychol 1998. Jun;20(3):310–319. [DOI] [PubMed] [Google Scholar]
- [29].Karantzoulis S, Novitski J, Gold M, Randolph C. The Repeatable Battery for the Assessment of Neuropsychological Status (RBANS): Utility in detection and characterization of mild cognitive impairment due to Alzheimer’s disease. Arch Clin Neuropsychol 2013. Dec;28(8):837–844. [DOI] [PubMed] [Google Scholar]
- [30].Charpignon ML, Vakulenko-Lagun B, Zheng B, Magdamo C, Su B, Evans K, et al. Drug repurposing of metformin for Alzheimer’s disease: Combining causal inference in medical records data and systems pharmacology for biomarker identification. medRxiv 2021; https://www.medrxiv.org/content/early/2021/08/12/2021.08.10.21261747.
