Skip to main content
Springer Nature - PMC COVID-19 Collection logoLink to Springer Nature - PMC COVID-19 Collection
. 2022 Dec 16:1–25. Online ahead of print. doi: 10.3758/s13428-022-02012-1

Power analysis for idiographic (within-subject) clinical trials: Implications for treatments of rare conditions and precision medicine

Stephen Tueller 1, Derek Ramirez 1, Jessica D Cance 1, Ai Ye 2, Anne C Wheeler 1,2, Zheng Fan 2, Christoph Hornik 3, Ty A Ridenour 1,2,4,
PMCID: PMC9757638  PMID: 36526885

Abstract

Power analysis informs a priori planning of behavioral and medical research, including for randomized clinical trials that are nomothetic (i.e., studies designed to infer results to the general population based on interindividual variabilities). Far fewer investigations and resources are available for power analysis of clinical trials that follow an idiographic approach, which emphasizes intraindividual variabilities between baseline (control) phase versus one or more treatment phases. We tested the impact on statistical power to detect treatment outcomes of four idiographic trial design factors that are under researchers’ control, assuming a multiple baseline design: sample size, number of observations per participant, proportion of observations in the baseline phase, and competing statistical models (i.e., hierarchical modeling versus piecewise regression). We also tested the impact of four factors that are largely outside of researchers’ control: population size, proportion of intraindividual variability due to residual error, treatment effect size, and form of outcomes during the treatment phase (phase jump versus gradual change). Monte Carlo simulations using all combinations of the factors were sampled with replacement from finite populations of 200, 1750, and 3500 participants. Analyses characterized the unique relative impact of each factor individually and all two-factor combinations, holding all others constant. Each factor impacted power, with the greatest impact being from larger treatment effect sizes, followed respectively by more observations per participant, larger samples, less residual variance, and the unexpected improvement in power associated with assigning closer to 50% of observations to the baseline phase. This study’s techniques and R package better enable a priori rigorous design of idiographic clinical trials for rare diseases, precision medicine, and other small-sample studies.

Supplementary Information

The online version contains supplementary material available at 10.3758/s13428-022-02012-1.

Keywords: Within-subject clinical trials, Idiographic, Statistical power, Effect size, Hierarchical modeling, Piecewise regression, Monte Carlo simulation, Rare diseases, Precision medicine


Clinical trials of treatments for rare diseases can be impeded simply by their small populations, as the large samples traditionally used in randomized trials may not exist. Other research scenarios that often employ small samples include pilot studies, organizational research such as implementation science, pragmatic trials, precision medicine, microtrials and micro-randomized trials, and applied behavior analysis. Additional barriers frequently preclude using traditional randomized controlled trial (RCT) methods for these scenarios: limited funding, few treatment centers to support recruitment, and even fewer treatment centers that can support standardized protocols (National Institutes of Health, 2019; Teare et al., 2014).

Small-sample, within-subject clinical trials offer powerful methods for rare disease clinical trials (McDonald & Nikles, 2021; Nikles et al., 2021; Percha et al., 2019; Ridenour et al., 2016; Shaffer et al., 2018). Nearly 7000 rare diseases are known, which collectively affect about 25 million US citizens (Groft, 2013; National Institutes of Health, 2019). One exemplar is Angelman syndrome, a neurogenetic disorder that affects 1 in 15,000 US newborns. Angelman syndrome is severe and debilitating, including severe intellectual disability, epilepsy, aggression to self, and other symptoms. Improved understanding of Angelman syndrome genetic mechanisms has led to clear hypotheses for treating it. However, clinical trials for Angelman syndrome are limited by its low prevalence, few treatment centers that are available for recruiting, the unknown generalizability from these patients, and low statistical power to test treatment outcomes (Baquet et al., 2006; Mills et al., 2006).

Within-subject experimental studies, also termed N-of-1 studies, single case experimental study designs, single case studies, or certain quasi-experimental designs, have traditionally been used with small samples to test treatment effects that are large (can be “seen” visually in time series data), and span a relatively short time before and after treatment exposure (e.g., within days or weeks). They emphasize reporting of time series data of a single outcome and for each individual participant, and don’t involve comparisons between subsamples (Kazdin, 2011; Shaffer et al., 2018; Smith, 2012). Historically, researchers using these designs have relied on visual inspection rather than rigorous statistical analyses. However, there are exceptions to all of these characterizations (Marcus et al., 2022; Ridenour et al., 2016).

Recent developments in within-subject clinical trials

Recent advances in methods for within-subject clinical trials capitalize on their strengths and address limitations of traditional RCTs for empirically-based personalized medicine (Blackston et al., 2019; Daza, 2018; Duan et al., 2013; Kronish et al., 2019; Percha et al., 2019; Ridenour et al., 2016). These include simulation studies to support a priori study design based on power to detect treatment outcomes in the form of phase jumps (i.e., changes in an outcome that occur immediately upon treatment exposure) and gradual changes (i.e., gradually increasing or decreasing effects) that account for autocorrelation, carry-over effects, and other factors of within-subject studies (Blackston et al., 2019; Duan et al., 2013; Percha et al., 2019). They also demonstrate that RCT counterfactual underpinnings are reflected in single-subject phased time series data (Daza, 2018; Holland, 1986; Rubin, 1974; Splawa-Neyman et al., 1990) and include statistical programs that have been developed to design and analyze within-subject studies (e.g., https://statsof1.org/resources/#sample-size%2D%2Dstatistical-power).

Herein, we test the relative impacts of eight design factors on power to detect treatment-related phase jumps or slopes in outcome variables of within-subject clinical trials. We use an umbrella term, idiographic clinical trials (ICTs), rather than case studies or “N-of-1” for three reasons. ICTs are designed a priori for samples of multiple participants; N-of-1 studies historically have not utilized rigorous stochastic analyses to account for random variation and statistically test hypotheses; and ICTs frequently test differences among subgroups (e.g., treatment vs. control groups) as well as phases within individuals (Howe & Ridenour, 2019; Myin-Germeys et al., 2009; Ridenour et al., 2016; Wright et al., 2015).

Statistical models used to analyze ICT data include interrupted time series analysis, hierarchical modeling for small samples, and unified structural equations (Baek & Ferron, 2013; Ridenour et al., 2013; Ridenour et al., 2016; Trompetter et al., 2019). An ICT usually emphasizes study of intensive within-person study of clinical processes in small, homogeneous samples or even at the individual level and then aggregating those findings. Although ICTs are frequently designed to detect short-term responses to a treatment, using daily or more frequent observations, they can also be used to study outcomes that occur over much longer timelines that have weekly or longer intervals between observations (Ridenour et al., 2013; Wittenborn et al., 2019).

ICTs are uniquely attuned for rare diseases by using small samples, providing statistical rigor, typically costing far less than RCTs, yielding detailed analysis of outcomes heterogeneity, and offering the option to be overlaid onto usual care practices (Ridenour et al., 2016; Wittenborn et al., 2019). ICTs can quantify mechanisms of change during the administration of treatments, regardless of their length of time to administer (e.g., weeks to months of psychotherapy), and have used data from health records to thereby minimize burden on participants (Howe & Ridenour, 2019; Ridenour et al., 2016; Ridenour et al., 2017; Wittenborn et al., 2019). ICTs also can potentially incorporate patient preferences (Ridenour et al., 2013). For example, patients with different medical conditions tend to favor different design features such as preferring long study duration (chronic pain), low costs of study participation (hypertension), or having greater tolerance for data collection frequency (asthma) (Cheung et al., 2020). Perhaps the greatest strengths of ICTs are their incentives for patients to participate and complete the study: each participant can receive the novel treatment and learn their own response to the treatment (termed “impact” rather than “efficacy”).

A mixed effects modeling framework provides a flexible, yet statistically powerful analytic approach for ICT data (Ferron et al., 2009; Ferron et al., 2010; Ridenour et al., 2013; Ridenour et al., 2017). Similar models are hierarchical linear models (HLM), multilevel models (MLM), mixed model trajectory analysis (MMTA), and latent growth curve models (LGM), which yield identical or equivalent special cases (i.e., estimates under one framework are identical or are simple transformations of estimates under another framework). Within ICTs, we refer to all of them as intensive hierarchical models because of three fundamental distinctions between ICT models versus others: (1) time series data are often intensively observed for relatively short durations of time but could span months or longer (Wittenborn et al., 2019), (2) data are usually collected from a sample of relatively few individuals (typically N < 100), and (3) results focus primarily on (albeit are not limited to) effects of intraindividual processes and variations (Walls & Schafer, 2006). The primary adjustments from traditional multilevel models that are needed for ICT intensive hierarchical models are to account for smaller samples and temporal dependencies due to intensively repeated measures (i.e., autocorrelation).

Power analysis for ICTs

When planning a clinical trial, it is critical to estimate a study design’s adequacy for testing a treatment’s outcomes, including power analysis. Table 1 presents existing software with potential for conducting ICT statistical power analyses as well as their limitations. One common limitation of these packages is they assume that samples are from infinite populations, whereas rare diseases violate this assumption. Our study uses PersonAlyticsPower software (Tueller et al., 2020) to explicitly parameterize finite populations for ICT simulations. A second limitation is that many software packages are limited to fewer observations than are recommended for ICTs, or do not readily offer the statistical adjustments needed for small numbers of participants (e.g., GLIMMPSE or PASS). Third, many packages do not specialize to the N = 1 situation (e.g., Optimal Design or powerlmm). Although separate time series power analysis software can be used, having both options in a single package offers greater efficiency and comparability across the N = 1 and small sample conditions. Fourth, several packages provide power analyses only for normally distributed outcomes. Fifth, few packages offer optional covariance structures (e.g., although Mplus can be manually programmed to parse out covariance structures, it does not include ready-made commands to do so; Muthén & Muthén, 1998–2019); selecting an incorrect covariance structure can bias the accuracy of many ICT estimates (Baek & Ferron, 2013; Petit-Bois et al., 2016).

Table 1.

Power analysis software options for idiographic clinical trials

Software package (Last updated)1
Citation
Approach to estimating power Model fitting function Parallelization Within-subject design Population assumption Max # time points Single–subject options Distributional options Residual correlation structure options Small sample adjustments
GLIMMPSE (2017); Kreidler et al., (2013) Simulation Proprietary No Yes2 Infinite 10 None Normal None None
Mplus (2019)3; Muthén and Muthén (2019) Simulation Proprietary Yes Yes3 Infinite4 No limit Yes Normal, binomial, multinomial, Poisson, negative binomial, survival Yes5 None
n1-simulator (2019); Percha et al., (2019) Simulation R script run as an R Shiny app No Yes Infinite No limit Single subject only Normal Yes (via nlme7) None
Optimal Design (2011); Spybrook et al., (2011). Mathem-atical solution Proprietary No No Infinite No limit None Normal None None
PASS (2020); PASS 14 Power Analysis and Sample Size Software (2020) Simulation Proprietary No Yes2 Infinite 506 None Normal None None
PersonAlytics-Power (2020); Tueller et al. (2020) Simulation nlme or gamlss2 R packages Yes Yes Infinite or finite2 No limit Yes Any of the 86 distributions currently implemented in gamlss Yes7 ML for model selection, REML for estimates, finite sample correction (FPC)
powerlmm (2019); Magnusson (2018) Simulation lme4 R package Yes Yes Infinite No limit None8 Normal, binomial, Poisson, gamma, lognormal, hurdle, two-part None8 Satterthwaite’s DF
simr (2019); Green and MacLeod (2016) Simulation lme4 R package No Yes Infinite No limit None8 Normal, binomial, Poisson None8 Satterthwaite’s or Kenward-Roger DF

1. Last updated as of March 2021.

2. Within-subject designs are not an explicit option; they can be achieved via manual inputs of time-specific means. Users are required to check their calculations to ensure that the means lead to the desired effect size.

3. Mplus is just one of many structural equation modeling software packages that offer simulation-based power analyses. Although complex to implement, any of these could be co-opted to obtain power estimates for ICT designs.

4. Mplus supports finite sample corrections for complex survey data, and some ICT designs could be analyzed under that framework. However, Mplus does not simulate data from finite samples unless the user first simulates the finite population and then resamples from this population external to Mplus (e.g., using R) and then analyzes the resampled data in Mplus.

5. In structural equation modeling software, many autocorrelation structures can be modelled by explicitly specifying the structure in the residual correlation matrix.

6. PASS offers mixed effects and repeated measures ANOVA options with 2 to 10 observations and then 10 to 50 observations in increments of 10.

7. Both the nlme (Pinheiro & Bates, 2011) and gamlss (Stasinopoulos et al., 2017) approaches to fitting mixed effects models are supported. Data can be simulated by resampling from an infinite population or by simulating a finite population and resampling from it. All autocorrelation structures implemented in the nlme R package are available; see documentation for corClasses.

8. Simulation software based on fitting models with the lme4 (Bates, 2010) R package do not have autocorrelation options except through outdated third-party software; see https://bbolker.github.io/mixedmodels-misc/notes/corr_braindump.html. The lme4 package also does not support single-subject models.

In sum, evidence and tools are needed specifically for conducting a priori power analyses that support ICT study planning. Shaffer et al.’s (2018) and Smith’s (2012) reviews showed that N-of-1 psychology clinical trials largely lack the methodological rigor and accuracy needed to interpret their generalizability. Underpowered ICT studies by definition have inflated type II error rates (failing to detect a true effect), which could have devastating results such as rejecting a treatment for a rare disease that does in fact have efficacy for treating the disease (Singh & Loke, 2012). Meta-analysis can aggregate multiple underpowered studies, but also requires more funding (to conduct multiple studies), has to interpolate across study designs that are not coordinated, and may burden a greater number of participants than when a single study is adequately powered (Button et al., 2013).

Recent simulation studies improve on an oft-used guideline for univariate time series analysis, which was to use a minimum of 50 observations (or time points) to obtain stable parameter estimates (Box et al., 1994; Imhoff et al., 1997; Meidinger, 1980). This guideline is oversimplified as confidence intervals and statistical power vary by many factors including autocorrelation, number of participants, number of observations, and treatment effect size (Blackston et al., 2019; Daza, 2018; Percha et al., 2019). Ferron et al., (2009) demonstrated that for multiple baseline designs, confidence intervals narrow and bias is reduced when the number of participants increases from four to eight. Blackston et al., (2019) reported that ICT trials offer superior power to parallel and crossover RCTs, but also more frequently yield incorrect conclusions when a sample is not representative of its population. Importantly, the primary difference between their ICT design, parallel, and crossover designs was number of observations per participant (i.e., a crossover design is essentially one type of an ICT with few observations).

Percha et al.’s (2019) simulation study for N = 1 studies with many phases, based on hypertension and pain management case studies, demonstrated that number of study phases, ordering of treatments, treatment duration, and observation frequency impacted statistical power to detect treatment effects. Whereas their complex designs are ideal for experimental studies including multiple treatments and withdrawal of treatment, they can rarely be implemented in clinical practice, and do not reflect the way treatment is delivered to individual clients or patients. For example, individuals typically either receive treatment after a period of untreated illness, or those who do not respond to one treatment may receive a second treatment (rather than random assignment of phases of alternative treatments).

Indeed, one strength of two-phase ICT designs (e.g., multiple baseline designs) is their resemblance to clinical delivery of treatment, which is one reason they are so widely used (Kazdin, 2011). The unique ecological and external validities of within-subject designs underscore the need for power analysis specifically to support ICT studies. Moreover, none of the aforementioned simulation studies were explicitly designed for rare disease clinical trials, although segments of their results could be applied to rare disease treatment studies. Finally, power analyses to identify the unique and combined contributions of the range of study design factors, holding other design factors constant, are not available at this time.

The present study

We conducted power analyses to detect average treatment effects in two-phase ICTs, accounting for the complex interplay among ICT study design factors, using Monte Carlo simulations. We considered the multiple baseline design was assumed in which all participants first complete a baseline (or control) phase consisting of repeated measures followed by a treatment phase that also consists of repeated measures (Kazdin, 2011). We also assumed no missing data and the same number of baseline observations for all participants to facilitate power analyses (e.g., akin to the average number of baseline observations within a study that randomly varies participants’ number of baseline observations). Data were analyzed twice to compare the relative power of intensive hierarchical modeling to piecewise regression. Two forms of treatment outcomes were investigated: an immediate treatment impact (a phase jump) in which the intercepts of the phases differ (Fig. 1A), and a gradual linear change (e.g., baseline slope = 0 with treatment slope > 0, Fig. 1B).

Fig. 1.

Fig. 1

Hypothetical illustrations of two trajectory forms for treatment impact on one participant: phase jump (A) and gradual linear change (B)

Design factors that were tested included number of observations, proportion of observations that occur during the baseline phase, number of participants, and the proportion of intrapersonal variance in the outcome that is due to residual error. Three hypothetical populations were simulated to reflect the range of finite population sizes of rare disease populations: 200, 1750, and 3500. Greater statistical power was hypothesized to be associated with larger treatment effects, more study participants, more observations per participant, and less residual error (i.e., greater measurement precision). Intensive hierarchical models were hypothesized to provide greater power than piecewise regression.

Methods

We first introduce the two analytic models used: intensive hierarchical model and piecewise regression. Next, the design factors that were simulated and tested are described, followed by the methods used to simulate data. Lastly, strategies for the analysis of the simulated data are summarized, including using the PersonAlyticsPower R package, applying techniques to analyze simulated datasets, analyzing p-values, and comparing among research design factors.

Intensive hierarchical model

The intensive hierarchical model (Ferron et al., 2010; Ridenour et al., 2016; Ridenour et al., 2017) is a longitudinal mixed effects model specified as:

yij=π0i+π1iTimej+π2Timej×Phasej+π3Phasej+εij 1

where yij is the outcome for patient i at time j. Here, Phasej = 0 during the baseline phase, and Phasej = 1 during the treatment phase. For simplicity, we assume that during the baseline phase, the mean intercept = 0 and mean time slope = 0, hence why these terms do not appear in Eq. (1). This model can be used to analyze the data from Fig. 1, which are displayed along with the design matrix associated with the Time and Phase variables in Table 2 (i.e., phase, time, and centered time). The data and model hierarchical structure has observations at each time (hierarchical level 1) clustered within individuals (level 2). Predictor variables include time, phase (i.e., baseline versus treatment), and the time-by-phase interaction. The model includes random intercepts π0i and random slopes π1i which model interpersonal variance, and a residual variance term εij which is assumed to be normal εij~N(0, σ2) where σ2 is the residual variance and is time-invariant. Intercepts and slopes are also assumed to be multivariate-normal and may be correlated, while it is assumed that the random effects and the error term are uncorrelated. Although not used in the simulation study, this model can be extended to include additional covariates. Unlike the simpler case of Eq. (1), note that the PersonAlytics software allows for nonzero baseline phase fixed effects for the intercept and time slope.

Table 2.

Design matrix for data in Fig. 1

Phase label Phase Time
(Timej)
Timej × Phase Centered time Piecewise baseline time
(Time1j)
Piecewise treatment time
(Time2j)
Fig. 1 Panel A outcome data Fig. 1 Panel B outcome data
Baseline 0 0 0 −9 0 0 −0.79 −0.89
Baseline 0 1 0 −8 1 0 0.02 −0.23
Baseline 0 2 0 −7 2 0 −1.00 0.33
Baseline 0 3 0 −6 3 0 1.43 −1.08
Baseline 0 4 0 −5 4 0 0.17 0.26
Baseline 0 5 0 −4 5 0 −0.98 0.10
Baseline 0 6 0 −3 6 0 0.32 0.15
Baseline 0 7 0 −2 7 0 0.57 1.18
Baseline 0 8 0 −1 8 0 0.41 −1.15
Baseline 0 9 0 0 9 0 −0.47 1.33
Treatment 1 10 10 1 10 0 1.10 0.47
Treatment 1 11 11 2 10 1 2.18 −0.04
Treatment 1 12 12 3 10 2 3.59 1.64
Treatment 1 13 13 4 10 3 0.87 1.60
Treatment 1 14 14 5 10 4 1.92 2.89
Treatment 1 15 15 6 10 5 2.13 2.19
Treatment 1 16 16 7 10 6 2.71 0.47
Treatment 1 17 17 8 10 7 1.76 1.79
Treatment 1 18 18 9 10 8 3.98 4.15
Treatment 1 19 19 10 10 9 1.86 4.28

In this study, the standard longitudinal mixed effects model used the centered time variable. The piecewise model used two variables for time. The baseline time variable estimates the effect of each additional time point during the baseline phase and then had a constant effect once treatment started. Analogously, the treatment time variable has a constant zero effect through the baseline phase and then estimates the effect of each additional time point within the treatment phase. The combination of these two time variables allows analysts to estimate nonlinear effects like those in the “linear slope change” conditions illustrated in Fig. 1B by breaking the trajectory into two parts. In general, piecewise models can be used to break up trajectories into as many segments as needed to approximate the observed trajectory shape as long as there are at least three time points per segment.

Equation (1) assumes that the overall influence of time on the outcome is linear. The time-by-phase interaction in Eq. (1) allows the outcome to vary differently over time as fixed effects across phases, resembling the trend illustrated in Fig. 1B. ICT outcomes may change as a phase jump (i.e., a difference between phase intercepts such as in Fig. 1A, π3), a change in linear slope over time (i.e., a difference between the slopes of baseline and treatment phases which is modeled as the time-by-phase interaction, π2), or both (in which both π3 and π2 are nonzero and clinically meaningful). Because we investigated many combinations of different design factors to obtain highly complex results, our simulations consider only a phase jump or time-by-phase interaction (not both) to simplify interpretation of the results.

Piecewise regression model

An alternative approach for modeling ICT data is the piecewise regression model (Neter et al., 1996), in which separate linear models are fit within each phase. The piecewise model employs a separate time variable. However, this incurs the computational cost implied by specifying separate random slope variables per phase (i.e., π1i and π2i in Eq. 2) and additional degrees of freedom. The piecewise regression model is specified as:

yij=π0i+π1iTime1j+π2iTime2j+π3Phasej+εij 2

The corresponding terms of Eq. (2) are generally the same as in Eq. (1). The primary differences are that Time1 can vary in the baseline phase but stays constant during the treatment phase (see “Piecewise baseline time” in Table 2), and that π2i is a random effect for Time2 (see “Piecewise treatment time” in Table 2), which can vary in the treatment phase but stays constant in the baseline phase.1

For ICT designs with two phases, the time-by-phase interaction term used in intensive hierarchical models accomplishes the same purpose as the phase-specific random slopes in the piecewise model (i.e., the two models are essentially two parameterizations of the same model; see Table 2 and Appendix A). The important distinction between the two is that Eq. (1) uses the fixed effect interaction term π2(Timej × Phasej) whereas Eq. (2) (the piecewise model) uses the mean of the random effect, πi2Time2. The models in Eq. (1) and Eq. (2) were separately fit to all the simulation data. Estimating the extra random effect from Eq. (2) was hypothesized to incur a greater cost to statistical power compared to only an additional fixed effect from Eq. (1).

Comparing the intensive hierarchical model and the piecewise regression model

The hierarchical regression model is often presented first in books on longitudinal modeling, with extensions like the piecewise model in later chapters. The piecewise model was included in this study as it may be a more appropriate model when longitudinal trajectories are discontinuous, such as that shown in Fig. 1B. Although the phase-by-time interaction in Eq. (1) is a sensible way to estimate the inflection in the slope illustrated in Fig. 1B, it can be more intuitive to reparametrize the interaction as phase-specific slope estimates. This reparameterization can be understood as follows. In the intensive hierarchical model, the time-by-phase interaction π2 estimates the additional rate of change in intervention phase not accounted for by the overall rate of change π1. In contrast, the piecewise model estimates the absolute rate of change in both the baseline phase (π1i) and the intervention phase (π2i). For a more detailed technical comparison of the two models, see Appendix A as well as Singer et al., (2003) chapter 6.1.

As will be shown in the results, some study conditions exhibited greater statistical power with the piecewise model, and some exhibited more power with the intensive hierarchical model, which can be used to partially inform users which model to use when their data are similar to either Fig. 1A or B. For more general trajectory shapes, PersonAlyticsPower can be used to explore statistical power for multiple types of models such as those briefly summarized in the next section.

ICT study design factors

ICT design factors that we tested include those that researchers typically do and do not have control over, but that must be accounted for to accurately estimate ICT power (Table 3). Also characterized were combinations among the simulation study design factors (not to be confused with the time-by-phase interaction of mixed effects models). Herein, the term “condition” refers to a unique combination of the design factors. To illustrate, one condition could consist of the first level of every study design factor listed in Table 3. Then changing the level for only one of the study factors in this condition would correspond to a new, different condition. In total, 6480 unique conditions of the design factors were simulated, each with 1000 simulations.

Table 3.

Simulation study design factors

Study design factor Factor levels (# of levels) Factor under researcher control? Notes
Population size N=3500; 1750; 200 (3 levels) No In conditions, one population was simulated, and samples of size n were drawn with replacement from this population.
Sample size n=20, 30, 50, 75, 100 (5 levels) Mostly, but within budgetary and practical constraints In practice, feasible sample size may be restricted by population size (e.g., n=100 requires recruitment half of a population with N=200).
Number of observations per participant OBS=30, 50, 75, 100 (4 levels) Yes, within budgetary, practical, and participant burden constraints
Proportion of observations in the baseline phase 5%, 10%, 20% (3 levels) Yes A minimum of five observations was required in the baseline phase, otherwise it was the proportion multiplied by the number of observations, rounded up. For example, .10*50=5 observations in the baseline phase.
Proportion of total variance that is residual variance 10%, 25%, 50% (3 levels) Partially; a more precise measure will reduce residual variance For simplicity, data were standardized such that the total variance was 1 and the proportion of the total variance due to the error term was set to one of these proportions.
Effect size (Cohen’s d) .2, .5, .8 (3 levels) Partially; interventions with larger effects might be selected, though this will not be known a priori for new interventions
Outcomes form Phase jump only—Fig. 1A, linear slope change in the intervention phase—Fig. 1B (2 levels) No The standardized mean difference between the average value of the outcome in the control and treatment phases is the effect size.
Analysis model Hierarchical longitudinal model, piecewise regression model (2 levels) Yes See Eq. 1 (hierarchical longitudinal model), Eq. 2 (piecewise regression model). Monte Carlo data were simulated using the piecewise regression model.
Total # of conditions 3*5*4*3*3*3*2*2=6480 combinations of factors -- A finite population of size N was simulated for each of the study conditions.
Total # of analyses 6480*1000=6,480,000 -- Samples of size n were drawn with replacement 1000 times per unique study condition. Estimated power was the proportion of the 1000 data sets for which p ≤ .05.

A simulation study factor is an attribute of ICT data, study design, or analysis that is systematically varied to examine its effect on statistical power. A simulation study condition is a unique combination of each of the simulation study factors.

Forms of treatment impact

The first study design factor that we tested, which researchers rarely have control over, was to compare two alternate forms of treatment impact (Fig. 1). Figure 1A illustrates a change in an outcome that occurs immediately after treatment begins (a phase jump) with no further change over time. Figure 1B illustrates a change in outcome that occurs gradually after treatment begins; in our case a linear increase with no immediate change (time-by-phase interaction). Figure 1A and B  illustrate increasing outcomes; the results of our simulation study apply equally to decreases in outcomes. There are many alternative forms that treatment impacts may take. These two were selected in order to clearly show the contrast between them and to simplify interpretation in light of other design factors.

Target population size

Another design factor on which researchers have no control is population size. Three small population sizes (200, 1750, 3500) were investigated to reflect common ranges found for rare disease populations that could be researched using ICTs. Population sizes larger than 3500 can likely be studied using traditional nomothetic RCT methods. A unique focus of this paper was drawing from small finite population sizes; thus, we applied a finite population correction (FPC) to the standard errors and p-values, and estimated power with and without the FPC (Lai et al., 2018). Results differed little between standard and FPC p-values; thus, results are presented only for models without the FPC. This was not surprising because the FPC is only going to change the statistical hypothesis test results in conditions where most p-values are in the rejection region (i.e., p ≤ α). Although it was outside the scope of the current study, an important future contribution would include an examination of changes in standard error estimates under the usual model and the model with the FPC and fully examine consequent effects on inferences.

Residual intrapersonal variance

Simulated datasets were also systematically varied by the proportion of intrapersonal variance that was due to residual or error variance. The amount of residual variance in a model outcome may reflect the type of outcome being studied (e.g., biomedical vs. behavioral), reliability of a measure (and corresponding measurement error), and quality of measurement instruments (e.g., academic grades vs. a standardized IQ test). To simplify interpretation of the influence of residual variance on power, it was assumed to be constant throughout an ICT.

Treatment outcomes effect size

Existing evidence may be used to anticipate what the impact might be of a novel intervention as measured by its effect size, depending on how similar the planned study is to prior interventions, samples, and the condition that is targeted by the intervention. Effect size was systematically varied herein to investigate the degree to which a researcher may maximize power to detect an expected treatment impact (e.g., strategies to detect a small effect size). It was expected that multiple design factors which are within a researcher’s control can be adjusted to detect even small effect sizes. Moreover, as mentioned earlier, the form of treatment impact on the outcome (phase jump or time-by-phase interaction) may influence power and is important to investigate in light of effect size. Effect sizes used to simulate data were based on Cohen’s small, medium, and large effects (Cohen, 1988).

Design factors that are under researchers’ control

Next, we investigated changes in statistical power associated with levels of ICT design factors that researchers can use to maximize the statistical power of their design. One example is the choice of analytic model (described earlier). Other design factors that were systematically varied were sample size, number of observations per participant, and proportion of study observations that are assigned to the baseline phase. Because researchers of rare diseases are often forced to use limited sample sizes, the number of observations per participant and proportion of observations assigned to the baseline phase may be critical design factors to consider for achieving the needed level of statistical power.

Data simulation

All data simulation design factors are summarized in Table 3, and data were simulated for fully factorial combination of all design factors in Table 3. Outcomes were simulated using random intercepts, slopes, and measurement error that produced individual participant lines that varied around a straight line (i.e., no curvature) within each group and each phase (Efron & Tibshirani, 1994; see also Fig. 1). The PersonAlyticsPower package implements a general data simulation strategy that generalizes to an arbitrary number of phases (and for future applications, an arbitrary number of groups by adding a subscript g). The data simulation model

yijp=β0ip+β1ipTimejp+eijp 3

was repeated within each phase where p =  = 0 is the baseline phase and p =  = 1 is the intervention phase. Equation (3) is a phase-specific simplification of Eqs. (1) and (2), and only once there are multiple phases do we need effects involving phase (e.g., π2, π3, π1i, and π2i from Eqs. 1 and 2).

In phase jump conditions (Fig. 1A and row 7 of Table 3), the effect size was controlled by fixing the mean of β0i{p =  = 0} to 0 and varying the mean of β0i{p =  = 1} to obtain the desired effect size (see row 6 in Table 3). For both phases, the mean of β1ip was fixed to 0. In linearly increasing change in the intervention phase conditions (Fig. 1B and row 7 of Table 3), the effect size was controlled by fixing the mean of β1i{p =  = 0} to 0 and varying the mean of β1i{p =  = 1} to obtain the desired effect size (see row 6 in Table 3). For both phases, the mean of β0ip was fixed to 0.

The random effects β0ip and β1ip were distributed as bivariate normal with moderate correlations of 0.3, which is about the average value observed in the authors’ past analyses of real data (e.g., Ridenour et al., 2013, 2016). The errors eijp followed a normal distribution and were independent of the random effects (which reflects no autocorrelation among residuals after controlling for random effects).

Data were simulated following both trajectory shapes shown in Fig. 1 (phase jump from Fig. 1A, linear increase from Fig. 1B; see row 7 in Table 3), and the models from both Eqs. 1 and 2 were fit to both trajectory shapes (see row 8 in Table 3). For both forms of treatment impact (i.e., phase jumps and gradual changes), the mean of the random intercepts π was set to 0 in both phases.

Other design factors shown in Table 3 include the finite population size from which samples of individuals were resamples (row 1), the size of the sample drawn from the population (row 2), the number of repeated observations per participant (row 3), the proportion of repeated observations that were in the baseline phase (row 4), and the proportion of the total variance that was residual variance (row 5). Outcomes were simulated such that the total variance (i.e., for random effects and residual errors) was 1, resulting in residual variances that were the same as the proportions shown in row 5 of Table 3. Data were simulated using an alpha version of the PersonAlyticsPower (Tueller et al., 2020) R package that is designed to allow users to conduct a power analysis for a combination of design factors. As an extension of the PersonAlytics R package (described next), PersonAlyticsPower was built on the nlme R package for normally distributed outcomes (as used herein) and the gamlss package for non-normal distributions. Since computer memory requirements for storing the simulated data were prohibitively large, the data were deleted after being analyzed using the approach described in the next section. A random seed was specified to ensure simulated data could be replicated. For each condition, 1000 datasets were simulated.

Analysis of simulated data

To estimate and characterize power, the primary outcomes for the current study were the p-values for the estimates of π2 for the intensive hierarchical model or the mean of the π2i for the piecewise model, and π3 in Eqs. (1) and (2). Specifically, the true statistical power was defined as the proportion of all possible p-values less than α for the p-values associated with estimates of π2 (Eq. 1), the mean of π2i (Eq. 2), and π3 (Eqs. 1 and 2). To calculate p-values for these parameters’ estimates, we used the PersonAlytics (Tueller, Ramirez, & Ridenour, 2019) R package (R Core Team, 2019). PersonAlytics was designed to automate the model selection process for longitudinal mixed effects models, including, but not limited to, the intensive hierarchical models studied here. Underlying the automation of PersonAlytics is the nlme (Pinheiro et al., 2019) R package for fitting mixed effects models and obtaining parameter estimates, their standard errors, test statistics, and p-values. The PersonAlytics high-throughput feature was used to analyze the simulation data, where each simulated dataset was handled as a distinct output within each simulation study condition for a total of 1000 distinct outputs per condition.

Importantly, time was centered in all analyses at the point of transition between the ICT phases (i.e., from baseline to treatment phase; see Table 2 for an example). Centering time between phases simplified interpretation of statistical output by better separating the treatment and baseline effects. To illustrate this, the x-axis of Fig. 1 corresponds to centered time.

Analysis of p-values

In the present study, for each of the 6480 conditions, estimated statistical power was defined as the (empirical) proportion of the 1000 simulated datasets with 𝑝 ≤ α where α = 0. 05 (i.e., p ≤ .05, one-sided). For example, suppose two conditions are identical except for the number of observations (OBS) per participant (akin to “waves” of data collection in a study), with OBS=30 in one condition and OBS=50 in the other. Suppose that the power was .4 in the first hypothetical condition (i.e., 400 of the 1000 simulated datasets reached p ≤ .05), and .8 in the second (i.e., 800 of the 1000 simulated datasets reached p ≤ .05). We would conclude that, all else being equal, increasing study observations from 30 to 50 increases statistical power from .4 to .8. Although this illustration is intuitive, it is challenging to make such comparisons among 6480 conditions.

To interpret results, the simulated p-values were analyzed in two ways—the first to characterize the results visually and the second quantitatively. First, we estimated statistical power in each of the 6480 conditions and displayed the results for each design factor that is summarized, aggregated across all levels of the other design factors that are not presented (Figs. 2, 3, 4 and 5). Graphical representations of the trends in power clearly summarized results over multiple design factors at once.

Fig. 2.

Fig. 2

Power by form of treatment impact, population size, number of time points, and sample size. Note: Each dot represents an average across 43 design factor combinations. Power estimates are aggregated across effect size, analytic model (intensive hierarchical regression or piecewise regression), proportion of time points assigned to the control phase, and proportion of intraindividual variability due to residual error. Power curves are unsmooth due to aggregations across other design features. Number of time points = number of observations per participant. What appears to be greater power to detect Time × Phase interactions with fewer time points is due to the exclusion of certain conditions for these analyses (illustrated in Fig. 3). As shown in Fig. 3, if number of time points in the baseline phase was < 4, the conditions were excluded from analyses to avoid anomalies in power curves due to few data points for baseline. The range of power (y-axis) was scaled differently for the two outcome forms to clearly present the distinctions among the power curves

Fig. 3.

Fig. 3

Power by proportion of observations in the baseline phase, form of treatment impact, number of time points, and sample size. Note: Each dot represents an average across 54 design factor combinations. Power estimates are aggregated across effect size, analytic model (intensive hierarchical regression or piecewise regression), population size, and proportion of intraindividual variability due to residual error. Power curves are unsmooth due to aggregations across other design features. Number of time points = number of observations per participant. When number of time points in the baseline phase was < 4, the condition was excluded from analyses to avoid anomalies in power curves due to few data points for baseline. The range of power (y-axis) was scaled differently for the two outcome forms to clearly present the distinctions among the power curves

Fig. 4.

Fig. 4

Power to detect a phase jump by analytic model, effect size, number of time points, and sample size. Note: Each dot represents an average across 43 design factor combinations. Power estimates are aggregated across proportion of observations in the baseline phase, population size, and proportion of intraindividual variability due to residual error. Power curves are unsmooth due to aggregations across other design features. As shown in Fig. 3, having at least 20% of time points in the baseline phase is optimal; thus, Fig. 4 presents results only for conditions in which 20% of time points occur in the baseline phase. Number of time points = number of observations per participant. The range of power (y-axis) was scaled differently for the two outcome forms to clearly present the distinctions among the power curves.

Fig. 5.

Fig. 5

Power to detect gradual change by analytic model, effect size, number of time points, and sample size. Note: Each dot represents an average across 43 design factor combinations. Power estimates are aggregated across proportion of observations in the baseline phase, population size, and proportion of intraindividual variability due to residual error. Power curves are unsmooth due to aggregations across other design features. As shown in Fig. 3, having at least 20% of time points in the baseline phase is optimal; thus, Fig. 4 presents results only for conditions in which 20% of time points occur in the baseline phase. Number of time points = number of observations per participant. The range of power (y-axis) was scaled differently for the two outcome forms to clearly present the distinctions among the power curves

The second approach involved fitting logistic regression models to all simulated data, with binary indicators of whether p ≤ .05 for each analysis results. These analyses quantified the unique contribution to power that was attributable to either a one-step change in level of a design factor or a combination in levels of two design factors while statistically controlling for the change in power that was attributable to the other factors. To this end, logistic regression models included all the design factors and all two-factor combinations (specific to each level of each factor) as predictor variables to quantify their unique and combined influences on power. The estimated odds ratio for each factor level in Table 4 and Appendix C illustrates change in power due to altering a single aspect of a study design (i.e., for each level of each factor, with respect to the reference level) after controlling for all other factors. Listing results that are specific to each level of each factor separately, controlling for all other levels of factors, provides the expected change in power due to altering a single aspect of a study design.

Table 4.

Logistic regression estimates of the unique impact of ICT design factors on power to detect phase jumps and gradual change, controlling for all other design factors

Effect Hierarchical model Piecewise model
Phase jump effect (π3) Time × phase interaction (π2) Phase jump effect (π3) Time × phase interaction (π2)
Sample size (referent N = 20)
N = 30 1.16 1.01 1.29 3.20
N = 50 1.42 0.64 1.39 27.26
N = 75 5.31 0.74 4.61 *
N =100 1.47 0.55 3.23 *
Number of observations (referent = 20)
40 Observations 1.86 1.62 2.12 2.98
60 Observations 2.07 2.63 2.84 3.33
80 Observations 2.86 4.06 4.14 5.60
100 Observations 3.48 3.42 3.35 5.54
Proportion of observations in baseline phase (referent = 5%)
10% of Observations 1.19 1.17 1.51 0.89
20% of Observations 1.60 1.88 2.19 0.66
Effect size (referent d = 0.20)
d = 0.50 5.40 1.09 4.52 *
d = 0.80 26.62 0.87 29.34 *
Proportion of intraindividual variance due to error (referent = 50%)
25% of variance is error 3.24 2.21 2.95 1.39
10% of variance is error 6.34 3.62 5.05 1.60
Population size (referent = 3500)
Population = 1750 0.63 0.83 0.93 1.13
Population = 200 0.61 0.66 0.98 0.97
Interaction terms
1. N = 30 × 40 Observations 0.98 0.93 0.93 0.83
2. N = 50 × 40 Observations 1.57 0.95 1.26 1.34
3. N = 75 × 40 Observations 0.70 0.59 0.90 3.50
4. N = 100 × 40 Observations 2.21 0.77 2.04 59.32
5. N = 30 × 60 Observations 0.93 0.98 1.05 1.30
6. N = 50 × 60 Observations 1.25 0.92 1.28 2.62
7. N = 75 × 60 Observations 0.73 0.49 0.83 40.89
8. N = 100 × 60 Observations 1.99 0.81 3.51 1.84
9. N = 30 × 80 Observations 1.19 0.95 0.95 0.83
10. N = 50 × 80 Observations 1.65 0.99 1.05 2.21
11. N = 75 × 80 Observations 0.97 0.69 0.64 0.52
12. N = 100 × 80 Observations 2.23 0.88 1.57 *
13. N = 30 × 100 Observations 1.15 1.09 1.25 1.21
14. N = 50 × 100 Observations 1.73 1.26 1.74 3.22
15. N = 75 × 100 Observations 1.11 0.89 1.03 10.25
16. N = 100 × 100 Observations 4.21 1.00 1.81 *
17. N = 30 × 10% Baseline 1.24 0.95 0.97 0.84
18. N = 50 × 10% Baseline 1.18 1.38 1.26 0.58
19. N = 75 × 10% Baseline 1.20 1.73 1.01 0.01
20. N = 100 × 10% Baseline 1.88 2.18 0.72 0.95
21. N = 30 × 20% Baseline 1.19 1.38 1.03 0.92
22. N = 50 × 20% Baseline 1.38 3.23 1.28 0.65
23. N = 75 × 20% Baseline 1.11 4.65 1.01 0.08
24. N = 100 × 20% Baseline 2.80 8.11 1.47 63.26
25. N = 30 × d = 0.50 2.86 1.14 1.74 *
26. N = 50 × d = 0.50 10.39 1.60 3.08 *
27. N = 75 × d = 0.50 53.59 2.11 5.83 *
28. N = 100 × d = 0.50 * 2.71 26.01 0.64
29. N = 30 × d = 0.80 5.76 1.52 5.23 0.36
30. N = 50 × d = 0.80 * 2.63 * 0.04
31. N = 75 × d = 0.80 * 4.53 * *
32. N = 100 × d = 0.80 * 4.54 * *
33. N = 30 × 25% Error 0.85 1.03 1.00 0.89
34. N = 50 × 25% Error 0.68 1.06 1.07 0.68
35. N = 75 × 25% Error 0.75 1.09 0.87 2.36
36. N = 100 × 25% Error 1.11 0.97 0.52 0.01
37. N = 30 × 10% Error 0.76 0.92 0.83 0.88
38. N = 50 × 10% Error 0.65 1.00 0.90 0.43
39. N = 75 × 10% Error 0.59 1.19 0.56 0.44
40. N = 100 × 10% Error 1.10 1.08 0.57 0.39
41. N = 30 × Pop. = 1750 1.10 1.01 1.07 0.98
42. N = 50 × Pop. = 1750 1.23 0.99 0.87 0.75
43. N = 75 × Pop. = 1750 0.93 1.12 0.89 1.21
44. N = 100 × Pop. = 1750 0.99 1.02 0.89 4.07
45. N = 30 × Pop. = 200 0.89 0.95 0.81 0.93
46. N = 50 × Pop. = 200 1.37 1.06 0.62 1.31
47. N = 75 × Pop. = 200 0.52 1.16 0.56 0.24
48. N = 100 × Pop. = 200 0.76 1.19 0.56 0.04
49. 40 Observations × 10% Baseline 0.98 2.00 1.10 0.73
50. 60 Observations × 10% Baseline 1.10 1.04 0.87 1.13
51. 80 Observations × 10% Baseline 1.40 0.73 0.85 1.50
52. 40 Observations × 20% Baseline * * * *
53. 60 Observations × 20% Baseline * * * *
54. 80 Observations × 20% Baseline 1.29 0.69 0.84 1.38
55. 40 Observations × d = 0.50 1.42 0.73 1.16 0.43
56. 60 Observations × d = 0.50 1.47 0.70 1.55 *
57. 80 Observations × d = 0.50 2.17 0.71 1.26 *
58. 100 Observations × d = 0.50 1.98 0.83 1.97 1.20
59. 40 Observations × d = 0.80 2.62 0.61 1.95 0.21
60. 60 Observations × d = 0.80 6.99 0.54 2.85 0.40
61. 80 Observations × d = 0.80 15.18 0.84 2.42 0.22
62. 100 Observations × d = 0.80 11.99 0.79 3.47 0.10
63. 40 Observations × 25% Error 0.80 0.77 0.52 0.74
64. 60 Observations × 25% Error 0.82 0.73 0.55 0.50
65. 80 Observations × 25% Error 0.53 0.88 0.57 0.30
66. 100 Observations × 25% Error 0.60 0.73 0.52 0.47
67. 40 Observations × 10% Error 0.76 0.79 0.61 0.79
68. 60 Observations × 10% Error 0.94 0.75 0.53 0.54
69. 80 Observations × 10% Error 0.52 0.75 0.65 0.33
70. 100 Observations × 10% Error 0.56 0.73 0.53 0.28
71. 40 Observations × Pop. = 1750 1.41 0.91 1.20 0.97
72. 60 Observations × Pop. = 1750 1.35 1.02 1.07 1.23
73. 80 Observations × Pop. = 1750 1.27 1.07 1.08 0.84
74. 100 Observations × Pop. = 1750 1.17 1.08 1.18 1.17
75. 40 Observations × Pop. = 200 1.46 1.00 1.27 1.20
76. 60 Observations × Pop. = 200 1.54 1.21 1.62 0.70
77. 80 Observations × Pop. = 200 1.46 1.33 1.30 0.71
78. 100 Observations × Pop. = 200 1.92 1.33 1.53 0.61
79. 10% Baseline × d = 0.50 1.50 1.50 1.19 1.77
80. 20% Baseline × d = 0.50 1.21 5.64 1.49 *
81. 10% Baseline × d = 0.80 3.18 2.74 1.59 4.57
82. 20% Baseline × d = 0.80 4.67 29.83 1.95 2.15
83. 10% Baseline × 25% Error 0.72 0.99 1.11 1.12
84. 20% Baseline × 25% Error 0.68 1.11 0.89 0.83
85. 10% Baseline × 10% Error 0.57 1.10 1.03 0.80
86. 20% Baseline × 10% Error 0.60 1.12 0.74 0.62
87. 10% Baseline × Pop. = 1750 1.05 1.14 0.85 0.83
88. 20% Baseline × Pop. = 1750 1.16 1.11 0.92 0.90
89. 10% Baseline × Pop. = 200 1.17 1.13 0.73 1.10

Odds ratios were calculated from logistic regression models of the binary outcome p ≤ .05 for π1 across all bootstrap replications. A separate logistic regression model was fit for each model (Eqs. 1 and 2) and population size combination in the columns. Each of these models included all main effects and two-factor combinations (as two-way interaction terms), listed in the rows. Pop. = population. Error = error variance. Results in columns 1–4 of Table 4 are also presented as forest plots in Appendix C, panels A–D, respectively.

† The treatment phase linear conditions (Tx Phase Linear) had null effects for π3 and are excluded for clarity.

* Empty cells were due to extreme collinearity (singularity) or low/no variance in the outcome for a given cell (quasi-complete separation).

We decided to limit our exploration of two-factor combinations for the following reasons. As evidenced by our testing of eight design factors (and confirmed by our results), statistical power of a study design is influenced by combinations of design factors. Whereas many past power analysis studies consider how power is impacted by a single factor, we aimed to clarify each individual factor’s impact on the power of a particular design factor while accounting for other design factors (i.e., by fitting a logistic regression model). At the same time, the number of unique results to report and interpret from simulating all potential combinations of our eight design factors would exceed the breadth of a single manuscript. Thus, at this early point in delineating the relative influences of our eight factors, we opted to limit this exploration to only two-factor combinations. As shown in Table 4 and Appendix C, results from evaluating all the two-factor combinations tested herein are voluminous and yield meaningful and insightful results without considering more complex combinations.

Direct comparisons among design factors

A challenge with quantifying statistical power, either in isolation, for two conditions, or graphically over hundreds of conditions, is that it does not tell us the relative and unique importance of a change in each study design factor while controlling for the other study design factors. We addressed this limitation in exploratory analyses by fitting logistic regression models (as mentioned earlier) with binary indicators of p ≤ .05 as the outcome (Efron & Tibshirani, 1994). Study design factors were dummy coded as predictors and used in one-factor (as main effects) and two-factor combinations of design factors (as interaction terms). This permitted simultaneous analysis of the results for all 6,480,000 simulated datasets (1000 datasets for each of the 6480 conditions).

Our focus was to evaluate which design factors offer the greatest impact on statistical power. However, with so many simulated observations, almost (if not) all coefficient estimates would attain p < .05 or less. Hence, we did not rely on p-values as a way to assess factor influence on power. Instead, we relied on the logistic regression coefficient estimates (by exponentiating them to transform them into adjusted odds ratios) to assess such influence.

To simplify presentation of results, a separate logistic regression model was fit for each model type (intensive hierarchal versus piecewise) and the two forms of treatment impact (phase jump versus gradual change). This resulted in 16 main effects and 101 two-factor combination effects, for a total of 117 odds ratios (ORs) per logistic regression analysis.

Results

We primarily focus on two types of results: those comparing power of design factors across form of treatment impact (e.g., Figs. 2, 3) and across type of model for each form of impact (e.g., Figs. 4, 5). Data simulation and analyses using PersonAlyticsPower took about 3 weeks using two PCs with 64 GB RAM, 3.6 GHz, and 64-bit Windows 10 operating systems.

Power curves for combinations of design factors

Each panel of Figs. 2, 3, 4, and 5 presents power curves for one design factor, separately for level of sample size (see legend) and number of observations per participant (see x-axis), with results averaged across all other design factors. Error bars display 95% confidence intervals of the power estimates (i.e., the average of the power estimates ± the standard error of the power estimates times the 2.5th quantile of the z-distribution).

Simulation data that are presented in Figs. 2 and 3 differ slightly from Figs. 4 and 5 because, as the results of Fig. 3 illustrate, assigning 20% (or more) of observations to the baseline phase improves power considerably over having only 5% or 10% of observations in the baseline phase, although this improvement is more pronounced for detecting gradual change than phase jumps. Additionally, in conditions with 40 or fewer observations and 5% to 10% baseline observations, only 1 to 4 observations are simulated for baseline. This in turn led to some non-monotonic trends from extreme simulated values (e.g., compared to an average value across more baseline observations). Thus, for comparing power across models, effect sizes, and forms of treatment impact in Figs. 4 and 5, the results are limited to conditions when 20% of observations are assigned to the baseline phase.

Population size

Figure 2 shows the impact of population size on power for varying sample sizes and varying numbers of observations per participant. Power curves for detecting a phase jump demonstrate little impact of population size on power, regardless of sample size and observations per participant. Likewise, population size had little impact on statistical power to detect gradual change modeled as a time-by-phase interaction after accounting for sample size and observations per participant. Note that the range of power (y-axis) was scaled differently for the two outcome forms to clearly present the distinctions among the power curves.

Also, in the results for gradual change, what appears to be greater power associated with fewer observations per participant is due to the exclusion of certain conditions for these analyses (illustrated in Fig. 3). Specifically, when the number of observations in the baseline phase were less than five, the condition was excluded from analyses to avoid anomalies in power curves due to few data points in baseline phases (e.g., when extreme values were simulated by change that pulled the averages and their confidence intervals in unexpected directions). To illustrate, if only 20 total observations were sampled, 5% of them would equate to only one baseline observation per participant. This is far more likely to be an extreme value compared to simulated baseline results that are averaged from multiple observations per participant.

Overall, results demonstrate that ICTs for very small populations (i.e., for very rare diseases) can yield similar power to larger populations up to N=3500 and probably larger. Recruiting the same number of participants would generally be more difficult for a population of 200 compared to a population of 3500. On the other hand, assuming a well-designed random sampling scheme (e.g., multistage stratification) is utilized, a sample of 100 is likely to be far more representative of, and yield results that are more generalizable to, a population of 200 than a population of 3500.

Sample size

As expected, increasing sample size generally improved statistical power to detect either form of treatment impact. The impact on power that results from increasing sample size is clearly seen in the power curves in Fig. 2. Improved power from larger sample sizes was largely consistent, as this effect can be seen across all power curves in Figs. 2, 3, 4, and 5. Not surprisingly, larger samples generally had the greatest impact on power for detecting small to medium effect sizes (Figs. 4 and 5). Moreover, the combined impact on power of sample size and number of observations per participant was largely consistent across simulated conditions.

Number of observations

Power curves illustrated that increasing power generally corresponds to more observations per participant (time points), although this association was not equally sized across Figs. 2, 3, 4, and 5. More observations per participants seems to have improved the power most for smaller samples (all figures) and for detecting small to medium effect sizes (Figs. 4 and 5).

Percent of observations in the baseline phase

Figure 3 illustrates that increasing the proportion of observations assigned to the baseline phase from 5% to 20% increased power for detecting time-by-phase interactions, but less so for detecting phase jumps. Results were not presented for the leftmost conditions of each panel in Fig. 3 because they were anomalous for reasons mentioned earlier under the Results section on population size.

Statistical model type

Figures 4 and 5 present power curves for detecting a phase jump and gradual change, respectively; each compares the intensive hierarchical model with the piecewise model. For detecting small, medium, and large phase jumps, the two statistical models provided similar statistical power (Fig. 4). For detecting time-by-phase interactions in outcomes (Fig. 5), intensive hierarchical models demonstrated generally superior power, especially when effect sizes were medium to large.

Lower power was hypothesized a priori for the piecewise model because it has an additional random effect; it is also estimating the interaction effect as the mean of a random effects π2i, whereas the intensive hierarchical model is simply estimating a fixed effect, π2. However, an unexpected result was that increasing the number of participants had less impact on the power of piecewise models to detect time-by-phase interactions compared to intensive hierarchical models, as we anticipated larger sample sizes to increase power for all estimates.

Effect size

As illustrated in Figs. 4 and 5, this association was observed for all conditions except when using a piecewise regression model to detect time-by-phase interactions. Under this condition, treatment effect size had virtually no relationship with power as the power curves were similar across effect sizes. Indeed, only increasing the number of observations per participant slightly increased statistical power to detect gradual change using a piecewise model. (Increasing sample size also had no consistent impact on power).

Study design factors: Unique and combined associations with power

Table 4 and Appendix C report the estimated adjusted ORs from logistic regression models, including a 95% confidence interval visualized in a forest plot. First listed in each column are the unique contributions of each individual factor (regression main effects), followed by combinations of factor levels (regression model interaction terms). Some cells were empty due to extreme collinearity or low-to-no variance in the outcome for a given cell. For example, if p was always ≤α for a combination of predictors, there would be no variance in that cell of the two-way predictor table and the effect would not be estimable (e.g., the d=0.80 main effect in first panel of Appendix C, Panel A). The ORs for main effects-only models are presented in Table 5.

Table 5.

Odds ratios for main effects only: Multiple logistic regression models of binary indicators of whether ICT design features attained p ≤ .05

Effect Hierarchical model Piecewise model
Phase jump effect (π3) Time × phase interaction (π2) Phase jump effect (π3) Time × phase interaction (π2)
Sample size (referent N = 20)
N = 30 1.51 1.32 1.43 2.60
N = 50 2.96 2.01 2.20 18.32
N = 75 4.02 2.80 3.33 76.43
N =100 7.32 3.49 4.51 159.88
Number of observations (referent = 20)
40 Observations 2.35 1.06 2.00 2.00
60 Observations 3.14 1.46 2.99 2.99
80 Observations 3.91 1.70 3.02 3.02
100 Observations 4.85 2.01 3.73 3.73
Proportion of observations in baseline phase (referent = 5%)
10% of Observations 1.53 1.5 1.85 0.69
20% of Observations 2.16 2.27 3.38 0.53
Effect size (referent d = 0.20)
d = 0.50 20.53 3.60 3.60 *
d = 0.80 * 8.39 8.39 *
Proportion of intraindividual variance due to error (referent = 50%)
25% of variance is error 1.31 2.43 1.30 0.84
10% of variance is error 1.71 16.93 1.96 0.59
Population size (referent = 3500)
Population = 1750 0.97 1.03 0.95 0.96
Population = 200 0.86 1.09 0.79 0.85

Odds ratios were calculated from logistic regression models of the binary outcome p ≤ .05 for π1 across all bootstrap replications. A separate logistic regression model was fit for each model (Eqs. 1 and 2) and population size combination in the columns. Pop = population. Error = error variance. † The treatment phase linear conditions (Tx Phase Linear) had null effects for π3 and are excluded for clarity.

* Empty cells were due low/no variance in the outcome for a given cell (quasi-complete separation) indicating (near) perfect prediction when effect sizes were large.

Many trends in power for a particular factor are non-monotonic or appear exceedingly large or small while also analyzing varying levels of the other factors (e.g., see the first four values in the first column of Table 4 and Appendix C, Panel A regarding increasing sample sizes to detect a phase jump using hierarchical modeling). To test whether these results are due to fitting a single logistic model that includes many two-factor interactions, the odds ratios were estimated a second time including only the main effect terms (i.e., dropping all two-factor combinations). These results are presented in Table 5. Compared to the corresponding results in Table 4 and Appendix C, trends in Table 5 demonstrate that power varies almost entirely monotonically (and mostly as expected) when the models include much fewer design factors.

Sample size

Each increase in sample size from N = 20 through N = 75 increased power to detect phase jumps under both analytic models (Tables 4 and 5). However, power was more impacted by other design factors at N=100. Compared to detecting phase jumps, the power to detect time-by-phase interactions appears to be far less robust than the impacts of other design factors. To illustrate, whereas Table 5 shows that power for detecting such interactions under a hierarchical model monotonically increases with greater sample size, no such relationship appears in Table 4 or Appendix C once two-factor interactions are added to the model.

Moreover, Figs. 2, 3, and 5 (intensive hierarchical model panel only) show that power increases with greater sample size. In contrast, the Fig. 5 results for the piecewise model show little if any impact of sample size on power.

Sample size additionally impacted the power in combination with other design factors: more observations per participant (see rows numbered 1–16 in Table 4 and Appendix C, especially when using piecewise models to detect time-by-phase interactions shown in Appendix C, Panel A), proportion of observations in the baseline phase (rows 17–14; second panels of Appendix C), and larger effect sizes (rows 25–32; second panels of Appendix C). However, power did not always increase monotonically due to being averaged across other design factors.

Number of observations

Table 4 and Appendix C show that increasing the number of observations increased power with notably greater increases in power occurring at higher numbers of observations. This trend was observed for both analytic models, and was monotonic for the intensive hierarchical model. The increased power that could be obtained by adding observations was frequently equivalent to, or larger than, power gains due to increasing sample size for most conditions (compare “main effects” in Tables 4 and 5 and Appendix C). Consistent with the power curves, number of observations interacted with sample size to further increase power (rows 1–16), although these gains in power were not monotonic. Number of observations provided additional increases in power in the presence of medium to large phase jumps (rows 55–62). Moreover, the trends in power seen across increasing numbers of observations when not including interaction terms in the model (Table 5) were largely replicated while modeling power as a function of all main terms and two-factor interactions (Table 4 and Appendix C).

Proportion of observations in baseline phase

Assigning more observations to a baseline phase increased power to detect a phase jump or time-by-phase interaction (Table 4 and Appendix C) but the latter increase in power was only observed when using an intensive hierarchical model (Appendix C, Panels A and B). As mentioned earlier, when interacting with increasing sample size (row numbers 17–24 in the second panels of Appendix C), more observations in the baseline phase further boosted statistical power for most conditions, but did so more consistently for the intensive hierarchical model. Finally, with larger effect sizes, having larger proportions of observations in the baseline phase further increased statistical power, but much more so for intensive hierarchical models in Appendix C, Panels A and B compared to piecewise models in Appendix C, Panels C and D (see row numbers 78–89 in the second panels of Appendix C).

Effect size

Effect size was the dominating factor for the power to detect phase jumps (Table 4 and Appendix C, Panels A and C), yielding exponentially greater statistical power regardless of the analytic model used and while controlling for all other factors. In contrast, the impact that effect size had on power to detect time-by-phase interactions seems to have been explained by interactions with other factors. As noted earlier, effect sizes interacted with sample sizes (rows 25–32), number of observations (rows 55–62), and proportion of observations assigned to baseline (rows 78–89) to increase power over and above the main effect contributions of each design factor.

Intraindividual variance due to residual error

Compared to 50% of the intraindividual variability in observations being due to residual error, lower residual error (or more precise measurement) increased power to detect phase jumps or time-by-phase interactions, with the latter effect being limited to the intensive hierarchical model (Table 4 and Appendix C, Panels A and B) and not occurring for the piecewise model (Table 4 and Appendix C, Panels C and D). Moreover, lower residual variance provided slight increases in power when interacting with more baseline observations for detecting time-by-phase interactions but only when using intensive hierarchical models (cf. rows 83–86 to rows 33–4 and 63–70).

Discussion

This Monte Carlo simulation study characterized the individual and combined impacts of multiple study design factors on statistical power for ICTs, focusing on their use for rare diseases. Statistical power was quantified for the most common forms of treatment impacts, phase jumps and gradual changes modeled as time-by-phase interactions. Impacts of study design factors on statistical power were specified in terms of their individual impacts (holding all other factors constant) and their interactions specified as two-factor combinations.

Results reported herein for improving power are expected to also largely generalize to reducing error in statistical estimates, as shown by the 95% confidence intervals for the power estimate in Figs. 2 and 3. However, our results are not fully generalizable to strategies for reducing error; as seen in the 95% confidence intervals for the power estimate of Figs. 4 and 5, variability in confidence intervals did not occur in 1:1 correspondence with statistical power.

This study represents a significant advance for a priori planning of ICTs and other within-subject clinical trials. Power analysis for clinical trials to date has largely focused on traditional RCTs or other nomothetic designs which cannot be used to estimate power or guide a priori study design for the full range of within-subject clinical trials. To illustrate, the goal of traditional RCTs is to estimate the difference between an individual’s outcome in the arm to which (s)he is assigned (also termed a factual potential outcome) and that individual’s unobserved outcome if (s)he had instead been assigned to an alternate arm (i.e., counterfactual potential outcome). This is often done by assuming that this difference is the same for all study individuals. This allows the individual difference to be estimated using the average difference in observed outcomes between individuals in each of the two randomized arms (Frangakis & Rubin, 2002; Hernan & Hernandez-Diaz, 2012; Rubin, 2005; Ten Have et al., 2008). In contrast, ICTs estimate each participants’ average potential outcome (or potential outcome trend) under each arm by observing each participant in both study arms multiple times. This allows estimation of a participant-specific, between-phase average treatment effect akin to the “average period treatment effect” of Daza (2018), where a “period” is referred to as a phase in ICTs. Thus, the assumption of independent samples that is frequently required for analyses of RCTs cannot be assumed for ICTs.

Limitations

Results of this Monte Carlo study should be interpreted in light of several limitations. These results are from analyses in which we assumed normal distributions in outcomes, utilized two-phase ICTs, and focused on specific forms of time series outcomes, and may not apply well to individual studies that do not meet these conditions. For example, alternative outcome variables may be skewed, binary (e.g., disease status), or counts (e.g., frequency of a behavior). Moreover, as with most simulation studies, we assumed that no data were missing.

Another specific factor that was not explicitly simulated (to maintain focus on the factors that were analyzed) is known as a carryover effect. A carryover effect occurs when the effect of one treatment continues to impact participants after a second treatment has begun (Duan et al., 2013; Percha et al., 2019; Zucker et al., 2010). The present simulation analyses were conducted to inform two-phase baseline and treatment studies (also known as single-subject two-phase or two-period crossover studies). These include multiple baseline designs, in which carryover effects are generally not considered because the baseline phase does not include an active treatment. Moreover, in many within-subject clinical trials, a “washout” period is implemented between treatments during which no treatment occurs to avoid carryover effects. Nevertheless, if carryover effects are possible in the context of a within-subjects clinical trial, it is important to account for this potential complication in designing the study.

Finally, there are other factors that can impact the statistical power of ICTs. One example is associations among random effects, which for our simulation analyses were distributed as bivariate normal with moderate correlations of 0.3 based on our prior studies. Another factor is the level of intraclass correlations (or intra-cluster correlation where a person is a cluster), which was not systematically tested in our simulation analyses (Wang & Schork, 2019).

The R packages used to conduct the simulation analyses, PersonAlytics and PersonAlyticsPower (Tueller et al., 2019, 2020), can be used to address these limitations. They are flexible enough to characterize statistical power analyses for a broader range of outcomes, other types of designs of within-subject clinical trials, and more complex forms of time series data.

Implications for small-population clinical trials

Results demonstrated that multiple design factors can be adjusted to optimize the statistical power of a within-subjects clinical trial. Similar to traditional RCTs, larger sample sizes yielded notably greater power. Increasing the number of observations per participant often had similar impact on power as having larger samples. The proportion of observations assigned to each study phase also sizably impacted power, suggesting that for optimal power in a two-phase ICT, observations ought to be distributed among baseline and treatment phases to optimize statistical power. Past within-subject clinical trials have largely consisted of far fewer observations during baseline than the treatment phase (usually 10% or fewer). Assigning few observations to baseline phases is due in part to ethical constraints of withholding treatment (e.g., when the baseline phase is a control condition) or when researchers’ primary interest is in the treatment data. Even so, quality comparator data are critical for understanding treatment effects (Freedland et al., 2019), and our results indicate this is even true for statistical testing of ICT effects.

This investigation highlights the need for at least five follow-up studies with important implications for conducting rigorous ICTs. First, one important follow-up study would be to clarify how power is affected by different correlations among random effects as well as the intraclass correlation. To illustrate, some of the unexpected results reported herein may have been due in part to larger or smaller intraclass correlation inadvertently induced by the way we simulated data.

One extension of this study would be to investigate how the variety of alternative within-subjects designs impact statistical power and precision (Kazdin, 2011). We focused on perhaps the most common within-subject design in which all participants first experience the control condition and then treatment. Results herein generalize to AB (a simple baseline-then-treatment, two-phase study with the same length of phases for all participants) and multiple baseline designs. It is unknown how well our results apply to other within-subject designs such as reversal designs which may add a third or fourth phase involving withdrawal and reinstatement of treatment, respectively (i.e., ABA or ABAB designs). These designs perhaps most strongly support causal inferences of any experimental design for human treatment research (Blackston et al., 2019; Ferron & Onghena, 1996). Results from reversal designs may involve multiple phase jumps and gradual changes per participant, and likely offer greater power and precision than multiple baseline designs (Blackston et al., 2019). Another variant of within-subject designs involves comparing more than two treatment levels, each with its own phase of observation.

A third fertile area for follow-up research is to test how ICT study design factors impact confidence intervals to optimize the precision of statistical estimates (Coulson et al., 2010; McShane et al., 2019; Schmidt, 1996). Improving statistical power also typically narrows confidence intervals (Ferron et al., 2009), as demonstrated by the 95% confidence intervals in Figs. 2 and 3. However, confidence intervals do not vary in lockstep with statistical power. Moreover, smaller samples are more vulnerable to bias from outliers.

A fourth important area for future research is to identify the most important factors that determine the generalizability of results from ICTs. To illustrate, results from a probability sample of 20 drawn from a rare disease population of 200 (10% of the population) is more likely to generalize to its population than a probability sample of 20 drawn from the US general population. Moreover, sampling strategies to ensure that a sample is representative of its respective population can further protect against biased results (Blackston et al., 2019). Existing rules of thumb to achieve generalizability from time series data are largely based on sample size, and typically use statistics that assume infinite populations. Stratified random sampling to ensure participants from key strata of a population are adequately represented (e.g., levels of illness with a rare disease) could maximize the generalizability of results from intensive study of small samples.

Generalizability of results and increased precision of statistical estimates from small samples is critical for many important areas of behavioral and medical research. These include rare diseases, treatments for outbreaks of communicable diseases (COVID-19, Zika virus), natural experiments, policy studies, pragmatic studies, implementation science (with small numbers of organizations), and precision medicine (Gruber et al., 2021; King et al., 2019). Pilot studies are often conducted in part to obtain estimates to guide a priori planning of much larger clinical trials. This practice, however, frequently yields misleading expectations about the results from a subsequent larger clinical trial such as using effect sizes from pilot study data to guide RCT power calculations when pilot study effect sizes are frequently not replicated in a subsequent RCT (Kraemer et al., 2006). Thus, improving strategies to generalize results from intensive study of small pilot samples also could improve downstream RCTs (Ridenour & Stull, 2018).

Lastly, at least one critical role that piecewise regression models could play in analysis of ICT data was not examined in this study. Unlike hierarchical models, piecewise models explicitly parameterize different random effects per study phase. For example, a treatment may have important impacts on intrapersonal variance in an outcome (increasing or decreasing). Optimizing strategies to model, test, and ensure a priori power to detect impacts of a treatment on statistical characteristics other than a change in mean outcome has been largely neglected in ICTs, other within-subject research, and nomothetic even methodologies.

Model extensions

ICT modeling options that were not tested herein include that a mixed effects model can be specified by allowing for different trajectory forms, such as polynomial orders of time (e.g., quadratic, cubic) (Grimm et al., 2016; Singer, & Willett, 2003). The same options hold for the piecewise model within each phase. It is also important to note that once data are collected, residual intrapersonal correlation should be tested and modeling an error covariance structure will usually be needed to account for autocorrelation that remains over and above the modelled variables. Commonly used approaches include fitting autoregressive moving average (ARMA) models, or specifying Toeplitz or compound symmetry (exchangeable) correlation matrices, but many others exist (Pinheiro & Bates, 2011). Model comparisons using fit statistics can be used to select a correlation structure that best fits the data. This selection process is automated by the PersonAlytics R package for ARMA models, and the PersonAlyticsPower package allows the user to specify the ARMA structure when conducting power analyses. Finally, non-normal outcomes also can be analyzed and are implemented in PersonAlytics building on the gamlss package (Stasinopoulos & Rigby, 2007).

Recommendations for applied researchers

Our results imply many specific ways that ICT researchers could refine a study design to attain a desired power to detect treatment impacts, including refining the number of observations assigned among the study phases, collecting more observations, using measures that are sensitive to change over short time periods and have little measurement error (i.e., that more effectively detect treatment impact effect size and reduce residual error), using larger sample sizes, and analyzing data using intensive hierarchical regression. However, as demonstrated in our figures and tables, these design factors interact in complex ways to impact power, and many of these factors are largely outside a researcher’s control.

It is critical to consider that monotonic trends in power that are frequently reported for nomothetic and single-factor power curves may not apply to power trends for ICTs in which many more design factors can influence a study’s capacity to detect treatment impact. Because the trends in statistical power of ICTs can vary more than those of nomothetic studies, a priori ICT power analyses are needed to effectively plan within-subject clinical trials and should account for as many as possible of the study’s design factors. Lastly, given the greater likelihood of anomalous results when the number of baseline observations was four or fewer, ICTs should plan for five or more baseline observations.

The encouraging results of this Monte Carlo simulation study for researchers are that multiple options exist to improve statistical power of ICTs. To this end, our results suggest that when feasible, one should adequately power a study to detect different forms of the treatment’s impact. A motivating example from the introduction is Angelman syndrome. Preliminary evidence suggests that brivaracetam may have efficacy for treating Angelman syndrome’s epileptic seizures and non-epileptic myoclonus (which mimics seizures), perhaps via a phase jump in the outcomes (Snoeren et al., 2022). This medication also may reduce aggression and self-injurious behaviors, presumably as a gradual change (e.g., time × phase interaction). If a study is only powered to detect a phase jump, it may not be adequately powered to detect the time-by-phase interaction (or vice versa).

In conclusion

The Monte Carlo simulation analyses presented herein and the PersonAlyticsPower software strengthen the rigorous design of ICTs. For many decades, statistical modeling of within-person treatment impacts has largely been excluded from analysis of ICT-like studies (Shaffer et al., 2018; Smith, 2012). One compelling feature of the ongoing resurgence in development of ICTs is its emphasis on methodological rigor in study design and statistical analysis of time series data. In that vein, this study specifically offers power analysis results and tools to strengthen the design of multiple baseline studies as well as other two-phase designs. More broadly, these tools offer opportunities to optimize statistical power for a much broader range of within-subject clinical trial designs than are currently available by considering factors that are and are not within a researcher’s control.

Supplementary Information

ESM 1 (626.3KB, pdf)

(PDF 626 kb)

Funding

This research was supported by funding from the National Institutes of Health, National Center for Advancing Translational Sciences (R21 TR002402), Ridenour PI.

Data Availability

The data that were simulated can be replicated using https://github.com/ICTatRTI/PersonAlyticsPower/blob/master/inst/simulation_studies/study1%20Within-subject%20Power%20Analysis.R.

Declarations

The authors wish to acknowledge the stellar blind peer reviews which improved both the manuscript and study.

Ethics Approval

Not applicable as this research consisted of a series of simulation analyses.

Consent to Publish

Not applicable as this research consisted of a series of simulation analyses.

Consent to Participate

Not applicable as this research consisted of a series of simulation analyses.

Conflicts of Interest

All authors declare that they have no conflicts of interest, financial or otherwise.

Footnotes

1

Setting up the time variable in a piecewise model must be done carefully to ensure the desired hypotheses are being tested. Readers are referred to the spline growth model (another term for the piecewise growth model) discussed in Grimm et al., (2016) chapter 7 and references therein; the discontinuity model (another term for the piecewise growth model) discussed in Singer and Willett (Singer et al., 2003) chapter 6.1; Magnusson (2015); and for a coding perspective, see the tutorials at https://rpsychologist.com/r-guide-longitudinal-lme-lmer for R and https://www.lexjansen.com/pharmasug-cn/2015/ST/PharmaSUG-China-2015-ST08.pdf for SAS.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. Baek EK, Ferron JM. Multilevel models for multiple baseline data: modeling across-participant variation in autocorrelation and residual variance. Behav Res Methods. 2013;45(1):65–74. doi: 10.3758/s13428-012-0231-z. [DOI] [PubMed] [Google Scholar]
  2. Baquet CR, Commiskey P, Daniel Mullins C, Mishra SI. Recruitment and participation in clinical trials: socio-demographic, rural/urban, and health care access predictors. Cancer Detect Prev. 2006;30(1):24–33. doi: 10.1016/j.cdp.2005.12.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Blackston, J. W., Chapple, A. G., McGree, J. M., McDonald, S., & Nikles, J. (2019) Comparison of aggregated N-of-1 trials with parallel and crossover randomized controlled trials using simulation studies. In Healthcare (Vol. 7, No. 4, p. 137). MDPI. [DOI] [PMC free article] [PubMed]
  4. Box, G.E., Jenkins, G.M., Reinsel, G.C., Ljung, G.M. (2015). Time series analysis: forecasting and control, 5th edn. Hoboken: John Wiley & Sons.
  5. Button KS, Ioannidis JP, Mokrysz C, Nosek BA, Flint J, Robinson ES, Munafò MR. Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience. 2013;14(5):365–376. doi: 10.1038/nrn3475. [DOI] [PubMed] [Google Scholar]
  6. Cheung, Y. K., Wood, D., Zhang, K., Ridenour, T. A., Derby, L., St Onge, T., Duan, N., Duer-Hefele, J., Davidson, K.W., Kronish, I. and Moise, N., (2020). Personal preferences for personalised trials among patients with chronic diseases: an empirical Bayesian analysis of a conjoint survey. BMJ Open, 10(6), e036056. [DOI] [PMC free article] [PubMed]
  7. Cohen, J. (1988). Statistical power analysis for the behavioral sciences, 2nd edn. Hillsdale: Lawrence Erlbaum.
  8. Coulson M, Healey M, Fidler F, Cumming G. Confidence intervals permit, but do not guarantee, better inference than statistical significance testing. Front Psychol. 2010;1:26. doi: 10.3389/fpsyg.2010.00026. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Daza, E. J. (2018). Causal analysis of self-tracked time series data using a counterfactual framework for N-of-1 trials. Methods of information in medicine, 57(S 01), e10–e21. [DOI] [PMC free article] [PubMed]
  10. Duan N, Kravitz RL, Schmid CH. Single-patient (n-of-1) trials: A pragmatic clinical decision methodology for patient-centered comparative effectiveness research. Journal of clinical epidemiology. 2013;66(8):S21–S28. doi: 10.1016/j.jclinepi.2013.04.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. London: CRC press.
  12. Ferron JM, Bell BA, Hess MR, Rendina-Gobioff G, Hibbard ST. Making treatment effect inferences from multiple baseline data: The utility of multilevel modeling approaches. Behav Res Methods. 2009;41(2):372–384. doi: 10.3758/BRM.41.2.372. [DOI] [PubMed] [Google Scholar]
  13. Ferron JM, Farmer JL, Owens CM. Estimating individual treatment effects from multiple baseline data: A Monte Carlo study of multilevel-modeling approaches. Behav Res Methods. 2010;42(4):930–943. doi: 10.3758/BRM.42.4.930. [DOI] [PubMed] [Google Scholar]
  14. Ferron J, Onghena P. The power of randomization tests for single-case phase designs. The Journal of Experimental Education. 1996;64(3):231–239. doi: 10.1080/00220973.1996.9943805. [DOI] [Google Scholar]
  15. Frangakis CE, Rubin DB. Principal stratification in causal inference. Biometrics. 2002;58(1):21–29. doi: 10.1111/j.0006-341x.2002.00021.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Freedland KE, King AC, Ambrosius WT, Mayo-Wilson E, Mohr DC, Czajkowski SM, Treweek SP. The selection of comparators for randomized controlled trials of health-related behavioral interventions: Recommendations of an NIH expert panel. Journal of Clinical Epidemiology. 2019;110:74–81. doi: 10.1016/j.jclinepi.2019.02.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Green P, MacLeod CJ. SIMR: An R package for power analysis of generalized linear mixed models by simulation. Methods in Ecology and Evolution. 2016;7(4):493–498. doi: 10.1111/2041-210X.12504. [DOI] [Google Scholar]
  18. Grimm, K. J., Ram, N., & Estabrook, R. (2016). Growth modeling: Structural equation and multilevel modeling approaches. New York: Guilford Publications.
  19. Groft SC. Rare diseases research: expanding collaborative translational research opportunities. Chest. 2013;144(1):16–23. doi: 10.1378/chest.13-0606. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Gruber, J., Prinstein, M. J., Clark, L. A., Rottenberg, J., Abramowitz, J. S., Albano, A. M., ... & Weinstock, L. M. (2021). Mental health and clinical psychological science in the time of COVID-19: Challenges, opportunities, and a call to action. American Psychologist, 76, 409. 10.1037/amp0000707 [DOI] [PMC free article] [PubMed]
  21. Hernan MA, Hernandez-Diaz S. Beyond the intention-to-treat in comparative effectiveness research. Clin Trials. 2012;9(1):48–55. doi: 10.1177/1740774511420743. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Holland PW. Statistics and causal inference. Journal of the American Statistical Association. 1986;81(396):945–960. doi: 10.1080/01621459.1986.10478354. [DOI] [PubMed] [Google Scholar]
  23. Howe, G. W., & Ridenour, T. A. (2019). Bridging the gap: Microtrials and idiographic designs for translating basic science into effective prevention of substance use prevention of substance use. In Z. Sloboda, H. Petras, E. Robertson, & R. Hingson, (Eds.), Prevention of substance use (Advances in prevention science) (pp. 349–366). New York: Springer.
  24. Imhoff, M., Bauer, M., Gather, U., & Löhlein, D. (1997). Time series analysis in intensive care medicine. Applied Cardiopulmonary Pathophysiology, 6, 263–281.
  25. Kazdin, A. E. (2011). Single-case research designs: Methods for clinical and applied settings.  New York: Oxford University Press.
  26. Kraemer HC, Mintz J, Noda A, Tinklenberg J, Yesavage JA. Caution regarding the use of pilot studies to guide power calculations for study proposals. Archives of general psychiatry. 2006;63(5):484–489. doi: 10.1001/archpsyc.63.5.484. [DOI] [PubMed] [Google Scholar]
  27. Kreidler, S. M., Muller, K. E., Grunwald, G. K., Ringham, B. M., Coker-Dukowitz, Z. T., Sakhadeo, U. R., Baron, A.E., & Glueck, D. H. (2013). GLIMMPSE: online power computation for linear models with and without a baseline covariate. Journal of Statistical Software, 54(10). [DOI] [PMC free article] [PubMed]
  28. King KM, Pullmann MD, Lyon AR, Dorsey S, Lewis CC. Using implementation science to close the gap between the optimal and typical practice of quantitative methods in clinical science. Journal of Abnormal Psychology. 2019;128(6):547. doi: 10.1037/abn0000417. [DOI] [PubMed] [Google Scholar]
  29. Kronish IM, Cheung YK, Shimbo D, Julian J, Gallagher B, Parsons F, Davidson KW. Increasing the precision of hypertension treatment through personalized trials: A pilot study. Journal of general internal medicine. 2019;34(6):839–845. doi: 10.1007/s11606-019-04831-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Lai, M. H., Kwok, O. M., Hsiao, Y. Y., & Cao, Q. (2018). Finite population correction for two-level hierarchical linear models. Psychological Methods, 23, 94–112. [DOI] [PubMed]
  31. McDonald, S., & Nikles, J. (2021). N-of-1 Trials in Healthcare. In Healthcare (Vol. 9, No. 3, p. 330). Multidisciplinary Digital Publishing Institute. [DOI] [PMC free article] [PubMed]
  32. Magnusson, D. (2015). Individual Development from an Interactional Perspective (Psychology Revivals): A Longitudinal Study: Psychology Press.
  33. Magnusson, K. (2018). Do you really need a multilevel model? A preview of powerlmm 0.4.0 - R Psychologist. Retrieved from https://rpsychologist.com/do-you-need-multilevel-powerlmm-0-4-0
  34. Marcus GM, Modrow MF, Schmid CH, Sigona K, Nah G, Yang J, Chu TC, Joyce S, Gettabecha S, Ogomori K, Olgin JE. Individualized studies of triggers of paroxysmal atrial fibrillation: the I-STOP-AFib randomized clinical trial. JAMA Cardiology. 2022;7(2):167–174. doi: 10.1001/jamacardio.2021.5010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. McShane BB, Gal D, Gelman A, Robert C, Tackett JL. Abandon statistical significance. The American Statistician. 2019;73(sup1):235–245. doi: 10.1080/00031305.2018.1527253. [DOI] [Google Scholar]
  36. Meidinger EE. Applied time series analysis for the social sciences. Sage Publications; 1980. [Google Scholar]
  37. Mills EJ, Seely D, Rachlis B, Griffith L, Wu P, Wilson K, et al. Barriers to participation in clinical trials of cancer: a meta-analysis and systematic review of patient-reported factors. Lancet Oncol. 2006;7(2):141–148. doi: 10.1016/S1470-2045(06)70576-9. [DOI] [PubMed] [Google Scholar]
  38. Muthén, L. K., & Muthén, B. O. (1998-2019). Mplus User’s Guide. (Eighth ed.). Los Angeles, CA: Muthén & Muthén.
  39. Myin-Germeys I, Oorschot M, Collip D, Lataster J, Delespaul P, Van Os J. Experience sampling research in psychopathology: opening the black box of daily life. Psychological medicine. 2009;39(9):1533–1547. doi: 10.1017/S0033291708004947. [DOI] [PubMed] [Google Scholar]
  40. National Institutes of Health. (2019). Fact Sheet: Rare Diseases Clinical Research Network. Retrieved from https://report.nih.gov/nihfactsheets/Pdfs/RareDiseasesClinicalResearchNetwork(ORDR).pdf
  41. Neter, J., Kutner, M. H., Nachtsheim, C. J., & Wasserman, W. (1996). Applied linear statistical models. Chicago: Irwin.
  42. Nikles, J., Daza, E. J., McDonald, S., Hekler, E., & Schork, N. J. (2021). Creating evidence from real world patient digital data. Frontiers in Computer. Science, 61https://www.frontiersin.org/research-topics/10089/creating-evidence-from-real-world-patient-digital-data
  43. Percha, B., Baskerville, E. B., Johnson, M., Dudley, J. T., & Zimmerman, N. (2019). Designing Robust N-of-1 studies for precision medicine: Simulation study and design recommendations. Journal of Medical Internet Research, 21(4), e12641. [DOI] [PMC free article] [PubMed]
  44. Petit-Bois M, Baek EK, Van den Noortgate W, Beretvas SN, Ferron JM. The consequences of modeling autocorrelation when synthesizing single-case studies using a three-level model. Behav Res Methods. 2016;48(2):803–812. doi: 10.3758/s13428-015-0612-1. [DOI] [PubMed] [Google Scholar]
  45. Pinheiro, J., & Bates, D. (2011). Mixed-effects models in S and S-PLUS. Corrected third printing: Springer Science & Business Media.
  46. Pinheiro, J., Bates, D., DebRoy, S., Sarkar, D., & R Core Team. (2019). nlme: Linear and Nonlinear Mixed Effects Models. R package. (Version 3.1–142). Retrieved on Dec 1, 2022 from https://CRAN.R-project.org/
  47. R Core Team. (2019). R: A language and environment for statistical computing. . Vienna, Austria: R Foundation for Statistical Computing. Retrieved from https://www.R-project.org/.
  48. Ridenour TA, Chen S-HK, Liu H-Y, Bobashev GV, Hill K, Cooper R. The clinical trials mosaic: Toward a range of clinical trials designs to optimize evidence-based treatment. Journal for Person-Oriented Research. 2017;3(1):28–48. doi: 10.17505/jpor.2017.03. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Ridenour TA, Pineo TZ, Maldonado Molina MM, Hassmiller Lich K. Toward rigorous idiographic research in prevention science: comparison between three analytic strategies for testing preventive intervention in very small samples. Prev Sci. 2013;14(3):267–278. doi: 10.1007/s11121-012-0311-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Ridenour TA, Stull D. Potential utility of idiographic clinical trials in drug development. Value and Outcomes Spotlight. 2018;4:23–27. [Google Scholar]
  51. Ridenour TA, Wittenborn AK, Raiff BR, Benedict N, Kane-Gill S. Illustrating idiographic methods for translation research: Moderation effects, natural clinical experiments, and complex treatment-by-subgroup interactions. Transl Behav Med. 2016;6(1):125–134. doi: 10.1007/s13142-015-0357-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology. 1974;66(5):688. doi: 10.1037/h0037350. [DOI] [Google Scholar]
  53. Rubin DB. Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association. 2005;100(469):322–331. doi: 10.1198/016214504000001880. [DOI] [Google Scholar]
  54. Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for the training of researchers. Psychological Methods, 1, 115–129.
  55. Shaffer JA, Kronish IM, Falzon L, Cheung YK, Davidson KW. N-of-1 randomized intervention trials in health psychology: A systematic review and methodology critique. Annals of Behavioral Medicine. 2018;52(9):731–742. doi: 10.1093/abm/kax026. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Singer, J. D., & Willett, J. B. (2003). Applied longitudinal data analysis: Modeling change and event occurrence: Oxford university press.
  57. Singh S, Loke YK. Drug safety assessment in clinical trials: Methodological challenges and opportunities. Trials. 2012;13(1):1–8. doi: 10.1186/1745-6215-13-138. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Smith JD. Single-case experimental designs: A systematic review of published research and current standards. Psychol Methods. 2012;17(4):510–550. doi: 10.1037/a0029312. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Snoeren A, Majoie MH, Fasen KC, Ijff DM. Brivaracetam for the treatment of refractory epilepsy in patients with prior exposure to levetiracetam: A retrospective outcome analysis. Seizure. 2022;96:102–107. doi: 10.1016/j.seizure.2022.02.007. [DOI] [PubMed] [Google Scholar]
  60. Splawa-Neyman, J., Dabrowska, D. M., & Speed, T. (1990). On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Statistical Science, 465–472.
  61. Spybrook, J., Bloom, H., Congdon, R., Hill, C., Martinez, A., Raudenbush, S., & To, A. (2011). Optimal design plus empirical evidence: Documentation for the “Optimal Design” software. William T. Grant Foundation. Retrieved on November, 5, 2012.
  62. Stasinopoulos MD, Rigby RA. Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software. 2007;23(7):1–46. doi: 10.18637/jss.v023.i07. [DOI] [Google Scholar]
  63. Stasinopoulos, M. D., Rigby, R. A., Heller, G. Z., Voudouris, V., & De Bastiani, F. (2017). Flexible regression and smoothing: Using GAMLSS in R: CRC Press.
  64. Teare MD, Dimairo M, Shephard N, Hayman A, Whitehead A, Walters SJ. Sample size requirements to estimate key design parameters from external pilot randomised controlled trials: a simulation study. Trials. 2014;15:264. doi: 10.1186/1745-6215-15-264. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Ten Have TR, Normand SL, Marcus SM, Brown CH, Lavori P, Duan N. Intent-to-Treat vs. Non-Intent-to-Treat Analyses under Treatment Non-Adherence in Mental Health Randomized Trials. Psychiatr Ann. 2008;38(12):772–783. doi: 10.3928/00485713-20081201-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Trompetter, H. R., Johnston, D. W., Johnston, M., Vollenbroek-Hutten, M. M., & Schreurs, K. M. (2019). Are processes in acceptance & commitment therapy (Act) related to chronic pain outcomes within individuals over time?: An exploratory study using n-of-1 designs. Journal for Person-Oriented. Research. [DOI] [PMC free article] [PubMed]
  67. Tueller, S. J., Ramirez, D., & Ridenour, T. A. (2019). PersonAlytics: Analytics for single-case, small N, and Idiographic Clinical Trials. R package. (Version 0.2.6.8). Retrieved on Dec 1, 2022 from https://www.personalytics.rti.org/personalytics-software/
  68. Tueller, S. J., Ramirez, D., & Ridenour, T. A. (2020). PersonAlyticsPower: Power Analysis and Simulation for PersonAlytics. R package. (Version 0.1.7.1). Retrieved on Dec 1, 2022 from https://www.personalytics.rti.org/personalytics-software/
  69. Walls TA, Schafer JL, editors. Models for intensive longitudinal data. Oxford University Press; 2006. [Google Scholar]
  70. Wang, Y., & Schork, N. J. (2019) Power and design issues in crossover-based N-of-1 clinical trials with fixed data collection periods. In Healthcare (Vol. 7, No. 3, p. 84). Multidisciplinary Digital Publishing Institute. [DOI] [PMC free article] [PubMed]
  71. Wittenborn AK, Liu T, Ridenour TA, Lachmar EM, Mitchell EA, Seedall RB. Randomized controlled trial of emotionally focused couple therapy compared to treatment as usual for depression: Outcomes and mechanisms of change. Journal of Marital and Family Therapy. 2019;45(3):395–409. doi: 10.1111/jmft.12350. [DOI] [PubMed] [Google Scholar]
  72. Wright AG, Beltz AM, Gates KM, Molenaar PC, Simms LJ. Examining the Dynamic Structure of Daily Internalizing and Externalizing Behavior at Multiple Levels of Analysis. Front Psychol. 2015;6:1914. doi: 10.3389/fpsyg.2015.01914. [DOI] [PMC free article] [PubMed] [Google Scholar]
  73. Zucker DR, Ruthazer R, Schmid CH. Individual (N-of-1) trials can be combined to give population comparative treatment effect estimates: Methodologic considerations. Journal of Clinical Epidemiology. 2010;63(12):1312–1323. doi: 10.1016/j.jclinepi.2010.04.020. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ESM 1 (626.3KB, pdf)

(PDF 626 kb)

Data Availability Statement

The data that were simulated can be replicated using https://github.com/ICTatRTI/PersonAlyticsPower/blob/master/inst/simulation_studies/study1%20Within-subject%20Power%20Analysis.R.


Articles from Behavior Research Methods are provided here courtesy of Nature Publishing Group

RESOURCES