THE STRATIFIED MICRO-RANDOMIZED TRIAL DESIGN: SAMPLE SIZE CONSIDERATIONS FOR TESTING NESTED CAUSAL EFFECTS OF TIME-VARYING TREATMENTS

WALTER DEMPSEY; PENG LIAO; SANTOSH KUMAR; SUSAN A MURPHY

doi:10.1214/19-aoas1293

. Author manuscript; available in PMC: 2021 Apr 15.

Published in final edited form as: Ann Appl Stat. 2020 Jun 29;14(2):661–684. doi: 10.1214/19-aoas1293

THE STRATIFIED MICRO-RANDOMIZED TRIAL DESIGN: SAMPLE SIZE CONSIDERATIONS FOR TESTING NESTED CAUSAL EFFECTS OF TIME-VARYING TREATMENTS

WALTER DEMPSEY ¹, PENG LIAO ¹, SANTOSH KUMAR ², SUSAN A MURPHY ³

PMCID: PMC8049613 NIHMSID: NIHMS1060597 PMID: 33868539

Abstract

Technological advancements in the field of mobile devices and wearable sensors have helped overcome obstacles in the delivery of care, making it possible to deliver behavioral treatments anytime and anywhere. Here, we discuss our work on the design of a mobile health smoking cessation intervention study with the goal of assessing whether reminders, delivered at times of stress, result in a reduction/prevention of stress in the near-term, and whether this effect changes with time in study. Multiple statistical challenges arose in this effort, leading to the development of the stratified micro-randomized trial design. In these designs, each individual is randomized to treatment repeatedly at times determined by predictions of risk. These risk times may be impacted by prior treatment. We describe the statistical challenges and detail how they can be met.

Keywords: sequential randomization, nested causal effects, stratified microrandomized trials, mobile health, weighted-centered least-squares method

1. Introduction.

The rise of wearable technologies has generated increased scientific interest in the use and development of mobile interventions. Such mobile technology holds promise in providing accessible support to individuals in need. Mobile interventions to maintain adherence to HIV medication and smoking cessation, for example, have shown sufficient effectiveness to be recommended for inclusion in health services [Free et al., 2013]. Scientists are increasingly interested in understanding whether it is useful to trigger delivery of treatments at risk times, such as when the individual is stressed [Hovsepian et al., 2015], anxious, or disengaging. Because treatments delivered by phone or wearable can be perceived as intrusive and burdensome, a further goal is to assess if treatment effects change through time.

This paper focuses on applied experimental trial design in the new area of mobile health. In particular, we discuss and illustrate the stratified microrandomized trial (sMRT) design. This is motivated by our work on the design of multiple sMRTs. This paper’s main focus is Sense2Stop, a mobile health smoking cessation study, that is currently underway. In this study, participants are trained in stress reduction exercises prior to their smoking quit date. Apps that can be used to guide the participant through the exercises are installed on study-provided phone. These apps can be accessed at any time by a participant. However, a common problem is that at the very times at which practicing these exercises might be most useful, participants do not do so. The scientific team is most interested in understanding whether reminders to practice stress-reduction exercises will be useful in reducing/preventing future stress if the reminders are delivered at times the participant is classified as stressed. Thus, some reminders are to occur at these stress times and the remaining at times the participant is not classified as stressed. A primary goal of this study is to assess whether the reminders, delivered at stress times, result in a reduction/prevention of stress over the subsequent hour and whether this effect changes with time.

The design of this sMRT as well as others present a number of challenges.

Expressing the primary scientific hypothesis in terms of a causal effect is nontrivial
The primary hypothesis test procedure (e.g. test statistic and rejection region) should balance small sample bias and power when the alternative hypothesis is true
We aim to construct a primary hypothesis test procedure (e.g. teststatistic and rejection region) that avoids introducing causal bias
Sometimes the primary hypothesis concerns the distribution of a response that should accrue over a time period in which there is no subsequent treatment, but in the study, subsequent treatment can occur during this time period
A generative model is needed to calculate the required number of participants
- Only small, observational data from participants wearing the same sensor suite are usually available
- The sample size calculator should be robust to plausible deviations from the baseline generative model

In the following, we first discuss the smoking cessation study in greater detail. Next, we introduce the stratified micro-randomized trial (sMRT). We then define the causal treatment effect addressing challenge 1. Next, we construct a test statistic and associated theory that accomodate challenges 2–4. Subsequently, we develop a simulation-based method for determining the sample size that accomodates challenge 5.

2. Sense2Stop smoking cessation study.

To focus on the experimental design and associated statistical challenges, we consider a simplified version of the smoking cessation study, Sense2Stop, in which we are involved through the Mobile Data to Knowledge Center (https://md2k.org/).^aSense2Stop is a 10 day mobile health intervention study beginning on each participant’s smoking quit day. Participants wear both an AutoSense chest band [Ertin et al., 2011] as well as bands on each wrist for 10 hours per day. An online pattern-mining algorithm uses the resulting sensor data to construct a binary time-varying stress classification (see Section 7 for an overview of how this algorithm uses episodes of time to construct the stress classifications) at each minute of sensor wearing throughout the entire day.

Each participant’s smart phone contains a number of guided stress-reduction exercises that can be accessed 24/7. Participants are trained in the use of these exercises prior to their quit date. The treatment is a smartphone notification to remind the participant to access the app and practice the stress-reduction exercises. Theoretically, a treatment can be delivered at any minute during the 10 hour day. Practically, treatment delivery is constrained by considerations of attendant burden and to times at which online stress classification is possible.

The trial design should enable us to address the scientific questions:

Is there an effect of the reminder treatment on near-term, proximal stress if the individual is currently experiencing stress? Does the effect of the reminder treatments vary with time in study?

3. Stratified Micro-Randomized Trial.

In general, the stratified microrandomized trial (sMRT) consists of a sequence of within-person decision times t = 1,...,T, e.g. occasions, at which treatment may be randomized. In Sense2Stop, there is a decision time each minute; that is, T = 600×10 decision times. sMRTs are a generalization of the micro-randomized trial [Liao et al., 2016, Dempsey et al., 2015, Klasnja et al., 2015, Bidargaddi et al., 2018] to accommodate stratification. The decision times are divided into strata and the randomization occurs separately by strata. This ensures sufficient treatment and no treatment occurrences within each strata.

In Sense2Stop, the stratification is motivated by our goal of collecting data to address the questions posed in the prior section. There are two strata, minutes at which a participant is classified as stressed and minutes at which the participant is not classified as stressed. Prior data indicated that participants are likely to experience many fewer minutes of stress than non-stress minutes per day, thus motivating the stratification.

In contrast to micro-randomized trials, in an sMRT, the stratification requires online monitoring of a time-varying stratification variable (e.g., minute-by-minute stress classification in Sense2Stop) as well as the development of randomization probabilities that, for each participant, depend on that participant’s prior data. As a result sample size calculations are more complex than in the micro-randomized trial further complicating the 5th challenge listed in Section 1.

To describe the sMRT, and in particular the Sense2Stop sMRT, we use the following definitions.

Availability.

At decision time t, the mobile app assesses if the participant is unavailable for randomization. That is, at some time points it is inappropriate to provide treatment due to ethical, feasibility, or burden considerations. In Sense2Stop, if a participant receives a treatment reminder, then for the next 60 minutes the participant is unavailable for further treatment. This was done to limit burden and intrusiveness of smartphone notifications.

Feasibility constraints often are due to current sensing technology along with restrictions imposed by the goal of real time detection of the stratification variable. In Sense2Stop, for example, the classification algorithm only makes a real time classification of stress at minutes at which sufficient evidence of recent stress has accumulated. In particular, the Sense2Stop classification algorithm produces a smoothed probability of physiological stress across the minutes with an episodic pattern – the minute-by-minute probability increases then decreases then increases and so on. An episode is defined by the beginning of a positive trend interval and peaks at the end of a positive-trend interval followed by the start of a negative-trend interval. To ensure the required sensitivity and specificity, the algorithm only makes a classification in the minute after the peak of an episode (see Figure 1). Only at these peak minutes is a participant considered available (provided no treatment has been delivered in the past 60 minutes). At all other times the participant is considered unavailable. For greater detail see the discussion in Section 7. The indicator I_t = 1 means that the participant is available at decision time t and I_t = 0 otherwise.

Fig 1: — Illustrative example of the episodic pattern of smoothed Sense2Stop stress probabilities and its associated online classification algorithm. In the minute following t^⋆, a stress classification is made. Subsequently, all minutes from the episode beginning to episode end are given the same classification. A participant can only be available in the minute following t^⋆.

Stratification variable.

The stratification variable is denoted X_t. In Sense2Stop, there are two strata; X_t = 1 indicates t is within an episode which, at the peak of this episode, the participant was classified as stressed and X_t = 0 otherwise. As depicted in Figure 1, X_t is only observed in real time if t is the minute following the peak. This is also the one minute during the episode at which the participant is available as discussed above. In general, X_t may be categorical.

Treatment.

At available decision times, treatment, A_t, is randomized. In Sense2Stop, A_t is binary with A_t = 1 if at minute t, the participant is randomized to receive a reminder to practice stress-reduction exercises and A_t = 0 otherwise.

Proximal response.

Usually treatments are designed to have a proximal, near-term effect on a response variable. This proximal response, denoted here by Y_t,Δ, is assumed to be a known function of the participant’s data within a subsequent window of length Δ. In Sense2Stop, the proximal outcome is the fraction of time classified as stressed over the subsequent Δ = 60 minutes. Note as discussed above, the real time stress classification is made at the peak of an episode (see Figure 1). Once a classification is made, then prior minutes in the same episode receive the same classification. This means that X_t is defined at all minutes t and thus can be used to form a proximal outcome. In particular,

Y_{t, Δ} = Δ^{- 1} \sum_{s = 1}^{Δ} 1_{X_{t + s} = 1} .

This choice of proximal response led to challenge 4.

Longitudinal data.

The ordering of a participant’s longitudinal data for use in the primary analysis is

({X_{1}, U_{1}, I_{1}}, A_{1}, {Y_{2}, X_{2}, U_{2}, I_{2}}, A_{2}, \dots, A_{T - 1}, {X_{T}, U_{T}, I_{T}})

where U_t ∈ {0, 1, 2} indicates the episode phase at decision time t (i.e., “prepeak”, “peak”, and “post-peak”). Let $H_{t} = ({{X_{s}, U_{s}, I_{s}}, A_{s}}_{s = 1}^{t - 1}, {X_{t}, U_{t}, I_{t}})$ denote the observed data at time t prior to randomization. In general, as in Sense2Stop, X_t, U_t and I_t may be impacted by prior treatment.

Randomization formula.

At an available decision time t, the randomization probability, pr(A_t = 1 | H_t), is a known function of H_t, denoted by p_t(1 | H_t); else if I_t = 0 then A_t = 0 and p_t(1 | H_t) is set to 0. Note that p_t(· | H_t) need only be defined, and is only used in the experiment, if t is an available decision time. Section F of the Supplementary materials provides as implified version of the formula, used in Sense2Stop, for p_t(a | h_t),t = 1,...,T for any value of observed history.

The randomization probability is set to ensure an average number of treatments within a given time duration (e.g. within a day or week). It is our experience that these constraints are almost always due to concerns about the intrusiveness and attendant burden. In designing Sense2Stop, the team felt that an average of 1.5 treatment reminders per day within each strata would be well-tolerated. The need for stratification and the average constraint of 1.5 treatment reminders per strata per day resulted in a randomization probability that depended on the entire observed history. This fact contributes to challenges 3 and 5.

REMARK 3.1 (Designing an sMRT).

Appendices A, B, and C are included to aid scientists interested in designing a sMRT.

4. Proximal effect of treatment.

The primary question of interest is whether the treatment has a proximal effect; that is, whether there is an effect of treatment at decision time t on the mean proximal response Y_t,Δ. Below we use potential outcomes [Robins, 1986, Rubin, 1978] to make this question precise and in particular, operationalize the questions relevant to Sense2Stop posed at the end of Section 2. Note, we are only interested in treatment effects conditional on availability (I_t = 1). We consider two types of effects: an effect that is defined conditionally on the value of the stratification variable X_t and I_t = 1 or an effect that is conditional only on I_t = 1, so marginal with respect to the distribution of X_t. For expositional simplicity, we focus on the test for the conditional treatment effect in the remainder of this paper. Appendix I provides a parallel discussion in the case of the marginal treatment effect.

4.1. Proximal treament effect, potential outcomes, and reference distribution.

As stated above, we use potential outcomes [Robins, 1986, Rubin, 1978] to define the conditional proximal effect. The overbar is used to denote a sequence through a specified treatment occasion; ${\bar{a}}_{t} = (a_{1}, \dots, a_{t})$ , for instance, denotes the sequence of realized actions up to and including decision time t. The potential observations at decision time t are ${X_{t} ({\bar{a}}_{t - 1}), U_{t} ({\bar{a}}_{t - 1}), I_{t} ({\bar{a}}_{t - 1})}_{{\bar{a}}_{t - 1} \in {0, 1}^{t - 1}} .$ For example at time 2, the potential observations are ${X_{2} (a_{1}), U_{2} (a_{1}), I_{2} (a_{1})}_{a_{1} \in {0, 1}} .$ In the case of Sense2Stop, availability is defined as

I_{t} ({\bar{a}}_{t - 1}) = {\begin{array}{l} 1 & if \sum_{s = 1}^{Δ} a_{t - s} = 0 and U_{t} ({\bar{a}}_{t - 1}) = 1 \\ 0 & otherwise \end{array}

The potential outcomes for the proximal response at time t are ${Y_{t, Δ} ({\bar{a}}_{t + Δ - 1})}_{{\bar{a}}_{t + Δ - 1} \in {0, 1}^{t + Δ - 1}} .$ Each individual has 2^t+Δ−1 potential outcomes at time t.

At the individual level, the effect of providing treatment versus not providing treatment at time t is a difference in potential outcomes for the proximal response and is given by

Y_{t, Δ} ({\bar{a}}_{t - 1}, 1, a_{t + 1}, \dots, a_{t + Δ - 1}) - Y_{t, Δ} ({\bar{a}}_{t - 1}, 0, a_{t + 1}, \dots, a_{t + Δ - 1}) .

(1)

In general there are 2^t+Δ−2 treatment differences for each individual, each corresponding to a treatment pattern for $({\bar{a}}_{t - 1}, a_{t + 1}, \dots, a_{t + Δ - 1}) .$ However participants’ availability constrains the number of possible treatment patterns. In particular our hypotheses only concern differences of potential outcomes corresponding to treatment at available times. In the Sense2Stop study, for instance, we are interested in treatment differences between potential outcomes for which if a_t = 1 then (a_t+1,...,a_t+Δ) is equal to $\bar{0},$ since following treatment, the participant is unavailable for further treatment for the next Δ = 60 minutes.

Recall that the “fundamental problem of causal inference” [Imbens and Rubin, 2015, Pearl, 2009] is that we can not observe any one of these individual differences. Thus we consider averages of potential outcomes in defining treatment effects. In addition to define the treatment effect, we specify a reference distribution^b; that is, the distribution of treatments prior to time t, ${\bar{a}}_{t - 1} .$ Moreover, if Δ > 1, then we must also define a second reference distribution over treatments after time t, (a_t+1,...,a_t+Δ−1). Overall, the treatment effect at time t will be an average of the differences in (1) both over the distribution across individual’s potential outcomes as well as over the reference distributions for the treatments and respecting the constraints imposed by availability. To define the proximal treatment effect we must select these reference distributions.

The question is, “Which reference distributions should be used?” The choice of which distribution to use for (a_t+1,...,a_t+Δ−1) might differ by the type of inference desired. For example, in Sense2Stop, we further operationalize the questions posed at the end of Section 2 by setting the treatments a_t+1,...,a_t+Δ−1 to 0. In this case, the treatment effect is:

The effect on the fraction of time stressed in the next hour of (a) providing a notification at time t to practice stress-reduction exercises and no notifications within the next hour versus (b) no notification at time t and no notifications within the next hour.

In this paper, we set treatment at the subsequent Δ − 1 times equal to 0 as described above. In order to select the reference distribution for ${\bar{a}}_{t - 1},$ we follow common practice in observational mobile health studies; here longitudinal methods such as GEEs and random effects models [Liang and Zeger, 1986] might be used to model how a time-varying variable, such as physical activity, varies with current mood. In this case, the mean model in these analyses is marginal over the past distribution of mood. A similar strategy in the randomized setting is to use the past treatment randomization probabilities as the reference distribution.

With the reference distribution set to the randomization probabilities for past treatment and set to no treatment for the subsequent Δ − 1 times, the average causal effect at time t can be viewed as an excursion. That is, participants get to time t under treatment according to the randomization probabilities, then at time t (if available) the effect is the contrast between two opposing excursions into the future. In one excursion, we treat at time t and then do not treat for Δ − 1 further times; in the opposing excursion, we do not treat at time t nor do we treat for Δ − 1 subsequent times.

Using the above reference distribution, the conditional, proximal treatment effect at time t, β(t; x), is:

\frac{E [\sum_{{\bar{a}}_{t - 1}} (\prod_{j = 1}^{t - 1} p_{j} (a_{j} | H_{j} ({\bar{a}}_{j - 1}))) (Y_{t, Δ} ({\bar{a}}_{t - 1}, 1, \bar{0}) - Y_{t, Δ} ({\bar{a}}_{t - 1}, 0, \bar{0})) I_{t} ({\bar{a}}_{t - 1}) 1_{X_{t} ({\bar{a}}_{t - 1}) = x}]}{E [\sum_{{\bar{a}}_{t - 1}} (\prod_{j = 1}^{t - 1} p_{j} (a_{j} | H_{j} ({\bar{a}}_{j - 1}))) I_{t} ({\bar{a}}_{t - 1}) 1_{X_{t} ({\bar{a}}_{t - 1}) = x}]} .

where the expectation, $E$ is over the distribution of the potential outcomes and $\bar{0}$ is a row vector of length Δ − 1.

Beyond scientific considerations, a further statistical consideration in selecting a reference distribution is that if the reference distribution is far from the randomization distribution then treatment effects may be very difficult to estimate. See Section B in the Supplementary materials for a discussion. For the reminder of this paper, the proximal effects are defined using the randomization distribution for past treatments $({\bar{a}}_{t - 1})$ and (a_t+1,...,a_t+Δ−1) are set to 0 (no treatment).

4.2. Proximal effect of treatment & observable data.

The following three assumptions are used to express the causal treatment effect, β(t; x), in terms of the observable data.

ASSUMPTION 4.1.

We assume consistency, positivity, and sequential ignorability [Robins, 1986]:

Consistency: For each $t \leq T + Δ, {X_{t} ({\bar{A}}_{t - 1}), I_{t} ({\bar{A}}_{t - 1})} = {X_{t}, I_{t}} .$ That is, the observed values equal the corresponding potential outcomes.
Positivity: if the joint density {H_t = h, A_t = a} is greater than zero, then pr(A_t = a_t | H_t = h_t) > 0.
Sequential ignorability: for each t ≤ T, the potential outcomes, ${X_{2} (a_{1}), I_{2} (a_{1}), \dots, X_{T + Δ} ({\bar{a}}_{T + Δ - 1})}_{{\bar{a}}_{T + Δ - 1} \in {0, 1}^{T + Δ - 1}},$ are independent of A_t conditional on the history H_t.

Sequential ignorability and, assuming all of the randomization probabilities are bounded away from 0 and 1, positivity, are guaranteed for an sMRT by design. Consistency is a necessary assumption for linking the potential outcomes as defined here to the data. When an individual’s outcomes may be influenced by the treatments provided to other individuals, consistency may not hold. In such instances, a group-based conceptualization of potential outcomes is used [Hong and Raudenbush, 2006, Vanderweele et al., 2013]. In particular, if the mobile intervention includes treatments that aim to produce social ties between participants, then consistency as stated above will not hold. For simplicity, we do not consider such mobile interventions here.

LEMMA 4.2.

Under assumption 4.1, the conditional treatment effect satisfies

β (t; x) = E_{p} [E_{p} [\prod_{j = t + 1}^{t + Δ - 1} \frac{1_{A_{j} = 0}}{p_{j} (A_{j} | H_{j})} Y_{t, Δ} | A_{t} = 1, H_{t}] | X_{t} = x, I_{t} = 1] - E_{p} [E_{p} [\prod_{j = t + 1}^{t + Δ - 1} \frac{1_{A_{j} = 0}}{p_{j} (A_{j} | H_{j})} Y_{t, Δ} | A_{t} = 0, H_{t}] | X_{t} = x, I_{t} = 1]

(2)

for all x ∈ {0,...,k} where each expectation is with respect to the distribution of the data collected using the randomization probabilities specified in the design of the sMRT (indicated by the subscript p on the expectations).

Note that the above products, e.g. $\prod_{j = t + 1}^{t + Δ - 1} \frac{1_{A_{j} = 0}}{p_{j} (A_{j} | H_{j})},$ are set to 1 if Δ = 1. Proof of Lemma 4.2 can be found in the Section G of the Supplementary materials. We now focus on designing an sMRT where the primary purpose is testing whether the treatment effect at any time point differs from 0.

5. Test statistic.

Our main objective is the development of a sample size formula that will ensure sufficient power to detect alternatives to the null hypothesis of no proximal treatment effect. For the conditional proximal effect, the null hypothesis is H₀ : β(t; x) = 0, t = 1...,T and x ∈ {0,...,k}.

The proposed sample size formulas are simulation-based and will follow from consideration of the distribution of test statistics under alternatives to the above null hypothesis. The sample size will be denoted by N. Our test statistic will generalize the test statistics developed by Boruvka et al. [2017] to accommodate stratification as well as the fact that the response Y_t,Δ covers a time interval during which subsequent treatment may be delivered (in Boruvka et al. [2017], Δ = 1 throughout). Moreover, sample size calculations are informed by the novel conceptual insight that these estimators can be interpreted as L₂ projections (see remark 4 in Section 5.1).

In the following, we describe L₂ projections and provide the test statistics. First, in the conditional setting, the test statistic is based on an empirical projection of {β(t; x)}_{t=1...,T;x∈{0,...,k}} on the space spanned by a q_c by 1 vector of features involving t and x, denoted by f_t(x). We denote the projection by $f_{t} (x) .$ The β_c weights in this projection are given by

β_{c}^{⋆} = \arg \min_{β_{c}} E_{p} [\sum_{t = 1}^{T} I_{t} {\tilde{p}}_{t} (1 | X_{t}) (1 - {\tilde{p}}_{t} (1 | X_{t})) {(β (t; X_{t}) - f_{t} {(X_{t})}^{'} β_{c})}^{2}]

where ${{\tilde{p}}_{t} (1 | x)}_{t = 1, \dots, T; x \in {0, \dots, k}}$ are pre-specified probabilities used to define the weighting across time and stratification distribution in the projection. The expectation $E$ _p is taken with respect to the joint distribution of ${(X_{t}, I_{t})}_{t = 1}^{T}$ generated using the randomization probabilities in the sMRT design. If desired, one can set ${\tilde{p}}_{t} (1 | x) = 1 / 2$ for all t, x. See Section 6.1 for further comments on the choice of the ${\tilde{p}}_{t} {(1 | x)}^{'} s and f_{t} (x) .$

In some settings, there will be sufficient a priori information (e.g. data on individuals from a similar population) that will permit the test statistic to use control variables. These variables are used to help reduce the variance of the estimators with the goal that the resulting test statistic is more powerful in detecting particular alternatives to the null hypothesis. See Section 6.1 for further discussion on the choice of control variables. For example, in Sense2Stop, a natural control variable would be the fraction of time stressed in the hour prior to time t as this pre-time t variable is likely highly correlated with the fraction of time stressed in the hour subsequent to time t, Y_t,60. Given a q^′ by 1 vector of “control variables” $g_{t} (H_{t}), define g_{t} {(H_{t})}^{'} α_{c}^{⋆}$ as an

α_{c}^{⋆} = \arg \min_{α} E_{p} [\sum_{t = 1}^{T} I_{t} w_{c t} (H_{t + Δ - 1}) {(Y_{t, Δ} - g_{t} {(H_{t})}^{'} α_{c})}^{2}]

where $w_{c t} (H_{t + Δ - 1}) = \frac{{\tilde{p}}_{t} (A_{t} | X_{t}) \prod_{s = 0}^{Δ - 1} 1 [A_{t + s} = 0]}{\prod_{s = 0}^{Δ - 1} p_{t + s} (A_{t + s} | H_{t + s})} .$ Note, one can choose g_t(H_t) equal to the scalar, 1. This use of control variables to reduce variance in the response is used to address challenge 2 listed in the Section 1.

Recall, the proposed test statistic is based on an estimator of $β_{c}^{⋆} .$ Here we consider an estimator of $β_{c}^{⋆}$ which is the minimizer of the following weighted, centered least-squares criterion, minimized over (α_c, β_c):

ℙ_{n} [\sum_{t = 1}^{T} I_{t} w_{c t} (H_{t + Δ - 1}) {(Y_{t, Δ} - g_{t} {(H_{t})}^{'} α_{c} - (A_{t} - {\tilde{p}}_{t} (1 | X_{t})) f_{t} {(X_{t})}^{'} β_{c})}^{2}]

(3)

where $P$ _n[ϕ(H_t+Δ−1)] is defined as the average of a function, ϕ(H_t+Δ−1), over the sample. The centering refers to the centering of the treatment indicator A_t in the above weighted least squares criterion. The centering idea is from Liao et al. [2016], Boruvka et al. [2017] (unlike here, Boruvka et al. [2017] aimed to consistently model the treatment effect). Here we aim to estimate the projection for use in the test statistic; the centering allows us to simultaneously consistently estimate the cofficients in each of the two projections. The consistent estimation of $β_{c}^{⋆}$ addresses challenge 2 listed in Section 1. Centering in the construction of the test statistic preserves the null and avoids introducing causal bias, which addresses challenge 3 listed in Section 1. Indeed, preserving the null is difficult because both the stratification variable x and the randomization probabilities may be influenced by prior treatment.

Under finite moment and invertibility assumptions, the minimizers $({\hat{α}}_{c,} {\hat{β}}_{c}),$ are consistent, asymptotically normal estimators of $(α_{c}^{⋆}, β_{c}^{⋆}) .$ The limiting variance of $\sqrt{N} ({\hat{β}}_{c} - β_{c}^{⋆})$ is given by $Q_{c}^{- 1} W_{c} Q_{c}^{- 1}$ where

W_{c} = E_{p} [\sum_{t = 1}^{T} I_{t} w_{c t} (H_{t + Δ - 1}) ϵ_{c t} (A_{t} - {\tilde{p}}_{t} (1 | X_{t})) f_{t} (X_{t}) \times \sum_{t = 1}^{T} I_{t} w_{c t} (H_{t + Δ - 1}) ϵ_{c t} (A_{t} - {\tilde{p}}_{t} (1 | X_{t})) f_{t} {(X_{t})}^{'}], ϵ_{c t} = Y_{t, Δ} - g_{t} {(H_{t})}^{'} α_{c}^{⋆} - (A_{t} - {\tilde{p}}_{t} (1 | X_{t})) f_{t} {(X_{t})}^{'} β_{c}^{⋆}, and Q_{c} = \sum_{t = 1}^{T} E_{p} [I_{t} {\tilde{p}}_{t} (1 | X_{t}) (1 - {\tilde{p}}_{t} (1 | X_{t}))) f_{t} (X_{t}) f_{t} {(X_{t})}^{'}] .

See Section G.2 in the Supplementary materials for technical details.

The proposed sample size formula is based on the test statistic

T_{c N} = N {\hat{β}}_{c}^{'} {\hat{Q}}_{c} {\hat{W}}_{c}^{- 1} {\hat{Q}}_{c} {\hat{β}}_{c}

(4)

where N is the sample size, ${\hat{W}}_{c} and {\hat{Q}}_{c}$ are empirical estimators of W_c and Q_c (i.e., replace $E_{p} with ℙ_{n}$ ), and ${\hat{ϵ}}_{c t} = Y_{t, Δ} - g_{t} {(H_{t})}^{'} {\hat{α}}_{c} - (A_{t} - {\tilde{p}}_{t} (1 | X_{t})) f_{t} {(X_{t})}^{'} {\hat{β}}_{c} .$ Here, we have implicitly assumed that ${\hat{W}}_{c}$ is invertible. The following lemma provides the distribution of T_cN:

LEMMA 5.1

(Asymptotic Distribution of T_cN). Under finite moment and invertibility assumptions,

N {({\hat{β}}_{c} - β_{c}^{⋆})}^{'} {\hat{Q}}_{c} {\hat{W}}_{c}^{- 1} {\hat{Q}}_{c} ({\hat{β}}_{c} - β_{c}^{⋆}) \to_{d} χ_{q_{c}}^{2} .

The above lemma implies that the distribution of the test statistic T_cN is approximately a non-central Chi-Squared distribution with non-centrality parameter λ = Nγ_c where

γ_{c} = {(β_{c}^{⋆})}^{'} Q_{c} W_{c}^{- 1} Q_{c} β_{c}^{⋆} .

(5)

However from a technical perspective, T_cN, is very similar to the quadratic form test statistics based on weighted regression used in Generalized Estimating Equations (GEEs) method [Liang and Zeger, 1986, Diggle et al., 2002]. In this field, much work has been done on how to best adjust these test statistics and their distribution when the sample size N might be small [Liao et al., 2016, Mancl and DeRouen, 2001]. The adjustments are based on the intuition that the quadratic form is akin to the multivariate T-test statistic used to test whether a vector of means is equal to 0 and thus Hotelling’s T-squared distribution is used to approximate the distribution when N may be small.

To develop the sample size formula, we follow the lead of the welldeveloped GEE literature and use a non-central Hotelling’s T-squared distribution with degrees of freedom (d₁ = q_c, d₂ = N − (q^′ + q_c)) to approximate the distribution of T_cN. Recall q^′ is the dimension of α_c and q_c is the dimension of β_c. See Section G in the Supplementary materials for a discussion of how for large N, we recover the Chi-Squared distribution given in Lemma 5.1. Recall that if a random variable X has non-central Hotelling’s T-squared distribution with degrees of freedom (d₁, d₂) and non-centrality parameter λ then $\frac{d_{2}}{d_{1} (d_{1} + d_{2} - 1)}$ X has non-central F-distribution with the same degrees of freedom and non-centrality parameter [Hotelling, 1931]. Thus the rejection region for the test H₀ : β(t; x) = 0, t = 1...,T and x ∈ {0,...,k} can be written as:

{T_{c N} > \frac{q_{c} (N - (q^{'} + 1))}{N - (q^{'} + q_{c})} F_{q_{c}, N - (q^{'} + q_{c}); 0}^{- 1} (1 - α_{0})}

(6)

with α₀ a specified significance level. For details regarding further small sample size adjustments, used when analyzing the data, see Section J in the Supplementary materials.

5.1. Remarks.

Next, we discuss components of the proposed test statistic.

Specification of the weights. The weight w_ct(H_t+Δ−1) plays multiple roles:
- First, the term ${\tilde{p}}_{t} (A_{t} | X_{t}) / p_{t} (A_{t} | H_{t})$ is similar to the inverse probability of treatment weighting in causal inference [Robins, 1986] in that it facilitates estimation of a marginal effect, marginal over the history H_t given strata X_t = x and availability I_t = 1.
- Second, choice of the numerator of the user-defined weight ${\tilde{p}}_{t} (A_{t} | X_{t})$ determines the L₂ projection of the treatment effect if β(t; x) is not equal to a linear combination of f_t(x). In these settings, the numerator of the weight determines the $β_{c}^{⋆}$ coefficients in the projection. See below for further comments regarding the L₂ projection. These user-defined weights are distinct from randomization probabilities in the sMRT. Note, however, that if the randomization probabilities in the sMRT p_t(1 | H_t) only depend on t and X_t then we can set ${\tilde{p}}_{t} (A_{t} | X_{t}) = p_{t} (1 | H_{t}) .$
- Third, the remaining terms $\prod_{s = 0}^{Δ - 1} [1 [A_{t + 1} = 0] / p_{t + s} (A_{t + s} | H_{t + s})]$ adjust for the fact that the reference distribution at the subsequent Δ − 1 times is different from the sMRT randomization protocol.
No use of a non-independence working correlation matrix. As discussed, the estimating equation underlying the test statistic is similar to a generalized estimating equation (GEEs) [Liang and Zeger, 1986, Diggle et al., 2002]. While this might motivate inclusion of non-independence working correlation matrix to further reduce the variance of estimator and thus increase the power of the test [Mancl and Leroux, 1996], the inclusionofanon-independenceworkingcorrelationmatrixgenerally introduces causal bias [Boruvka et al., 2017, Liao et al., 2016]. Similar biases occur with the use of non-independence working correlation matrices in the inverse probability of treatment weighting literature [Vansteelandt, 2007, Tchetgen Tchetgen et al., 2012] or in GEEs where a time-varying response is modeled by time-varying covariates [Pepe and Anderson, 1994].
Use of sandwich estimator of the variance. The test statistic accounts for the within person correlation in the longitudinal response via use of a sandwich estimator (i.e., ${\hat{Q}}_{c} {\hat{W}}_{c}^{- 1} {\hat{Q}}_{c}$ ) for the covariance matrix. Unfortunately, the power of the test will depend on the within-person correlation in responses, thus the simulation-based sample size formula developed below requires modeling the correlation. Under the null hypothesis that β(t; x) = 0, t = 1...,T and x ∈ {0,...,k}, the test statistic and associated rejection region has the desired asymptotic type I error rate regardless of the underlying true within-person correlation (assuming W_c is invertible).
Use of a L₂-projection to form the test statistic. Recall that if under the alternative hypothesis β(t; x) is not equal to a linear combination of f_t(x), then the L₂-projection of β(t; x) depends on the feature vector, the pattern of availability across time and the distribution of the stratification variable across time. Figure 2 provides a visualization. Here consider different uses of the feature d_t denoting the day-in-study. The red line is the complex, true treatment effect β(t; x). The black line is the projected effect onto feature vector f_t = (1, d_t) when there is a quadratic pattern across time in availability $(E [I_{t} | X_{t} = x])$ and the stratification distribution P(X_t = x) is constant through time. Similar interpretations hold for the blue, dotted blue and dotted black lines as indicated in Figure 2. The four projections are distinct in the top graph in Figure 2, illustrating how the joint distribution of availability and the stratification variable affects the L₂-projection.

None of the four projections fully reflect the true alternative β(t; x) but all four roughly pick up the departure of the true β(t; x) from the null hypothesis. While it is tempting to consider higher dimensional and more flexible feature spaces so as to more fully reflect the variety of possible alternatives to the null hypothesis, these come at a cost in the additional degrees of freedom in the F statistic. This may lead to an increase in sample size for a given desired power. This tradeoff is discussed at length in Section H in the Supplementary materials; we suggest sizing a study for primary hypothesis tests using the least complex alternative possible. In the case of Sense2Stop, we decided to use a projection onto $f_{t} = (1, d_{t}, d_{t}^{2}) .$ The dotted black line in Figure 2 captures most of the variation in β(t; x) under plausible time patterns in the distribution of availability and the stratification variable. The test statistic targets this low dimensional alternative so as to address challenge 2 listed in Section 1.

Fig 2: — Illustration of the L₂-projection of β(t; x) onto feature vector f_t. The reference distribution ${\tilde{p}}_{t} (1 | x)$ is constant in (t, x). The feature vector is non-parametric in binary x and set within each strata to $f_{t} = (1, d_{t}) or f_{t} = (1, d_{t}, d_{t}^{2})$ where d_t is equal to the number days in study; expected availability given X_t = x is constant in t time or is quadratic in t with the same average. The distribution of X_t is constant in t or is quadratic in t with the same average.

6. Sample size formulae.

To plan the sMRT, we need to determine the sample size, N, needed to detect a specific alternative with a given power (1 − β₀) at a given significance level (α₀). The sample size is the smallest value N such that

1 - F_{q_{c}, N - (q^{'} + q_{c}); N γ_{c}} (F_{q_{c}, N - (q^{'} + q_{c}); 0}^{- 1} (1 - α_{0})) \geq 1 - β_{0} .

(7)

$F_{d_{1}, d_{2}; λ} and F_{d_{1}, d_{2}; λ}^{- 1}$ denote the cumulative and inverse distribution functions respectively for the non-central F-distribution with degrees of freedom (d₁, d₂) and non-centrality parameter λ. Calculation of the sample size N is non-trivial due to the unknown form of the noncentrality parameter, Nγ_c (where γ_c is defined in (5)). This is in contrast to micro-randomized trials where, under non-stochastic randomization probabilities and certain working assumptions, Liao et al. [2016] were able to find an analytic form for the noncentrality parameter Nγ_c.

We outline a simulation-based sample size calculation, starting with general overview and comments in Section 6.1 and employ this calculator to design the smoking cessation study in Section 7.

6.1. Simulation-based sample size calculation.

As discussed above, explicit calculation of the sample size N is non-trivial due to the unknown form of the non-centrality parameter. Here, we propose a three-step simulation-based sample size calculator.

In the first step, equation (5) and information elicited from the scientist is used to calculate, via Monte-Carlo integration, γ_c in the non-centrality parameter. The resulting value, ${\hat{γ}}_{c},$ is plugged in to equation (7) to solve for an initial sample size ${\hat{N}}_{0} .$ In the second step, we use a binary search algorithm to search over a neighbourhood of ${\hat{N}}_{0}$ ; in our simulations, we found the binary search quickly resulted in a solution. For each sample size N required by the binary search algorithm, K samples each of N simulated participants are run. Within each simulation, the rejection region for the test is given by equation (6) at the specified significance level. The average number of rejected null hypotheses across the K simulations is the estimated power for the sample size N. The sample size is the minimal N with estimated power above the pre-specified threshold 1 − β₀.

In the last, third, step we conduct a variety of simulations to assess the robustness of the sample size calculator to any assumptions and to make adjustments to ensure robustness. See our use of these simulations to test robustness for Sense2Stop in Section 7.

The sample size calculator uses the following information for t = 1,...,T; x ∈ {0,...,k}:

desired type 1 and type 2 error rates,
targeted alternative β(t; x),
selected probabilities ${{\tilde{p}}_{t} (1 | x)},$
selected “control variables” g_t(H_t),
the randomization formula used to determine p_t(1|h) given a history h, and
a generative model for {H_t}_t=1,...,T.

We provide general comments concerning the choice of the above items and then build the sample size calculator for the Sense2Stop study of Section 7.

First, we elicit information from the scientist to construct a specific alternative form for β(t; x). A simple approach is to consider linear alternatives, ${β (t; x) = f_{t} {(x)}^{'} β_{c}^{⋆}}_{t = 1, \dots, T; x \in {0, \dots, k}}$ so that the L₂ projection and the alternative coincide. Stratification variables are often categorical (X is categorical); as a result, we model the alternative separately for each value of X = x; x ∈ {0,...,k}. Furthermore, if we suspect that the effect will be generally decreasing (with study time) due to habituation, then we might consider a vector feature, f_t that represents a linear in time t trend. Or we might believe that the effect of the treatments might be low at the beginning of the study and then increase as participants learn how to use the treatment and then decrease due to habituation; here, we might consider a vector feature f_t that results in a quadratic trend. Both quadratic and linear trends are presented in Figure 2.

The less complex the projection (smaller q_c) of the alternative β(t; x), the smaller the required sample size N becomes. On the other hand, the use of a simple projection for the alternative may not reflect the true alternative β(t; x) very well (see Section H in the Supplementary materials for a discussion of this tradeoff). This led to the suggestion in Section 5.1 for sizing a study for primary hypothesis tests using the least complex alternative possible.

To select the probabilities ${{\tilde{p}}_{t} (1 | x)}_{t = 1, \dots, T; x \in {0, \dots, k}},$ recall that these probabilities define the weighting across time and across the stratification distribution of the alternative when operationalized as an L₂ projection. To see this, suppose we decide to target a constant-across-time alternative and select $f_{t} (X_{t}) = {(1_{X_{t} = 1}, 1_{X_{t} = 2}, \dots, 1_{X_{t} = k})}^{'} .$ If we set the reference probabilities to be constant in t and x then $β_{c}^{⋆} = (β_{c, 1}^{⋆}, β_{c, 2}^{⋆}, \dots, β_{c, k})$ where

β_{c, x}^{⋆} = {[\sum_{t = 1}^{T} E [I_{t} 1_{X_{t} = x}]]}^{- 1} [\sum_{t = 1}^{T} E [I_{t} 1_{X_{t} = x}] β (t; x)] .

In this case, β_c,x is an average treatment effect across time weighted by the fraction of time the participant is available and in stratification level x. In our work, we usually set ${\tilde{p}}_{t} (1 | x)$ to be constant in (t, x) so as to more easily discuss the targeted alternative with collaborators.

Next, a decision should be made about which control variables g_t(H_t) should be included in the construction of the test statistic. One might want to include in the q^′ by 1 vector g_t(H_t) many variables so as to maximally reduce variance and thus increase the size of the noncentrality parameter in (5); indeed, for fixed q^′, the larger the noncentrality parameter, the smaller the sample size N. However, from equation (1), we see that fixing all other quantities, the sample size N increases with increasing q^′. So intuitively, there is a tradeoff between increasing the size of the noncentrality parameter by including more variables in g_t(H_t) with the resulting reduction in degrees of freedom in the denominator of the F test caused by increasing q^′, the number of variables in g_t(H_t). See Section H in the Supplementary materials for further discussion. This tradeoff is directly related to balancing between small sample bias and power, challenge 2 from Section 1. Below, for Sense2Stop, we calculate the sample size with the vector of control variables g_t(H_t) set equal to f_t(X_t); this maintains a hierarchical regression, yet keeps q^′ as small as possible. Incidentally, this simplifies the development of the generative model as additional time-varying variables are not included.

Generally, the randomization formula has been determined by considerations of treatment burden, availability, and whether it is critical for the scientific question that the randomization depend on a time-varying stratification variable such as a prediction of risk. Treatment burden considerations might impose a constraint, such as on average around n treatments per strata should occur over a specified time period (e.g. an average of n treatments per day); also, the randomization formula might be developed so as to limit the variance in the number of treatments in the specified time period. In the Sense2Stop study, the randomization probability p_t(1|H_t) is set to limit treatment to an average of 1.5 treatments per day when classified as stressed and when not classified as stressed.

The sample size formula requires the specification of a generative model for the history H_t which achieves the specified alternative treatment effect. However, existing datasets that include the use of the required sensor suites and thus can be used to guide the form of the generative model are often small and do not include treatment. In Sense2Stop, for example, we require a generative model for the multivariate distribution of ${X_{t}, U_{t}, I_{t}, A_{t}}_{t = 1}^{T}$ of which only the distribution of A_t given (H_t, I_t = 1) is known (e.g. p_t(1|H_t)). We have access to a small, observational, no-treatment data set that included the required sensor suites and thus can be used to guide the form of the generative model. Because the dataset is small, in Section 7 we construct a low dimensional Markovian generative model. Here, and in general, the prior data does not include treatments. Thus, we use the prior data to develop a generative model under no treatment.

The relatively simple generative model allows us to use only a few summary statistics from this small noisy dataset. This, of course, may lead to bias (i.e., the simple generative model may not adequately reflect the true data generating mechanism). This bias would be problematic if the bias results in sample sizes for which the power to detect the desired effect is below the specified power. Thus, we also use the small dataset to guide our assessment of robustness of the sample size calculator. In Section 7.4.3, a complex generative model is proposed by exploratory data analysis.

We follow the three steps outlined at the beginning of this subsection to provide a sample size N. Our calculator also provides standardized effect sizes. Table 6 in Section K of the Supplementary materials provides standardized treatment effect sizes, defined as, $d (t; x) = β (t; x) / {\bar{σ}}_{x} .$ The average conditional variance, ${\bar{σ}}_{x}^{2} = (1 / T) \sum_{t = 1}^{T} E [Var (Y_{t, Δ} | I_{t} = 1, A_{t}, H_{t}) | I_{t} = 1, X_{t} = x],$ is calculated using the alternative effect β(t; x) and the generative model.

TABLE 6.

Parameter estimates for each Weibull survival regression.

	Pre-peak			Post-peak
Parameter	Estimate	Std. Error	p-value	Estimate	Std. Error	p-value
Intercept	1.78	0.016	0.000	1.59	0.02	0.000
0L Stress Ep.	−0.20	0.037	0.000	0.45	0.07	0.000
1L Stress Ep.	-	-	-	−0.21	0.058	0.004
2L Stress Ep.	-	-	-	−0.16	0.07	0.020
Log(scale)	−0.24	0.015	0.000	−0.31	0.05	0.000

Open in a new tab

7. Sense2Stop.

In the following, we take the general three-step procedure and walk through how to adapt it to the specifics of the Sense2Stop study and form the sample size calculator. Recall the last step involves a variety of simulations to assess robustness to the assumption underlying the generative model; this step is provided in section 7.4.

As noted previously, Sense2Stop is a 10 day study; the first day is the “quit day”, the day the participant quits smoking. Recall that participants wear the AutoSense sensor suite [Ertin et al., 2011] which provides a variety of physiological data streams. During the conduct of the study, the stratification variable, X_t is constructed online. X_t = 1 if at minute t there is “sufficient evidence” that the participant is in a stress episode; otherwise X_t = 0. That is, until there is sufficient evidence whether that the participant is in a stress episode, X_t remains unknown. Further information on these episodes follows. First, every minute, a support vector machine (SVM) algorithm is applied to a number of ECG and respiration features constructed from the prior one minute stream of sensor data. The output of the SVM is then transformed to obtain a stress “likelihood” in (0, 1); see Hovsepian et al. [2015] for details. This output (in (0, 1)) across the minute intervals is further smoothed to obtain a smoother stress likelihood time series. Next, a Moving Average Convergence Divergence approach is used to identify minutes at which the trend in the stress likelihood is going up and when it is going down; see Sarker et al. [2016] for details. The beginning of an episode is marked by the start of a positive-trend interval in the stress “likelihood.” Recall the peak of an episode is the end of a positive-trend interval followed by the start of a negative-trend interval. If the area under the curve from the beginning of the episode to the minute that the peak of the episode is detected exceeds a threshold, then at this time the individual is classified as stressed for all minutes t in the episode (i.e., X_t = 1). At all other times, the participant belongs to the not classified as stressed strata (i.e., X_t = 0). Figure 1 visualizes this episodic pattern. The threshold is based on prior data from lab experiments. and was evaluated on independent test datasets (from both lab and field) in terms of the F1 score (a combination of sensitivity and specificity [Wikipedia, 2017]) for use in detecting physiological stress.

Next, we build the simulation-based calculator assuming the primary hypothesis is H₀ : β(t; x) = 0; t = 1,...,T; x ∈ {0, 1} and the test statistic is as given in (4). Small sample corrections are used in constructing the test statistic as discussed in Section 5; see Section J in the Supplementary materials for additional details.

7.1. Simulation-based calculator.

We start by choosing inputs for the sample size formula as outlined in Section 6.1. We set the desired type 1 and type 2 error rates to be 0.05 and 0.20 respectively. We next specify the targeted alternative $β (t; x) = f_{t} {(x)}^{'} β_{c}^{⋆} for β_{c}^{⋆} \in ℝ^{q_{c}} .$ The scientific team suspected that if there is an effect of the mindfulness reminders, then this effect might increase as participants begin to practice the mindfulness exercises and then the effect may decrease due to habituation. Thus, we select $f_{t} {(X_{t})}^{'} = (f_{t}^{'} \cdot 1_{X_{t} = 0}, f_{t}^{'} \cdot 1_{X_{t} = 1}) where f_{t}^{'} = (1, ⌊ \frac{t - 1}{600} ⌋, {⌊ \frac{t - 1}{600} ⌋}^{2}) .$ This leads to a non-parametric treatment effect model in the stratification variable X_t, and a piecewise constant treatment effect model in time given X_t = x that is quadratic as a function of “day in study.” In this case, the dimension of the L₂ projection is $q_{c} = 3 \cdot 2 = 6, β_{c}^{⋆} = (β_{c, 0}^{⋆}, β_{c, 1}^{⋆}) \in ℝ^{6}$ and the targeted alternative is $β (t; x) = f_{t}^{'} β_{c, x}^{⋆}$ for x = 0, 1. Next, to elicit enough information from the scientist to specify $β_{c}^{⋆},$ we ask scientists to specify for each level of X, (1) an initial conditional effect, (2) the day of maximal effect $(t_{x}^{⋆})$ and (3) the average conditional treatment effect ${\bar{β}}_{c, x} = T^{- 1} \sum_{t = 1}^{T} β (t; x) .$ This set of conditions uniquely identifies the subvector $β_{c, x}^{⋆},$ therefore, the conditions over each level of X combine to uniquely identify the vector $β_{c}^{⋆} = (β_{c, 0}^{⋆}, β_{c, 1}^{⋆})$ as desired. For Sense2Stop, we will target the same alternative for both levels of the stratification variable X_t; thus, $β_{c, 0}^{⋆} = β_{c, 1}^{⋆} .$ To set this common alternative, we use the following values: the day of maximal effect is day 6 and the initial conditional effect is 0. We consider three possible common values of ${\bar{β}}_{c, 0} = {\bar{β}}_{c, 1}$ denoted $\bar{β}$ in Table 2.

TABLE 2.

Estimated sample size, N, achieved power, and achieved type I error.

	Sample size	Power	Type I error
$\bar{β}$ = 0.030	50	80.6%	5.1%
$\bar{β}$ = 0.025	67	80.7	4.4
$\bar{β}$ = 0.020	127	80.6	5.6

Open in a new tab

Here we set the control variables to g_t(H_t) = f_t(X_t). Furthermore, suppose the formula for randomization probability depends only on past values of the time-varying variable X_t, availability I_t, and treatments A_t. We use the formula for p_t(a | h_t) provided in Section F of the Supplementary materials. One of the inputs to the randomization formula at an available decision point t is the expected number of episodes during the remaining part of the day that will be classified as stressed (X = 1) and the expected number of episodes during the remaining day that will not be classified as stressed (X = 0). The generative model developed below is used to provide this input. See Section F in the Supplementary materials for further details and the specification of other inputs to this randomization formula.

7.1.1. Generative Model.

We briefly overview the procedure for constructing the generative model. We then move on to the specifics, highlighting the rationale behind each decision. First, the stratification variable process X_t is a state-space stochastic process. A natural candidate for such processes with small, finite state-spaces are Markov chains. These are computationally tractable, and easy to discuss with the scientific team. Second, availability is tied to the episodic nature of the stress classifier as decribed at the beginning of Section 7 – with pre-peak, peak, and post-peak phases to each episode. To handle this, we specify a “structured Markov chain” (X_t, U_t) where t is at the minute level, X_t denotes the episode type (“Stress”, “Non-stress”), and U_t denotes the current episode phase (“prepeak”, “peak”, and “post-peak”). The episodic nature of the data is due to the complexities of the underlying physiology of stress and the particulars of the stress classifier.

We now use a subset of the data collected in an observational, no treatment, smoking cessation study of 61 cigarette smokers (here on called the “Minnesota dataset”) [Saleheen et al., 2015] to inform the generative model of longitudinal trajectory ${X_{t}, I_{t}}_{t = 1}^{T} .$ Of the 61 participants, 50 had sufficiently high-quality electrocardiogram data to construct the episodes and infer the stress classification. This subset is reported in Sarker et al. [2017]. From this data, we calculate the sample moments:

For each episode type (i.e., x ∈ {0, 1}), the probability that the next episode will be a stress episode – i.e., a 2 by 1 vector $\bar{W}$
For each episode type (i.e., x ∈ {0, 1}), the average episode length – i.e., a 2 by 1 vector $\bar{Z}$

The sample moments are: $\bar{W}$ = (0.067, 0.519) and $\bar{Z}$ = (10.9, 12.0).

Using these sample moments, we construct a no-treatment transition matrix for the joint process V_t = (X_t, U_t), t = 1,...,600. Each episode ends in state V_t = (x, 2) for x ∈ {0, 1} and transitions to the beginning of the next episode, $V_{t + 1} = (x^{'}, 0) for x^{'} \in {0, 1} .$ We restrict the transition matrix such that for x ∈ {0, 1}:

(x, 0) can only transition to states (x, 0) or (x, 1) (i.e., stay in state “prepeak” or transition to state “peak”) from one minute to the next minute.
(x, 1) transitions immediately to (x, 2) with probability one (i.e.,pr(V_t+1 = (x, 2) | V_t = (x, 1)) = 1); in other words, the process inhabits the “peak” state for only one minute.
(x, 2) can only transition to states (x, 2), (0, 0), or (1, 0) (i.e., stay in state “post-peak” or end the episode and begin a new one).

We label each episode depending on the value x. We use the approximation: U_t ≠ 1 implies I_t = 0. In this minute, the episode is classified as stressed or not classified as stressed. Define ${\tilde{Z}}_{(x, u)}$ to be the length of the phase u in an episode of type x after the chain enters state (x, u). Then ${\tilde{Z}}_{(x, 1)} = 0$ for each x because as soon as the chain enters the peak (u = 1) of an episode, the chain departs. Otherwise, set ${\tilde{Z}}_{(x, u)} = ({\bar{Z}}_{x} - 3) / 2$ for u = 0 and u = 2^c

We set the no-treatment transition probability matrix to

P_{(x, u), (x, u)}^{(0)} = {\tilde{Z}}_{x, u} / ({\tilde{Z}}_{x, u} + 1) and P_{(x, 1), (x, 2)}^{(0)} = 1.0

for x ∈ {0, 1} and u ∈ {0, 2}, and then set

P_{(x, 2), (0, 0)}^{(0)} = (1 - {\bar{W}}_{x}) (1 - P_{(x, 2), (x, 2)}) and P_{(x, 2), (1, 0)}^{(0)} = {\bar{W}}_{x} (1 - P_{(x, 2), (x, 2)})

for x ∈ {0, 1} (recall that ${\bar{W}}_{x}$ is the estimated probability that the next episode will be a stress episode). All other entries of P⁽⁰⁾ are set to zero. Thus P⁽⁰⁾ is a deterministic function of the moments $\bar{W} and \bar{Z} .$ See Table 1 for the transition matrix P⁽⁰⁾.

TABLE 1.

P⁽⁰⁾: Transition Matrix for the Markov chain, V_t, under No Treatment

		Non-stress			Stress
		Pre-peak	Peak	Post-peak	Pre-peak	Peak	Post-peak
Non-stress	Pre-peak Peak Post-peak	0.80 0.00 0.19	0.20 0.00 0.00	0.00 1.00 0.80	0.00 0.00 0.01	0.00 0.00 0.00	0.00 0.00 0.00

Stress	Pre-peak Peak Post-peak	0.00 0.00 0.09	0.00 0.00 0.00	0.00 0.00 0.00	0.82 0.00 0.09	0.18 0.00 0.00	0.00 1.00 0.82

Open in a new tab

The transition matrix P⁽⁰⁾ specified in Table 1 has stationary distribution (π_(0,0) = 0.394, π_(0,1) = 0.080, π_(0,2) = 0.394, π_(1,0) = 0.061, π_(1,1) = 0.011, π_(1,2) = 0.061).

7.2. Generative model under treatment.

Next, we form the generative model under treatment. We make the simplifying assumption that following treatment (i.e., A_t = 1), V_t+j evolves as a discrete-time Markov chain but with respect to a different transition matrix $P_{t}^{(1)}$ for each of the subsequent j = 1,...,60 minutes. After the hour, assuming a subsequent treatment notification is not provided, the time-varying stratification variable returns to evolution as a Markov chain with transition matrix P⁽⁰⁾. Thus,

pr (V_{t} = (x, u) | V_{t - 1} = (x^{'}, u^{'}), H_{t - 1}) = {\begin{array}{l} {[P^{(0)}]}_{(x^{'}, u^{'}), (x, u)} & if A_{t - s} = 0, s = 1, \dots, 60 \\ {[P_{t}^{(1)}]}_{(x^{'}, u^{'}), (x, u)} & otherwise \end{array} .

Because the alternative β(t; x) is constant within each day, we will construct a transition matrix, $P_{t}^{(1)}$ , that will only depend on t through the day of decision t. Thus, we use the notation $P_{d (t)}^{(1)}$ instead of $P_{t}^{(1)}$ where d(t) is the day of decision time t.

Recall the treatment effect is the effect of providing a notification at time t to practice stress-reduction exercises and no more notifications within the next hour versus no notification at time t and no notifications over the next hour on the percent of time stressed in the next hour. Thus, the reference policy sets the treatments a_t+1,...,a_t+Δ−1 to 0 and the expected proximal response under the reference policy can be computed analytically for any combination of x and a (Δ = 60). See Section K.1 of the Supplementary materials for derivations of the below analytic forms. When a = 1, under the proposed generative model the above expectation is equal to $Δ^{- 1} \sum_{s = 1}^{Δ} \sum_{u \in {0, 1, 2}} {[{(P_{d (t)}^{(1)})}^{s}]}_{(x, 1), (1, u)} .$ When a = 0, the expectation is equal to the fraction of time stressed within the next hour under the reference policy of no actions for that hour $Δ^{- 1} \sum_{s = 1}^{Δ} \sum_{u \in {0, 1, 2}} {[{(P^{0})}^{S}]}_{(x, 1), (1, u)} .$

Given the alternative β(t; x) for a particular day, we set $P_{d (t)}^{(1)}$ equal to

\arg \min_{Q \in P} \sum_{x \in [0, 1]} (Δ^{- 1} \sum_{s = 1}^{Δ} \sum_{u \in [0, 1, 2]} ({[Q^{s}]}_{(x, 1), (1, u))} - {[{(P^{(0)})}^{s}]}_{(x, 1), (1, u)}) - β {(t; x)}^{2})

where $P$ denotes the set of transition matrices which satisfy the constraints discussed above. The set $P$ can be parameterized in order to use generalpurpose, box-constrained optimization methods to calculate $P_{d (t)}^{(1)}$ efficiently. For all calculations, we initialize with inputs equivalent to the transition matrix P⁽⁰⁾. Using this procedure, the maximum squared distance across all alternatives β(t; x) considered in this paper is 2.71 × 10⁻¹¹ (i.e., low approximation error).

7.3. Generatingthesimulateddata.

The prior section yields the no-treatment and treatment transition matrices (i.e., P⁽⁰⁾ and ${P_{d}^{(1)}}_{d = 1}^{10})$ )given the specified alternative {β(t; x)}. We briefly show how to use this information along with the randomization probability formula to generate synthetic data arising from a stratified micro-randomized trial. First, we generate data for each day independently. On a given day at time t, we first generate V_t using the transition equation in section 7.2. We then assess availability, I_t, which is a deterministic function of the current value of V_t and the past sixty minute history of actions ${A_{t - s}}_{s = 1, \dots, 60} . That is, I_{t} = 1 [\sum_{s = 1}^{60} A_{t - s} = 0] \times 1 [U_{t} = 1] .$ We adjust availability in the first hour of each day to be only a function of whether an intervention was already provided that day. Given I_t = 1, we take the history H_t and generate the action at time t, A_t, using the given randomization probability formula p_t(1|H_t) found in Section F of the Supplementary materials. In order to compute the proximal response Y_t,Δ for every minute over the ten hour day (i.e., t = 1,...,600), we simulate an additional eleventh hour during which participants cannot receive treatment (i.e., participants are unavailable). The above procedure generates synthetic data for one participant in a stratified micro-randomized trial.

7.3.1. The test statistic.

The above provides the generative model for use in the simulation-based sample size calculator. Next, consider the choice of the test statistic for use in calculating the sample size. In the test statistic (4), we set the time t reference probability ${\tilde{p}}_{t} (1 | x)$ equal to $\sum_{x = 0, 1} π_{(x, 1)} (1.5 / [(600 - 1.5 \cdot 60) π_{(x, 1)}]) = 5.88 \times 10^{- 3} .$ Recall that the numerator of the weight, w_ct, in (3) is ${\tilde{p}}_{t} (A_{t} | x) \prod_{s = 1}^{Δ - 1} 1 [A_{t + s} = 0] .$ The probability, ${\tilde{p}}_{t} (1 | x)$ is equal to the daily average number of treatments while in state x divided by the daily average number of times the participant is available and in state x, marginalized over the state x. In the denominator, the term 1.5·60 is subtracted off the total number of decision points due to the availability constraints following treatment.

The test statistic (4) with the above choice of reference probabilities, and the above generative model are used to generate the sample sizes in Table 2. The column labeled, Sample Size, in this table provides the estimated sample size to detect a specified alternative for the conditional proximal effect given power of 80% and significance level 5% for the Sense2Stop study. Recall that our input for the day of maximal effect is day 5 and the input for the initial conditional effect is 0 for both levels of the time-varying variable X_t. The average treatment effects ${{\bar{β}}_{x} = T^{- 1} \sum_{t = 1}^{T} β (t; x)}_{x = 0, 1}$ are assumed equal across levels X and set to $\bar{β};$ in the tables below, three values of $\bar{β}$ are considered. Achieved significance levels under the null are included.

7.4. Evaluation of Simulation Calculator for the Smoking Cessation Study.

Recall the relatively simple generative model allowed us to use only a very few statistics from the Minnesota dataset described in Section 7.1.1. There are two concerns: (1) whether or not the participants in the Minnesota study are representative of the future participants in Sense2Stop, and (2) whether or not the generative model built upon a few sample moments adequately captures the variation in the unknown longitudinal distribution ${(X_{t}, U_{t}, I_{t})}_{t = 1}^{T}$ under no treatment. If (1) does not hold, then the sample moments may be biased (scenario A). If (1) holds and (2) does not, it may be that prior scientific knowledge can suggest potential deviations that are difficult to estimate given the small size of prior data (scenario B). If (1) holds and (2) does not, alternatively, it may be that we can account for additional variation via fitting more complex models to the Minnesota dataset (scenario C). Any of these scenarios may problematic if it results in sample sizes for which the power to detect the desired effect is below the specified power. Therefore, we construct a feasible set of alternative generative models to which the sample size calculator should be robust. Note, this is not an exhaustive list, but highlights three important scenarios we expect to occur in practice.

7.4.1. Misspecification of transition matrix P⁽⁰⁾.

For scenario A, we consider situations where the generative model for the future Sense2Stop participants can be constructed in the same manner; however, the correct moment inputs for Sense2Stop are deviations from the sampled moments of the Minnesota study. Let B_{(ϵ, ϵ}′₎ denote an (ϵ, ϵ^′)-ball around the inputs $(\bar{W}, \bar{Z});$ that is,

B_{(ϵ, ϵ^{'})} = {(W, Z) | ‖ W - \bar{W} ‖_{\infty} \leq ϵ and ‖ Z - \bar{Z} ‖_{\infty} \leq ϵ^{'}} .

For each (W, Z) ∈ B_{(ϵ, ϵ^′)}, we wish to compute the achieved power under the alternative generative model where V_t under no treatment evolves as a Markov chain with transition matrix P constructed from inputs W and Z; however, this is computationally prohibitive. Simulation suggests power to be a smooth, non-increasing function of both ϵ and ϵ^′, so instead we focus on computing power for the following subset of B_{(ϵ, ϵ^′)}:

Ω_{(ϵ, ϵ^{'})} = {(W, Z) | W \in \bar{W} \pm {(ϵ, - ϵ), (ϵ, ϵ)} and Z \in \bar{Z} \pm {(ϵ^{'}, - ϵ^{'}), (ϵ^{'}, ϵ^{'})} .

For each pair (W, Z) ∈ Ω_{(ϵ, ϵ^′)} we compute the associated transition matrix P; then we compute the sequence of transition matrices $P_{d (t)}^{(1)}$ which maintain the correct alternative treatment effect. We define the power for B_{(ϵ, ϵ^′)} to be the minimum power across (W, Z) $\in Ω_{(ϵ, ϵ')}$ .

Selection of (ϵ, ϵ^′) is driven by observed variation in the Minnesota dataset. For selection of ϵ^′, we note the standard deviation of non-stress and stress episode durations in the Minnesota dataset is 6.89 and 6.48 respectively. Moreover, the standard errors in the sample moment $\bar{Z}$ were only 0.12 and 0.28 respectively. Thus, we chose ϵ^′ ∈ {2, 4}. To select ϵ, we observe the standard error for the moment estimates $\bar{W}$ are 0.005 and 0.03 for non-stress and stress episodes respectively. Thus, we set ϵ ∈ {0.01, 0.02}.

Table 3 presents achieved power under the previously calculated sample sizes for Ω_(0.02,4) and Ω_(0.01,2) respectively. For both $(ϵ, ϵ^{'}) = (0.01, 2) and (ϵ, ϵ^{'}) = (0.02, 4),$ the achieved power is significantly below the pre-specified 80% level for all three choices of the average treatment effect $\bar{β}$ .

TABLE 3.

Misspecification of transition matrix P⁽⁰⁾: minimum achieved power over set of matrices in $Ω_{ϵ, ϵ^{'}}$

	$(ϵ, ϵ^{'}) =$
	(0.02, 4)	(0.01, 2)
$\bar{β}$ = 0.030	57.5%	61.5%
$\bar{β}$ = 0.025	43.9	52.2
$\bar{β}$ = 0.020	40.4	65.6

Open in a new tab

7.4.2. Deviations from a time-homogenous transition matrix under no treatment.

For scenario B, we consider a deviation suggested by prior knowledge – namely, that stress dynamics are different over the weekend from the weekday. Given the prior Minnesota study was small, this proposed deviation is not data-driven, but scientifically motivated. This suggests a different type of misspecification of the transition matrix P⁽⁰⁾, that of time-inhomogeneity; as before the treatment effect is still correctly specified. In particular, suppose that the assumed transition matrix, P⁽⁰⁾, is correct for weekdays but not for weekends; in particular, suppose in reality that the transition matrix under no treatment on the weekend is $P_{weekend}^{(0)} \neq P^{(0)} .$ The weekend is defined as d(t) = 6 and 7 (i.e., all participants enter the study on a Monday). We specify $P_{weekend}^{(0)}$ via inputs $({\bar{W}}_{weekend}, {\bar{Z}}_{weekend})$ which we set to two possible values

\underset{weekend inputs (1)}{\underset{︸}{((0.04, 0.45), (10.9, 12.0))}} or \underset{weekend inputs (2)}{\underset{︸}{((0.10, 0.60), (10.9, 12.0))}} .

Using the inputs, we construct two alternate versions of what the true transition matrix $P_{weekend}^{(0)}$ might be. For input (1), the individual is less likely to enter a stress episode over the weekend; for input (2), the individual is more likely to enter a stress episode over the weekend. In both cases, the average episode lengths are assumed equal to $\bar{W} .$

Table 4 presents achieved power under these alternative generative models. We see that the achieved power is below the pre-specified 80% threshold in each case except for $\bar{β} = 0.020$ under weekend input 1. If the scientist thought such deviations feasible, then the above analysis suggests for Sense2Stop that the sample size be set to ensure a least 80% power over a set of feasible choices for time-inhomogeneous choices for the no-treatment transition matrix.

TABLE 4.

Estimated power under generative model with time-inhomogeneous Markov chain.

	Estimated power
	Weekend Input 1	Weekend Input 2
$\bar{β}$ = 0.030	79.2	69.8
$\bar{β}$ = 0.025	72.5	66.0
$\bar{β}$ = 0.020	81.5	76.4

Open in a new tab

7.4.3. Deviations from a Markovian generative model.

For scenario C, we fit a semi-Markov generative model to the small, observational Minnesota study. This accounts for additional variation in the prior study, but the resulting generative model may not represent behavior for future Sense2Stop participants. That is, the model may overfit the Minnesota study data, and not generalize well to the future Sense2Stop participants. After presenting data analysis for the semi-Markovian deviation, we then assess robustness of the sample size calculator to this data-driven deviation.

We start by considering the episodic transition rule. The Markovian model assumes that the episode transitions only depend on the prior episode classification. We test this by fitting a logistic regression with episode classification as the response variable and lagged values of episode classification as well as additional summaries of past history, including prior episode durations and time of day, as covariates. Analysis suggests that neither time of day nor prior episode duration were statistically significant. We used forward selection to determine the number of lagged values of episode classification, leading to inclusion of two lags. Table 5 presents the estimates of the logistic regression along with robust standard errors and confidence intervals.

TABLE 5.

Parameter estimates for the logistic regression. Response is indicator of current episode being a stress episode.

Parameter	Estimate	Std. Error	95% LCL	95% LCL
Intercept	−2.83	0.10	−3.03	−2.63
1L Stress Ep.	2.75	0.20	2.37	3.14
2L Stress Ep.	0.71	0.22	0.27	1.14

Open in a new tab

This model leads to slightly distinct behavior of the transition rules. For example, given the prior episode was a stress episode, the probability of the next episode being a stress episode ranges from 0.480 (two-lagged prior episode was non-stress) to 0.652 (two-lagged prior episode was stress). Given the prior episode was a non-stress episode, the probability of the next episode being a stress episode ranges from 0.056 (two-lagged prior episode was non-stress) to 0.107 (two-lagged prior episode was stress). Table 5 leads to a different Markovian model in which the state is $(X_{t}, U_{t}, L_{t}^{(1)}) where L_{t}^{(1)}$ denotes the classification of the prior episode.

We next examine the pre and post peak durations. Figure 3 shows histograms of the pre and post peak durations in the analyzed subset of data along with empirical Bayes estimates of the probability density functions under both exponential and Weibull distribution specifications. We recognize the durations are discrete and the above distributions are continuous. These are fit for simplicity. When generating the episode duration, we generate a random variable from the continuous distribution and take the integer part of that random variable. It is evident from the figures that the Weibull distribution is more appropriate.

Fig 3: — Histograms of duration for pre/post-peak durations for Minnesota study. Empirical bayes pdfs for exponential (red) and weibull (black) densities are overlayed.

Table 6 presents the parameter estimates for this over-fit model to the duration data assuming a Weibull distribution^d Like the episodic transition rules, the post and pre peak durations now depend on the current episode classification as well as the prior episode classifications. The exploratory data analysis suggests a semi-Markovian model in which the pre/post peak durations are Weibull distributed, and the state is given by $(X_{t}, U_{t}, L_{t}^{(1)}, L_{t}^{(2)}) where L_{t}^{(i)}$ denotes the classification of the ith prior episode. For the pre-peak model, the one and two-lagged indicators of a stress episode (“1L” and “2L Stress Ep.” in Table 6) were insignificant and thus excluded from the model.

Next,wetestrobustnessofthesamplesizecalculatortothesemi-Markovian deviations described above. To test the calculator, we generate data using the no-treatment semi-Markov model specified in Section L in the Supplementary materials. The data is simulated so that the treatment effect used by the calculator is correct. See Section L in the Supplementary materials for a discussion of how this was achieved. Table 7 presents achieved power under these alternative generative models. We see that the achieved power is well above the pre-specified 80% threshold in each case. Therefore, the sample size calculator is robust to such complex deviations from the Markovian generative model. For the given alternative β(t; x) and semi-Markov generative model, we calculate the standardized effects. These are provided in Table 7 in Section K of the Supplementary materials.

TABLE 7.

Estimated power under semi-Markov generative.

	Estimated power
$\bar{β}$ = 0.030	93.6
$\bar{β}$ = 0.025	88.0
$\bar{β}$ = 0.020	93.6

Open in a new tab

7.5. Adjustments to the simulation-based calculator.

In section 7.4, we evaluated the simulation calculator built in Section 7.1. Here we make adjustments to the simulation calculator to ensure robustness. First, we note that the simulation calculator is robust to the potential semi-Markovian deviation discussed in Section 7.4.3. Next, we make the decision that we are not concerned with lack of robustness to deviations from a time-homogenous transition matrix as discussed in section 7.4.2. Therefore, we focus on making the simulation calculator robust to misspecification of Markov transition matrix as discussed in section 7.4.1.

Analysis in section 7.4.1 suggests for Sense2Stop that the sample size should be set to ensure at least 80% power over a set of feasible choices for the transition matrix P⁽⁰⁾. We fix (ϵ, ϵ^′) = (0.01, 2) to be our tolerance to misspecification of the inputs. For each set of inputs (W,Z) ϵ Ω_0.01,2, we compute a sample size using the simulation calculator built in Section 7.1. The maximum of this set of computed sample sizes is chosen to ensure tolerance to misspecification of the transition matrix. Table 8 presents the sample size under this procedure as well as the achieved minimum power over the set Ω_ϵ,ϵ′.

TABLE 8.

Estimated sample size, N, and computed power under ϵ = 2 and ϵ^′ = 0.01.

	Sample size	Minimum Power
$\bar{β}$ = 0.030	69	81.9%
$\bar{β}$ = 0.025	107	80.4
$\bar{β}$ = 0.020	208	80.5

Open in a new tab

We have now used the three-step procedure to form a sample size calculator for the smoking cessation study example. For illustration suppose we wish to detect an average conditional treatment effect $\bar{β}$ equal to 0.025. Based on the above discussion a sample size N of 107 would be recommended to ensure power above the pre-specified 80% threshold across a set of feasible deviations from the assumed generative model.

8. Conclusion.

Inthispaper,weintroducedthestratifiedmicro-randomized trial (sMRT) and provided a definition and discussion of proximal treatment effects along with the dependence of this definition on a reference distribution. We proposed a simulation-based approach for determining sample size and used this approach to determine the sample size for a simplified version of the MD2K smoking cessation study. We expect that similar trial designs would be applicable in areas such as marketing and advertising in which each client is tracked and provided incentives, e.g. treatments, repeatedly over time, and it is of interest to determine in which contexts particular treatments are most effective.

An alternative test to our projection-based method is a randomizationbased test. In Bojinov and Shephard [2018], exact randomization based p-values are constructed for testing causal effects in single time series experiments. The approach relies solely on random assignment of treatment paths rather than the distribution of the test statistic for the validity of the test [Rosenberger et al., 2019]. Randomization inference, however, targets a sharp null that the treatment has no effect on the distribution of all the time-varying endogenous variables (i.e., in our setting across availability I_t, phase U_t, stratification X_t, and response Y_t,Δ variables). Our inferential target is more restrictive – our goal is to assess if the conditional mean for a specific outcome Y_t,Δ given availability (i.e., I_t = 1) and stratification variable (i.e., X_t = x) is equal to zero jointly across time t = 1,...,T and strata x = 0, 1. The authors would be very interested in future work that extends the randomization test framework to our inferential target.

While the focus here is sample size considerations, stratified microrandomized studies yield data for a variety of interesting secondary data analyses. For example, understanding predictors of future availability is of general interest as keeping participants engaged in the mobile health intervention is often of high concern. Moreover, there is interest in using the data in constructing “dynamic treatment regimes” (e.g., just-in-time adaptive interventions [Spruijt-Metz and Nilsen, 2014]). The stratified microrandomized trial improves such analyses by reducing causal confounding.

Supplementary Material

Supplement

NIHMS1060597-supplement-Supplement.pdf^{(513.4KB, pdf)}

Footnotes

Simplified version refers to omission of study details that obscure the core health and statistical science considerations (e.g. self-report protocol, methods used to reduce data loss due to technical failures, and initial confusion in language).

Here reference distribution is unrelated to the notion of reference sets in randomization inference; see Rosenberger et al. [2019].

We subtract three as we are guaranteed one pre-peak, one peak and one post-peak minute in each episode. Dividing by two splits the remaining average time evenly between pre-peak and post-peak phases of an episode.

Models are fit to duration minus one as pre and post peak durations are guaranteed to be greater than one. Thus we are modeling the duration in the state above the minimum value of one.

SUPPLEMENTARY MATERIAL

Supplementary materials: Supplementary material for “The stratified micro-randomized trial design: sample size considerations for testing nested causal effects of time-varying treatments”

(http://www.e-publications.org/ims/support/download/imsart-ims.zip). This supplement provides important details on the Sense2Stop stratified microrandomized trial design, additional comments on sMRT design choices, proofsandtechnicalderivations,andsamplesizecalculationsforthemarginal proximal effect.

References.

Bidargaddi N, Almirall D, Murphy S, Nahum-Shani I, Kovalcik M, Pituch T, Maaieh H, and Strecher V To prompt or not to prompt? a microrandomized trial of time-varying push notifications to increase proximal engagement with a mobile health app. JMIR Mhealth Uhealth, 6(11):e10123, November 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bojinov I and Shephard N Time series experiments and causal estimands: exact randomization tests and trading. Journal of the American Statistical Association (Forthcoming), 2018. [Google Scholar]
Boruvka A, Almirall D, Witkiewitz K, and Murphy SA Assessing time-varying causal effect moderation in mobile health. To appear in the Journal of the American Statistical Association, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dempsey W, Liao P, Klasnja I P Nahun-Shani, and Murphy SA Randomised trials for the fitbit generation. Significance, 12(6):20–23, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Diggle PJ, Heagerty P, Liang KY, and Zeger SL Analysis of Longitudinal Data. Oxford Science Publications. Clarendon press, 2002. [Google Scholar]
Ertin E, Stohs N, Kumar S, Raij A, al’Absi M, and Shah S Autosense: Unobtrusively wearable sensor suite for inferring the onset, causality, and consequences of stress in the field. In Proceedings of the 9th ACM Conference on Embedded Networked Sensor Systems, pages 274–287, New York, NY, USA, 2011. [Google Scholar]
Free C, Phillips G, Galli L, Watson L, Felix L, Edwards P, Patel V, and Haines A The effectiveness of mobile-health technology-based health behaviour change or disease management interventions for health care consumers: A systematic review. PLOS Medicine, 10(1):1–45, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hong G and Raudenbush SW Evaluating kindergarten retention policy. Journal of the American Statistical Association, 101(475):901–910, 2006. [Google Scholar]
Hotelling H The generalization of student’s ratio. Annals of Mathematical Sciences, 2(3): 360–378, 1931. [Google Scholar]
Hovsepian K, al’Absi M, Ertin E, Kamarck T, Nakajima M, and Kumar S cstress: Towards a gold standard for continuous stress assessment in the mobile environment. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing, UbiComp ‘15, pages 493–504, New York, NY, USA, 2015. ACM. [DOI] [PMC free article] [PubMed] [Google Scholar]
Imbens GW and Rubin DB Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, New York, NY, USA, 2015. [Google Scholar]
Klasnja P, Hekler EB, Shiffman S, Boruvka A, Almirall D, Tewari A, and Murphy SA Micro-randomized trials: An experimental design for developing just-in-time adaptive interventions. Health Psychology, 34:1220–1228, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liang KY and Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika, 73(1):13–22, 1986. [Google Scholar]
Liao P, Klasjna P, Tewari A, and Murphy SA Micro-randomized trials in mhealth. Statistics in Medicine, 35(12):1944–71, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mancl LAand DeRouen TA A covariance estimator for GEE with improved small-sample properties. Biometrics, 57(1):126–134, 2001. [DOI] [PubMed] [Google Scholar]
Mancl LA and Leroux BG Efficiency of regression estimates for clustered data. Biometrics, 52:500–511, 1996. [PubMed] [Google Scholar]
Pearl J Causal inference in statistics: An overview. Statistics Surveys, 3:96–146, 2009. [Google Scholar]
Pepe MS and Anderson GL A cautionary note on inference for marginal regression models with longitudinal data and general correlated response data. Communications in Statistics - Simulation and Computation, 23(4):939–951, 1994. [Google Scholar]
Robins J A new approach to causal inference in mortality studies with a sustained exposure period-application to control of the healthy worker survivor effect. Mathematical Modelling, 7(9):1393–1512, 1986. [Google Scholar]
Rosenberger W, Uschner D, and Wang Y Randomization: The forgotten component of the randomized clinical trial. Statistics in Medicine, 38:1–12, 2019. [DOI] [PubMed] [Google Scholar]
Rubin Bayesian inference for causal effects: The role of randomization. The Annals of Statistics, 6(1):34–58, 1978. [Google Scholar]
Saleheen N, Ali AA, Hossain SM, Sarker H, Chatterjee S, Marlin B, Ertin E, al’Absi M, and Kumar S puffmarker: A multi-sensor approach for pinpointing the timing of first lapse in smoking cessation. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing, UbiComp ‘15, pages 999–1010, New York, NY, USA,2015. ACM. URL http://doi.acm.org/10.1145/2750858.2806897. [PMC free article] [PubMed] [Google Scholar]
Sarker H, Tyburski M, Rahman MM, Hovsepian K, Sharmin M, Epstein DH, Preston KL, Furr-Holden CD, Milam A, Nahum-Shani I, al’Absi M, and Kumar S Finding significant stress episodes in a discontinuous time series of rapidly varying mobile sensor data. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, CHI ‘16, pages 4489–4501, Santa Clara, California, USA, 2016. ACM. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sarker H, Hovsepian K, Chatterjee S, Nahum-Shani I, Murphy SA, Spring B, Ertin E, al’Absi M, Nakajima M, and Kumar S From markers to interventions: The case of just-in-time stress intervention. In Regh JM, Murphy SA, and Kumar S, editors, Mobile Health Sensors, Analytic Methods, and Applications. Springer International Publishing, 2017. [Google Scholar]
Spruijt-Metz D and Nilsen W Dynamic models of behavior for just-in-time adaptive interventions. Pervasive Computing, IEEE, 13(3):13–17, 2014. [Google Scholar]
Tchetgen Tchetgen EJ, Glymour MM, Weuve J, and Robins J Specifying the correlationstructure in inverse-probability-weighting estimation for repeated measures. Epidemiology, 23(4):644–646, 2012. [DOI] [PubMed] [Google Scholar]
Vanderweele TJ, Hong G, Jones SM, and Brown JL Mediation and spillover effects in group-randomized trials: A case study of the 4rs educational intervention. Journal of the American Statistical Association, 108(502):469–482, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vansteelandt S On confounding, prediction and efficiency in the analysis of longitudinal and cross-sectional clustered data. Scandinavian Journal of Statistics, 34(3):478–498, 2007. [Google Scholar]
Wikipedia. F1 score — Wikipedia, the free encyclopedia, 2017. URL https://en.wikipedia.org/wiki/F1_score. [Online; accessed 23-May-2017].

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

NIHMS1060597-supplement-Supplement.pdf^{(513.4KB, pdf)}

[R1] Bidargaddi N, Almirall D, Murphy S, Nahum-Shani I, Kovalcik M, Pituch T, Maaieh H, and Strecher V To prompt or not to prompt? a microrandomized trial of time-varying push notifications to increase proximal engagement with a mobile health app. JMIR Mhealth Uhealth, 6(11):e10123, November 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Bojinov I and Shephard N Time series experiments and causal estimands: exact randomization tests and trading. Journal of the American Statistical Association (Forthcoming), 2018. [Google Scholar]

[R3] Boruvka A, Almirall D, Witkiewitz K, and Murphy SA Assessing time-varying causal effect moderation in mobile health. To appear in the Journal of the American Statistical Association, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Dempsey W, Liao P, Klasnja I P Nahun-Shani, and Murphy SA Randomised trials for the fitbit generation. Significance, 12(6):20–23, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Diggle PJ, Heagerty P, Liang KY, and Zeger SL Analysis of Longitudinal Data. Oxford Science Publications. Clarendon press, 2002. [Google Scholar]

[R6] Ertin E, Stohs N, Kumar S, Raij A, al’Absi M, and Shah S Autosense: Unobtrusively wearable sensor suite for inferring the onset, causality, and consequences of stress in the field. In Proceedings of the 9th ACM Conference on Embedded Networked Sensor Systems, pages 274–287, New York, NY, USA, 2011. [Google Scholar]

[R7] Free C, Phillips G, Galli L, Watson L, Felix L, Edwards P, Patel V, and Haines A The effectiveness of mobile-health technology-based health behaviour change or disease management interventions for health care consumers: A systematic review. PLOS Medicine, 10(1):1–45, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Hong G and Raudenbush SW Evaluating kindergarten retention policy. Journal of the American Statistical Association, 101(475):901–910, 2006. [Google Scholar]

[R9] Hotelling H The generalization of student’s ratio. Annals of Mathematical Sciences, 2(3): 360–378, 1931. [Google Scholar]

[R10] Hovsepian K, al’Absi M, Ertin E, Kamarck T, Nakajima M, and Kumar S cstress: Towards a gold standard for continuous stress assessment in the mobile environment. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing, UbiComp ‘15, pages 493–504, New York, NY, USA, 2015. ACM. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Imbens GW and Rubin DB Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, New York, NY, USA, 2015. [Google Scholar]

[R12] Klasnja P, Hekler EB, Shiffman S, Boruvka A, Almirall D, Tewari A, and Murphy SA Micro-randomized trials: An experimental design for developing just-in-time adaptive interventions. Health Psychology, 34:1220–1228, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Liang KY and Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika, 73(1):13–22, 1986. [Google Scholar]

[R14] Liao P, Klasjna P, Tewari A, and Murphy SA Micro-randomized trials in mhealth. Statistics in Medicine, 35(12):1944–71, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Mancl LAand DeRouen TA A covariance estimator for GEE with improved small-sample properties. Biometrics, 57(1):126–134, 2001. [DOI] [PubMed] [Google Scholar]

[R16] Mancl LA and Leroux BG Efficiency of regression estimates for clustered data. Biometrics, 52:500–511, 1996. [PubMed] [Google Scholar]

[R17] Pearl J Causal inference in statistics: An overview. Statistics Surveys, 3:96–146, 2009. [Google Scholar]

[R18] Pepe MS and Anderson GL A cautionary note on inference for marginal regression models with longitudinal data and general correlated response data. Communications in Statistics - Simulation and Computation, 23(4):939–951, 1994. [Google Scholar]

[R19] Robins J A new approach to causal inference in mortality studies with a sustained exposure period-application to control of the healthy worker survivor effect. Mathematical Modelling, 7(9):1393–1512, 1986. [Google Scholar]

[R20] Rosenberger W, Uschner D, and Wang Y Randomization: The forgotten component of the randomized clinical trial. Statistics in Medicine, 38:1–12, 2019. [DOI] [PubMed] [Google Scholar]

[R21] Rubin Bayesian inference for causal effects: The role of randomization. The Annals of Statistics, 6(1):34–58, 1978. [Google Scholar]

[R22] Saleheen N, Ali AA, Hossain SM, Sarker H, Chatterjee S, Marlin B, Ertin E, al’Absi M, and Kumar S puffmarker: A multi-sensor approach for pinpointing the timing of first lapse in smoking cessation. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing, UbiComp ‘15, pages 999–1010, New York, NY, USA,2015. ACM. URL http://doi.acm.org/10.1145/2750858.2806897. [PMC free article] [PubMed] [Google Scholar]

[R23] Sarker H, Tyburski M, Rahman MM, Hovsepian K, Sharmin M, Epstein DH, Preston KL, Furr-Holden CD, Milam A, Nahum-Shani I, al’Absi M, and Kumar S Finding significant stress episodes in a discontinuous time series of rapidly varying mobile sensor data. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, CHI ‘16, pages 4489–4501, Santa Clara, California, USA, 2016. ACM. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Sarker H, Hovsepian K, Chatterjee S, Nahum-Shani I, Murphy SA, Spring B, Ertin E, al’Absi M, Nakajima M, and Kumar S From markers to interventions: The case of just-in-time stress intervention. In Regh JM, Murphy SA, and Kumar S, editors, Mobile Health Sensors, Analytic Methods, and Applications. Springer International Publishing, 2017. [Google Scholar]

[R25] Spruijt-Metz D and Nilsen W Dynamic models of behavior for just-in-time adaptive interventions. Pervasive Computing, IEEE, 13(3):13–17, 2014. [Google Scholar]

[R26] Tchetgen Tchetgen EJ, Glymour MM, Weuve J, and Robins J Specifying the correlationstructure in inverse-probability-weighting estimation for repeated measures. Epidemiology, 23(4):644–646, 2012. [DOI] [PubMed] [Google Scholar]

[R27] Vanderweele TJ, Hong G, Jones SM, and Brown JL Mediation and spillover effects in group-randomized trials: A case study of the 4rs educational intervention. Journal of the American Statistical Association, 108(502):469–482, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Vansteelandt S On confounding, prediction and efficiency in the analysis of longitudinal and cross-sectional clustered data. Scandinavian Journal of Statistics, 34(3):478–498, 2007. [Google Scholar]

[R29] Wikipedia. F1 score — Wikipedia, the free encyclopedia, 2017. URL https://en.wikipedia.org/wiki/F1_score. [Online; accessed 23-May-2017].

PERMALINK

THE STRATIFIED MICRO-RANDOMIZED TRIAL DESIGN: SAMPLE SIZE CONSIDERATIONS FOR TESTING NESTED CAUSAL EFFECTS OF TIME-VARYING TREATMENTS

WALTER DEMPSEY

PENG LIAO

SANTOSH KUMAR

SUSAN A MURPHY

Abstract

1. Introduction.

2. Sense2Stop smoking cessation study.

3. Stratified Micro-Randomized Trial.

Availability.

Fig 1:

Stratification variable.

Treatment.

Proximal response.

Longitudinal data.

Randomization formula.

REMARK 3.1 (Designing an sMRT).

4. Proximal effect of treatment.

4.1. Proximal treament effect, potential outcomes, and reference distribution.

4.2. Proximal effect of treatment & observable data.

ASSUMPTION 4.1.

LEMMA 4.2.

5. Test statistic.

LEMMA 5.1

5.1. Remarks.

Fig 2:

6. Sample size formulae.

6.1. Simulation-based sample size calculation.

TABLE 6.

7. Sense2Stop.

7.1. Simulation-based calculator.

TABLE 2.

7.1.1. Generative Model.

TABLE 1.

7.2. Generative model under treatment.

7.3. Generatingthesimulateddata.

7.3.1. The test statistic.

7.4. Evaluation of Simulation Calculator for the Smoking Cessation Study.

7.4.1. Misspecification of transition matrix P(0).

TABLE 3.

7.4.2. Deviations from a time-homogenous transition matrix under no treatment.

TABLE 4.

7.4.3. Deviations from a Markovian generative model.

TABLE 5.

Fig 3:

TABLE 7.

7.5. Adjustments to the simulation-based calculator.

TABLE 8.

8. Conclusion.

Supplementary Material

Footnotes

References.

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

7.4.1. Misspecification of transition matrix P⁽⁰⁾.