Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Jan 1.
Published in final edited form as: J Biopharm Stat. 2017 Nov 20;28(5):857–869. doi: 10.1080/10543406.2017.1399898

Sample Size Calculations for Blinding Assessment

Victoria Landsman 1,2, Mark Fillery 3, Howard Vernon 3, Heejung Bang 4
PMCID: PMC5960641  NIHMSID: NIHMS960904  PMID: 29157126

Abstract

Blinding is a critical component in randomized clinical trials along with treatment effect estimation and comparisons between the treatments. Various methods have been proposed for the statistical analyses of blinding-related data but there is little guidance for determining the sample size for this type of data, especially if blinding assessment is done in pilot studies. In this paper, we try to fill this gap and provide simple methods to address sample size calculations for a ‘new’ study with different research questions and scenarios. The proposed methods are framed in terms of estimation/precision or statistical testing to allow investigators to choose the best suited method for their goals. We illustrate the methods using worked examples with real data.

Keywords: blinding index, clinical trial, contingency table, masking, patient blinding

1. Introduction

Blinding has been widely perceived to be important in randomized controlled trials (RCT) and other comparative evaluations. Traditionally, blinding-related issues have been discussed more qualitatively or conceptually, for example, “blinding is important” and “double (or triple) blind is the best” (Hopton and Macpherson, 2011; Jadad, et al., 1996; Kolahi, et al., 2009; Wilsey, et al., 2016). In the last two decades, a growing number of studies on blinding have pursued a quantitative approach to study design, data collection, analysis and interpretation (Arandjelović, 2012; Bang, et al., 2010; Chow and Shao, 2004; James, et al., 1996; Jeong, et al., 2013; Wilsey, et al., 2016; Wright, et al., 2012). Nowadays, blinding is also emphasized in non-pharmacological trials, like trials involving devices or physical therapy, in order to demonstrate internal validity. Meta-analyses of blinding have offered some optimistic news about the feasibility of blinding for some interventions, which are traditionally believed to be hard to blind, including non-drug or non-injection (Boutron, et al., 2007; Brinjikji, et al., 2010; Freed, et al., 2014; Hopton and Macpherson, 2011; Houweling, et al., 2014; Moroz, et al., 2013; Wilsey, et al., 2016).

Despite various statistical and methodological proposals, there is still little guidance concerning the determination of minimal or adequate sample size (N) for blinding assessment in a statistically justifiable manner. In the past, it was not unusual for a statistician to advise clients seeking sample size calculations for a blinding assessment to take a sample of at least N=30, or perhaps N=100 patients, in the absence of a good reference. In some cases, statisticians may have even said, “Use any N available”. Recently, Shin et al. (Shin, et al., 2016) proposed a pilot RCT on acupuncture with 2 centers and selected a blinding index (BI) as one of the study outcomes (Bang, et al., 2004). This team recruited 40 participants which may be reasonable for a pilot study testing a clinical outcome, but it is unclear whether it is sufficient for testing a blinding outcome. In another example, Vernon and his colleagues proposed a new study for blinding assessment of real vs. control cervical manipulation procedures with multiple chiropractors following an initial study (Vernon, et al., 2013). The investigators decided to use a BI for evaluation of blinding as the primary outcome, but found a lack of statistically sound recommendations in the blinding literature to inform the choice of the sample size (Vernon, 2017) https://clinicaltrials.gov/ct2/show/NCT01772966.

The growing interest in statistical analyses of blinding outcomes in recent years (Arandjelović, 2012; Baethge, et al., 2013; Crisp, 2015; Hertzberg, et al., 2008; Houweling, et al., 2014; Wright, et al., 2012) creates the need to address sample size calculations in order to improve the quality of inference on analysis of blinding data. The key question is how to determine the sample size for a new stand-alone study, such as a pilot study focusing on masking, evaluation of short-term blinding (Walter, et al., 2005), or part of a large phase III RCT, in which blinding assessment may be defined as a secondary aim.

In this article, we propose simple and intuitive ways to address this question. Since blinding studies, unlike evaluations of clinical effectiveness, do not have clear-cut aims, we present three different research scenarios and describe methodologies to obtain sample sizes in each case. The proposed methods can be integrated into traditional frameworks (e.g., estimation or statistical testing) with clear underlying mechanisms and operational characteristics, and the calculations are straightforward. Of note, we recommend the proposed methods for power calculations while designing a ‘new’ study, and that post-hoc power calculations be avoided (CONSORT, 2010; Hoenig and Heisey, 2001).

2. Methods: background, notation, and proposals

Studies focusing on treatment comparison or survey research generally have clear and well-defined goals, such as detection of treatment effect with high power, or ensuring the margin of error not exceeding a desired bound. Conversely, blinding assessment studies tend to have somewhat subjective or varied goals. Regardless of the divergent goals, blinding studies are driven by two questions at the design and analysis stages: 1) was the blinding broken? and 2) if so, are the outcome data affected? These two questions can be converted into statistically relevant scenarios reliant on the format of blinding assessment data. Typically, blinding data are presented as a 2×3 table with two allocation arms (Treatment (T=1) vs. Control (C=2)) and three choices for guess (1(=T), 2(=C), 3(=Don’t know)) as in Table 1. Some researchers use a 2×2 format without the ‘Don’t know’ option, while some others use a 2×5 format accounting for degree of belief (e.g., ‘strongly believe’, ‘somewhat believe’), which can be reduced to a 2×3 format (Bang, et al., 2004; James, et al., 1996; Mathieu, et al., 2014; Wright, et al., 2012).

Table 1.

Notations used to summarize typical blinding data

Guess, count Total
Allocation 1 (=T) 2 (=C) 3 (=Don’t know) Sample size
Treatment (T)
n11
n12
n13
n1 (= nT)
Control (C)
n21
n22
n23
n2 (= nC)

In this view, the three statistically relevant scenarios would be: 1) testing the independence of allocation and guess; 2) estimating arm-specific trinomial proportions of guess and their contrast; and 3) testing the effect of allocation-guess interaction on the clinical outcome. In the rest of this section, we derive the formulas for the sample size required for each of the three scenarios with adoption or adaptation of existing methods or ideas, followed by the Worked examples in the next section.

Let nij represent a cell count in a 2×3 table with blinding data (see Table 1; i=1,2 and j=1,2,3). The row sums, n1. and n2., define the sample sizes for each arm, and the total sample size is equal to N=n1.+n2. Unless specified differently, we assume 1:1 allocation (n1.=n2.=n and N=2*n) and α=0.05. Also, we denote joint and conditional probabilities as pij and pj|i, respectively. Parameter and estimator notation may be used interchangeably when the context is clear, as it is a common practice in research on N/power.

2.1 Scenario 1: Testing the independence of allocation and guess

The first and most natural consideration for an investigator seeing data in a 2×3 table (as in Table 1) would be a classical test of the independence of allocation (row) and guess (column) or the homogeneity of the response in multiple groups (Agresti, 2013; Mathieu, et al., 2014). The Pearson Chi-square and Likelihood Ratio (LR) tests are standard tests for the association in an unordered rxc table, with r arms and c guesses.

Under the alternative hypothesis, Chi-square and LR statistics, commonly denoted as X2 and G2 in the literature, have large-sample noncentral chi-squared distributions. Let pij denote the joint probability in cell (i,j), where i represents the allocation (i=1,2), j represents the guess (j=1,2,3), and p¯ij denote the joint probability in cell (i,j) under the null (the independence hypothesis); pij=p¯ij=1. Using this notation, the noncentrality parameter (λ) for Pearson Chi-square equals

λ=Ni,j(pijp¯ij)2/p¯ij,

and the noncentrality parameter for LR statistic equals

λ=2Ni,jpijlog(pijp¯ij).

The desired sample size (N) for a chi-squared test can be obtained from the power equation P(Xυ,λ2>χυ2(α)), where υ=(r1)×(c1) is the degrees of freedom (df). If the data from ‘previous’ studies are available and we want to detect the similar observed differences, pij and p¯ij can be estimated as nij/N and (ni.*n.j)/N2, respectively. Approximations can be done using published or built-in table, e.g., (Agresti, 2013); see the Worked Examples below.

2.2 Scenario 2: Estimation of (a) arm-specific trinomial proportions and (b) their contrast

Blinding data for a given arm i can be viewed as a sample from a trinomial distribution with probabilities (p1|i,p2|i,p3|i), and j=13pj|i=1. In this subsection, we consider determination of the sample size required in each arm (n) to ensure the estimation of these probabilities and/or their difference (namely, BI) with a specified level of precision.

First, let us consider the estimation of the probabilities. In this case, the objective is to find the smallest sample size n for arm isuch that the estimated proportions are simultaneously within specified distances of true population proportions with a probability of at least 1-α, that is,

P(j=13|p^j|ipj|i|dj)1αforarmi.

The sample size determination procedures proposed by Tortora (Tortora, 1978) and Thompson (Thompson, 1987) are based on the simultaneous confidence intervals (CIs) for the multinomial model. Tortora constrained the width of the jth category interval, j=1,2,3, to be ≤2 dj and obtained the following formula for n,

n=maxj=1,2,3[zα/(2×3)2pj|i(1pj|i)/dj2]

where zα/(2×3) is the upper (α/6)*100th percentile of the standard normal distribution. The total sample size N can be obtained as N=2*n, where n is the larger value between the sample sizes obtained for each arm. In the absence of prior knowledge about pj|i, the ‘worst-case’ (in view of the maximal n required) would be to use the probabilities vector (1/2,1/2,0) for each arm. Assuming dj=dfor each category j, the (maximal) n for each arm can be further simplified as

n=zα/624d2=1.43279d2forα=0.05.

In contrast, Thompson described the general form of the ‘worst’ parameter vector under the constraint of the equal width, dj=d. His theoretical result depends on the specified level of α: for example, (1/3,1/3,1/3) is the ‘worst-case’ parameter vector if α=0.05, whereas (1/2,1/2,0) is the ‘worst-case’ parameter vector if α=0.025; see Table 1 in (Thompson, 1987). His procedure results in smaller sample sizes and hence is attractive under the condition of the equal width intervals. The conservative sample size using the Thompson’s method is given by

n=2zα/629d2=1.27359d2forα=0.05 Eq (1)

which is lower than the Tortora’s counterpart. Both methods would result in sample size estimates that ensure a specified confidence range for the estimated probabilities.

Next, we consider the estimation of the BI. Bl is an arm-specific index for blinding assessment defined as a contrast between the probability of correct guess and the probability of incorrect guess:

BIT=p1|1p2|1and BIC=p2|2p1|2.

In a 2×2 format without the ‘Don’t know’ option, the BI reduces to BIi=2pi1 for i=T,C, where pT=p1|1 and pC=p2|2. Estimators for the BIs can be obtained by replacing the arm-specific probabilities by their estimators:

BI^T=(n1Tn2T)/nTandBI^C=(n2cn1c)/nc.

BI quantifies the correct guess beyond chance (50%) or imbalance between correct vs. incorrect guesses in the blinding data as in Table 1. For example, BI=0 represents ‘random guess’; BI=0.3 represents a 30% imbalance in guess toward the correct guess, say, 40% participants guessed T vs. 10% guessed C in arm T; BI=−0.3 represents an imbalance of 30% toward the incorrect guess. BI is usually interpreted in terms of possible blinding scenarios, along with qualitative data (e.g., reasons for guess), whenever available. After all, there are contexts in which correct guesses are not undesirable, and may provide insight, such as when they reflect ‘wishful thinking’ (Bang, 2016; Brinjikji, et al., 2010).

A standard 2-sided CI for BI with (1-α) confidence level for the treatment arm iis:

(p^1|ip^2|i)±zα/2[p^1|i(1p^1|i)+p^2|i(1p^2|i)+2p^1|ip^2|i]/ni.

Assuming 1:1 allocation as before, the objective is to find a sample size n such that the inequality zα/2[p^1|i(1p^1|i)+p^2|i(1p^2|i)+2p^1|ip^2|i]/n d holds for a specified threshold d. Solving this inequality in terms of n yields

n=zα/22[p^1|i(1p^1|i)+p^2|i(1p^2|i)+2p^1|ip^2|i]/d2.

Using the method of Lagrange multipliers, we can show that the maximal sample size is reached when p3|i=0 and p1|i=p2|i=(1-p3|i)/2. In this case, the maximal sample size is attained at the trinomial vector (p1|i=p2|i=p3|i)=(1/2, 1/2, 0) simplifying the above formula to n=z2α/2/d2. This result has two important implications: a) allowing ‘Don’t know’ as a guess category decreases the required sample size; and b) in a 2×2 format, the maximal value of Var( BI^i)= 4*Var( p^i) is reached at the binomial vector (1/2, 1/2), resulting in the maximum sample size equal to n=z2α/2/d2. Interestingly, this sample size is 4 times of the well-known conservative sample size for estimation of the probability of event in binomial data.

To summarize, with good estimates for trinomial or binomial probabilities, the sample size for each arm is given as n=z2α/2[p1|i(1-p1|i)+p2|i(1-p2|i)+2p1|ip2|i)]/d2 for a 2×3 format, and n = z2α/2[4pi(1-pi)]/d2 for a 2×2 format. In the absence of good estimates, the conservative sample size

n=z2α/2/d2 Eq (2)

with p=1/2 can be used for both formats. It can be seen from these formulae that a fundamental operational characteristic, common in virtually all sample size estimations, applies here as well: the more stringent the threshold (or the narrower the CI), the larger the sample size required.

2.3 Scenario 3: Testing the effect of allocation-guess interaction on the clinical outcome

Breached blinding has a potential to distort clinical findings, leading to biased estimates of treatment effects with unknown direction (Bang, 2016; Mathieu, et al., 2014). In the presence of allocation-guess interaction, the estimated average treatment effect (ATE) may depend on the guess status (1(=T), 2(=C), 3(=Don’t know)) resulting in meaningfully different ATEs in subgroups of different guess status. For instance, the ATE estimate obtained from those who guessed T can be positive, in favor of treatment, while the estimate from those who guessed C can be negative or null. In these situations, detecting the interaction between allocation and guess with a reasonable power could be of scientific interest.

Testing the effect of allocation-guess interaction on a (continuous) clinical outcome may be understood in a framework of a two-way fixed effects unbalanced ANOVA. Let yijk be the outcome from the kth patient in arm i with guess j, where i=1,2; j=1,2,3; and k=1,…,nij, and the cell sizes nij (non-zero) are defined in Table 1. In a 2×3 case, a univariate general linear model can be parametrized as

yijk=μij+εijk,i=1,2;j=1,2,3

where εijkare normally distributed independent and identical errors with mean zero and variance σ2. This model can be written in matrix notation as y=XB+ε, where X is a design matrix of size N×mof zeros and ones (m=6 in our case), and B=[μ11μ12μ13μ21μ22μ23] (Elston and Bush, 1964).

Testing the null hypothesis of no interaction between allocation and guess is equivalent to testing the equality of ATE for each category of guess. This hypothesis can be viewed as a special case of a linear hypothesis H0: LB=0 vs. HA: LB≠0, with L being a q×m contrast matrix of full rank, where q is the number of contrasts. For example, for an overall test of no interaction effect, L=[100111100111]. The F-statistic is given by

F=(LB^)(L(XX)1L)1(LB^)/qee/(Nm)

with B^=(XX)1Xy and ee=(yXB^)(yXB^). Under HA, F is distributed as F(q, N-m, λ) with a noncentrality parameter λ=(LB)(L(XX)1L)1(LB)/σ2. The sample size is computed by inverting the power equation P(F(q, N-m, λ)≥Fα(q, N-m)); see (Castelloe and O’Brien, 2001; Elston and Bush, 1964; Muller and Peterson, 1984; O’Brien, 1986; O’Brien and Shieh, 1992) for details and general theory.

In summary, use of this method requires outcome data for allocation by guess status from historical studies, preliminary data, or pre-specified values (Chow and Shao, 2004; Wright, et al., 2012). Thus, the utility of this method may be limited in practice when it is difficult to come up with plausible inputs, although it may be of ultimate interest related to blinding assessment. See Section 3.3 for an illustrative example from Chow and Shao, where suspected breached blinding might have been problematic and warranted further investigation.

3. Worked examples

Even with technically sound methods, a crucial issue is how to implement the methods and make sensible decisions in practice. In this section, we illustrate sample size calculations using inputs from published data to assist in designing a ‘new’ study.

If a new study is designed as a pilot study where blinding is the primary outcome, trialists may use our N methods directly (e.g., a wide margin can be predefined for a pilot, and a narrow margin for a real study). If blinding is defined a secondary or tertiary outcome in a ‘new’ study, trialists may choose N for primary clinical outcome, which is typically done in RCTs (Briggs, 2000). If N was obtained for a primary clinical outcome, we can calculate what power or precision of the estimate of the blinding parameter can be attained with this N. Toward making an overall conclusion, we adopt Cohen’s logic (Cohen, 1990): determine the sample size necessary to detect a negligible signal/indication of breached blinding with high probability. After the research is carried out using that sample size, and the result is not significant, the conclusion is justified that no nontrivial signal exists, at a given level. This does, in fact, probabilistically show the intended null hypothesis of no more than a trivial small signal (i.e., blinding is acceptable).

3.1. Scenario 1: Testing the independence of allocation and guess

The sample size under this scenario can be obtained manually or using SAS macros powerRxC or Unifypow (SAS Institute, Cary, NC), among others (Castelloe and O’Brien, 2001; O’Brien and Shieh, 1992). As an illustrative example of a manual calculation, the real (rather than hypothetical) data from a study for blinding assessment of a sham cervical manipulation procedure (Vernon, et al., 2013) is used as an input for calculation of a sample size for a new study. Rigorous blinding evaluation was chosen as a primary aim of the new study (Vernon, 2017). The self-reported guess status collected from a secondary analysis in an earlier study is presented in Table 2 and it may be used to obtain the observed joint probabilities in a 2×3 table: 0.25, 0.14, 0.11 for arm T and 0.14, 0.23, 0.125 for arm C for guess 1,2,3, respectively. The expected joint probabilities under independence are: 0.195, 0.1875, 0.115 for guess of 1,2,3 in both arms. Using this information, the noncentrality parameter for Pearson Chi-square equals λ=0.054*N. The approximate power for a number of different N values can be obtained from the table ‘Power of Chi-squared Test for α=0.05’ in (Agresti, 2013) or various statistical software: if N=20, λ=1.08, the power is approximately 0.13 (with df=2). Similar calculations show that N=176 will be required to test the same hypothesis with 80% power. In addition, the noncentrality parameter for the LR test equals λ=0.055*N, which is very similar to a Pearson’s as expected. Sample SAS codes are provided in Appendix. Of note, resulting estimates can be unstable or unreliable with low cell counts.

Table 2.

Blinding data from Vernon et al. (2013) and Sample size/Power

a. Blinding data
Guess Total
Allocation Real Sham Don’t know Sample size
Real 16 9 7 32
Sham 9 15 8 32
b. Sample size and power ( α=0.05)
N Power of Pearson Chi-square Power of LR
20 0.14 0.14
50 0.30 0.30
100 0.56 0.55
176 0.80 0.81
200 0.85 0.86
300 0.96 0.96

3.2. Scenario 2: Estimation of (a) arm-specific trinomial proportions and (b) their contrast

Proportions obtained from previous studies can be used to inform sample size calculations. For example, Vernon et al. (Vernon, et al., 2013) observed the trinomial proportions (0.5, 0.28, 0.22) in arm T and (0.28, 0.47, 0.25) in arm C in an immediate post-treatment evaluation of blinding. Park et al.’s acupuncture trial (Park, et al., 2005) also observed the proportions of guess near 0.5 (e.g., 26/49). In the absence of prior data on trinomial proportions or with the observed proportions close to 0.5, as in the trials above, the conservative sample size for one arm can be obtained using n=zα/624d2. For example, with α=0.05, the sample size equals n=574, 144, 36 for d=0.05, 0.1, 0.2, respectively. The sample sizes corresponding to the Thompson’s method equal n=510, 128, 32. Since in the blinding context, p1|i and p2|i near 0.5 are quite plausible (whereas extreme scenarios such as (0;0;1) are rare), these conservative sample sizes are reasonably justified.

Next, we discuss the estimation of BI, the contrast of the arm-specific proportions. Recent systematic reviews and meta-analyses on blinding provide a new insight on the range of feasible values of the BI in different types of studies (Baethge, et al., 2013; Freed, et al., 2014; Moroz, et al., 2013). For example, Freed et al. (Freed, et al., 2014) focused on meta-analysis of BI in trials of psychiatric disorders. It is remarkable that a large number of studies included in Freed et al. produced the BI (in weighted average) close to 0 in the control arm, where BI=0 corresponds to a ‘random guess’, supposedly the most ideal blinding scenario (Bang, 2016). Therefore, in the absence of good input about a possible value of BI in a future study, BI=0 could be a reasonable starting point for sample size calculation.

For a blinding data collected in a 2×3 format, BI=0 is implied by all parameter vectors of the form of (p1|i,p1|i,12p1|i), where 0p1|i1/2. Table 3 presents sample calculations for this case obtained for two thresholds (d=0.1 and d=0.2) for p1|i=0.1,0.2,0.3,0,4,0.5. The results in the table clearly demonstrate that as p1|i values get closer to zero (implying that more people are expected to answer ‘Don’t know’), the required sample size decreases. When p1|i value gets closer to 0.5, a larger sample size is required. The ‘worst-case’ corresponds to p1|i=0.5, in which case 2×2 and 2×3 formats are equivalent and the required sample size reaches its maximum value. All confirm our theoretical results in Section 2.2. Notice a 5-fold difference in sample size required for the case (0.1, 0.1, 0.8) as opposed to the case (0.5, 0.5, 0) for the same value of α and d. The tighter the desired width of the CI (=2*d), the larger the sample size required, e.g., 4 times for d=0.1 vs. 0.2.

Table 3.

Sample size (nT.) for the estimation of BI in treatment arm ( α=0.05)

BIT = P1|T-P2|T d=0.2 d=0.1
BIT = 0
0.1-0.1 20 77
0.2-0.2 39 154
0.3-0.3 58 231
0.4-0.4 77 308
0.5-0.5* 97 385
BIT = 0.1
0.1-0.0 9 35
0.2-0.1 28 115
BIT = 0.1, 0.2, 0.3, 0.4
0.3-0.2/0.3-0.1 48/35 189/139
0.4-0.3/0.4-0.2/0.4-0.1 67/54/40 266/216/158
0.5-0.4/0.5-0.3/0.5-0.2/0.5-0.1 86/73/59/43 342/292/235/170

P1|T=expected proportion of persons who guessed 1 (=T).

P2|T=expected proportion of persons who guessed 2 (=C).

BIT denotes blinding index and nT. denotes sample size in Treatment arm.

*

The required sample size in this case is equivalent to a conservative sample size in a 2×2 format (without “Don’t Know” category), n=z2α/2/d2.

As an illustration, when a future study on blinding assessment of cervical manipulation is designed after a pilot study (Vernon, et al., 2013), and the team anticipates BI values to be in the range [0, 0.1], the required sample size per arm would be 86 (assuming parameter vector of (0.5, 0.4, 0.1)) or 97 (assuming (0.5, 0.5, 0)) with d=0.2. These sample sizes correspond to N=2*86=172 or N=2*97=194 which is comparable to the N=176 obtained under Scenario 1.

As noted above, our methods can also be used to assess the power or estimation precision for blinding (defined as a secondary or tertiary outcome) in the studies with the sample size obtained for clinical outcome. To exemplify the use of our methods for this common scenario, assume that a research team is planning a new trial to estimate the effect of real vs. sham acupuncture on muscle spasticity as primary outcome. The Ashworth scale for muscle spasticity is defined as a dichotomized outcome (yes/no increase in muscle tone). The team is informed by a previous trial (Park, et al., 2005) and wants to detect a clinically meaningful difference of 33% vs. 22% for real vs. sham acupuncture, respectively. The sample size n=276 for each arm is required to detect this difference with α=5% and 80% power with a 1:1 allocation via the Fisher exact test. If we decide to use the Thompson’s formula for a trinomial vector and a conservative formula for BI, Equations (1) and (2), we get d=0.07 and 0.12, respectively, with the given n. Thus, we may expect that the total sample size N=552 is sufficient to make the goal of reliable blinding assessment achievable in the planning stage.

3.3. Scenario 3: Testing the effect of allocation-guess interaction on the clinical outcome

Chow and Shao (Chow and Shao, 2004) analyzed the Brownell and Stunkard’s data (Brownell and Stunkard, 1982) focusing on breached blinding, the role of consent form, and its potential impact on the clinical outcome in the weight loss trial. The observed blinding data (Table 4) yielded substantial agreement between the allocation and guess and very high BI values (BIT=0.67 and BIC=0.52; markedly larger than the previously suggested threshold, 0.2 (Freed, et al., 2014; Kolahi, et al., 2009; Park, et al., 2008)), which may be indicative of possible breach in blinding. Moreover, dramatic interaction between allocation and guess on the clinical outcome of weight loss is apparent in Figure 1, created using the summary statistics provided in the original papers (Brownell and Stunkard, 1982; Chow and Shao, 2004). These summary statistics may be used to reproduce the raw outcome data closely.

Table 4.

Blinding and outcome data from Brownell and Stunkard (1982) and Sample size/Power

a. Blinding data
Guess, count Total
Allocation Active drug Placebo Don’t know Sample size
Active drug 19 3 2 24
Placebo 3 16 6 25
b. Mean weight loss (kg) in subgroups defined by allocation and guess
Guess
Assignment Active drug Placebo Don’t know Overall
Active drug 9.6 3.9 12.2 9.1
Placebo 2.6 6.1 5.8 5.6
c. Sample size and power ( α=0.05)
Source
σ
N Power
Main effect of guess (df=2) 4 148 0.860
4 98 0.682
5 148 0.668
5 98 0.481
Overall interaction (df=2) 4 148 0.994
4 98 0.948
5 148 0.942
5 98 0.806
Tailored interaction (df=1)
ATE among guess=T vs.
ATE among guess=C
4 148 0.994
4 98 0.954
5 148 0.947
5 98 0.829

df: degrees of freedom; ATE: average treatment effect.

Figure 1. Clinical outcome by allocation and guess status.

Figure 1

Created using data from Brownell and Stunkard (1982) and Chow and Shao (2004).

drug: Guess T; placebo: Guess C; dnk: Guess ‘Don’t know’.

In this and similar situations, we would be interested in the sample size required to test the overall interaction (i.e., the equality of ATEs for all guess categories) as well as the custom interaction (e.g., the equality of ATEs between those who guessed T and those who guessed C) with sufficient power in a new study. The method for sample size determination described in Section 2.3 can be implemented using GLMpower procedure in SAS; see Appendix for a sample code. Cell means, cell counts and error variance σ2are the sample required inputs. An estimate of the error variance may be obtained by fitting a two-way ANOVA model to the re-constructed raw data or crudely approximated as σ≈ ( ymaxymin)/6, where ymax and ymin stand for maximal and minimal values of a reasonably symmetric outcome. Using the re-constructed raw data, we obtained 5, thus confirming the previous estimation (Chow and Shao, 2004). We ran an additional power analysis with σ(4(20))/6=4 using information provided in raw data plot (Brownell and Stunkard, 1982) for an illustration purpose.

Again, let us assume that we are planning to design a new trial, with the primary outcome defined as weight loss and blinding being selected as one of the secondary outcomes. We hypothesize in this case that the clinically meaningful difference in weight loss between treatment and placebo is about 3.5 kg with a conservative value σ=5. Assuming a 1:1 allocation, α=1% and power=95% (these strict conditions are assumed to minimize false positive results and ensure very high power), n=74 for each arm (total N=148) will be required to test the difference in means.

Using the results in Table 4, we can conclude that this sample size will be sufficient to detect overall and custom interactions defined above for blinding with high power (>90%) at α=5%. Let us now consider another, conservative situation with sample size N=98. This sample size is ~30% lower than N=148 but is double the original sample size (N=49) used in Brownell and Stunkard (1982). Using this sample size with the standard assumptions of α=5% and power=80%, we will be able to detect a true difference of <3 kg in treatment vs. control groups. Using the results in Table 4, we can now conclude that a sample size in the range of 100-150 would be sufficient to detect interactions of interest defined above with power >80% assuming σ5 and α=5%.

On the other side, the power to detect the difference in ATE among those who guessed T vs. those who answered ‘Don’t know’ is very low (~5%), which is expected as the associated ATE lines have nearly the same slopes in Figure 1. At the same time, we might assume that those who chose ‘Don’t know’ tend to be neutral or less biased. A similar analysis can be repeated with a different classification for guess status: guess correctly vs. guess incorrectly vs. not guess, as attempted by Chow and Shao.

This exercise demonstrates the feasibility of sample size and power calculations in the setting where we design a new study and reasonable blinding data along with the outcome data are available for inputs as educated guess. Although this type of data is rare in the present literature, we believe that growing research on blinding will yield more collection and reporting of similar data.

4. Discussion

In this paper, we consider the three qualitatively different scenarios relevant to quantitative analysis of blinding, and present methods of sample size determination for planning a future study with a blinding assessment component. The scenarios are framed in terms of estimation and precision (Scenario 2a & b)) as well as statistical testing and power (Scenario 1 and 3). We illustrated the three ─ real and hypothetical ─ scenarios with Worked examples using published data and exemplified the calculations manually or using a statistical software.

The methods and examples in this paper offer users a suite of formulae that can be used to determine sample sizes in order to conduct a spectrum of studies, from a pilot or feasibility study (Walter, et al., 2005) to a full scale RCT. Since power and sample size calculation should be tailored to a specific research question, the choice of the particular method and formula depends on the goals and input availability. The proposed methods could be particularly essential for studies testing and establishing a newly developed control, sham or placebo intervention. Blinding assessment is also desirable for studies that test if the two treatments are easily distinguishable and there are no clinical outcomes (say, masking).

The proposed methods can be implemented in a flexible manner for different purposes. For example, when designing a pilot study, researchers may decide to use the formulae in the paper with a wider margin of error than that which is suitable for a real trial: for example, 0.2 can be set as a targeted threshold for estimation of the BI in a pilot study, and 0.1 in a larger, actual trial. If a research team decides to evaluate blinding in a very large trial, the sample size formulae may be used to create a subsample from the entire sample, e.g., (COMMIT, 2005), to which a blinding questionnaire could be administered. Similarly to power/sample size analysis relevant to a clinical outcome, if the sample size used in blinding assessment is substantially lower than the ones justifiable by the methods in our paper, the designation of “pilot or feasibility” study may be reasonable.

We want to discuss some limitations. First, we considered studies with a 1:1 allocation and 2 arms since blinding data is optimally interpreted in this setting. With a 2:1 allocation, we are not fully certain if P(correct guess)=0.5 is still the most ideal value; some patients may use 50% in their guess (between T vs. C), while others may use 66.7% if they were informed about the allocation ratio and remember it (Bang, 2016; Brownell and Stunkard, 1982). N/power calculations for advanced designs (e.g., with 2:1 allocation, clustering or crossover designs) as well as accounting for other complex issues (e.g., informative drop-out) in RCT are possible topics for future research (Bang, et al., 2010; Park, et al., 2005; Roy, 2012; Zhang, et al., 2013).

Second, we did not handle multiple testing rigorously, partly because we do not pursue rigorous hypothesis testing in blinding. Of note, we framed the problem in classical superiority testing with a 2-sided CI/test, not equivalence test or a 1-sided hypothesis/CI, which may be more relevant to blinding. The main reason behind our decision was to avoid unnecessary complexity (e.g., equivalence margin), because blinding is a tool, not a goal, and any numerical analysis alone should not be used for binary designation (e.g., success or failure) (Bang and Park, 2013; Zhang, et al., 2013). Along the same line, we emphasize the importance of the estimation-based method described in Scenario 2 in the blinding context. Finally, post-hoc or retrospective power calculation should be avoided (CONSORT, 2010; Hoenig and Heisey, 2001).

In closing, we propose the methods for sample size and power calculations which can be used for exploratory or planning purposes, and can address different research questions, inputs and settings commonly encountered in blinding.

Acknowledgments

Funding: HB was partly supported by the National Institutes of Health through grants UL1 TR001860 and P50 AR063043. HV was supported by the National Institutes of Health through grant 1R21AT004396.

Appendix: Sample SAS codes

1. Scenario 1

data Vernon;
 do Treat=1 to 2;
  do Guess=1 to 3;
   input freq @@; output;
  end;
 end;
datalines;
16 9 7
9 15 8
;
%powerRxC(data=Vernon,row=Treat,col=Guess,count=freq,nrange=%str(20,50,100,176,200 to 500 by 100))

2. Scenario 3

data Chow;
input Treat Guess Outcome Weight;
datalines;
1 1 9.6 19
1 2 3.9 3
1 3 12.2 2
2 1 2.6 3
2 2 6.1 16
2 3 5.8 6
;
proc glmpower data=Chow;
class Treat Guess;
model Outcome = Treat | Guess;
weight Weight;
contrast ‘Inter-overall’ Treat*Guess 1 011 0 1, Treat*Guess 0 11 01 1;
contrast ‘Inter-tailored (ATE in 1(=T) vs. 2(=C))’ Treat*Guess 11 01 1 0;
power
stddev = 4 5
ntotal = 148 98
power = .;
run;

Footnotes

Conflict of Interest: None

References

  1. Agresti A. Categorical Data Analysis. Wiley; 2013. [Google Scholar]
  2. Arandjelović O. A new framework for interpreting the outcomes of imperfectly blinded controlled clinical trials. PLoS One. 2012;7:e48984. doi: 10.1371/journal.pone.0048984. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Baethge C, Assall O, Baldessarini R. Systematic review of blinding assessment in randomized controlled trials in Schizophrenia and Affective Disorders 2000–2010. Psychotherapy and Psychosomatics. 2013;82:152–160. doi: 10.1159/000346144. [DOI] [PubMed] [Google Scholar]
  4. Bang H. Random guess and wishful thinking are the best blinding scenarios. Contemporary Clinical Trials Communications. 2016;3:117–121. doi: 10.1016/j.conctc.2016.05.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bang H, et al. Blinding assessment in clinical trials: A review of statistical methods and a proposal of blinding assessment protocol. Clinical Research and Regulatory Affairs. 2010;27:42–51. [Google Scholar]
  6. Bang H, Ni L, Davis CE. Assessment of blinding in clinical trials. Controlled Clinical Trials. 2004;25:143–156. doi: 10.1016/j.cct.2003.10.016. [DOI] [PubMed] [Google Scholar]
  7. Bang H, Park J. Blinding in clinical trials: a practical approach. J Alternative and Complemtary Medicine. 2013;19:367–369. doi: 10.1089/acm.2012.0210. [DOI] [PubMed] [Google Scholar]
  8. Boutron I, et al. Reporting methods of blinding in randomized trials assessing nonpharmacological treatments. PLoS Med. 2007;4:e61. doi: 10.1371/journal.pmed.0040061. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Briggs A. Economic evaluation and clinical trials: size matters. The need for greater power in cost analyses poses an ethical dilemma. BMJ. 2000;321:1362. doi: 10.1136/bmj.321.7273.1362. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Brinjikji W, et al. Investigational vertebroplasty efficacy and safetry trial: Detailed analysis of blinding efficacy. Radiology. 2010;257:219–225. doi: 10.1148/radiol.10100094. [DOI] [PubMed] [Google Scholar]
  11. Brownell KD, Stunkard AJ. The double-blind in danger: untoward consequences of informed consent. Am J Psychiatry. 1982;139:1487–1489. doi: 10.1176/ajp.139.11.1487. [DOI] [PubMed] [Google Scholar]
  12. Castelloe JM, O’Brien RG. Power and sample size determination for linear models. Proceedings of the Twenty-sixth Annual SAS Users Group International Conference. 2001;240 [Google Scholar]
  13. Chow SC, Shao J. Analysis of clinical data with breached blindness. Statistics in Medicine. 2004;23:1185–1193. doi: 10.1002/sim.1694. [DOI] [PubMed] [Google Scholar]
  14. Cohen J. Things I have learned (so far) American Psychologist. 1990;45:1304–1312. [Google Scholar]
  15. COMMIT (ClOpidogrel and Metoprolol in Myocardial Infarction Trial) collaborative group. Addition of clopidogrel to aspirin in 45 852 patients with acute myocardial infarction: randomised placebo-controlled trial. Lancet. 2005;366:1607–1621. doi: 10.1016/S0140-6736(05)67660-X. [DOI] [PubMed] [Google Scholar]
  16. CONSORT. 2010 http://www.consort-statement.org/consort-2010, last access on July 30, 2017.
  17. Crisp A. Blinding in pharmaceutical clinical trials: An overview of points to consider. Contemporary Clinical Trials. 2015;43:155–163. doi: 10.1016/j.cct.2015.06.002. [DOI] [PubMed] [Google Scholar]
  18. Elston RC, Bush N. The hypotheses that can be tested when there are interactions in an analysis of variance model. Biometrics. 1964;20:681–698. [Google Scholar]
  19. Freed B, et al. Assessing blinding in trials of psychiatric disorders: A meta-analysis based on blinding index. Psychiatry Research. 2014;219:241–247. doi: 10.1016/j.psychres.2014.05.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Hertzberg V, et al. Use of dose modification schedules is effective for blinding trials of warfarin: evidence from the WASID study. Clinical Trials. 2008;5:23–30. doi: 10.1177/1740774507087781. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Hoenig JM, Heisey DM. The abuse of power: the pervasive fallacy of power calculations for data analysis. The American Statistician. 2001;55:1–6. [Google Scholar]
  22. Hopton AK, Macpherson H. Assessing blinding in randomised controlled trials of acupuncture: challenges and recommendations. Chin J Integr Med. 2011;17:173–176. doi: 10.1007/s11655-011-0663-9. [DOI] [PubMed] [Google Scholar]
  23. Houweling A, et al. Blinding strategies in the conduct and reporting of a randomized placebo-controlled device trial. Clinical Trials. 2014;11:547–552. doi: 10.1177/1740774514535999. [DOI] [PubMed] [Google Scholar]
  24. Jadad A, et al. Assessing the quality of reports of randomized clinical trials: Is blinding necessary? Controlled Clinical Trials. 1996;17:1–12. doi: 10.1016/0197-2456(95)00134-4. [DOI] [PubMed] [Google Scholar]
  25. James KE, et al. An index for assessing blindness in a multi-centre clinical trial: disulfiram for alcohol cessation - a VA cooperative study. Statistics in Medicine. 1996;15:1421–1434. doi: 10.1002/(SICI)1097-0258(19960715)15:13<1421::AID-SIM266>3.0.CO;2-H. [DOI] [PubMed] [Google Scholar]
  26. Jeong H, et al. The effect of rigorous study design in the research of autologous bone marrow-derived mononuclear cell transfer in patients with acute myocardial infarction. Stem Cell Research & Therapy. 2013;4 doi: 10.1186/scrt233. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Kolahi J, Bang H, Park J. Towards a proposal for assessment of blinding success in clinical trials: up-to-date review. Community Dentistry and Oral Epidemiology. 2009;37:477–484. doi: 10.1111/j.1600-0528.2009.00494.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Mathieu E, et al. A theoretical analysis showed that blinding cannot eliminate potential for bias associated with beliefs about allocation in randomized clinical trials. Journal of Clinical Epidemiology. 2014;67:667–671. doi: 10.1016/j.jclinepi.2014.02.001. [DOI] [PubMed] [Google Scholar]
  29. Moroz A, et al. Blinding measured: a systematic review of randomized controlled trials of acupuncture. Evidence-Based Complementary and Alternative Medicine. 2013:708251. doi: 10.1155/2013/708251. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Muller KE, Peterson BL. Practical methods for computing power in testing the multivariate general linear hypothesis. Computational Statistics and Data Analysis. 1984;2:143–158. [Google Scholar]
  31. O’Brien RG. Using the SAS System to perform power analysis for log-linear models. The Proceedings of the Eleventh Annaul SAS Users Group International Conference 1986 [Google Scholar]
  32. O’Brien RG, Shieh G. Pragmatic, unifying algorithm gives power probabilities for common F tests of the multivariate general linear hypothesis. The American Statistical Association Meetings 1992 [Google Scholar]
  33. Park J, Bang H, Canette I. Blinding in clinical trials, time to do it better. Complementary Therapies in Medicine. 2008;16:121–123. doi: 10.1016/j.ctim.2008.05.001. [DOI] [PubMed] [Google Scholar]
  34. Park J, et al. Acupuncture for subacute stroke rehabilitation: a sham-controlled, subject- and assessor-blind, randomized trial. Archives of Internal Medicine. 2005;165:2026–2031. doi: 10.1001/archinte.165.17.2026. [DOI] [PubMed] [Google Scholar]
  35. Roy J. Randomized treatment-belief trials. Contemporary Clinical Trials. 2012;33:172–177. doi: 10.1016/j.cct.2011.09.011. [DOI] [PubMed] [Google Scholar]
  36. Shin S, et al. Effectiveness and safety of electroacupuncture on poststroke urinary incontinence: study protocol of a pilot multicentered, randomized, parallel, sham-controlled trial. Evidence-Based Complementary and Alternative Medicine. 2016:5709295. doi: 10.1155/2016/5709295. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Thompson SK. Sample size for estimating multinomial proportions. The American Statistician. 1987;41:42–46. [Google Scholar]
  38. Tortora RD. A note on sample size estimation for multinomial populations. The American Statistician. 1978;32:100–102. [Google Scholar]
  39. Vernon H. Chiropractic Manual Therapy and Neck Pain. 2017 https://clinicaltrials.gov/ct2/show/NCT01772966, Last accessed on July 30, 2017.
  40. Vernon H, et al. Retention of blinding at follow-up in a randomized clinical study using a sham-control cervical manipulation procedure for neck pain: secondary analyses from a randomized clinical study. J of Manipulative and Physiological Therapeutics. 2013;36:522–526. doi: 10.1016/j.jmpt.2013.06.005. [DOI] [PubMed] [Google Scholar]
  41. Walter S, Awasthi S, Jeyaseelan L. Pre-trial evaluation of the potential for unblinding in drug trials: a prototype example. Contemporary Clinical Trials. 2005;26:459–468. doi: 10.1016/j.cct.2005.02.006. [DOI] [PubMed] [Google Scholar]
  42. Wilsey B, Deutsch R, Marcotte TD. Maintenance of blinding in clinical trials and the implications for studying analgesia using cannabinoids. Cannabis and Cannabinoid Research. 2016;1:139–148. doi: 10.1089/can.2016.0016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Wright S, Duncombe P, Altman DG. Assessment of blinding to treatment allocation in studies of a cannabis-based medicine (Sativex®) in people with multiple sclerosis: a new approach. Trials. 2012;13:1–11. doi: 10.1186/1745-6215-13-189. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Zhang Z, et al. A causal model for joint evaluation of placebo and treatment-specific effects in clinical trials. Biometrics. 2013;69:318–327. doi: 10.1111/biom.12005. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES