Abstract
To ensure that a study can properly address its research aims, the sample size and power must be determined appropriately. Covariate adjustment via regression modeling permits more precise estimation of the effect of a primary variable of interest at the expense of increased complexity in sample size / power calculation. The presence of correlation between the main variable and other covariates, commonly seen in observational studies and non-randomized clinical trials, further complicates this process. Though sample size and power specification methods have been obtained to accommodate specific covariate distributions and models, most existing approaches rely on either simple approximations lacking theoretical support or complex procedures that are difficult to apply at the design stage. The current literature lacks a general, coherent theory applicable to a broader class of regression models and covariate distributions. We introduce succinct formulas for sample size and power determination with the generalized linear, Cox, and Fine-Gray models that account for correlation between a main effect and other covariates. Extensive simulations demonstrate that this method produces studies that are appropriately sized to meet their type I error rate and power specifications, particularly offering accurate sample size/power estimation in the presence of correlated covariates.
Keywords: Biomarkers, Generalized Linear Models, Time to Event Regression, Study Design, Sample Size/Power Determination, Variance Inflation Factor
1 |. INTRODUCTION
The primary goal of many biomedical studies is the evaluation of a specific variable’s impact on an outcome of interest. For example, researchers may wish to evaluate efficacy of a promising treatment or a biomarker’s effectiveness in predicting future disease occurrence. An example of the latter is given by Zaid et al.1, which explored possible correlations of 10 biomarkers with clinical outcomes following blood and marrow transplantation. Regression modeling is often employed to evaluate a primary variable of interest while accounting for the influence of other covariates. Moreover, rather than being normally distributed, the primary endpoint in many clinical studies is a binary, count, or time to event outcome, necessitating the use of nonlinear regression modeling to account for covariate influence. Correlation is often known to exist between the main variable and other covariates, such as demographics or health-related factors; this correlation can arise not only in observational studies but also in non-randomized clinical trials.
The main determinant of the power to detect an appreciable effect in common regression models is the sample size of the study. Calculating the required sample size for nonlinear regression models to satisfy type I error rate and power requirements of a study is made increasingly complicated by two factors: the nonlinear relationship between the response and explanatory variables, and the presence of correlation between covariates. The majority of sample size estimation procedures for regression models assume the covariates are uncorrelated. For the logistic, Poisson, Cox2, and Fine-Gray regression3 models, formulas based on the asymptotic, normal distribution of the parameter estimate were presented by Hsieh et al.4, Signorini5, Schoenfeld6, and Latouche et al.7, respectively. Extending this estimation to accomodate correlated covariates has proven to be much more challenging, though. For generalized linear models, Self and Mauritsen8 proposed a method for estimating the required sample size through an iterative procedure; this approach was extended by Liu and Liang9 for marginal modeling of the mean response from clustered observations. But, these techniques fail to give closed forms for the sample size estimate or demonstrate superior performance over simpler, less computationally demanding methods. Alternatively, Hanley and Moodie10 advocated for a simpler, yet general, approach to sample size and power calculation for generalized linear models. Though the authors demonstrate this approach’s application to a variety of applications of these models, including hypothesis testing of mean differences, correlations, risk differences, and risk and odds ratios, its usage in the setting of correlated covariates was covered only for ordinary linear regression, likely due to the sparsity of existing literature rigorously supporting this method in more complex settings.
When correlation exists between covariates, Hsieh et al.11 argued for using a technique that inflates the sample size obtained for uncorrelated covariates by a factor determined by the strength of the correlation. They presented a simulation study supporting this method’s usage with their logistic regression formula and Schoenfeld’s Cox model formula12, but did not provide theoretical justification. Signorini5 gave theoretical support to this inflation method for Poisson regression with normally distributed covariates, while Schmoor et al.13 and Latouche et al.7 offered derivations supporting the method’s use with Cox, and Fine-Gray models when two binary covariates are at play. Justification for this approach in a more general context, however, is lacking. Our proposed methodology both extends these simpler formulas for logistic, Poisson, Cox, and Fine-Gray regression to admit a more general class of predictors and regression models and provides theoretical support for the sample size inflation technique that adjusts for correlated covariates.
Section 2 introduces a unified approach to sample size determination for generalized linear, Cox, and Fine-Gray regression models that utilizes the inflation factor approach of Hsieh et al.4 to account for the presence of correlation. In Section 3, we present results of a comprehensive simulation study that evaluated the performance of this sample size estimation in studies employing realistic sample sizes. Examples of applying the proposed sample size estimation in the design of prospective cohort studies are given in Section 4. Section 5 concludes the paper with a discussion of these findings and potential extensions of the technique.
2 |. METHODS
2.1 |. General Approach
Suppose that the primary objective of a study is to evaluate the impact of a main covariate Z on an outcome Y while controlling for the influence of other covariates X = (X1, X2, …, Xp)T. For instance, Z could denote a binary treatment indicator in a two-arm clinical trial or a continuously measured biomarker’s level in a correlative study. If the parameter β represents the effect of Z on the outcome in a regression model, then a two sided test of H0 : β = 0 determines whether the main covariate has any impact on the outcome. For many common regression models, there exists an estimator of β such that , where the asymptotic variance v may depend on the true value of β. From this asymptotic distribution, we can derive a test of H0. The sample size n required for a two sided Wald test of H0 to detect an effect β = δ with level α and power π is
where Φ−1 is the standard normal quantile function. Usually, v (β) is a continuous function of β such that v (δ) ≈ v (0) when δ is close to 0, giving the simpler formula
| (1) |
Formula (1) will be used as the basis for sample size calculation for the generalized linear regression, Cox, and Fine-Gray models in the methodology that follows. The main difficulty involved is accurately specifying the variance term v (0), which can be especially challenging when Z and X are correlated. We will show that simple formulas can be applied that account for this correlation.
2.2 |. Generalized Linear Regression Models
First, consider the class of generalized linear models (GLM)14. Let be a random sample of outcomes and covariates such that all Zi and Xi have finite variance. The covariates and outcome are assumed to be related through the generalized linear model
| (2) |
where h is a link function that is differentiable and invertible, η = (β0, β, γT)T is the full parameter vector, and W = (1, Z, XT)T. The assumptions are satisfied for common choices of link function, including the identity, log, logit, probit, complementary log-log, and Box-Cox transformations. Furthermore, each response variable Yi is assumed to arise from the exponential dispersion family15 with parameters θ and ϕ, having a probability mass/density function of the form
Due to the GLM assumption, the parameter θ depends on Zi and Xi. Using well-known results from Nelder and Wedderburn14, it can be shown that the maximum likelihood estimator of η converges in distribution to N (0, Σ−1) where Σ = E [W⊗2q (η, W)] with and g = (A′)−1 ∘ h−1 (see Supporting Information for a derivation of this result).
We use the multiple correlation coefficient of Z with X, denoted by R, to describe their correlation; this satisfies R2 = Cov(Z, X)Var(Z)−1Var(X)−1Cov(X, Z), where R2 can be interpreted as the proportion of the variance of Z that is explained by X. It is known that in an ordinary linear regression model, is inversely proportional to Var (Z) (1 − R2). Because of this, the term VIF = (1 − R2)−1 is called the variance inflation factor, as it indicates the proportional increase in that occurs in linear regression compared to the situation where Z and X are uncorrelated. This also indicates the inflation in sample size required to detect a given effect, compared to the correlation-free setting. For linear regression, this result and formula (1) provide the sample size
| (3) |
where σ2 is the residual variance. GLMs exhibit a similar variance inflation effect, shown by the following theorem.
Theorem 1 Assume that model (2) holds and E (Z |X) = a + b′X for some scalar a and vector b. Then a sufficient sample size for two-sided testing of H0 : β = 0 at significance level α with power π to detect β = δ is given by
| (4) |
where f00 = E [q (η0, W)] and η0 = (β0, 0, γT)T is the value of η under H0.
The sample size given for testing a main effect in GLMs is determined by six components: the type I error rate α; the power π; the targeted effect size δ; the variability of the main effect, Var (Z); the correlation between the main effect and other covariates, described by the V I F; and f00, which relates to the response distribution and choice of link function. This sample size is both proportional to the VIF and inversely proportional to the variance of Z, the same as for ordinary linear regression.
The general form of f00 is quite complicated, but it does simplify greatly for common choices of GLMs. When the canonical link function is chosen such that A′ ≡ h−1, g is the identity function and so depends only on the dispersion parameter ϕ and the residual variance under H0. For linear regression, f00 = 1/σ2 and so formulas (3) and (4) coincide. For logistic and Poisson regression, ϕ = 1 is assumed. Thus, for the logistic model, the expected variance of Y |X. For Poisson regression, f00 = E[exp(β0 + γT X)], which is also the expected number of events for a subject under H0, and so the sample size relies upon the expected number of events for this model. Specification of the residual variance term Var (Y|W) in f00 is likely to be the most challenging part in applying formula (4). This challenge is not unique to our method, however, but also arises in simulation-based approaches to sample size estimation for GLMs accompanied by increased probability modeling and computational demands. Historical data and/or expert opinion can be utilized to obtain estimates for the regression parameters β0 and γ as well as a plausible distribution for W in order to specify f00.
Although this article focuses upon specification of a sufficient sample size for the purpose of designing a study to test a main effect, formula (4) can also be applied toward power calculation or to determine the targeted effect size. Situations arise where the sample size may be predetermined and interest lies in computing the power provided to detect a clinically meaningful effect size or, conversely, in finding the minimum effect size that can be detected at a given power level. Our formula may be applied readily in these situations as well: given any two of the design parameters n, π, and δ, the third is readily determined.
Theorem (1) assumes a linear model for Z given X; however, our simulations support the use of formula (4) in situations where the relationship is not linear but instead follows a logistic or log-linear model. This suggests that the assumption could be relaxed and that perhaps the linear relationship only needs to hold approximately near the mean of X, where the majority of its probability density/mass is concentrated. Formula (4) matches the formula suggested by4 for logistic regression. For Poisson regression,5 offered a similar method for sample size determination but assumed that influential covariates had a multivariate normal distribution. Our method unites these methods under a general framework that permits a broad class of link functions and covariate distributions.
2.3 |. Fine-Gray Regression Model
Similar formulas can be applied for the Cox and Fine-Gray regression models for time to event data. The Fine-Gray model is considered first, since the competing risks setting to which it applies can be viewed as a special case of the single failure type setting in which the Cox model is applicable. Without loss of generality, we model the occurrence of cause 1 events from competing risks data over the study period [0, t*], where t* is the maximum follow-up time considered. Subjects may be right-censored within this period, and it is assumed that censoring happens independently of event occurrence and covariates. For the ith patient on study, let Ti denote the event time, Ci be the censoring time, Zi be the main covariate of interest, Xi be the vector of other covariates, and εi be the event type. The vectors are assumed to be a random sample from the target population. The Fine-Gray regression model specifies the subdistribution hazard function through the form
| (5) |
where t is the time on study and is the baseline subdistribution hazard. Let ν = (β, γT)T and . Martens and Logan16 showed that the estimator of ν presented in Fine and Gray3 is such that where, under H0,
with . Σ* can be expressed in block matrix form as
where is the scalar (1,1) element of Σ*. Then for large n, . As with generalized linear models, we can obtain a simpler expression if E(Z|X) is linear with respect to X.
Theorem 2 Assume that model (5) holds and that E(Z|X) = a + b′X for some scalar a and vector b. Then v(0) ≤ [Var (Z)ψ*]−1VIF, where ψ* = P(Ti ≤ Ci ^ t*, εi = 1) is the probability that a subject has a cause 1 event in [0, t*] that is not censored. Subsequently, a sufficient sample size for two-sided testing of H0 : β = 0 at significance level α with power π to detect β = δ is given by
| (6) |
This formula is nearly identical to (4), the sole difference being that f00 is replaced with ψ*, the chance that an event of interest is observed during the study period [0, t*]. This event frequency is generally easier to specify than the residual variance involved in f00 and this specification can be guided by previous studies or clinicians’ input. As with the GLM formula, the simulation study in Section 3 supports using formula (6) even if E(Z|X) is not linear with respect to X. Latouche et al.7 offered this formula solely for the special case where Z is binary and uncorrelated with X in the censoring-free and censoring-complete settings (potential censoring times are known for all patients). Because (6) is derived using the variance calculation in Martens and Logan16, which is valid for censoring-free, censoring-complete, and random right-censoring situations, our formula (6) is valid in all of these settings as well.
2.4 |. Cox Regression Model
In the single failure type setting, Cox regression is often used to evaluate a main variable’s impact on failure risk while adjusting for covariates. We assume that patients are at risk of a single event type and that their observations comprise a random sample. Moreover, the censoring assumption may be relaxed from independent censoring to “noninformative” censoring, i.e. T and C are independent given Z and X. Letting λ be the hazard function for the failure distribution, the Cox model assumes that this function has the form
| (7) |
where t is the time on study, [0, t*] is the study period being considered, and λ0 is the baseline hazard function. Similar as before, Cox2 showed that the Cox partial likelihood estimator of ν exists such that where, under H0,
with . From this, we obtain a formula nearly identical to those for the GLM and Fine-Gray models.
Theorem 3 Assume that the model (7) holds and that E(Z|X) = a + b′X for some scalar a, vector b. Then v (0) ≤ [Var (Z)ψ]−1V I F, where ψ = P (Ti ≤ Ci ∧ t*) is the probability that a subject has an event in [0, t*] that is not censored. Subsequently, a sufficient sample size for two-sided testing of H0 : β = 0 at significance level α with power π to detect β = δ is given by
| (8) |
This formula simply uses the observed event frequency ψ in place of f00 and ψ* for (4) and (6), respectively. Schoenfeld6 first introduced this formula for the Cox model in the special case of a binary main variable that is uncorrelated with other covariates. Hsieh and Lavori12 proposed applying this formula for a Cox model with a nonbinary main variable and/or correlation between the main and other covariates, but did not provide theoretical support for this usage. Our methodology reconciles these methods of sample size estimation for the Cox and Fine-Gray models by extending their usage to general covariate distributions and possible right censoring, while providing a firm theoretical basis for their application.
As noted in Section 2.3 for GLM models, formulas (6) and (8) may also be applied in situations where the sample size is predetermined and investigators instead wish to assess the power provided to detect a targeted effect size or to find the minimum effect size detectable at a given power level. The power π or effect size δ can be expressed directly in terms of the other elements of these formulas using basic algebra.
2.5 |. Event-driven Sample Size Calculation for the Poisson, Cox, and Fine-Gray Models
The generalized linear, Cox, and Fine-Gray regression sample size calculations presented here are all derived from the very general formula in (1), though derivation of the asymptotic variance term v(0) proceeds differently for each. These formulas provide the required number of subjects for a study to test influence of a main effect of interest; but, sample size calculations for the Cox and Fine-Gray models are sometimes performed in terms of the required number of events. The proposed formulas for these models can also be applied to compute the required number of relevant events because, by rearranging terms in (8) and (6), we obtain
where the right-hand side gives the expected total number of events occurring in n patients under H0. For large n, this expectation will lie close to the observed number of events, and so the right-hand side can be used as the required number of events to perform the specified two-sided test. The required number of events can be similarly specified for Poisson regression. For this model, the term f00 in (4) equals E[exp(β0 + γTX)], the expected number of events per subject under H0. Then rearranging terms in (4) gives
where the right-hand side is the expected total number of events in n subjects under H0 and can be used as the targeted number of events for the study.
2.6 |. Precision / Confidence Interval-focused Sample Size Calculation
The proposed formulas can also be applied toward a precision-focused sample size estimation in the generalized linear, Cox, and Fine-Gray models, i.e. choosing an appropriate sample size to yield a confidence interval (CI) for β whose width does not exceed a specified value. To do this, we simply modify the formulas by replacing δ by the desired half-width for the CI and replacing the numerator term [Φ−1 (1 − α/2) + Φ−1 (π)]2 by Φ−1 (1 − α/2)2; all other terms remain as previously defined. Conversely, the modifications also may be used to determine the width of the CI that will be produced by a given sample size. These results pertain to both the subject-based sample size formulas (4), (6), and (8) as well as to the event-based formulas in Section 2.5.
3 |. SIMULATION STUDY
The theory presented in Section 2 supports application of the proposed sample size determination techniques for studies with large samples. To assess the methods’ performance for more modest, perhaps more realistic sample sizes, a simulation study was conducted. We evaluated whether the proposed formulas produce sample sizes that provide actual power levels that lie close to the targeted levels for detecting a prespecified effect size of interest. For each simulation, we choose a targeted effect size δ for the model under consideration, compute the estimated sample size from our formula, generate data for the regression model under the alternative hypothesis Ha : β = δ, and record whether the fitted regression model rejects H0.
Simulations were performed for the logistic, Cox, and Fine-Gray regression models. For each model, two values of δ were considered that give minimum estimated sample sizes in the ranges 100–300 and 400–600. Actual power levels targeted by the sample size calculations included π = 80% and 90%. For these models, we considered the following choices of covariates: Z has either a Bernoulli(0.5), standard normal, or standard lognormal distribution; and X is either Bernoulli(0.5) or is a random vector with independent components, whose first two components are Bernoulli(0.5) and third component is standard normal. Values considered for the VIF were 1.0, 1.2, 1.5, or 2.0. If Z was standard normal, the correlation was induced by generating the main covariate as , where W ~ N(0, 1) is independent of X, p = dim(X), and the constant c is chosen to produce the specified VIF value. If Z was Bernoulli, it was generated by the logistic model logit , with c chosen to induce the targeted level of correlation. If Z was lognormal, it relates to X through the log-linear model , where W ~ N (0, 1) is independent of X and c is chosen to induce the targeted level of correlation.
For the logistic model, h(u) = logit(u) and covariate effects were generated according to model (2) with the following possible specifications: β0 = −log 3, 0, or log 3; β = log 1.4 or log 1.6; and , where φ = − log 3, − log 2, log 2, or log 3. Choices of β correspond to odds ratios of 1.4 or 1.6 for the main variable. γ was specified so that, for a given value of φ, the linear predictor γT X would have the same variance under either choice of distribution for X. This produced a total of 1152 possible configurations of simulation parameters, with 10000 replicates simulated for each.
For the Cox and Fine-Gray models, we assume all subjects are followed for a maximum of one year, with those surviving longer being administratively censored at the one year mark. Simulations for the Cox model considered β = log 1.2 or log 1.3 and for the Fine-Gray model considered β = log 1.3 or log 1.4, corresponding to hazard ratios of 1.2 or 1.3 for the primary variable of interest in the Cox model and subdistribution hazard ratios of 1.3 or 1.4 in the Fine-Gray model. Choices of censoring rates for the two models included either no censoring, or exponential censoring at a rate of 50% per year. For the Fine-Gray model, the overall proportion of cause 1 events was specified as either 30% or 60%. In total, the Cox and Fine-Gray models had 768 and 1536 possible configurations, respectively; for each configuration, 10000 studies were simulated.
From each configuration, an empirical estimate for the power provided by the calculated sample size is obtained. The large number of these estimates for each model were then summarized using box plots to evaluate the performance of our sample size specification across configurations. Moreover, it was noted from our simulation results that the only other simulation parameters that impacted the empirical power estimates noticeably were the VIF and the targeted power level, π. Therefore, these box plots aggregate the estimates obtained from common values of VIF and π.
Figure 1 displays these estimates from the logistic model simulations determined by sample sizes computed using (4). For VIF = 1, where the covariates are uncorrelated, the empirical power estimates tend to fall about 1% short of the targeted power level used in the sample size calculation. We attribute this slight shortfall to incomplete convergence of the logistic model’s parameter estimate to its asymptotic distribution. Interestingly, the empirical power increased slightly for VIF > 1 and meets the targeted power, on average. We believe two factors explain this slight boost in observed power levels: first, the larger sample sizes required by the variance inflation technique ensure tighter convergence of model estimates to the asymptotic distribution from which the sample size formula is derived; second, the sample size from formula (4) was shown to be an upper bound, so it may be conservative. The proposed sample size formula appears to perform well by producing sample sizes that provide power close to or exceeding the targeted level in a wide variety of settings.
FIGURE 1.

Estimated power provided for testing the main effect in the logistic regression model by the proposed sample size method: Empirical estimates of the actual power levels provided by logistic regression using the sample size formula in (4) were generated using sample sizes intended to target 80% and 90% power. Box plots summarize the estimates for each targeted power level and variance inflation factor (VIF) considered. Dotted black lines indicate the targeted power levels for the tests.
Power estimates for sample sizes computed for the Cox model via (8) and for the Fine-Gray model via (6) are displayed in Figures 2 and Figure 3, respectively. Similar trends to those for the logistic model are seen: with uncorrelated covariates (VIF = 1), empirical power levels are roughly 1% less, on average, than the targeted levels; however, in the presence of correlation, estimated power levels tend to meet the target for these time-to-event regression models.
FIGURE 2.

Estimated power provided for testing the main effect in the Cox proportional hazards model by the proposed sample size method: Empirical estimates of the actual power levels provided by Cox regression using the sample size formula in (8) were generated using sample sizes intended to target 80% and 90% power. Box plots summarize the estimates for each targeted power level and variance inflation factor (VIF) considered. Dotted black lines indicate the targeted power levels for the tests.
FIGURE 3.

Estimated power provided for testing the main effect in the Fine-Gray regression model by the proposed sample size method: Empirical estimates of the actual power levels provided by Fine-Gray regression using the sample size formula in (6) were generated using sample sizes intended to target 80% and 90% power. Box plots summarize the estimates for each targeted power level and variance inflation factor (VIF) considered. Dotted black lines indicate the targeted power levels for the tests.
One curiosity seen in Figures 1 – 3 is that, though the locations of the box plots are similar across figures and lie close to the targeted power levels, the widths are noticeably larger for the logistic and Fine-Gray models compared to the Cox model. Moreover, these widths are increasing as VIF does. Our working explanation for these results is that, though the empirical power levels generally match the target, there are some configurations for which the actual power levels deviate from the target due to incomplete convergence of the test statistics to their asymptotic distributions. Because the logistic and Fine-Gray models both considered larger numbers of configurations, the higher variability of their empirical power estimates reflects the greater diversity in simulation settings considered compared to the Cox model. Additionally, the VIF adjustment of the sample size under no correlation (VIF = 1.0) exacerbates any deviations of the true power from the targets, producing the ‘fanning out’ of box plot widths in these figures as the VIF grows larger.
For the simulations displayed in Figures 1 – 3, we randomly generated a distinct set of covariates for each replicate from the specified covariate distribution in its configuration. This was done to assess the performance across a wide variety of realized studies. Alternatively, we also ran simulations by generating a single set of covariates for each configuration and simulating outcomes for its 10000 replicated datasets using this common set of predictors. These simulations gave nearly identical empirical power estimates to those displayed in the Figures.
4 |. EXAMPLES
Applications of the proposed methods are demonstrated through the calculation of sample sizes for hypothetical prospective cohort studies investigating possible correlations between biomarkers and development of adverse outcomes following blood and marrow transplantation. One such outcome is veno-occlusive disease (VOD), a rare but severe complication that occurs in approximately 5% of transplant patients and can result in liver failure and a mortality rate estimated to be as high as 80%. A retrospective biomarker study by Zaid et al.1 found that lower levels of one biomarker, L-Ficolin, were associated with a greater risk of VOD occurrence within 1 year following transplant (p = 0.005), using data from the Blood and Marrow Transplant Clinical Trials Network 0402 trial17. This association was evaluated using a multivariable logistic regression model from 197 patients with log L-Ficolin levels as the main effect, dichotomized as Low or High using the median value in the cohort. We will use formula (4) to determine an appropriate sample size for a validation study that similarly evaluates L-Ficolin’s impact on VOD onset within the first year in a logistic regression that adjusts for age, Karnofsky score, race, and donor-recipient sex matching, factors that may correlate with VOD risk and/or L-Ficolin concentrations. Because these factors may also be correlated with the biomarker being considered, accounting for this correlation in the sample size determination for this validation study is imperative.
The proposed sample size formula (4) requires specification of the variance of log L-Ficolin, its multiple correlation coefficient R with the other covariates, the expected variance of VOD occurrence f00, the level and power of the test, and the targeted covariate effect. Data from Zaid et al.1 provides estimates of R = 0.370 and f00 = 0.0654. Because the main effect of log L-Ficolin level is binary with the observed median level from the previous study as the cutpoint, we expect that its variance Var(Z) will be 0.25. A univariate logistic model regressing VOD occurrence on dichotomized log L-Ficolin gave an estimated odds ratio of 4.04 with a 95% CI of (1.09, 14.98). The large width of the CI is attributed mainly to the low VOD frequency of 7.1% in this study. Moreover, the estimates may possibly be biased due to publication bias. Therefore, we compute the estimated sample sizes providing 80% and 90% power to detect a range of plausible effect sizes lying within the lower half of the CI. Plugging these effect sizes and the other parameter estimates into formula (4) gives the sample size estimates listed in Table 1. To illustrate the importance of accounting for correlation of covariates, suppose that the validation study aims to detect an odds ratio of 3.00 for log L-Ficolin with two-sided testing at a 5% level. The respective sample sizes needed to provide 80% and 90% power are 461 and 617 patients. If log L-Ficolin is instead assumed to be uncorrelated with the other covariates, the estimated sample sizes are 398 and 533 to provide 80% and 90% power, respectively. But, if the true correlation matches the estimate 0.370 from the previous biomarker study, the actual power levels provided by these sample sizes are 74.0% and 85.4%, respectively, falling 6.0% and 4.6% short of the targeted levels. Though these are not large deviations from intended power levels, they are appreciable and likely to be considered unacceptable by study investigators.
TABLE 1.
Estimated Sample Sizes from Formula (4) for Detecting Odds Ratios in VOD Example
| Power | Targeted Odds Ratio | ||||||
|---|---|---|---|---|---|---|---|
| 1.25 | 1.50 | 2.00 | 2.50 | 3.00 | 3.50 | 4.00 | |
| 80% | 11171 | 3384 | 1158 | 663 | 461 | 355 | 290 |
| 90% | 14954 | 4530 | 1550 | 887 | 617 | 475 | 388 |
Another risk that may follow blood and marrow transplantation is the development of chronic graft-versus-host disease (cGVHD). This disease affects roughly 40% of transplant patients, tends to develop 6 months or later post-transplant, and indicates an elevated risk of mortality; therefore, identifying markers that predict its development is important so that therapy can be initiated early to mitigate this risk. The study by Zaid et al.1 found an association between elevated levels of chemokine (C-X-C motif) ligand 9 (CXCL9) and increased risk of cGVHD onset in a Cox proportional hazards model (p = 0.007). This evaluation dichotomized CXCL9 according to the median value observed in the sample and treated CXCL9 as a time-dependent variable.
However, because death before the development of cGVHD precludes us from observing this disease, onset of cGVHD is a competing risk outcome. Moreover, dichotomization of a continuous predictor can lead to a loss of power for evaluating its effect in a regression model18. Therefore, for a hypothetical validation study that evaluates the direct impact of CXCL9 on the chance of developing cGVHD within 2 years of transplant, we would use the Fine-Gray model and treat CXCL9 as continuous by using the log biomarker level at day 100 after transplant. Moreover, the model would adjust for the factors age, conditioning regimen, time from diagnosis to transplant, donor-recipient sex matching, recipient CMV status, and Karnofsky performance score, which are suspected to influence the risk of cGVHD from previous research.
The proposed sample size formula (6) requires specification of the variance of log CXCL9, its multiple correlation coefficient R with the other covariates, the probability of observing cGVHD onset by two years, the level and power of the test, and the covariate effect that we wish to detect. We assume that the censoring distribution for the validation study will be the same as that in the original one. Using data from1, we estimated Var(log CXCL9) = 4.813, R = 0.380, and the probability of observing cGVHD onset by two years = 135/211 = 64.0%. Because this is a continuous main effect of interest, a subdistribution hazard ratio for log CXCL9 of 1.10 will be considered, indicating increasing likelihood of cGVHD as CXCL9 level increases. Using formula (6), the sample size to detect a log CXCL9 subdistribution hazard ratio of 1.10 at a 5% level with 80% power using two-sided testing is 328; detecting this effect with 90% power requires 439 patients. If the correlation of log CXCL9 with the other covariates is not considered by simply assuming VIF = 1, the formula gives sample sizes of 281 and 376 to provide 80% and 90% power, respectively. If the true correlation equals the estimate of 0.380 from the prior study, though, the respective power levels provided by these sample sizes are 73.6% and 85.0%, falling below the targeted levels by 6.5% and 5.0%. Although the multiple correlations of 0.370 and 0.380 considered for the VOD and cGVHD examples are moderate in size, our calculations demonstrate that ignoring their existence may diminish the actual power provided and prevent the study under design from effectively testing the main effect of interest.
5 |. DISCUSSION
This paper proposes succinct formulas for determining a sufficient sample size to detect the effect of a primary covariate of interest in generalized linear, Cox, and Fine-Gray regression models. This methodology supports a broad class of generalized linear models that includes the frequently used logistic and Poisson models as well as less common choices such as probit, exponential, and inverse Gaussian regression. These techniques support nearly arbitrary choices of primary and other covariates, the only requirement being that they have finite variance. Moreover, they can accommodate correlation between the main effect and other covariates. By including these features, the proposed techniques both synthesize and extend a collection of existing sample size formulas in the statistical literature for restricted sets of regression models and covariate distributions. For the Fine-Gray competing risks regression model in particular, the proposed formula represents a substantial generalization of the work of Latouche et al.7 to application with a general class of covariates in both the censoring-complete and random right censoring settings.
A simulation study showed that the proposed formulas give sample sizes that come close to attaining the nominal power levels under a wide variety of choices of data distributions and covariate effects. Notably, in the presence of correlation between the primary variable and other covariates, the inflation factor adjustment to the sample size provided a suitable increase, generally yielding empirical power levels that met or exceeded the targeted levels. Application of the formula was demonstrated through sample size determination for biomarker validation studies, using an existing biomarker study’s data as a reference. Because of the observational nature of studies such as this and previous knowledge of other risk factors for the outcome of interest, it is crucial to adjust for these factors in the analysis and account for their possible correlation with the main effect of interest. These considerations also apply to nonrandomized clinical trials, where randomization may not be possible due to ethical or feasibility concerns. This makes our proposed methodology especially valuable in nonrandomized trials as well as observational studies.
The formulas proffered and examples presented focus upon determination of an appropriate sample size for designing a study. Because these formulas demonstrate a clear relationship between the sample size, power, and effect size, they are also useful for determining the power provided by a given sample to detect a specified effect and for specifying the smallest effect that may be targeted at a given power level and sample size. These are particularly useful applications in situations where the sample size may be predetermined or constrained, such as in designing studies of retrospective data as well as prospective studies in rare disease populations, where feasibility considerations restrict the sample size to be modest.
Our sample size / power estimation approach for the Cox and Fine-Gray models may be used to determine either the number of events or the number of subjects (n) required. The factor relating these quantities in formulas is the likelihood of an event of interest being observed in the study, ψ or ψ*. This likelihood depends not only on the event time distribution itself but also on design features including the follow-up period, random censoring distribution, and, for prospective studies, on the accrual period and pattern. Care must be taken in specifying these elements so that ψ (or ψ*), and consequently, n, may be estimated accurately. Historical data on related research studies and the proposed strategy for recruitment of patients and centers can guide accurate specification of the censoring and accrual distributions.
This paper focuses on accuracy sample size/power specification in designing individual studies because a study is the fundamental unit for performing effect size estimation and hypothesis testing. An important problem not addressed here is power and precision calculations for meta-analyses, which aim to aggregate multiple studies’ findings into a more precise assessment of the primary effect of interest. Random or mixed effects models are often employed in this setting to perform this evaluation while accounting for potential heterogeneity of the primary effect between studies. We will explore extending the methodology presented here to accommodate regression models commonly used in meta-analysis, thereby aiding the design of meta-analysis studies as well.
For the examples, we assumed that the biomarkers were measured precisely, without error. But, measurement error on predictor variables can arise in observational studies. Its presence can cause a loss of power in testing the effect of these variables if it is improperly handled. Tosteson et al.19 and Gilbert et al.20 present schemes for sample size determination when measurement error exists on the main covariate of interest in a logistic regression that can account for correlation between covariates. Future work will investigate how to incorporate the presence of this type of error into our methods for accurate sample size calculation in nonlinear regression models.
In addition to providing accurate sample size determination at the design stage, our technique may also offer accurate computation for the purpose of unblinded sample size reestimation partway through a study. Because limited information may exist at the design phase to guide specification of nuisance parameters such as covariate effects and event and censoring rates, this can be a valuable tool to salvage a study that is underpowered due to misspecification. In particular, Tarima et al.21 proposed a method for interim sample size reestimation that uses resampling of interim data to estimate the distribution of nuisance parameters and, correspondingly, to adjust the total sample size to attain the targeted power level. Moreover, our proposed formulas are compatible with their technique; the resampling method computes interim estimates for nuisance parameters and adjusted α and π values that can be plugged into our formulas to obtain an updated sample size that attains the intended type I error and power levels. The validity of their method, like ours, is proven through derivations that rely on asymptotic properties. Therefore, the combined use of our and their methods for sample size recalculation in studies with more modest sizes requires evaluation before this joint application can be recommended for wide scale application. We plan to assess small sample performance of this application of our formulas through an comprehensive simulation study.
Supplementary Material
Acknowledgments
Support for this study was provided to Michael J. Martens by the Blood and Marrow Transplant Clinical Trials Network and the National Heart, Lung, and Blood Institute by grant #F31HL134317. Support for the BMT CTN 0402 trial was provided by grant #U10HL069294 to the Blood and Marrow Transplant Clinical Trials Network from the National Heart, Lung, and Blood Institute and the National Cancer Institute, along with contributions by Wyeth Pharmaceuticals Inc. The authors thank the Blood and Marrow Transplant Clinical Trials Network for permitting use of the 0402 trial data. The content is solely the responsibility of the authors and does not necessarily represent the official views of the above mentioned parties.
Funding information
Blood and Marrow Transplant Clinical Trials Network; National Heart, Lung, and Blood Institute, Grant/Award Number: F31HL134317 and U10HL069294; National Cancer Institute, Grant/Award Number: U10HL069294
Footnotes
6 | SUPPORTING INFORMATION
Additional information for this article is available online, including the derivations of Theorems 1–3.
Data Availability Statement
The data used for the examples in Section 4 are available from the Blood and Marrow Transplant Clinical Trials Network. Restrictions apply to the availability of these data, which were permitted to be analyzed by the authors for this study. Data are available from the authors with the permission of the Blood and Marrow Transplant Clinical Trials Network.
References
- 1.Abu Zaid M, Wu J, Wu C, et al. Plasma biomarkers of risk for death in a multicenter phase 3 trial with uniform transplant characteristics post–allogeneic HCT. Blood. 2017;129(2):162–170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Cox DR. Regression and life tables (with discussion). J R Stat Soc Series B Stat Methodol. 1972;34(2):187–220. [Google Scholar]
- 3.Fine JP and Gray RJ. A proportional hazards model for the subdistribution of a competing risk. J Am Stat Assoc. 1999;94(446):496–509. [Google Scholar]
- 4.Hsieh FY, Bloch DA, and Larsen MD. A simple method of sample size calculation for linear and logistic regression. Stat Med. 1998;17(14):1623–1634. [DOI] [PubMed] [Google Scholar]
- 5.Signorini DF. Sample size for Poisson regression. Biometrika. 1991;78(2):446–450. [Google Scholar]
- 6.Schoenfeld DA. Sample-size formula for the proportional-hazards regression model. Biometrics. 1983;39(2):499–503. [PubMed] [Google Scholar]
- 7.Latouche A, Porcher R, and Chevret S. Sample size formula for proportional hazards modelling of competing risks. Stat Med. 2004;23(21):3263–3274. [DOI] [PubMed] [Google Scholar]
- 8.Self SG and Mauritsen RH. Power / sample size calculations for generalized linear models. Biometrics. 1988;44(1):79–86. [Google Scholar]
- 9.Liu G and Liang K. Sample size calculations for studies with correlated observations. Biometrics. 1997;53(3):937–947. [PubMed] [Google Scholar]
- 10.Hanley JA and Moodie EEM. Sample size, precision and power calculations: a unified approach. J Biom Biostat. 2011;2(124). doi: 10.4172/2155-6180.1000124. [DOI] [Google Scholar]
- 11.Hsieh FY, Lavori PW, Cohen HJ, et al. An overview of variance inflation factors for sample-size calculation. Eval Health Prof. 2003;26(3):239–257. [DOI] [PubMed] [Google Scholar]
- 12.Hsieh FY and Lavori PW. Sample-size calculations for the Cox proportional hazards regression model with nonbinary covariates. Control Clin Trials. 2000;21(6):552–560. [DOI] [PubMed] [Google Scholar]
- 13.Schmoor C, Sauerbrei W, and Schumacher M. Sample size considerations for the evaluation of prognostic factors in survival analysis. Stat Med. 2000;19(4):441–452. [DOI] [PubMed] [Google Scholar]
- 14.Nelder JA and Wedderburn RWM. Generalized linear models. J R Stat Soc Ser A. 1972;135(3):370–384. [Google Scholar]
- 15.Jørgensen B Exponential dispersion models. J R Stat Soc Series B Stat Methodol. 1987;49(2):127–145. [Google Scholar]
- 16.Martens MJ and Logan BR. A group sequential test for treatment effect based on the Fine–Gray model. Biometrics. 2018;74(3):1006–1013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Cutler C, Logan BR, Nakamura R, et al. Tacrolimus/sirolimus vs. tacrolimus/methotrexate for graft-vs.-host disease prophylaxis after HLA-matched, related donor hematopoietic stem cell transplantation: results of Blood and Marrow Transplant Clinical Trials Network trial 0402. Blood. 2012;120(21):739–739. [Google Scholar]
- 18.Royston P, Altman DG, and Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med. 2006;25(1):127–141. [DOI] [PubMed] [Google Scholar]
- 19.Tosteson TD, Buzas JS, Demidenko E, et al. Power and sample size calculations for generalized regression models with covariate measurement error. Stat Med. 2003;22(7):1069–1082. [DOI] [PubMed] [Google Scholar]
- 20.Gilbert PB, Janes HE, and Huang Y. Power / sample size calculations for assessing correlates of risk in clinical efficacy trials. Stat Med. 2016;35(21):3745–3759. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Tarima S, He P, Wang T, et al. An interim sample size recalculation for observational studies. Obs Stud. 2016;2(2):65–85. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
