artcat: Sample-size calculation for an ordered categorical outcome

Ian R White; Ella Marley-Zagar; Tim P Morris; Mahesh K B Parmar; Patrick Royston; Abdel G Babiker

doi:10.1177/1536867X231161934

. Author manuscript; available in PMC: 2023 Apr 24.

Published in final edited form as: Stata J. 2023 Apr 5;23(1):3–23. doi: 10.1177/1536867X231161934

artcat: Sample-size calculation for an ordered categorical outcome

Ian R White ¹, Ella Marley-Zagar ², Tim P Morris ³, Mahesh K B Parmar ⁴, Patrick Royston ⁵, Abdel G Babiker ⁶

PMCID: PMC7614472 EMSID: EMS174193 PMID: 37155554

Abstract

We describe a new command, artcat, that calculates sample size or power for a randomized controlled trial or similar experiment with an ordered categorical outcome, where analysis is by the proportional-odds model. artcat implements the method of Whitehead (1993, Statistics in Medicine 12: 2257–2271). We also propose and implement a new method that 1) allows the user to specify a treatment effect that does not obey the proportional-odds assumption, 2) offers greater accuracy for large treatment effects, and 3) allows for noninferiority trials. We illustrate the command and explore the value of an ordered categorical outcome over a binary outcome in various settings. We show by simulation that the methods perform well and that the new method is more accurate than Whitehead’s method.

Keywords: st0700, artcat, sample size, power, clinical trial, randomized controlled trial, noninferiority trial, substantial-superiority trial, categorical variable, proportional-odds model, evaluation

1. Introduction

The power of a randomized controlled trial or similar experiment is the probability that the primary analysis will show a statistically significant result in favor of the studied treatment (or other intervention). Designers of randomized controlled trials (which we henceforth simply call “trials”) typically aim to have 80% or 90% power for a given true treatment effect. Sample-size calculations are used to determine either the sample size required to give a specified power or the power implied by a specified sample size. Various formulas are in wide use (Julious 2009).

The most common sample-size calculation is for comparing two groups, treatment and control, also called “arms”. Multiarm trials improve efficiency by evaluating several new treatments in one trial (Parmar et al. 2017) and are usually designed using a two-group sample-size calculation, assuming that each treatment group will be compared with the control group. Sample-size calculations for general tests of heterogeneity between treatment groups are rarely used and are not discussed in this article.

In Stata, several standard sample-size calculations are available in the built-in power family. More advanced sample-size calculations are provided in the Analysis of Resources for Trials package (Barthel, Royston, and Babiker 2005; Barthel et al. 2006; Royston and Barthel 2010; Marley-Zagar et al. 2023).

However, none of these packages allows for an ordered categorical outcome, sometimes called an ordinal outcome. Such outcomes have been used, for example, in a trial evaluating treatments for influenza, where a six-category outcome was defined as 1) death, 2) in intensive care, 3) hospitalized but requiring supplemental oxygen, 4) hospitalized and not requiring supplemental oxygen, 5) discharged but unable to resume normal activities, or 6) discharged with full resumption of normal activities (Davey et al. 2019).

The present work was motivated by the need to consider the use of ordered categorical outcomes in a proposed trial of treatments for COVID-19, for example, a three-level outcome of death, in hospital, or alive and not in hospital. Other trials of treatments for COVID-19 have used various outcome scales, typically with six to eight ordered categories.

In this article, we introduce a new command, artcat, that addresses this need. The command performs sample-size calculations using the method of Whitehead (1993). We also introduce a new method that is both more flexible and more accurate than Whitehead’s method. The methods are described in section 5. The syntax is described in section 3.1, followed by examples in section 3, simulation evaluations in section 5, and a description of our procedures for testing the software in section 6. We end with section 7 suggesting future directions.

2. Methods

2.1. General sample-size formulas

Suppose the benefit of treatment is captured by an estimand θ (for example, a risk difference or log odds-ratio) so that the analysis of a superiority trial involves a significance test of the null hypothesis θ = 0. The designers of the trial want to ensure a high power, defined as the probability of a significant result, under the assumption that θ = d. Sample-size formulas relate the type II error (β = 1 – power) to the sample size n when the type I error is set to α. A general sample-size formula relates the required variance of an estimator $\hat{θ}$ to d, α, and β [Julious 2004, (2)],

var (\hat{θ}) = \frac{d^{2}}{{(z_{1 - α / 2} + z_{1 - β})}^{2}}

where z_p is the standard normal deviate with cumulative density p. Because var( $(\hat{θ})$ ) is, to a very good approximation, inversely proportional to the total sample size n, we can write var( $(\hat{θ}) = V / n$ ) = V/n for some V : methods for calculating V in particular settings will be described below. Hence, the total sample-size requirement is

n = \frac{V {(z_{1 - α / 2} + z_{1 - β})}^{2}}{d^{2}}

(1)

The formula above implicitly assumes that the variance is the same under the null and alternative hypotheses, and this is not true for categorical outcomes. For example, for binary data, binomial variation follows distributions with different probabilities in the two groups, but under the null hypothesis, the average probability is assumed for both groups. We refine (1) by letting var( $(\hat{θ}) = V / n$ ) = V/n describe the variance of the estimator when θ = d, while var( $(\hat{θ}) = V_{test} / n$ ) = V_test/n describes its variance when the null hypothesis is assumed for the data. This gives an improved sample-size formula

n = \frac{{(\sqrt{V_{test}} z_{1 - α / 2} + \sqrt{V} z_{1 - β})}^{2}}{d^{2}}

(2)

Let var( $(\hat{θ}) = V_{N} / n$ ) = V_N/n under the null and V_A/n under the alternative hypothesis. A “local” test, assuming small treatment effects, sets V = V_test = V_N; we call this method NN. A “distant” test, valid for small or large treatment effects, sets V = V_A. We may then have V_test = V_A (method AA), appropriate if a Wald test is used, or V_test = V_N (method NA), appropriate for the score test or approximations to it such as the likelihood-ratio test. All of these values are substituted into (2); methods NN and AA are given by the simpler formula (1) with V = V_N and V = V_A, respectively. This gives the formulas

Method NN : n = \frac{V_{N} {(z_{1 - α / 2} + z_{1 - β})}^{2}}{d^{2}}

(3)

Method NA : n = \frac{{(\sqrt{V_{N}} z_{1 - α / 2} + \sqrt{V_{A}} z_{1 - β})}^{2}}{d^{2}}

(4)

Method AA: n = \frac{V_{A} {(z_{1 - α / 2} + z_{1 - β})}^{2}}{d^{2}}

(5)

For binary data, these formulas are commonly used with the estimand θ defined as the risk difference. artbin offers a “local” method (NN), a “distant” method (NA), and a “Wald” method (AA). For ordinal data, θ will be defined as the log odds-ratio.

2.2. Whitehead’s method

We use the equations above in the specific case of an ordered categorical outcome Y and randomized treatment Z. Let the distribution of Y in the control group be $p (Y = i ∣ Z = c) = p_{c i}$ , and let the distribution of Y in the experimental (research) group be $p (Y = i ∣ Z = e) = p_{e i}$ , for i = 1, . . . , I. We initially assume for definiteness that outcome level 1 is the least favorable and level I is the most favorable and that the aim of the study is to demonstrate lower probabilities of the worse outcomes in the experimental group.

Whitehead (1993) considered the case where the n participants are randomized to control and experimental groups in the ratio a : 1 and the distributions of the outcome in the two groups obey a proportional-odds model,

logit \sum_{i = 1}^{i = k} p_{e i} = logit \sum_{i = 1}^{i = k} p_{c i} + θ

(6)

for any k = 1, . . . , I − 1 (McCullagh 1980). Here θ is the log odds-ratio, which is assumed common across levels k. θ < 0 indicates lower probabilities of the less favorable outcomes in the experimental group and hence a beneficial treatment. This led to the formula

n = \frac{3 {(a + 1)}^{2} {(z_{1 - α / 2} + z_{1 - β})}^{2}}{a d^{2} (1 - \sum_{i} {\bar{p}}_{i}^{3})}

(7)

where ${\bar{p}}_{i} = (a p_{c i} + p_{e i}) / (a + 1)$ and d is the expected value of θ (Whitehead 1993).

This is a good and widely known formula. However, it has three limitations. First, it requires a common odds ratio to be specified at the design stage. In our experience, clinicians sometimes propose that treatments will reduce the risk of adverse outcomes by a fixed risk ratio so that p_ei/p_ci is the same for all i < I. This does not provide a value θ. Second, the expression used for the variance V is valid only under the null, so (7) represents method NN, and other methods may be more accurate; Whitehead (1993) discussed alternatives. Third, the formula does not cover noninferiority trials (see section 2.4 below). Our new proposal addresses these limitations.

2.3. New proposal

We propose a new method of sample-size determination that is valid for arbitrary sets of (p_ci, p_ei) that may not obey the proportional-odds model. The idea is to evaluate V_N and V_A by constructing a dataset of expected outcomes and fitting the proportional-odds model with the ologit command.

We construct a dataset containing the expected outcomes per participant recruited. This contains two records for each outcome level: one for the experimental group and one for the control group. For each record, we compute the probability that a participant is randomized to that group and has his or her outcome at that level. For the control group, for outcome level i, this probability is $p (Z = c) p (Y = i ∣ Z = c) = a p_{c i} / (a + 1)$ . For the experimental group, this probability is $p (Z = c) p (Y = i ∣ Z = c) = a p_{e i} / (a + 1)$ . These probabilities sum to 1.
We perform an ologit analysis of this dataset, using the weights as importance weights. This analysis yields the expected treatment effect d as the coefficient of Z. If the proportional-odds assumption does not hold, then d is interpreted as an average log odds-ratio. This analysis also yields the variance V_A as the estimated variance of the estimated coefficient of Z, so that the variance for a dataset of size n will be V_A/n. This is enough to implement method AA using (5).
For methods NN and NA, we change the weights to their values under the null, $a {\bar{p}}_{i} / (a + 1)$ and ${\bar{p}}_{i} / (a + 1)$ , and refit the ologit analysis. Then V_N is the estimated variance of the estimated coefficient of Z. We can then use (3) for method NN and (4) for method NA.

2.4. Noninferiority trials

In a noninferiority trial, the null hypothesis is that the experimental treatment is worse than the control treatment by a prespecified amount m, termed the margin. The margin typically represents a small degree of worsening of the primary outcome that is judged to be acceptable because of other advantages of the experimental treatment that are not captured by the primary outcome. In the setting of a categorical outcome ordered from least (1) to most (I) favorable outcome, the margin is expressed as an odds ratio greater than 1, and m > 0 is the log odds-ratio. The null hypothesis is then θ = m, and the alternative hypothesis is θ < m. Typically, the investigators expect the true treatment effect to be 0, so that p_ei = p_ci for all i and d = 0, but some noninferiority trials are designed under the expectation that the experimental treatment is somewhat beneficial and so d < 0 (for example, Nunn et al. [2019]).

The expected (alternative) variance of the estimate is given in the same way as for a superiority trial, but the test (null) variance must be calculated differently to reflect the noninferiority null. This is easily done in the ologit framework described above, with (2) modified to

n = \frac{{(\sqrt{V_{test}} z_{1 - α / 2} + \sqrt{V} z_{1 - β})}^{2}}{{(d - m)}^{2}}

Steps 1 and 2 are unchanged. At step 3, we fit model (6) to the dataset of expected results per participant under the null θ = m by using the offset() option of ologit. We then estimate the fitted probabilities, with which we revise the dataset of expected results per participant and fit model (6) again, yielding the test (null) variance V_N. If this procedure is applied with m = 0, then the results are the same as with a superiority trial.

These methods also apply without modification to a substantial-superiority trial, in which the aim is to show that the experimental treatment is substantially superior to the control; this is implemented by setting the margin m < 0. A substantial-superiority trial requires a larger sample size than a superiority trial with the same expected treatment effect.

2.5. Risk difference, risk ratio, or odds ratio

The odds ratio is often a sensible estimand for an ordered categorical outcome because it is plausibly constant across different levels [k in (6)], unlike the risk difference and risk ratio. For a binary outcome, this issue does not arise, and the risk difference or risk ratio is usually preferred because of its simpler interpretation (Altman, Deeks, and Sackett 1998).

In the binary outcome case, we may ask how sample-size calculations with the different estimands compare. In a superiority trial, all estimands imply the same null hypothesis—that the two treatments are equal. Sample-size calculations with different estimands then address the same question but use different approximations: artbin assumes a normal distribution for the estimated risk difference, while artcat assumes this for the estimated log odds-ratio. We will explore the impact of these different approximations in section 4.2. In a noninferiority trial, by contrast, the null hypothesis depends on the estimand used (Quartagno et al. 2020), so sample-size calculations with different estimands are not comparable and may differ markedly.

3. Syntax

artcat, pc(numlist) {pe(numlist) | or(exp) | rr (exp)} [[power(#) n(#)] cumulative [unfavourable | unfavorable | favourable | favorable] margin(#) aratio(# #) alpha(#) onesided ologit [(type)] whitehead noprobtable probformat (string) format(string) noround noheader]

3.1. Options

pc(numlist)specifies the probabilities in each outcome level; the rightmost level may be omitted. pc() is required.

pe(numlist) specifies the probabilities in each outcome level, specified as for pc(); or cumulative probabilities, if the cumulative option is used. One of pe(), or(), or rr() must be specified.

or(exp) specifies the odds ratio at each outcome level. An odds ratio less than 1 means that the distribution in the experimental group is shifted toward the rightmost level compared with the control group. One of pe(), or(), or rr() must be specified.

rr(exp) specifies the risk ratio at each outcome level except the rightmost. A risk ratio less than 1 means that the experimental group has lower probability at every level except the rightmost level compared with the control group. One of pe(), or(), or rr() must be specified.

power(#) specifies the power required; sample size will be computed. The default is power(0.8) if neither power() nor n() is specified. You cannot specify both power() and n().

n(#) specifies the total sample size; power will be computed. You cannot specify both power() and n().

cumulative specifies that the probabilities in pc() are cumulative probabilities.

unfavourable or unfavorable specifies that the leftmost outcome level represents the least favorable outcome. Both American and English spellings are allowed.

favourable or favorable specifies that the leftmost outcome level represents the most favorable outcome. Both American and English spellings are allowed.

margin(#) specifies the margin, as an odds ratio, for a noninferiority or a substantial-superiority trial. If the unfavorable option is specified, then # > 1 specifies a non-inferiority trial, and # < 1 specifies a substantial-superiority trial. If the favorable option is specified, then it is the other way around. If margin() is not specified or if margin(1) is specified, then a superiority trial is assumed.

aratio(# #) specifies the allocation ratio; for example, aratio(1 2) means 2 participants in the experimental group for every 1 participant in the control group.

alpha(#) specifies the significance level. The default is alpha(0.05).

onesided specifies that the level specified by alpha() is the one-sided significance level. The default is a two-sided significance level.

ologit[(type) ] uses the ologit (new) method. type may be NN, NA, or AA. The default is ologit(NA). ologit is the same as ologit(NA).

whitehead uses the Whitehead method. This option requires or() to be specified and is not available with margin().

noprobtable specifies not to display the table of anticipated probabilities (probabilities at each level in the control and experimental groups).

probformat(string) specifies the format for displaying table of anticipated probabilities. The default is probformat(%-5.1f).

format(string) specifies the format for displaying calculated sample sizes (default is format(%6.1f)) or powers (default is format(%6.3f)).

noround specifies not to round the sample size per group to the next largest integer.

noheader specifies not to print the header describing the program.

3.2. Favorable and unfavorable outcomes

The user is recommended to specify whether the leftmost levels of the outcome are favorable or unfavorable. However, the program also works this out for itself. In a superiority trial, an expected odds ratio smaller (larger) than 1 implies an unfavorable (favorable) outcome. If the margin is specified, then the criterion is whether the expected odds ratio is smaller or larger than the margin. If the user has specified the favorable or unfavorable option, then this is checked; if not, the program prints a note stating which it has inferred.

4. Examples

4.1. Six-level outcome

We reproduce the sample-size calculation for the FLU-IVIG trial (Davey et al. 2019). The control arm is expected to have a 1.8% probability of the least favorable outcome (death), a 3.6% probability of the next worst outcome (admission to an intensive care unit), and so on. The trial is designed to have 80% power if the treatment achieves an odds ratio of 1.77 for a favorable outcome. We invert this to match the assumption above of an unfavorable outcome.

. artcat, pc(.018 .036 .156 .141 .39) or(1/1.77) unfavourable

ART - ANALYSIS OF RESOURCES FOR TRIALS (categorical version 1.2 24jun2022)

`A sample size program by Ian White with input and support from` `Ella Marley-Zagar, Tim Morris, Max Parmar, Patrick Royston and Ab Babiker. MRC Clinical Trials Unit at UCL, London WC1V 6LJ, UK.`
`Type of trial`	`superiority`
`Favourable/unfavourable outcome`	`unfavourable`
`Null hypothesis`	`odds ratio = 1`
`Superiority region`	`odds ratio < 1`
`Allocation ratio C:E`	`1:1`
`Anticipated probabilities, control`	`.018 .036 .156 .141 .39`
`experimental`	`given by odds ratio = 0.565`
`Table of anticipated probabilities`	`C`	`E`
`1 least favourable`	`0.018`	`0.010`
`2`	`0.036`	`0.021`
`3`	`0.156`	`0.099`
`4`	`0.141`	`0.103`
`5`	`0.390`	`0.384`
`6 most favourable`	`0.259`	`0.382`
`Alpha`	`0.050 (two-sided)`
`Power (designed)`	`0.800`
`Method`	`ologit`	`(variance NA)`
`Total sample size (calculated)`	`322`
`Sample size per group (calculated)`	`161 161`

`Type of trial`	`non-inferiority`
`Favourable/unfavourable outcome`	`unfavourable`
`Null hypothesis`	`odds ratio = 1.330`
`Non-inferiority region`	`odds ratio < 1.330`
`Allocation ratio C:E`	`1:1`
`Anticipated probabilities, control`	`.01 .021 .099 .103 .384`
`experimental`	`given by odds ratio = 1.000`
`Alpha`	`0.050 (two-sided)`
`Power (designed)`	`0.800`
`Method`	`ologit (variance NA)`
`Total sample size (calculated)`	`1314`
`Sample size per group (calculated)`	`657 657`

`A sample size program by Abdel Babiker, Patrick Royston, Friederike Barthel, Ella Marley-Zagar and Ian White` `MRC Clinical Trials Unit at UCL, London WC1V 6LJ, UK.`
`Type of trial`	`superiority`
`Number of groups`	`2`
`Favourable/unfavourable outcome`	`unfavourable`
	`Inferred by the program`
`Allocation ratio`	`equal group sizes`
`Statistical test assumed`	`unconditional comparison of 2` `binomial proportions` `using the score test`
`Local or distant`	`distant`
`Continuity correction`	`no`
`Anticipated event probabilities`	`0.400 0.200`
`Alpha`	`0.050 (two-sided)`
	`(taken as .025 one-sided)`
`Power (designed)`	`0.900`
`Total sample size (calculated)`	`218`
`Sample size per group (calculated)`	`109 109`
`Expected total number of events`	`65.40`

Odds	Sample size for 90% power
ratio	Whitehead	New NN	New NA	New AA
0.2	56	56	60	67
0.3	98	98	102	109
0.4	168	168	172	178
0.5	291	291	295	302
0.6	534	534	538	544
0.7	1090	1090	1094	1101
0.8	2777	2777	2781	2787

Odds	Sample	Power % from sample-size formula or simulation
ratio	size	Whitehead	New NN	New NA	New AA	Simulation
0.2	56	90.1	90.1	88.1	84.5	88.4
0.3	98	90.1	90.1	88.9	86.9	89.2
0.4	168	90.1	90.1	89.4	88.3	89.5
0.5	291	90.0	90.0	89.6	89.0	89.6
0.6	534	90.0	90.0	89.8	89.5	89.7
0.7	1090	90.0	90.0	89.9	89.7	90.1
0.8	2777	90.0	90.0	90.0	89.9	90.1

Control	Odds	Sample size
fraction	ratio	`power`	`artbin`		`artcat`
p _c1			local	distant	Whitehead	New NN	New NA	New AA
0.20	0.2	194	197	192	150	150	180	230
0.20	0.3	286	290	285	249	249	274	314
0.20	0.4	436	439	436	403	403	425	460
0.20	0.5	696	699	694	666	666	686	717
0.20	0.6	1194	1198	1198	1168	1168	1186	1214
0.20	0.7	2318	2322	2322	2294	2294	2311	2336
0.20	0.8	5660	5664	5664	5638	5638	5654	5677
0.02	0.2	1964	1968	1968	1365	1365	1746	2418
0.02	0.3	2792	2795	2795	2253	2253	2585	3137
0.02	0.4	4106	4110	4110	3615	3615	3914	4394
0.02	0.5	6356	6359	6359	5902	5902	6176	6607
0.02	0.6	10622	10626	10626	10201	10201	10454	10848
0.02	0.7	20118	20121	20121	19722	19722	19959	20324
0.02	0.8	48042	48045	48045	47670	47670	47893	48235

PERMALINK

artcat: Sample-size calculation for an ordered categorical outcome

Ian R White

Ella Marley-Zagar

Tim P Morris

Mahesh K B Parmar

Patrick Royston

Abdel G Babiker

Abstract

1. Introduction

2. Methods

2.1. General sample-size formulas

2.2. Whitehead’s method

2.3. New proposal

2.4. Noninferiority trials

2.5. Risk difference, risk ratio, or odds ratio

3. Syntax

3.1. Options

3.2. Favorable and unfavorable outcomes

4. Examples

4.1. Six-level outcome

4.2. Binary outcome and comparison with artbin

4.3. Effect of subdividing the categories

5. Evaluations

5.1. Evaluation 1: Six-level outcome based on the FLU-IVIG study

Table 1. Sample sizes required to give 90% power for the FLU-IVIG setting, estimated by the Whitehead and new sample-size formulas.

Table 2.

5.2. Evaluation 2: Two levels

Table 3.

Table 4.

6. Software testing

7. Conclusions

Acknowledgments

Biographies

About the authors

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases