Abstract
Cancer clinical trials typically generate detailed patient toxicity data. The most common measure used to summarize patient toxicity is the maximum grade among all toxicities and it does not fully represent the toxicity burden experienced by patients. In this article, we study the mathematical and statistical properties of the toxicity index (TI), in an effort to address this deficiency. We introduce a total ordering, (T‐rank), that allows us to fully rank the patients according to how frequently they exhibit toxicities, and show that TI is the only measure that preserves the T‐rank among its competitors. Moreover, we propose a Poisson‐Limit model for sparse toxicity data. Under this model, we develop a general two‐sample test, which can be applied to any summary measure for detecting differences among two population of toxicity data. We derive the asymptotic power function of this class as well as the asymptotic relative efficiency (ARE) of the members of the class. We evaluate the ARE formula empirically and show that if the data are drawn from a random Poisson‐Limit model, the TI is more efficient, with high probability, than the maximum and the average summary measures. Finally, we evaluate our method on clinical trial toxicity data and show that TI has a higher power in detecting the differences in toxicity profile among treatments. The results of this article can be applied beyond toxicity modeling, to any problem where one observes a sparse array of scores on subjects and a ranking based on extreme scores is desirable.
Keywords: adverse events, Poisson‐Limit model, toxicity index, T‐rank preservation, two‐sample test
1. INTRODUCTION
Adverse event reporting has become a crucial step in the assessment of therapies in clinical trials since the first recommendation for grading toxicities was presented, following the efforts of the World Health Organization to standardize cancer treatment reports in the late 1970s. 1 , 2 An adverse event (AE), or a toxicity, is any unfavorable and unintended sign, symptom, or disease temporally associated with the use of a medical treatment or procedure that may or may not be considered related to the medical treatment or procedure. 3 The National Cancer Institute (NCI) published the common toxicity criteria (CTC) in 1983, providing a standardized list of AE terms commonly encountered in oncology to guide investigators in identifying and documenting toxicities. 4 , 5
The CTC has evolved over the last decades to the current and wide‐spread common terminology criteria for adverse events, 3 which is organized in 26 system organ classes, such that each AE term is defined and classified into grades of severity denoted as 0, 1, … , 5 with 0 corresponding to no symptoms, 1 corresponding to a mild symptom, up to 5 indicating death. While CTCAE allows investigators to assess and document toxicities in a systematized manner, it also creates a challenge in summarizing a large amount of data at the patient level. Each patient can experience more than one AE term from different organ systems and different grades of the same AE term during several cycles of treatment resulting in a vector of integers, the toxicity profile, of varying length containing all toxicities grades per patient. In recognition of deficiencies in toxicity reporting, NCI launched a Cancer Moonshot funding opportunity 6 to accelerate research on improved approaches to evaluating the tolerability of cancer treatments.
Currently, the most used approach in phase III trials to summarize multiple toxicities per patient is the maximum‐grade (max‐grade). 7 , 8 , 9 Other approaches such as those of References 10, 11, 12 have been proposed in the literature to address the lack of representation of the toxicity burden experienced by patients when the max‐grade method is applied. In particular, a longitudinal approach (Tox‐T) 12 incorporating low‐grade toxicities and their duration using the average as a summary measure was introduced. Even though Tox‐T better represents the toxicity burden, the average grade is hard to interpret clinically by investigators. In the setting of dose finding in early phase cancer trials, many authors proposed statistical models and dose escalation designs that take into account all grades and types of toxicities with the goal of improving the safety and efficiency of the trial. 13 , 14 , 15 , 16 , 17 , 18 Some of these methods use multivariate models for characterizing the relationship between different grades of toxicities and dose while others, such as the Q‐TWIST, 19 summarize the information from all toxicity grades into a continuous score. In general, under some scenarios for the location of the true maximum tolerated dose (MTD), a modest gain in the precision of the estimate of the MTD is achieved when including information from all toxicities relative to models that use a binary indicator of toxicity (a.k.a. dose limiting toxicity), without compromising the safety of the trial.
The toxicity index (TI), proposed in Reference 20, is a summary measure that preserves the highest grade while incorporating lower grade toxicities. The TI can avoid the loss of information and improve clinical interpretability. It was recently shown that TI is more powerful than other common toxicity summaries in detecting differences among treatments 21 in the National Surgical Adjuvant Breast and Bowel Project clinical trial (NSABP R0‐4). 22 , 23 To quantify the information loss, power curves were empirically estimated by resampling the available data, showing the relative performance of different toxicity summaries in comparing treatments.
While the TI was introduced more than a decade ago, its mathematical and statistical properties have never been studied. In this article, we report on the novel characteristics of TI as a summary measure and its effectiveness in comparing treatments. We also develop a framework for modeling and ranking toxicity data, which as discussed below, is applicable beyond drug toxicity problems. Our contributions in this article are as follows:
We propose a framework for modeling the data in experiments where scores on multiple events are recorded for a collection of subjects. In particular, the model is applicable to n × d arrays of scores on n subject and d adverse events, where each entry is a score in {0, 1, … , K}. We refer to the elements of the latter set as grades. We consider sparse arrays where most of the scores are zero, and argue in favor of a Poisson‐Limit model that simplifies downstream analysis. The model can be equivalently expressed in terms of the reduced frequency vectors (RFV) of the subjects where, for each subject, one records the frequency of each observed grade except zero.
We introduce a total ordering, referred to as the T‐order, on the space of RFVs, that allows for comparing them based on how frequently they exhibit extreme grades. Since the T‐order is a total order (ie, every pair of vectors are comparable), it induces a full ranking among RFVs which we call the T‐rank. This, for example, allows one to fully rank subjects (or treatments), strictly and without ties, except when the RFV vectors are exactly equal. The T‐rank, in particular, is relevant to drug toxicity trials where one wants to emphasize differences in extreme toxicity. The ranking is also useful in any application where variations in the extreme scores among subjects or treatments are of concern.
We consider two‐sample testing based on score data coming from two different Poisson‐Limit models. We propose a general test that compares the two treatments by looking at the difference in mean between values of a summary measure g(·) applied to each subject in each treatment. We study three candidates for g(·): the TI, the average, and the maximum score. We derive exact expressions for the mean and variance of the test statistic under the Poisson‐Limit model, allowing us to analytically set the critical region of the test with guarantees on the asymptotic significance level.
Our result also provides an approximate analytical formula for the power function for each test, as well as an exact expression for the slope of the test as defined by Vaart 24 (chapter 14). The slopes allow us to analytically calculate the asymptotic relative efficiency (ARE) of the three summary measures with respect to each other. We evaluate these ARE expressions empirically, showing that in the majority of the cases, TI is more efficient than the maximum and much more efficient than the average.
We demonstrate our theoretical results on the two‐sample test by fitting the Poisson‐Limit model to real data from a clinical trial and evaluating the power function for tests based on the three summary measures. The resulting plots complement those of Reference 21 which are obtained based on resampling the data, and confirm the superiority of TI in detecting the differences in toxicity profile among drugs.
The organization of the rest of the article is as follows: Section 2 provides background on the toxicity data and the summary measures. We then introduce the Poisson‐Limit model for such data. In Section 3, we derive mathematical properties of TI, including monotonicity with respect to the T‐rank. Deficiencies of competing summary measures in preserving the ranks are also discussed here. Section 4 develops our two‐sample test and presents analytical results on the power function and test slopes. The methodology is illustrated with simulations in Section 5 and conclude with a discussion in Section 7.
Notation. We use for the set of integers, for the set of nonnegative real numbers, and for the set of nonnegative integers. We write [K] = {1, … , K} and [K]∗ = {0, 1, … , K}. The indicator of a set A is denoted as 1{x ∈ A}, evaluating to 1 if x ∈ A and 0 otherwise.
2. DATA, MODELS, AND SUMMARY MEASURES
Assume that we have a collection of d AEs that we represent as {1, … , d}. For each subject in a given period of time (one cycle or multiple cycles combined), we observe a toxicity profile that can be viewed as a vector , where Yi represents the toxicity grade of AE i. Generally, we assume that there is a maximum possible grade K, so that we can assume , where we recall that [K]∗ = {0, 1 … , K}. We assume that the corresponding grades for different AEs are equivalent, that is, grade 1 in Y1 is equivalent to grade 1 in Y2, and so on. The toxicity index is an example of a summary measure that maps Y to a scalar grade. More precisely, let Y(1) ≥ Y(2) ≥ ⋯ ≥ Y(d) be the order statistic of Y. The toxicity index was originally defined as a function given by: 20
(1) |
Besides the toxicity index, there are other summary measures, which are often defined in terms of the frequency vector of the grades, that is, X∗ = (X0, X1, X2, … , XK) where . Since we are mainly interested in sparse toxicity profiles, level 0 has a special status. We will work with the reduced frequency vector X = (X1, … , XK) that only retains the frequency of nonzero levels. Let us write and note that X0 = d − X+, that is, there is no loss of information working with X instead of X∗.
Common summary measures can be stated in terms of the reduced frequency vector X. The following two examples are of particular interest:
- The mean index which can be represented as
(2) - The maximum index which is given by
where we interpret the maximum of the empty set as 0.(3)
Throughout, maximum index, mx, and max‐grade will be used interchangeably.
2.1. The Poisson‐Limit model
To study the statistical properties of the summary measures, we propose a model for the toxicity data. We start with a model, where each entry of Y is an i.i.d. draw from a categorical variable with levels in [K]∗. That is, {Yj} are i.i.d., with for r ∈ [K]∗ and j ∈ [d]. In the absence of prior knowledge about toxicities, the i.i.d. assumption is a reasonable first approximation. It follows that the frequency vector X∗ has a multinomial distribution with parameter d and p∗ = (p0, p1, … , pK). Formally, X∗ ∼ Mult(d, p∗).
Under the above model, we denote the distribution of the reduced frequency vector X, as Mult|0⟨(d, p) where p = (p1, … , pK). In other words, X ∼ Mult|0⟨(d, p) if
where and . We are interested in the cases where the toxicity profile is sparse, that is, many of {Yj} are often zero. In those cases, it is reasonable to assume that for r ∈ [K]. Letting , our model is equivalent to
(4) |
As shown in the Appendix (Section A0.1), X converges in distribution to a product of Poisson distributions, that is,
(5) |
We will use this limiting distribution in our statistical analysis. It is a good approximation for large sparse toxicity profiles and allows us to derive explicit analytical expressions for various statistical quantities of interest. We refer to (5) as the Poisson‐Limit model.
Remark 1
Even when the distributions of the AEs are allowed to be different, that is, with qjr potentially varying with j, we still have a Poisson‐Binomial distribution for each marginal (ie, the distribution of for a given r), which can be approximated by a Poisson distribution in the sparse case.
3. TOXICITY INDEX PRESERVES T‐rank
In this section, We derive some analytical properties of the toxicity index. We first show how the toxicity index can be computed based only on the reduced frequency vector (RFV) of the observations. We define a total ordering, referred to as the T‐rank, on the space of RFVs and show that the toxicity index preserves this ordering in a strict sense. A consequence of this result is that the toxicity index is a one‐to‐one mapping on the RFV space, hence there is no loss of information when summarizing the toxicity profile this way.
3.1. Closed‐form representation
Recall that represents the toxicity profile of a patient. Consider the truncated toxicity index:
which only considers toxicities of grades k or lower. We have where K is the maximum toxicity grade observed in Y.
Let be the reduced frequency vector of Y: for all k = 1, … , K. The following proposition provides a recursive formula for in terms of x:
Proposition 1
For any , let for k = 2, … , K and gK + 1(x) = 1 . We have, for k = 1, … , K ,
where x = x(Y) is the reduced frequency vector of Y and .
It follows from Proposition 1 that the toxicity index , defined in (1), has the following closed form in terms of the frequency vector: where
(6) |
Letting , we can write TI(·) alternatively as follows:
For K = 5, with x = (x1, x2, x3, x4, x5), one can also rewrite (6) as
(7) |
which could be faster for computations. In the sequel, we also refer to TI(·) as the toxicity index, based on the equivalence just established.
3.2. Monotonicity and T‐rank preservation
We now introduce a total order on the space of reduced frequency vectors (RFVs). In fact, the total order can be defined over all real‐valued K‐dimensional vectors:
Definition 1
(T‐order) For , we write x ≻ y or y ≺ x if x ≠ y and
In words, for any two vectors x and y that are not equal, we let i(x, y) be the largest grade (last position) on which they differ and say that x ≻ y if the coordinate of x at i(x, y) is larger than the corresponding coordinate of y. We write x ≽ y if either x = y or x ≻ y, and similarly for x ≼ y.
The T‐order, defined by ≽, is a total order on , that is, we can compare any pair of vectors in . To see this, it is enough to verify the transitivity property:
Lemma 1
(T‐order is transitive) If x ≻ y and y ≻ z , then x ≻ z.
Recall that the toxicity index can be written as a function TI(·) of the reduced frequency vector; see (6). The toxicity index TI(·) is (strictly) increasing on , with respect to the T‐order:
Theorem 1
Let K ≥ 2 . For any , if x ≻ y , then TI(x) > TI(y).
Since the T‐order is a total order, it provides a ranking of all elements of . We refer to this ranking as the T‐rank. Theorem 1 shows that the toxicity index preserves the T‐rank. Theorem 1 also shows that TI(·) is injective (ie, one‐to‐one) on . The significance of the T‐order, and the associated T‐rank, is that it is sensitive to the highest grade on which the two RFVs x and y differ. This is the desired way to compare RFVs derived from toxicity profiles, since the highest grade toxicities are the most important. The two other common summary measures, namely, mx(·) and avg(·) do not preserve the T‐rank:
The max‐grade mx(x) is a nondecreasing function w.r.t. the T‐rank but is not one‐to‐one, that is, we can have x ≻ y but g(x) = g(y). The average function avg(x) is neither monotone w.r.t. to the T‐rank nor one‐to‐one. For example, consider K = 2, and take x = (6, 2) and y = (2, 1). Then, x ≻ y but avg(x) = 1.25 < avg(y) = 1.33. It is also easy to find cases where x ≻ y and avg(x) > avg(y) or avg(x) = avg(y), for example, take x = (2, 2) or x = (4, 2), respectively. Figure 1 illustrates the TI vs mx and avg as functions of some sorted vectors based on T‐rank showing that TI is both monotone and one‐to‐one unlike the mx and avg and thus preserves ranking.
Intuitively, the T‐rank is a natural way to order patients according to their toxicity grades. We look at the largest grade first, and if the frequency of that grade is higher in one patient we rank their toxicity higher. Otherwise (ie, in case of equality), we look at the second largest grade and compare the frequencies there and so on. As an example, in the hypothetical dataset, discussed in Section 6.1, the RFVs of patients 7 and 8 are x = (0, 1, 3, 1, 0) and y = (0, 4, 1, 1, 0), respectively. These two match on grades 5 and 4 but not 3. Then, the largest grade that they differ on is grade 3, that is, i(x, y) = 3. Since x3 > y3, we conclude that patient 7 has a higher T‐rank than patient 8. This way we can rank all patients. The mx function when plotted against this order looks like a step function while TI always monotonically increases, as shown in Figure 1. The avg mixes the high and low grades which generally causes the overall grade to go down, hence misrepresenting the toxicity as determined by the T‐rank; this is shown in Figure 1 as average occasionally going down even when the toxicity rank goes up.
4. STATISTICAL THEORY
In this section, we consider the statistical problem of testing whether two populations of toxicity profiles are different. This is the central problem of pharmacology where one is interested in determining whether two or more drugs have the same toxicity effects in patients, or whether a certain drug is more toxic than the other. We present the problem under the Poisson‐Limit model of Section 2.1, and introduce a general two‐sample test that can be applied based on any scalar summary measure. We present an asymptotic analysis of these tests (Theorem 2) and compute the relevant parameters for the three summary measures: the average, the maximum, and the TI.
4.1. Two‐sample tests
Recall the Poisson‐Limit model of Section 2.1 for the reduced frequency vectors (RFVs). Assume that we observes RFV samples from two such populations:
(8) |
where and are vectors in . We recall that is the distribution with independent coordinates. Our goal is to test the null hypothesis . We consider the general statistic
for some mapping g from to . Let us define
(9) |
where and denote the expectation and variance, assuming that . We assume that and are finite for all . Consider the two‐sample test that rejects the null hypothesis if
(10) |
where is the following estimate
(11) |
The next theorem describes the asymptotic behavior of this test:
Theorem 2
Let and be generated from model ( 8 ), with some . Assume that n, m → ∞ such that and that is a continuous function. Then, the two‐sample test that rejects the null according to ( 10 ), has asymptotic level . Moreover:
- (a)
If , the test is asymptotically consistent (ie, the power converges to 1).
- (b)
If , the test is inconsistent and its power converges towhere . In particular, if in addition , the test is asymptotically powerless (ie, the power converges to ).
- (c)
Assume that Mg(·) is continuously differentiable at. For the shrinking alternative,
with
, the power of the test converges to
(12)
The quantity B2, with B given in (12), is the slope of the test in the direction . The slope plays a role in determining the Pitman asymptotic relative efficiency (ARE) of two test with respect to each other. In particular, the ratio of the slopes for two tests determines the asymptotic ratio of the samples sizes necessary for achieving the same power by two tests of a given size; see Reference 24, chapter 14 for details. The proof of Theorem 2 also gives the following asymptotic approximation to the power of the test:
(13) |
where , and . Simulations in Section 5 show that this approximate formula is quite accurate even for small n.
Remark 2
One can replace in (10) with any consistent estimate of the variance of the pooled sample (under null). Theorem 2 still holds for such a test. In particular, we can replace with the empirical variance of the pooled sample: . The resulting test will be the usual two‐sample t‐test and it enjoys the same asymptotic properties as those of (10). There could be some advantage in using vs for small sample sizes, but it is generally hard to quantify the difference since both estimates quickly approach the true variance under the null. The main utility of computing the functions and Mg(·) is that they enable us to obtain explicit expressions for the asymptotic power of the tests and the AREs.
To implement the test (10), one needs the variance function . To compute the slope, we additionally need the mean function Mg(·). Below we derive exact expressions for these quantities for the three summary statistics of Section 2, that is, g ∈ {TI, mx, avg}. To simplify the notation, let , , and . Note that is a probability distribution on [K]. Let us denote the first and second moments of this probability distribution as
(14) |
We also define the function
where Ei is the exponential integral and is the Euler‐Mascheroni constant.
Lemma 2
Under a Poisson model,
, we have
Corollary 1
Assume that and . Then, the test ( 10 ) based on g = avg is inconsistent. If in addition, , then the test is powerless.
It is known that for , hence as . This implies that for sufficiently large , the variance is nearly completely determined by . Thus equality of the first moments of and together with is enough for the mean index test to be almost powerless. Corollary 1, in fact, suggests that rather that the null hypothesis of , the test based on avg is appropriate for testing the following null:
(15) |
Remark 3
A counterintuitive consequence of Lemma 2 is that there are examples of rate vectors and , such that coordinatewise, but . See Section 6.1 for such an example.
Next, we consider the maximum index: For z = (z1, … , zK, zK + 1), with zK + 1 = 1, define
(16) |
See Section A0.3 for a derivation of the second equality.
Lemma 3
Under a Poisson model, , we have
where with .
Finally, we consider the TI index. Let ai = i/(i + 1) and bi = i(i + 2)/(i + 1)2, and define
A sum with lower limit higher than the upper limit evaluates to zero, e.g., . Let . We also define
(17) |
Lemma 4
Under a Poisson model,
, we have
(18)
(19)
4.2. Computing the slope
The slope of the test (10) is B2, where B is defined in (12). To compute the slope, we need the gradient of the mean function. Let us consider the case of mx. The mean function is of the form where and W is a K × K upper triangular matrix with ith row such that wij = 1 for j ≥ i. It follows that
where is applied coordinatewise to vector . The mean function for the TI statistic has a similar form: , where U = (uij) is an upper triangular matrix with uij = j/(j + 1) for j ≥ i. The rest of the calculations follow similarly. For the avg, the ith element of the gradient of the mean function is
We will use these expressions to empirically evaluate the slopes of the three summary measures, and compare their asymptotic relative efficiencies.
5. SIMULATIONS
We start by investigating the asymptotic relative efficiency (ARE) of TI w.r.t. the mx and avg summary measures. To do so, we consider the two‐sample Poisson‐Limit model (8) with K = 5 and generate the mean vectors, and , randomly as follows: is drawn from and is set equal to , with . We generate 104 such parameter pairs and for each model, evaluate the ratio of the test slopes for TI vs mx and avg as a measure of the ARE, that is, we calculate and .
Figure 2 illustrates the resulting histogram for the two slopes. Values above one indicate a higher efficiency for TI relative to the competing method. The plots show that, in the majority of cases, TI has a higher asymptotic efficiency compared to mx and avg. The histogram for has a heavier tail than which indicates that there are cases that TI achieves a much higher efficiency relative to avg (compared with mx). Under our sampling scheme, the probability that TI has a larger slope than the mx and the avg is 0.97 and 0.99, respectively.
Next, we consider specific values for and , namely and , and investigate the receiver operating characteristic (ROC) and the power curves for the three tests. The ROC is obtained by plotting the true positive rate (TPR) achievable for any given false positive rate (FPR). Note that TPR is an alternative name for the power; similarly, FPR is another name for the test size. The power curves are obtained by plotting the power of the test of size , against the sample size n. Figure 3 shows the ROC curves of the TI, mx and avg for n = 100 and also the asymptotic and simulated power plots for 1000 repetitions, illustrated with the dashed and solid lines, respectively. The values of and are chosen to be close to the setting of Corollary 1. In particular, we have , and for the avg test, , , . Corollary 1 explains the poor performance of the avg in this setting, matching what Figure 3 shows: The avg is nearly powerless for this testing problem.
We note the close match between the asymptotic power curves calculated based on (13) and the simulated curves. If we increase the number of repetitions further, the simulated power perfectly match the asymptotic one. In this example, TI outperforms the mx and the avg, in terms of power, for all false positive rates, and achieves a given power, at size , with a smaller sample size.
6. TRIAL APPLICATION
In this section, we first demonstrate how the two‐sample tests discussed in Section 4 can be implemented using a simple trial with hypothetical data. Then, we show the results of application to a real clinical trial.
6.1. Hypothetical data
Consider a simple hypothetical trial where n = m = 5 patients were assigned to each of two treatment groups, with toxicity data given in Table 1. The data were randomly generated from Poisson‐Limit models with and and the patients are ordered in decreasing T‐rank. Note that the first drug is more toxic on average, in the sense that coordinatewise, that is, the mean of vector of one distribution dominates the other in every coordinate. The pooled rate estimate is obtained by taking the average of each grade column in Table 1, giving . The table also shows the TI and mx for each patient. Averaging over the TI column, we obtain mean TI values of 4.996 and 4.210 for the two groups. Hence, STI = 4.996 − 4.210 = 0.786. The standard error (ie, the estimated standard deviation) of this statistic is
using the formula in Lemma 4 to calculate . At level , the test based on TI rejects the equality of treatments since . Similar calculations can be performed for the tests based on mx and avg, as shown in Table 2. Both tests fail to reject the equality of treatments since Smx/se(Smx) = 1.89 and Savg/se(Savg) = 1.07 are both less than .
TABLE 1.
Treatment | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | TI | mx | avg | |
---|---|---|---|---|---|---|---|---|---|
Patient 1 | 1 | 0 | 3 | 0 | 7 | 0 | 5 | 4 | 3.40 |
Patient 2 | 1 | 0 | 5 | 1 | 6 | 0 | 5 | 4 | 3.08 |
Patient 3 | 1 | 0 | 8 | 1 | 4 | 0 | 5 | 4 | 2.69 |
Patient 4 | 1 | 0 | 4 | 3 | 3 | 0 | 4.99 | 4 | 2.90 |
Patient 5 | 1 | 0 | 4 | 2 | 3 | 0 | 4.99 | 4 | 2.89 |
Patient 6 | 2 | 0 | 1 | 0 | 3 | 0 | 4.98 | 4 | 3.50 |
Patient 7 | 2 | 0 | 1 | 3 | 1 | 0 | 4.79 | 4 | 3.00 |
Patient 8 | 2 | 0 | 4 | 1 | 1 | 0 | 4.75 | 4 | 2.50 |
Patient 9 | 2 | 0 | 2 | 0 | 1 | 0 | 4.53 | 4 | 2.67 |
Patient 10 | 2 | 0 | 1 | 0 | 0 | 0 | 2 | 2 | 2.00 |
TABLE 2.
Treatment 1 | Treatment 2 | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Mean |
|
|
Mean |
|
|
Sg |
|
|
se(Sg) |
|
Power | ||||||||
TI | 4.996 | 4.972 | 0.02 | 4.210 | 4.337 | 1.088 | 0.786 | 0.635 | 1.108 | 0.251 | 3.131 | 0.877 | |||||||
mx | 4 | 3.991 | 0.0142 | 3.6 | 3.634 | 0.698 | 0.4 | 0.357 | 0.7122 | 0.212 | 1.887 | 0.800 | |||||||
avg | 2.992 | 2.9999 | 0.0923 | 2.734 | 3.0479 | 0.542 | 0.258 | ‐0.048 | 0.6343 | 0.233 | 1.107 | 0.295 |
Note, from Table 2, that the standard error of the two statistics STI and Smx are roughly the same in this case. However, the difference in group means is much larger for TI relative to mx (|STI|≫ |Smx|) due to patients in group 1 having many more grade 4 toxicities relative to group 2. The TI takes the frequency of grade 4 toxicities into account while the mx ignores it.
We can also use formula (13) to approximate the power of the tests. The true means are and using the formula from Lemma 4. Considering that the sample size is small, these values are very close to the empirical means computed from the data. The corresponding true variances are and . Note how small group 1 variance is relative to group 2. The pooled normalized variance (under the alternative) is (since ) and the mean difference is . Plugging‐in these values into (13) we obtain an approximate power of 0.877.
Similarly, for the mx, note how close the true means are to the empirical ones (cf Table 2). The fact that for treatment 1, is quite large—hence most patients show at least one grade 4 toxicity—is reflected in the small value of the variance . That is, for the majority of patients in group 1, mx will be 4, hence its mean will be concentrated near 4. In this case, we have and giving an approximate power of 0.800.
For the avg, group 1 variance is higher than both TI and mx, though still quite small relative to group 2. The mean difference is very small in this case, . The negative sign indicates that treatment 1 is less toxic on average than treatment 2 (in expectation), which is quite counter‐intuitive since treatment 1 is more toxic in terms of the extreme grades. Even more surprisingly, we have despite coordinatewise. That is, the frequency of all the grades are higher on average for treatment 1, but its expected avg is lower. This can be confirmed empirically, for large n, by observing that Savg turns out to be negative most of the time. It can be explained by noting that being much larger than biases the avg for group 1 toward lower grades. This shows a clear deficiency of the avg in ordering toxicities. Plugging‐in and into (13), we obtain an approximate power of 0.295.
6.2. Application to clinical trial data
We use the data from the NSABP R0‐4 rectal cancer clinical trial as a case study for comparing the power of the three summary statistics (TI, mx, and avg) in detecting treatment differences. NSABP R0‐4 was a phase 3 trial conducted between July 2004 and August 2013 (NCT00058474). Eligible patients were diagnosed with surgically resectable stage II or III rectal adenocarcinoma. Patients were assigned to four different treatments: (1) infusional 5‐fluorouracil (5FU); (2) oral capecitabine (CAPE); (3) combination of 5FU and oxaliplatin (OX); (4) combination of CAPE and Ox. The trial included 1608 participants. From these, 50 patients were ineligible or had missing data. The analytic sample consisted of 1558 individuals. CONSORT diagram is available in Reference 21. Main study description is available in Reference 23.
We use the power analysis to compare the performance of the three methods in detecting treatment differences. We first fit a Poisson‐Limit model (see (5)) to each treatment data which amounts to estimating its rate vector using the sample mean of the corresponding RFVs; see for example (11), but applied only to a single sample. Once the model for each treatment is specified, we can compare the methods based on their power in differentiating pairs of treatments. For each method, we plot the power vs the sample size. These plots can be obtained by simulating from the fitted models (a form of parametric bootstrap) or by using the asymptotic power formula (13) which, as was shown in the previous section, gives accurate results even for small sample sizes.
Figure 4 shows the power plots for testing four pairs of treatments that are known to be different. All tests are performed at 5% significance level. The plots show that TI has a greater power in detecting differences between treatments in all the four cases. That is, using TI, the required number of patients to detect treatment differences is smaller. We note that Reference 21 performs a similar power analysis on the same dataset. They provide no model for the data and instead use bootstrap to perform the power analysis. The tests they consider are based on Wilcoxson signed‐rank statistic. In contrast, we provide a model for the data and use parametric bootstrap (or exact asymptotic power curves) to perform the power analysis. Our test statistics are also different and based on the difference‐in‐mean statistics. Compared with the results of Reference 21, Figure 4 shows that the test statistics considered here achieve a given power with a smaller sample size. We observe a similar relative ranking of TI, average and max‐grade as that observed by Gresham et al. 21
7. DISCUSSION
In this article, we studied the mathematical and statistical properties of the toxicity index. We showed that the TI can be stated solely as a function of the reduced frequency vector, with a tractable closed‐form formula. We introduced T‐rank preservation as a desirable property that allows investigators to achieve clinically‐meaningful ranking of the toxicity profiles. We showed that compared with competing summary measures (ie, the max‐grade and the average‐grade), TI is the only measure that preserves the T‐rank and is an injective (ie, one‐to‐one) mapping on the space of toxicity frequencies. Neither max‐grade, nor the average‐grade are injective or T‐rank preserving. The max‐grade loses information in the toxicity profile by only looking at the highest grade. The average‐grade loses the information about the extreme toxicities by equally weighting low and high grades.
To statistically compare various toxicity measures, we proposed a Poisson‐Limit model for modeling sparse toxicity data via their reduced (ie, 0‐removed) frequency vectors. Under this model, we developed a general two‐sample test for detecting differences among two population of toxicity data. We derived formulas for the asymptotic power of the test, for the three toxicity measures (TI, max‐grade, and average‐grade) and calculated their asymptotic relative efficiencies (AREs). We empirically demonstrated that TI has a higher ARE that the other two summary measures, with high probability, under a random Poisson‐Limit model. We also illustrated that TI achieves a higher power in detecting treatment differences in a cancer trial, validating the empirical results of Reference 21. The framework we developed in this article allows investigators to compare toxicity‐summary methods in any clinical trial, and to analytically evaluate new proposals for toxicity summaries.
The TI can be generalized to accommodate noninteger toxicity grades if the clinician believes that a toxicity has a different impact than the observed grade. 11 , 25 Furthermore, the TI can be used in any risk assessment application with a grade system similar to CTCAE, provided that the grades for different inputs (eg, items) are equivalent. In Reference 26, the TI was used to score surgical complications based on the Clavien‐Dindo system. 27 More broadly, the analysis in this article applies to any application where integer scores on multiple events are recorded for a collection of subjects, assuming that the resulting score array is sparse. We are currently investigating the benefits of using TI for patient reported outcomes (PROs) data, 28 extending the TI for multiple cycles and searching for other T‐rank preserving summary measures.
ACKNOWLEDGEMENTS
The authors would like to thank Patricia A. Ganz and Ron D. Hays for helpful comments and discussions. They also thank Michael Luu for his help with plotting Figures 1 and 4. This work was supported in part by the National Cancer Institute of the National Institutes of Health (NIH) (1U01CA232859‐01) (Márcio A. Diniz, Zahra S. Razaee, Mourad Tighiouart, Greg Yothers, André Rogatko); National Cancer Institute (R01 CA188480‐01A1) (Mourad Tighiouart, André Rogatko); NIH National Center for Advancing Translational Science (NCATS) UCLA CTSI (UL1 TR001881‐01) (Márcio A. Diniz, Mourad Tighiouart, André Rogatko) and National Cancer Institute (U10CA180868 [NCTN], UG1CA189867 [NCORP], U10CA180822 [NRG Oncology SDMC]) that supported trial conduct and data collection. Open access funding enabled and organized by Projekt DEAL.
1.
A.1. Proof of Equation (5)
Recall that X∗ = (X0, X1 … , Xd) ∼ Mult(d, p∗) where p∗ = (p0, p1 … , pK). It is well known that the moment‐generating function (MGF) of X∗ is given by . Setting t0 = 0 and noting that and for i = 1, … , K, we obtain the MGF of X,
We have
where is the MGF of a Poisson random variable with mean . The proof is complete.
A.2. Proofs of Section 3
Proof of Proposition 1
Assume that Yi = k for rk ≤ i ≤ rk − 1 for k = 1, … , K with 1 = rK ≤ ⋯ ≤ r1 ≤ r0 = n. Note that xk = rk − 1 − rk + 1. For i ∈ ⟦rk, rk − 1⟧, we have
Then,
which is the desired result.
Proof of Lemma 1
Assume x ≻ y and y ≻ z. By assumption, x ≠ y and y ≠ z. We consider three cases:
- (i)
i(x, y) = i(y, z) = : ℓ. Then, it is not hard to argue that i(x, z) = ℓ and since xℓ > yℓ > zℓ, we have x ≻ z.
- (ii)
ℓ1: = i(x, y) > i(y, z). Then, i(x, z) = ℓ1 and . Hence, x ≻ z.
- (iii)
i(x, y) < i(y, z) = : ℓ0. Then, i(x, z) = ℓ0 and . Hence, x ≻ z.
The proof is complete.
Proof of Theorem 1
Assume K ≥ 2. Consider the following two vectors
If is enough to show that TI(x) ≥ TI(y). In fact, we show equality. To simplify, let us write u = (1 + K)−m. Then,
Note that all the lower terms are zero in the expansion. On the other hand, we have and gk(y) = ∞ for k < K. It follows that
Then,
The proof for the general case follows by considering the largest grade on which x and y differ, which without loss of generality (WLOG) we assume to be K. Assume WLOG that xK > yK. Then, setting xj = 0 for all j ≠ K only decreases TI(x). Similarly, increasing yK to xK − 1 and increasing all yj = ∞, only increases TI(y). We have shown that even in this worst case TI(x) ≥ TI(y). So all the other cases follow.
A.3. Proofs of Section 4
Proof of Theorem 2
For simplicity, and without loss of generality, assume that . Let and so that . Then, by the central limit theorem
Subtracting, we get
Under the null distribution, , and the test that rejects when
has asymptotic level . Since and is continuous, by the continuous mapping theorem, the same result holds with replaced with . Let , and . It follows that for large n, the power is close to
(A1) More precisely, if , then the power of the test converges to which is the desired result for part (a). If in addition , then , hence the power converges to .
For part (c), we note that by the Taylor expansion , and by the continuity of . Hence, , and . The result then follows by considering the asymptotic expression (A1).
Proof of Lemma 2
Under the Poisson model, where . Assuming m ≠ 0, we have
We obtain . Since , by smoothing,
which is the desired expression. Now, by the law of total variance
We have where . On the other hand, assuming m ≠ 0, conditioned on X+ = m, we can represent where drawn i.i.d. Note that is categorical variable taking values 1, … , K with probabilities . Hence,
where is the variance of . Hence,
Since , we have (cf Reference 29)
The proof is complete.
Proof of Lemma 3
Let Br = 1{Xr > 0} and . Then, and we have
if k ≠ 0 which gives the desired result.
Proof of Lemma 4
Assume that and let gk = gk(X). Then,
Recall that if , then . Thus,
By the independence of the coordinates, . Equation (18) follows. For the variance, we note that
where the final equality is by independence and the definition of . We have
Noting that , Equation (19) follows.
We have
Recalling that zK + 1 = 1 finishes the proof.
Razaee ZS, Amini AA, Diniz MA, Tighiouart M, Yothers G, Rogatko A. On the properties of the toxicity index and its statistical efficiency. Statistics in Medicine. 2021;40:1535–1552. 10.1002/sim.8858
Funding information National Cancer Institute, 1U01CA232859‐ 01; R01 CA188480‐01A1; National Center for Advancing Translational Sciences, UL1 TR001881‐01
References
- 1. Miller AB, Hoogstraten BFAU, Staquet MFAU, Winkler A. Reporting results of cancer treatment. Cancer. 1981;47(1):207‐214. [DOI] [PubMed] [Google Scholar]
- 2. Trotti A, Colevas AD, Setser A, Basch E. Patient‐reported outcomes and the evolution of adverse event reporting in oncology. J Clin Oncol. 2007;25(32):5121‐5127. [DOI] [PubMed] [Google Scholar]
- 3.CTCAE Common Terminology Criteria for Adverse Events (CTCAE), Version 5.
- 4. Trotti A, Colevas AD, Setser A, et al. CTCAE v3.0: development of a comprehensive grading system for the adverse effects of cancer treatment. Seminars in Radiation Oncology. 2003;13(3):176‐181. [DOI] [PubMed] [Google Scholar]
- 5. Thanarajasingam G, Hubbard JM, Sloan JA, Grothey A. The imperative for a new approach to toxicity analysis in oncology clinical trials. J National Cancer Inst. 2015;107(10):djv216. [DOI] [PubMed] [Google Scholar]
- 6. Singer DS, Jacks T, Jaffee E. A US “Cancer Moonshot” to accelerate cancer research. Science. 2016;353(6304):1105‐1106. [DOI] [PubMed] [Google Scholar]
- 7. Forastiere AA, Goepfert H, Maor M, et al. Concurrent chemotherapy and radiotherapy for organ preservation in advanced laryngeal cancer. N Engl J Med. 2003;349(22):2091‐2098. [DOI] [PubMed] [Google Scholar]
- 8. Adelstein DJ, Li Y, Adams GL, et al. An intergroup phase III comparison of standard radiation therapy and two schedules of concurrent chemoradiotherapy in patients with unresectable squamous cell head and neck cancer. J Clin Oncol. 2003;21(1):92‐98. [DOI] [PubMed] [Google Scholar]
- 9. Baselga J, Trigo JM, Bourhis J, et al. Phase II multicenter study of the antiepidermal growth factor receptor monoclonal antibody cetuximab in combination with platinum‐based chemotherapy in patients with platinum‐refractory metastatic and/or recurrent squamous cell carcinoma of the head and neck. J Clin Oncol. 2005;23(24):5568‐5577. [DOI] [PubMed] [Google Scholar]
- 10. Trotti A, Pajak TF, Gwede CK, et al. TAME: development of a new method for summarising adverse events of cancer treatment by the radiation therapy oncology group. Lancet Oncol. 2007;8(7):613‐624. [DOI] [PubMed] [Google Scholar]
- 11. Lee SM, Hershman DL, Martin P, Leonard JP, Cheung YK. Toxicity burden score: a novel approach to summarize multiple toxic effects. Ann Oncol. 2011;23(2):537‐541. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Thanarajasingam G, Atherton PJ, Novotny PJ, Loprinzi CL, Sloan JA, Grothey A. Longitudinal adverse event assessment in oncology clinical trials: the Toxicity over Time (ToxT) analysis of alliance trials NCCTG N9741 and 979254. Lancet Oncol. 2016;17(5):663‐670. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Gordon NH, Willson JK. Using toxicity grades in the design and analysis of cancer phase I clinical trials. Stat Med. 1992;11(16):2063‐2075. [DOI] [PubMed] [Google Scholar]
- 14. Wang C, Chen TT, Tyan I. Designs for phase I cancer clinical trials with differentiation of graded toxicity. Commun Stat. 2000;29(5‐6):975‐987. [Google Scholar]
- 15. Bekele BN, Thall PF. Dose‐finding based on multiple toxicities in a soft tissue sarcoma trial. J Am Stat Assoc. 2004;99(465):26‐35. [Google Scholar]
- 16. Van Meter EM, Garrett‐Mayer E, Bandyopadhyay D. Proportional odds model for dosefinding clinical trial designs with ordinal toxicity grading. Stat Med. 2011;30(17):2070‐2080. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Chen Z, Tighiouart M, Kowalski J. Dose escalation with overdose control using a quasi‐continuous toxicity score in cancer phase I clinical trials. Contemp Clin Trials. 2012;33(5):949‐958. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Tighiouart M, Cook‐Wiens G, Rogatko A. Escalation with overdose control using ordinal toxicity grades for cancer phase I clinical trials. J Probab Stat. 2012;2012:1‐18. [Google Scholar]
- 19. Gelber RD, Goldhirsch A, Cole BF, Wieand HS, Schroeder G, Krook JE. A quality‐adjusted time without symptoms or toxicity (Q‐TWiST) analysis of adjuvant radiation therapy and chemotherapy for resectable rectal cancer. J Natl Cancer Inst. 1996;88(15):1039‐1045. [DOI] [PubMed] [Google Scholar]
- 20. Rogatko A, Babb JS, Wang H, Slifker MJ, Hudes GR. Patient characteristics compete with dose as predictors of acute treatment toxicity in early phase clinical trials. Clin Cancer Res. 2004;10(14):4645‐4651. [DOI] [PubMed] [Google Scholar]
- 21. Gresham G, Diniz MA, Razaee ZS, et al. Evaluating treatment tolerability in cancer clinical trials using the toxicity index. JNCI J National Cancer Inst. 2020;112(12):1266‐1274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Russell MM, Ganz PA, Lopa S, et al. Comparative effectiveness of sphincter‐sparing surgery versus abdominoperineal resection in rectal cancer: patient‐reported outcomes in national surgical adjuvant breast and bowel project randomized trial R‐04. Ann Surg. 2015;261(1):144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Allegra CJ, Yothers G, O'Connell MJ, et al. Neoadjuvant 5‐FU or capecitabine plus radiation with or without oxaliplatin in rectal cancer patients: a phase III randomized clinical trial. J National Cancer Inst. 2015;107(11):djv248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Vaart AW. Asymptotic Statistics. Cambridge, MA: Cambridge University Press; 2000. [Google Scholar]
- 25. Yuan Z, Chappell R, Bailey H. The continual reassessment method for multiple toxicity grades: a Bayesian quasi‐likelihood approach. Biometrics. 2007;63(1):173‐179. [DOI] [PubMed] [Google Scholar]
- 26. Anger JT, Mueller ER, Tarnay C, et al. Robotic compared with laparoscopic sacrocolpopexy: a randomized controlled trial. Obstet Gynecol. 2014;123(1):5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Dindo D, Demartines N, Clavien P‐A. Classification of surgical complications: a new proposal with evaluation in a cohort of 6336 patients and results of a survey. Ann Surg. 2004;240(2):205. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Nayfield SG, Ganz PA, Moinpour CM, Cella DF, Hailey BJ. Report from a National Cancer Institute (USA) workshop on quality of life assessment in cancer clinical trials. Qual Life Res. 1992;1(3):203‐210. [DOI] [PubMed] [Google Scholar]
- 29. Audenaert KMR. Inverse moments of univariate discrete distributions via the Poisson expansion; 2008. arXiv preprint arXiv:0809.4155.