A hierarchical testing approach for detecting safety signals in clinical trials

Xianming Tan; Bingshu E Chen; Jianping Sun; Tejendra Patel; Joseph G Ibrahim

doi:10.1002/sim.8495

. Author manuscript; available in PMC: 2021 Jul 6.

Published in final edited form as: Stat Med. 2020 Feb 12;39(10):1541–1557. doi: 10.1002/sim.8495

A hierarchical testing approach for detecting safety signals in clinical trials

Xianming Tan ¹, Bingshu E Chen ², Jianping Sun ³, Tejendra Patel ⁴, Joseph G Ibrahim ¹

PMCID: PMC8258607 NIHMSID: NIHMS1713893 PMID: 32050050

Abstract

Detecting safety signals in clinical trial safety data is known to be challenging due to high dimensionality, rare occurrence, weak signal, and complex dependence. We propose a new hierarchical testing approach for analyzing safety data from a typical randomized clinical trial. This approach accounts for the hierarchical structure of adverse events (AEs), that is, AEs are categorized by system organ class (SOC). Our approach contains two steps: the first step tests, for each SOC, whether any AEs within this SOC are differently distributed between treatment arms; and the second step identifies signal AEs from SOCs passing the first step tests. We show the superiority, in terms of power of detecting safety signals given controlled false discovery rate, of the new approach comparing with currently available approaches through simulation studies. We also demonstrate this approach with two real data examples.

Keywords: clinical trials, drug safety, hierarchical testing, MiST, multiplicity, signal detection

1 |. INTRODUCTION

Drug safety has become a public concern since the Elixir Sulfanilamide disaster in late 1930s, which led to the Federal Food, Drug and Cosmetic Act (FD&C Act) (1938), and then Kefauver-Harris Drug Amendments (1962) as a response to the Thalidomide tragedy. The past decades have seen numerous regulatory guidance documents issued to enhance drug safety evaluation by both national and international organizations like International Conference on Harmonisation,¹ Council for International Organizations of Medical Sciences,² Food and Drug Administration (FDA),³ and European Commission.⁴

Drug safety, also known as Pharmacovigilance, concerns about the collection, detection, assessment, monitoring, and prevention of adverse effects with pharmaceutical products. According to 21 CFR 312.32(a), an adverse event (AE, also called an adverse experience) “means any untoward medical occurrence associated with the use of a drug in humans, whether or not considered drug related,” and can be “any unfavorable and unintended sign (eg, an abnormal laboratory finding), symptom, or disease temporally associated with the use of a drug, and does not imply any judgment about causality.”

In current practice of drug development, the assessment of the safety of a new drug begins at the preapproval stage with preclinical animal studies and with early phase I studies that examine the absorption, excretion, dose ranging, tolerance, and other pharmacokinetic properties of the drug in humans. The occurrence, severity, and duration of patient AEs are also routinely recorded during phase II and phase III of clinical development. During the postapproval marketing, much broader patient populations become involved, and the safety information may be obtained from voluntary reports, monitoring system, uncontrolled patient follow-up, and formal epidemiological studies.

As described by Xia et al,⁵ AEs could be classified into three tiers:

Tier 1 AEs are those thought to be caused by the drug and are specifically tested in the trial.
Tier 2 AEs are those routinely collected in clinical trials but about which no specific hypotheses have been formulated in advance. There are typically many types of Tier 2 AEs.
Tier 3 AEs are like Tier 2 AEs but are “rare” events for which “medical judgements usually prevails.”

In this article, however, we consider analysis of the entire AE dataset without differentiating the three Tiers because in most trials, Tier 1 AEs are empty and efficacy has been the dominant primary endpoint, and the distinction between Tier 2 and Tier 3 AEs is not always clear or necessary. Clinical trial AE data are typically coded in the Medical Dictionary for Regulatory Activities (MedDRA), and represented by two levels: system organ class (SOC) and preferred term (PT). There are about 22 000 AEs (PT level) classified into 27 SOCs.⁶

Also, we focus on analyzing AE data from randomized premarket trials because (a) more comprehensive safety data become available at this stage and (b) timing of analyzing AEs is relevant. The former is self evident. As for the latter, the changing environment has called increasing attention to safety assessment in preapproval trials. The past 20 years have witnessed some high profile product withdrawals and safety has been the predominant reason. For example, in 2004, FDA pulled Rofecoxib (Vioxx), a nonsteroidal anti-inflammatory drug, from the market due to increased risk of cardiovascular events.^7,8 In this case, Merck set aside $4.85 billion to pay for legal costs, comparing to the $2.5 billion in sales for Vioxx in 2003. The sponsors of pharmaceutical products are thus highly motivated to identify safety concerns as early as possible to avoid disastrous loss. It is highly desired to be able to detect safety signals in preapprova l trials.

One important goal of analyzing safety data is signal detection, that is, to identify certain possible risks of a study drug for further investigation. Unfortunately, typical statistical analysis for AEs continues to be descriptive tables, which may run 100 pages or longer and rely on the ocular examination, and is far from achieving the goal of signal detection. The primary statistical challenges for detecting adverse drug reactions, as summarized by Xia and Jiang⁹ and CIMOS 2005,² include but are not limited to:

Rare events: Most types of AE are rare due to several possible reasons as follows: (a) premarketing trials are designed based on efficacy endpoints, and the study period may not be long enough to observe safety endpoints; and (b) drugs proceed to Phase III trials have already been scrutinized for safety in Phase I and II trials (eg, 30% new drug could not reach Phase 3 due to safety issue).
High dimensionality: The number of AE types in many late-phase clinical trials can be very large (eg, in the hundreds or even thousands), and it is difficult to prespecify hypothesis for specific safety events because most AE types are unexpected. This causes a multiplicity issue that could lead to, on the one hand, excessive number of false-positive signals without adjustment for multiplicity, and on the other hand, excessive rate of false-negative findings with rigorous multiplicity adjustment.
Medical classification: The grouping of AEs into categories might help reduce dimensionality and increase by-category AE rates. However, how to define groups of related AEs represents a statistical and clinical challenge, and there is no consensus so far.
Complexity: AEs are complexly interrelated and how to evaluate the multidimensional safety information as a whole poses a serious statistical challenge.

The analysis of safety data from clinical trials thus offers unique methodological opportunities. Specifically, the combination of high dimensionality and rare events is among the most challenging problems that statisticians encounter in safety data analysis. Many efforts have been invested to tackle this issue. Mehrotra and Heyse¹⁰ proposed a “double false discovery rate” (original DFDR) approach that first discards AEs whose observed incidence rates are too low to reach the usual significance level of 0.05, and then group the remaining AEs into SOCs, and apply FDR control procedures to adjust p-values across SOCs (taking the the minimal p-values for each SOC) and within each SOC. Mehrotra and Heyse¹⁰ also presented reasons for focusing on control of FDR instead of family-wise error rate. The DFDR approach is a frequentist approach. A simplified but improved version of DFDR (new DFDR) proposed by Mehrotra and Adewale¹¹ was shown, via simulation studies, to be superior than several multiplicity adjustment approaches which include: (a) the Bonferroni correction, (b) the Benjamini-Hochberg (BH) procedure,¹² and (c) Group BH.¹³ Of note, most of these FDR control approaches were originated in the areas like statistical genetics to deal with a large number of simultaneous tests, so is the graphical approach, volcano plot,¹⁴ which has been frequently used to aid ocular examination of lengthy descriptive table of safety data.

Several Bayesian methods have also been proposed in analyzing AE data. Berry and Berry¹⁵ applied the Bayesian method in analyzing AE data by constructing a hierarchical Bayesian mixture model for binary outcomes. This approach explicitly models the existing MedDRA coding structure of AEs so that strength can be borrowed within and across SOCs under the assumption that the AEs within an SOC are more similar than those in different SOCs. Xia et al¹⁶ extended the work by accounting for various exposure or follow up times among different subjects under the Poisson likelihood. DuMouchel¹⁷ introduced a multivariate Bayesian logistic regression (MBLR) which can be viewed as a generalization of Berry and Berry¹⁵ to enable a search for vulnerable subgroups. In part due to the unique feature of borrowing strength within SOCs, Bayesian approaches are claimed to have more promises over Frequntist approaches, as Chi et al¹⁸ commented “Safety assessment is one area where frequentist strategies have been less applicable. Perhaps Bayesian approaches in this area have more promise.”

However, borrowing strength from AEs within the same SOC is not necessarily Bayesian approaches’ exclusive feature, as we will show in this article. More specifically, in this article, we propose a frequentist-based approach, which we expect to complement the dominant Bayesian approaches, to test whether a new drug leads to a different safety profile comparing to that of the control. We focus on analysis of the entire AE data collected in typical randomized trials, and our approach aims at more efficiently detecting safety signals through simultaneously dealing with high dimensionality and rareness of safety issues arising from late-stage premarketing trials.

This rest of this article is organized as follows. Section 2 describes how we formulate the safety signal detection problem and how we connect it with a genome-wide association study (GWAS) problem. Section 3 reports a simulation study to demonstrate that FDR of the new approach is well-controlled and its power is satisfactory comparing to current practice. Section 4 presents the results from two case studies. Finally, Section 5 summarizes our approach and suggests directions for future work.

2 |. METHODS

We denote AEs for the ith (i = 1, 2, …, N) subject from a trial as y_i = (y_i1,…, y_iD)^′. We assume y_id ∈ {0, 1}, d = 1, 2, …, D, with y_id = 1 indicating the occurrence of dth AE for subject i. We denote the treatment received by the ith subject as T_i, with T_i ∈ {0 = Control, 1 = Treatment}. For randomized trials, the treatment variable is expected to be orthogonal to baseline covariates. We thus ignore associated covariates in this study, like patient’s age, disease stages, to emphasize that our primary goal is to detect treatment related safety signal, not a more ambitious goal like to identify vulnerable subgroups.

In a typical randomized cancer clinical trial, the dimension of AEs, D, could be very high. These outcomes, however, could be grouped by SOCs, and within each SOC, the number of AEs might range from several to tens. Also, it is common that the frequency of any given outcome could be very low.⁹

We are interested in the following question: does receiving the study treatment lead to a different safety profile than receiving a control/placebo treatment? Denote p_d1 = Pr{y_d = 1|T = 1} and p_d0 = Pr{y_d = 1|T = 0}, this question can be formulated as a multiple hypothesis testing problem^*:

H_{0 d} : p_{d 1} = p_{d 0} vs H_{1 d} : p_{d 1} \neq p_{d 0} for d \in {1, 2, \dots, D} .

Most frequentist approaches in the literature examine each of the D tests, and then report adjusted p-values using certain multiplicity adjustment approaches. This approach, in general, is not successful (with very low power) given low frequency and high dimension of the safety outcomes. Novel frequentist approaches that account for such challenges are desired.

Noticing that AEs are categorized by SOC, we propose a hierarchical testing approach.¹⁹ We first divide the above D hypotheses into corresponding SOCs. We then select SOCs with evidence of true discoveries and test AE level hypotheses within such SOCs. In order to implement this approach, two questions shall be answered:

How to test whether an SOC contains true discoveries?
How to identify signal AEs (true discoveries) within such SOCs?

In the following subsections, we describe our solutions step by step.

2.1 |. SOC level test

Given an SOC, the question is whether this SOC contains true discoveries? If an SOC contains no true discovery, we would expect AEs in this SOC are evenly distributed between treatment arms. This implies that the safety profile (restricted to this SOC) cannot “discriminate” treatment. This motivates the following SOC level testing procedure.

Noticing that we are comparing safety profiles between treatment groups. This equivalents to examine whether the two conditional distributions of AEs are equal:

f (y ∣ T = 1) = f (y ∣ T = 0) .

where f(y|T) denotes the conditional distribution of y given T, the treatment assignment. This is equivalent to check whether treatment allocation T and AEs y are independent, or whether

f (y, T) = f (y) \times f (T) .

For simplicity but without ambiguity, here and in the following, we slightly abuse the notation of f(·) to let it denote different conditional or unconditional distributions. Noticing that f(y, T) = f(y) × f(T|y), testing independence between T and y is equivalent to testing whether

H_{0} : f (T) = f (T ∣ y) .

(1)

There could be different ways to test this independence hypothesis. Our idea is to consider this problem in a reversed way. More specifically, we assume f(T|y) can be approximated by the following logistic regression model:

logit \Pr {T = 1 ∣ y} = α_{0} + β_{1} y_{1} + \dots + β_{d} y_{d}

Here we use d instead of D to indicate that we consider a subset (ie, within an SOC) of all AEs. We will examine SOCs one by one and then adjust for multiplicity. Given an SOC, if there is no safety profile difference caused by treatment, we should expect

H_{0} : β_{1} = \dots = β_{d} = 0.

When d is large and most AEs have low frequency or show minor difference in frequency between treatment arms, to test the above composite hypothesis still suffers insufficient power. For example, a likelihood ratio test corresponds to a large degree-of-freedom (d), which leads to decreased power. To deal with this challenge, we follow the idea of a rare variant association study approach²⁰ and consider combining AEs from the same SOC to increase power. Specifically, we assume that

β_{j} \overset{i . i . d}{~} N (π, τ^{2}) .

(2)

The above null hypothesis is then equivalent to:

H_{0} : π = 0 and τ^{2} = 0.

Two special cases might be of interest also: (a) assume τ² = 0, and we only check π = 0; and (b) assume π = 0, and we only examine τ² = 0. One could show that the first special case is equivalent to comparing sum of these safety issues between the two treatment groups, which is called “Burden test,”²¹ in the literature of statistical genetics. The second special case is related to a model, which leads to the “SKAT” test proposed by Wu et al.²²

2.2 |. Test statistics

For individual i, the full model assumes that

logit \Pr (T_{i} = 1 ∣ y_{i}) = α_{0} + \sum_{j = 1}^{d} β_{j} y_{i j} .

Let β_j = π + δ_j, were $δ_{j} \overset{iid}{~} N (0, τ^{2}$ . The null hypothesis is H₀ : π = 0andτ² = 0, which can be divided into two tests: (a) H₀ : τ² = 0, and (b) H₀ : π = 0|τ² = 0.

Define T = (T₁, …, T_N)^T as the N-dimension vector of treatment assignments, Y^T = (y₁,…,y_N) as an N × d matrix, and 1_d as a d-dim vector with all elements as ones. The score test statistic for testing τ² = 0 is

S_{τ^{2}} = {(T - \hat{μ})}^{T} Y Y^{T} (T - \hat{μ}),

where $\hat{μ} = {({\hat{μ}}_{1}, \dots, {\hat{μ}}_{N})}^{T}$ such that ${\hat{μ}}_{i} = {logit}^{- 1} {{\hat{α}}_{0} + (y_{i}^{T} 1_{d}) \hat{π}}$ , and $({\hat{α}}_{0}, \hat{π})$ are obtained under the null model when τ² = 0.

Define ${\hat{σ}}_{i}^{2} = {\hat{μ}}_{i} (1 - {\hat{μ}}_{i})$ for i = 1,…, N, and $\hat{D} = diag {{\hat{σ}}_{1}^{2}, \dots, {\hat{σ}}_{N}^{2}}$ . Let $\hat{P} = \hat{D} - \hat{D} M {(M^{T} \hat{D} M)}^{- 1} M \hat{D}$ with M = (1_N, Y1_d) as an N × 2 matrix. We have under $H_{0} : τ^{2} = 0, S_{τ^{2}} ~ \sum_{t = 1}^{s} λ_{s} χ_{1, t}^{2}$ , where $χ_{1, t}^{2}, t = 1, 2, \dots, s$ , are independent χ²(1) random variables, and λ₁ ≥ ⋯ ≥ λ_s > 0 are nonzero eigenvalues of matrix ${\hat{P}}^{1 / 2} Y W Y^{T} {\hat{P}}^{1 / 2}$ . We reject this subhypothesis when the observed $S_{τ^{2}}$ is too large, or the corresponding p-value, denoted as p₁, is no larger than the nominal level α = 0.05.

The score test statistic for testing H₀ : π = 0|τ² = 0 is given by

U_{π} = {(Y 1_{d})}^{T} (T - \tilde{μ}),

where $\hat{μ} = {({\hat{μ}}_{1}, \dots, {\hat{μ}}_{N})}^{T}$ with ${\tilde{μ}}_{i} = {logit}^{- 1} ({\tilde{α}}_{0})$ and ${\tilde{α}}_{0}$ is estimated under the null model when π = 0 and τ² = 0.

Define ${\tilde{σ}}_{i}^{2} = {\tilde{μ}}_{i} (1 - {\tilde{μ}}_{i})$ for i = 1,…, N, and $\tilde{D} = diag {{\tilde{σ}}_{i}^{2}, \dots, {\tilde{σ}}_{N}^{2}}$ . Let $Σ = {(Y 1_{d})}^{T} {\tilde{D} - \tilde{D} 1_{N} {(1_{N}^{T} \tilde{D} 1_{N})}^{- 1} 1_{N}^{T} \tilde{D}} (Y 1_{d})$ . Then, under π = 0 and τ² = 0, we have $Σ^{- 1 / 2} U_{π} ~ N (0, 1)$ , or $U_{π}^{T} Σ^{- 1} U_{π} ~ χ^{2} (1)$ . We reject this subhypothesis when the observed $U_{π}^{T} Σ^{- 1} U_{π}$ is too large, or the corresponding p-value, denoted as p₂, is no larger than 0.05.

The two test statistics, S_τ² and U_π, are asymptotically independent,²⁰ and we employ Fisher’s combined probability test to examine the null hypothesis H₀ : π = 0andτ² = 0. The overall test statistics is: T = −2(ln(p₁) + ln(p₂)), which converges to a χ²(4) distribution under H₀. More detailed derivation of these results can be found in Reference 20, where they call the test procedure “MiST.” One may conduct SOC-level safety signal detection only based on U_π or $S_{τ^{2}}$ . However, our simulation showed that the combined test statistics performed better, and this is consistent with the observations in the MiST paper.²⁰

It is worth noting that, although we assume AEs (Y) are binary outcomes, the derivation of the test statistics does not require this assumption, and the new approach is still valid if AEs are measured with continuous or ordinal outcomes.

2.3 |. Inference at AE level

Given the testing results at SOC level, we could go further to identify individual AE types, which might be affected by treatment. To this aim, we propose the following procedure.

Flag SOCs by applying a regular FDR approach (eg, the BH FDR control approach) on SOC level p-values obtained from the MiST tests.
For each flagged SOC, conduct individual comparisons for each AE type within this SOC between the treatment and control arms using, for example, Fisher’s exact test. Then, we flag AE types in this SOC if the corresponding FDR adjusted (eg, using BH or other FDR control methods within this SOC) p-values are smaller than 0.05.
All AE types within nonflagged SOCs are not flagged.

Identifying signal AE from a given SOC could essentially be viewed as a problem multiple testing under dependence, which has become a hot research topic in the area of multiple testing.^23,24 In this article, however, we still employ the classic BH procedure for its simplicity and validity under dependence, though at potential cost of power loss.

2.4 |. Relations with current approaches

Our approach is a hierarchical two-step test procedure. It first detects signal at SOC level and then detects signal at AE level. The first step can be viewed as a screening step to prevent null SOCs (ie, no AE from such SOCs to be flagged) from entering the second step. This is similar to the original DFDR approach,¹⁰ and the new DFDR approach,¹¹ although the three approaches propose different screening rules for the first step. Specifically, the MiST method tests all AEs in an SOC simultaneously, but DFDR (original or new) tests each AE separately inside an SOC and then uses minP (of adjusted or raw p-values) as the overall test statistic for this SOC. At the second step, there are two versions of detecting AE level signals:

Conduct signal detection within each SOC, and account for multiple comparisons using classic BH procedure or adaptive BH approach.²⁵ This is employed in our hierarchical testing approach and the original DFDR approach.
Pool together all AEs from all SOCs entering the second step and then apply the BH procedure to these AEs to detect signals. This is proposed in the new DFDR approach.

The first approach leads to controlled overall FDR if the screening step is tight enough to ensure the number of null SOCs entering the second step is not too big. There is no theoretical guarantee of controlled overall FDR for our approach or for general hierarchical testing approach.¹⁹ When the FDR level is set at α₂, the second approach should almost always lead to well controlled FDR (level α₂), though also not theoretically guaranteed, as mentioned in Reference 11). Because the second approach is associated with heavier adjustment for multiplicity, we argue that the first approach should be more powerful.

It is important to point out that the separate subset BH approach (ssBH²⁶) is not a two-step approach although it also first divides hypotheses into group. The ssBH procedure conducts separate BH procedure (at subset-dependent FDR levels) on each of these subsets, and rejections of any hypotheses in any of these separate BH procedures are considered as final. Similarly, the group BH approach¹³ is not a two-step approach either. The group BH approach first divides hypotheses into groups so that to reweight p-values within each group. All these hypotheses will then be pooled together and a single BH procedure will be applied to the reweighted p-values to identify hypotheses to be rejected.

On the other hand, at the SOC level, our approach is similar to the Bayesian approaches,^15,16 in which they all try to combine safety information within an SOC and hence gather strength within an SOC. In contrast to the Bayesian approaches, however, our approach is in essential a frequentist approach that does not require specification of several layers of prior distributions and related assumptions. The random effects assumption, Equation (2), can be viewed as a counterpart of the prior distribution on regression coefficients in Bayesian models, but can also be related to certain shrinkage estimation approach, for example, Ridge estimation. Thus said, our proposed approach is a hybrid approach that derives testing statistics based on a working model that looks similar to models used in some Bayesian approaches.

Finally, we would like to point out that the new approach works directly on person-level safety data, while all other Frequentist approaches mentioned in this article work on summary statistics, that is, p-values. Working directly on person-level data (comparing to working directly on summary statistics) will allow an approach to account for dependence among AEs (and hence individual p-values) and hence improve its performance.

3 |. SIMULATION STUDIES

To examine the validity of the proposed testing procedure under finite sample size, we conduct numerical studies to check (a) false discovery rate (FDR) and (b) power at both the SOC level and the individual AE level under different scenarios. Here the FDR is calculated as the proportion of falsely flagged signals (SOCs at SOC level or AE for AE level) among all flagged signals, and power is calculated as the proportion of correctly flagged signals among true signals that should be flagged.

We focus on comparisons between frequentist approaches. We compared our approach with the following currently available frequentist testing procedures in terms of FDR and power.

NOADJ: unadjusted significance testing with raw p-values based on Fisher’s exact two-sided test for each type of AEs.
BONF: the Bonferroni correction.
BH: the Benjamini-Hochberg procedure for FDR control, also called one-step BH in Reference 11.
New DFDR: the new double false discovery rate approach.¹¹
GBH: group Benjamini-Hochberg.¹³
ssBH: subset Benjamini-Hochberg.²⁶

The first five approaches were included in the simulation studies in Reference 11, while the last one was not included. We still include the first five approaches in our simulation studies because our simulation studies contain extra scenarios not examined in Reference 11, and it worths further comparing these approaches under extra scenarios.

Except for new DFDR, which first flags SOCs, and then flags AEs within each flagged SOC, none of these approaches are designed for safety signal detection at SOC level. To enable comparison at the same level, we propose intuitive inference at SOC level based on individual AE testing results: we flag an SOC if any AE within this SOC is flagged.

We did not include comparisons between frequentist and Bayesian approaches, that is, Berry & Berry,¹⁵ MBLR,¹⁷ and those described in Reference 16, in this article. The reason is that it is not straightforward to ensure fair comparisons between them. To mention one, Bayesian approaches flag individual AEs based on certain posterior probabilities. For example, as described in Reference 16, AE j should be claimed as a signal if Pr(θ_bj > a|Data) > p, where a, p are prespecified constants. However, different specifications of constants a, p are required under different simulation scenarios, and there is no consensus on how to determine scenario-specific optimal constants (a, p). For example, using constants like a = 0 and p = 0.95 for all scenarios would very likely lead to unfair comparisons between frequentist and Bayesian approaches.

3.1 |. Simulation setup

We largely followed the simulation designs in Reference 11, but with a few extra settings that emphasize the nature of low frequency, weak signal of safety AEs, and possible negative correlations among different AEs, which may occur in real clinical trials.

In this simulation setting (cited from Reference 11, with minor revisions), the simulation started with generating correlated binary random variables following Lunn and Davies.²⁷ For a given patient, Y_kj ∈ {0, 1}, 1 ≤ k ≤ s, 1 ≤ j ≤ d_k denotes the occurrence of AE type j within SOC k and was generated using Y_kj = A_jX_kj, where A_j ~ i.i.d. Bernoulli(ϕ_j), and X_kj = (1 − U_kj)V_kj + U_kjZ_k with U_kj ~ i.i.d. Bernoulli(r_k), and V_kj, Z_k ~ i.i.d. Bernoulli(p_k). The random variables A_j, U_kj, V_kj, and Z_k are mutually independent. Thus, within each SOC, the Y_kjs were positively correlated Bernoulli variables with E{Y_kj} = ϕ_jp_k, var{Y_kj}=ϕ_jp_k(1−ϕ_jp_k) and $ρ_{j j^{'}}^{(k)} = cov (Y_{k j}, Y_{k j^{'}}) = \frac{\sqrt{ϕ_{j} ϕ_{j^{'}}} (1 - p_{k}) r_{k}^{2}}{\sqrt{(1 - ϕ_{j} p_{k}) (1 - ϕ_{j^{'}} p_{k})}}$ . Under this setting, the maximum correlation between any two AEs within SOC k was $r_{k}^{2}$ (this occurs when ϕ_j=ϕ_j′=1) and AE types in different SOCs were independent.

To complete the setting, we must specify sample size, number of SOCs, number of AE types in each SOC, and signal strength for AEs, which have different occurrence rates between treatment and control arms. The settings as specified in the new DFDR paper¹¹ and extra settings we have examined are listed in Table 1.

TABLE 1.

Simulation settings

Setting	SOC (number of AEs)	r²	p	δ	Treatment
1	6 (10,3,6,5,10,8)	0.5,0.25,0,0.625,0.36,0	5%, 10%, 15%, 20%, 25%, 30%	15%	(a) Null, (b) 60% in BS1, (c) 60% in BS1 and BS4, (d) 100% in BS1, (e) 100% in BS1 and BS4
2	20 (2 (1–5),5 (6–10),10 (11–15),20 (16–20))	0.5 (1–5), 0.25 (6–10), 0.36 (11–15), 0.5 (16–20)	5% (1–5), 10% (6–10), 15% (11–15), 20% (16–20)	15%	(a) Null, (b) 60% in (BS10 BS15 BS20), (c) 60% in (BS9 BS10 BS14 BS15 BS19 BS20), (d) 100% in (BS10 BS15 BS20), (e) 100% in (BS9 BS10 BS14 BS15 BS19 BS20)
3	6 (10,3,6,5,10,8)	0.5,0.25,0,0.625,0.36,0	3%, 5%, 7%, 9%, 11%, 13%	5%	(a) Null, (b) 60% in BS1, (c) 60% in BS1 and BS4, (d) 100% in BS1, (e) 100% in BS1 and BS4
4	20 (2 (1–5),5 (6–10),10 (11–15),20 (16–20))	0.5 (1–5), 0.25 (6–10), 0.36 (11–15), 0.5 (16–20)	5% (1–5), 7% (6–10), 9% (11–15), 11% (16–20)	5%	(a) Null, (b) 60% in (BS10 BS15 BS20), (c) 60% in (BS9 BS10 BS14 BS15 BS19 BS20), (d) 100% in (BS10 BS15 BS20), (e) 100% in (BS9 BS10 BS14 BS15 BS19 BS20)

Open in a new tab

Abbreviations: AE, adverse event; SOC, system organ class.

Settings (or sets) 1 and 2 are the same as in Reference 11, while settings 3 and 4 are revised settings of settings 1 and 2, respectively, to emphasize rare occurrence and weak signals. Under each setting, five parameter setups were considered, as indicated in the “Treatment” column. These five setups include the null situation where the treatment arm and the control arm have the same safety profile, and four setups under which the two arms have different safety profiles. For the latter four setups, we specify the SOCs and specific AEs in these SOCs on which the two arms have different occurrence rates. The difference in occurrence rates (5% or 15%) is specified by δ as shown in the column “δ.”

To allow for negative correlations among AEs, and to allow variations in direction of signals (eg, a drug may cause one AE but suppress another AE), we, for each of the above settings, create a new setting, by reversing Y_kj (ie, 1 − Y_kj) for one out of every four AEs generated under the old setting. We use this arbitrary (convenient but unrealistic) setup of negative correlations among AEs to examine the robustness of the safety data analysis approaches when positive correlations among AEs are not correct. We named the new settings (corresponding to settings 1, 2, 3, and 4) as scenarios 5, 6, 7, and 8, respectively. This leads to a total of 120(= 2 × 4 × 5 × 3) combinations of setting(or set)/setup/sample size.

It is worth noting that the above data simulation process is independent of the underlying models of those model-based approaches including MiST.

Sample sizes were set to be 150, 300, and 450 per arm for weak signal settings, that is, δ = 0.05, and 50, 100, and 150 per arm for strong signal settings, that is, δ = 0.15. For each setting/scenario/sample size, we run simulation with 2000 replications.

As a reviewer pointed out, in current setups (a to e), a given SOC either has none or most AEs that are associated with treatment. Such feature could be atypical in practice and likely favors more the two-step approach than single-step approaches. We thus, for settings 1 and 2, included three more setups: (1) 10% signal AEs in each SOC, (2) 30% signal AEs in each SOC, and (3) 50% signal AEs in each SOC. These cover scenarios under which true signals are rare, and scenarios under which signal AEs are uniformly distributed in SOCs. It is expected that under such scenarios, single-step approaches are likely more favorable, that is, signal AEs are evenly distributed among all SOCs, and hence, no SOC should be removed at step 1 and the number of tests at step 2 will not be reduced.

All approaches were implemented using the R package “c212,”²⁸ except for the MiST approach, which was implemented using the R package “MiST.”²⁹ To completely specify the new hierarchical approach, in our simulation studies, we set both SOC level (step 1) target FDR and within SOC level (step 2) target FDR at 0.05 to achieve the overall FDR target value 0.05.

3.2 |. Simulation results

To save space and to emphasize the main findings, we only present simulation results that convey new information not presented in Reference 11. We first note (data not shown to save space) that, setting target overall FDR level at 0.05, we reach the same conclusions as in Reference 11 in terms of comparisons across new DFDR, unadjusted, Bonferroni approach, BH procedure, and GBH:

no-adjustment approach and GBH failed to control overall FDR;
the FDR of Bonferroni approach, one-step BH, and new DFDR was consistently at or below target level, that is, 0.05 in our simulations;
the new DFDR was consistently more powerful than Bonferroni approach, one-step BH, and original DFDR.

The above findings applied to both the settings tested in Reference 11 the new settings we added.

We thus focus on comparisons across three approaches: new DFDR, hierarchical, and ssBH, in terms of power and FDR. We treat the new DFDR approach as the reference approach and presented the comparisons in the following figures: (a) hierarchical approach (we call MiST for simplicity) vs new DFDR in Figure 1, and (b) ssBH vs new DFDR in Figure 2. From these plots, the following findings are worth emphasizing.

False discovery rate (FDR) for global nulls (1st row) and FDR (2nd row) and power (3rd row) for alternatives at BD(SOC) (left) and AE (right) level of MiST vs DFDR for all scenario settings. We use line segments connects paired points (representing DFDR and MiST results) from the same scenario settings as defined by scenario number, parameter set, and sample size. For global nulls (1st row), there are three dots for each scenario. For alternatives (2nd and 3rd rows), there are 12 dots for each scenario, with each block of three dot-pairs, from left to right, corresponds to sets (b), (c), (d), and (e) as specified in Table 1. The three dots in each block correspond to three sample sizes, in increasing order from left to right [Colour figure can be viewed at wileyonlinelibrary.com]

False discovery rate (FDR) for global nulls (1st row) and FDR (2nd row) and power (3rd row) for alternatives at SOC (left) and AE (right) level of ssBH vs DFDR for all scenario settings. We use line segments that connect paired points (representing DFDR and ssBH results) from the same scenario settings as defined by scenario number, parameter set, and sample size. For global nulls (1st row), there are three dots for each scenario. For alternatives (2nd and 3rd rows), there are 12 dots for each scenario, with each block of three dot-pairs, from left to right, corresponds to sets (b), (c), (d), and (e) as specified in Table 1. The three dots in each block correspond to three sample sizes, in increasing order from left to right [Colour figure can be viewed at wileyonlinelibrary.com]

The new approach led to well-controlled SOC-level FDR (upper and middle left plots of Figure 1), and well-controlled overall FDR for all settings we have tried (upper and middle right plots of Figure 1).
The MiST approach was consistently more powerful than new DFDR (bottom right plot of Figure 1), and the differences in overall power between MiST and new DFDR was very substantial, with a median improvement at about 0.11 (range 0.02–0.21). The improvement in overall power should largely come from the improved power of detecting signal SOCs, as indicated in the bottom left plot of Figure 1: the MiST approach was consistently and substantially more powerful than new DFDR in terms of SOC level power, with a median improvement at about 0.20 (range 0.05–0.43).
When overall occurrence rates decrease (eg, p_i ∈ [5%, 30%] vs p_i ∈ [3%, 13%] ) and difference in safety profiles between the two arms decrease (eg, δ = 0.15 vs δ = 0.05), we notice sharp decrease in power associated with both new DFDR and MiST approaches, and increased difference in power between these new DFDR and the MiST approach. This indicates a very important feature of the MiST approach: it performed much better than new DFDR, for detecting safety signals under rare occurrence and weak signal situation, which is commonly encountered in clinic studies.
Taking one example, for setting 4 (185 AEs in 20 SOCs, weak signal), setup 2 (60% AE signals in 15% SOCs), and sample size 450, MiST power = 0.81 vs new DFDR approach power = 0.38 at SOC level, and MiST power = 0.39 vs new DFDR approach power = 0.26 at AE level.
However, it is still more challenging to correctly flag AE types than to flag SOCs when AEs are rare, signal is weak, and number of AE types is large (settings 4 and 8, sample size = 150).
As shown in Figure 2, both the ssBH and new DFDR approaches led to well-controlled FDR, but they both could become extremely conservative under nearly all settings at both the SOC and AE levels. Also, new DFDR was consistently more powerful than ssBH in terms of AE level power (bottom right plot of Figure 2), and also (for most scenarios) in terms of SOC level power (bottom left plot of Figure 2).
As shown in Figure 3, when true signals are not clustered in a few SOCs, but randomly (uniformly) distributed in different SOCs, the advantage of two-step approaches over one-step approaches become negligible, especially when sample size is large.

False discovery rate (FDR) (1st row) and power (2nd row) for three extra alternatives at SOC (left) and AE (right) levels of MiST, DFDR, and BH for scenario settings 1 and 2. We use line segments that connect paired points (representing DFDR, MiST, and BH results) from the same scenario settings as defined by scenario number, parameter set, and sample size. In each plot, there are nine dots for each scenario, with each block of three dot-pairs, from left to right, corresponds to three extra sets (1) 10%, (2) 30%, and (3) 50% true signals in each SOC. The three dots in each block correspond to three sample sizes, in increasing order from left to right [Colour figure can be viewed at wileyonlinelibrary.com]

4 |. REAL EXAMPLES

We present in this section secondary analyses of AE data from two clinical trials.

4.1 |. Clinical Trial: MA.20

This randomized, open label, placebo controlled study was designed to compare whole-breast irradiation plus regional nodal irradiation (WBI+RNI) with whole-breast irradiation (WBI) alone in women with early-stage breast cancer who were treated with breast-conserving surgery and adjuvant systemic therapy.³⁰ The primary objective is to assess the effect of WBI+RNI by comparing the overall survival between patients treated by WBI+RNI and WBI alone.

A total of 1832 patients were randomly assigned to WBI (n = 916) or WBI+RNI (n = 916). The safety analysis was based on the treated population of the treatment the patient actually received. Among 1820 patients who received treatment, 927 patients received WBI treatment and 893 patients received WBI+RNI treatment. Of them, about 300 did not report any delayed AE (those occurring > 3 months after the completion of radiation). For those who reported any delayed AE, a total of 111 AE types were reported, and these AEs were grouped into 22 SOCs.

The following AEs (Table 2) were reported significantly different between the two arms, according to unadjusted p-values (Fisher’s exact test) and cut-off p <= 0.05.

TABLE 2.

Delayed AEs of Grade 1 or higher for MA.20

SOC	AE	WBI (927)	WBI + RNI (893)	Raw p-value	MiST	new DFDR
Blood and lymphatic system disorders	Blood and lymphatic system disorders-Other	180 (19.4)	226(25.3)	7.9e-4	0.0032	0.0032
Musculoskeletal and connective tissue disorders	Arthralgia	85 (9.2)	108 (12.1)	0.048	NS	NS
Musculoskeletal and connective tissue disorders	Generalized muscle weakness	131 (14.1)	157 (17.6)	0.046	NS	NS
Nervous system disorders	Peripheral motor neuropathy	16 (1.7)	30 (3.4)	0.035	NS	NS
Nervous system disorders	Peripheral sensory neuropathy	187 (20.2)	218 (24.4)	0.032	NS	NS
Respiratory thoracic and mediastinal disorders	Dyspnea	118 (12.7)	150 (16.8)	0.017	NS	NS
Respiratory thoracic and mediastinal disorders	Pulmonary fibrosis	3 (0.3)	32 (3.6)	1.2e-7	1.1e-6	1.1e-6

Open in a new tab

Abbreviations: AE, adverse event; DFDR, double false discovery rate; SOC, system organ class; WBI+RNI, whole-breast irradiation plus regional nodal irradiation.

SOC level:

At FDR level of 0.05, with adjustment for multiplicity using BH approach, the MiST approach flagged two SOCs: (a) “Blood and lymphatic system disorders” (FDR-adjusted p-value = 0.02), and (b) “Respiratory thoracic and mediastinal disorders” (FDR-adjusted p-value = 0.002). A further check of the two components of the MiST approach test statistics showed the main contribution of the difference arose from the mean level of AE occurrence (FDR-adjusted Burden test p-value < 0.006), not the variation among individual AE occurrences (FDR-adjusted SKAT test p-value > 0.25). In this example, BH, new DFDR (frequentist approaches with appropriate control of FDR) also flagged these two, and only these two SOCs; subset BH (ssBH) only flagged the second SOC.

Individual AE level:

After adjustment for multiplicity, two AEs: (a) “Blood and lymphatic system disorders - Other” in SOC “Blood and lymphatic system disorders,” and (b) “Pulmonary fibrosis” in SOC “Respiratory thoracic and mediastinal disorders,” were flagged as being statistically different between the two arms using either the MiST approach or the new DFDR approach. The “Blood and lymphatic system disorders - Other” is attributed to a significant increase in the rate of lymphedema in the WBI+RNI arm because of the RNI.

4.2 |. Clinical Trial NIDA-CSP-999: Multicenter Clinical Trial of Buprenorphine - 3

This is a multicenter, randomized, double-blinded Phase 3 trial³¹ with a total of 736 subjects accrued from 12 different sites and randomly assigned to four treatment arms with different daily dosages: 1 mg (n = 185), 4 mg (n = 182), 8 mg (n = 188), 16 mg (n = 181), of Buprenorphine over a 16-week treatment period in the maintenance treatment of heroin addict. The primary objective was to evaluate the safety and efficacy of an 8-mg/day sublingual dose of buprenorphine by comparison with a 1-mg/day dose, while as a secondary objective, outcomes were determined concurrently for patients treated with two other dose levels (ie, 4 mg/day and 16 mg/day). Of these 736 patients, 118 did not report any AE. For those who reported any AE, a total of 185 AE types were reported, and these AEs were grouped into 12 SOCs.

We analyzed the AE data for the two primary arms: 1 mg/day vs 8 mg/day. The following AEs (Table 3) were reported significantly different between the two arms, according to unadjusted p-values (Fisher’s exact test) and cut-off p <= 0.05.

TABLE 3.

AEs of Grade 1 or higher for NIDA-CSP-999

SOC	AE	1 mg (185)	8 mg (188)	Raw p	MiST	new DFDR
BODY AS A WHOLE	FLU SYND (13)	6 (3.2)	23 (12.2)	0.00086	NS	NS
BODY AS A WHOLE	PAIN BACK (26)	18 (9.7)	32 (17.0)	0.0330	NS	NS
DIGESTIVE SYSTEM	CONSTIP (43)	12 (6.5)	26 (13.8)	0.0164	NS	NS
DIGESTIVE SYSTEM	DIARRHEA (44)	19 (10.3)	8 (4.3)	0.0441	NS	NS
DIGESTIVE SYSTEM	NAUSEA (57)	13 (7.0)	27 (14.4)	0.0190	NS	NS

Open in a new tab

Abbreviations: AE, adverse event; DFDR, double false discovery rate; SOC, system organ class.

SOC level:

With adjustment for multiplicity, no SOC was flagged at FDR level of 0.05. However, at α level of 0.10, the MiST approach flagged “DIGESTIVE SYSTEM” SOC (FDR-adjusted p-value = 0.072). A further check of the two components of the MiST approach test statistics showed the main source of the difference arose from variation in these individual AEs (FDR-adjusted SKAT test p-value = 0.06), instead of the mean level of AE occurrence (FDR-adjusted Burden test p-value = 0.69). This is consistent with the findings that (a) constipation (p = 0.0164) and nausea (p = 0.0190) occurred more frequently in the 8-mg group than in the 1-mg group, and (b) diarrhea was frequent complained in the 1-mg group than the 8-mg group (p = 0.0441). This implies that AEs were negatively correlated, like scenarios 5–8 in our simulation studies. At FDR level of 0.10, no SOC was flagged using any other approaches with adjustment of multiplicity.

Individual AE level:

With adjustment for multiplicity, no AE was flagged as being statistically different between the two arms. Without adjustment of multiplicity at AE level, however, we would flag “CONSTIP,” “DIARRHEA,” and “NAU-SEA,” in the SOC of “DIGESTIVE SYSTEM” (the only SOC that was flagged [only by the MiST approach] at FDR level 0.10).

The observed negative correlation between diarrhea and constipation (and nausea) is worth discussing. It shows a real case that negative dependence among AEs could exist. Indeed, this observed negative correlation is not likely due to a random chance, but more likely due to the pharmacological property of Buprenorphine, as indicated in the following pharmacology interpretation. Buprenorphine is a unique medication of the opioid class because it is a partial agonist at the mu-opioid receptor and a kappa receptor antagonist.³² In addition, a greater number of mu-opioid receptors must be activated by buprenorphine in order for this medication to exhibit an effect similar to full opioid agonists such as morphine.³³ This may explain the significantly higher incidence of constipation and nausea, the class effect of opioids, in the 8-mg dose group. At the lower 1-mg dose level, excessive kappa receptor inhibition may attribute to the increased incidence of diarrhea because normally when activated at the central level, kappa receptors inhibit colonic contractions.³⁴

5 |. DISCUSSION

This article proposed a hierarchical testing approach for analyzing safety data from randomized clinical trials. Explicitly taking advantage of the natural hierarchical structure of AEs, we group AEs (PT level) into SOCs and try to exclude SOCs, which contain no true discoveries, and then focus on identifying signal AEs from SOCs passing the screening step. We reformulate the problem of screening SOCs as one testing independence between treatment and safety profile, and this independence testing problem, after straightforward transformation and simplifications, leads to a problem very similar to a rare variant genome wide association study problem. We thus borrow strength of the well-studied approach, MiST, to tackle the safety signal detection issues in clinical trials, which, in general, are characterized by the combination of (a) high dimension, that is, very large number of different AE types, (b) rare occurrence, and (c) weak signal, or minimal treatment effects on individual AE types. The reversed logistic regression model used in independence testing also frees us from explicitly accounting for complex dependence among AEs. At the SOC level, the reversed logistic regression approach works directly on person-level data, treating AEs as covariates and treatment assignment as outcome. In fitting the reverse logistic regression model, correlations among covariates (AEs) are allowed and implicitly accounted and will not cause much loss of efficiency in model fitting as long as the dependence are not too extreme. On the contrary, approaches (eg, DFDR) that work on summary statistics (p-values) at the SOC level perform well only when the p-values are independent or weakly dependent, but lose efficiency (power) when the dependence among the summary statistics are not negligible.

The new approach and the new DFDR approach share a common thought: to flag individual AE types could be infeasible due to multiplicity and weak signals, and it is necessary to group AE types in certain way, for example, based on SOCs, and combine signals within each group. This is also a key motivation of several Bayesian approaches although we focus on frequentist approaches in this study. Comparing to new DFDR, our approach provides a different way of flagging an SOC. Our simulation studies clearly showed the superiority of the new approach over new DFDR, and other approaches that have controlled FDR, in terms of power of detecting true safety signals at the SOC level. Such superiority remains if we move to safety signal detection at the AE level based on FDR adjusted p-values within screened SOCs. This is especially true under the most practical scenarios with (1) high dimension, (2) rare occurrence, and (3) weak signal.

Although most frequentist approaches discussed here were originated in the area of statistical genetics, our simulations showed that their performance could be dramatically different when they were employed to detect safety signals. For example, the group BH became invalid when applied to safety signal detection due to uncontrolled FDR, as observed in Reference 11 and confirmed in our simulation (data not shown). The group BH approach contains two steps, with p-values being weighted (p-values could become smaller after applying weights) in step 1 followed by a BH FDR adjustment applied to the weighted p-values in step 2. The validity of group BH approach relies on a “sparsity” of signal assumption, which is reasonable and widely accepted in GWAS research. However, this assumption does not hold in our simulated scenarios and may not be readily justifiable in practice of clinical trials.

Overall, the new hierarchical approach is robustly better performed for safety signal detection than other approaches currently available. Moreover, the computation cost of the new approach is minimal. For example, it took about 0.3 seconds on a PC (Xeon(R) CPU E3–1226 v3@3.30GHz, single process) to complete one replication for setting 2 (185 AEs in 20 SOCs), scenario 1, with sample size 50 per arm. Thus, our simulation results provide strong support, based on FDR, power, as well as ease of implementation, for the use of new hierarchical approach, over other approaches that have valid FDR, for flagging AEs at both the SOC level and individual AE level.

The hierarchical testing approach is assumed to be much more powerful than single-step approach under certain scenario,¹⁹ and our study has shown very promising results of this approach applied in safety data analysis. There are still a lot to be pursued to consolidate this hierarchical testing approach. First, although we propose using MiST approach for SOC level detection, there are other approaches^35–37 of testing independence between treatment assignment and multivariate safety outcomes. It would be interesting to compare different approaches in future work. Second, we employed the classic BH procedure at the second step to identify signal AEs. The BH approach is well known to be very conservative, especially for dependent tests.³⁸ More powerful FDR control approach for correlated tests might further improve the power of hierarchical testing approaches for detecting safety signals. This also merits further exploration.

Last but also most important, control of overall FDR for hierarchical testing approach is still an open question.¹⁹ In our simulation studies, the overall FDR was well controlled, at or below target FDR. However, this does not mean our approach theoretically guarantees controlled overall FDR. More likely, the observed well-controlled FDR arose from the fact that we use target FDR of 5% at SOC level, and the total number of SOCs in our simulation studies was no larger than 20. This limited the total number of null SOCs entering the second step, and hence prevented inflated overall FDR. A thorough discussion of overall FDR control is out of the scope of this article, but we hypothesize that control of overall FDR for hierarchical testing approach is achievable by appropriately choosing selection criteria (not necessarily FDR level) at both SOC level (step 1) and AE level (step 2), and there might exist certain optimal combination of selection criteria to maximize power while controlling overall FDR.

Dependence among tests could be one cause of power loss for one-step approaches like BH and ssBH. However, carefully accounting for dependence among correlated tests may not fully recover the lost power. To support this argument, we tested a bootstrap FDR control approach,³⁸ applied on pooled AEs (ie, ignoring group structure), and compared its performance with that of new DFDR. As shown in Figure 4, the single-step bootstrap FDR control approach tended to have larger FDR (or FDR closer to target value) comparing to the new DFDR approach when sample size is not small (upper and middle plots in Figure 4). However, the bootstrap FDR control approach was consistently less powerful than the new DFDR approach, and thus much less powerful than our new approach. This implies that accounting for the hierarchical structure among AEs could be advantageous. Our simulation results suggest the idea that the hierarchical testing approach is among the most promising approaches to account for the hierarchical structure among AEs, although further consolidation is desired.

False discovery rate (FDR) for global nulls (1st row), and FDR (2nd row) and power (3rd row) for alternatives at AE level of Bootstrap approach vs DFDR for all scenario settings. We use line segments that connect paired points (representing DFDR and Bootstrap approach results) from the same scenario settings as defined by scenario number, parameter set, and sample size. For global nulls (1st row), there are three dots for each scenario. For alternatives (2nd and 3rd rows), there are 12 dots for each scenario, with each block of three dot-pairs, from left to right, corresponds to sets (b), (c), (d), and (e) as specified in Table 1. The three dots in each block correspond to three sample sizes, in increasing order from left to right [Colour figure can be viewed at wileyonlinelibrary.com]

Safety data analysis is not limited to safety signal detection. Similarly or even more important questions include, but not limited to (1) subgroup analysis to identify vulnerable subgroups, and (2) benefit-risk assessment that jointly analyzes efficacy and safety outcomes. However, safety signal detection is the gate-keeper step, and other safety questions become clinically relevant only when we are sure there exist safety issues. Moreover, statistical methodologies useful for detecting safety signals might also be applicable to follow-up questions. For example, subgroup analysis could be viewed as a more detailed safety signal detection problem, that is, examining whether there exist safety signals in prespecified subgroups of patients. We thus argue that safety signal detection still deserves researchers’ attention.

Supplementary Material

Editor comments

NIHMS1713893-supplement-Editor_comments.doc^{(101.5KB, doc)}

ACKNOWLEDGEMENTS

The authors wish to thank the editor, associate editor and the referees for helpful comments and suggestions, which have led to an improvement of this article. Dr. Tan’s work was partially supported by NCI grant P30CA016086; Dr. Chen’s work was supported in part by the grants from the NSERC; Dr. Ibrahim’s research was partially supported by NIH grants \#GM 70335, and P01CA142538. The second real data example used data from “NIDA-CSP-999: A Multicenter ClinicalTrial of Buprenorphine in Treatment of Opiate Dependence”. NIDA databases and information are available at http://datashare.nida.nih.gov.

Funding information

Natural Sciences and Engineering Research Council of Canada

APPENDIX

One-sided test vs two-sided test

In the above derivations, we assume conducting two-sided alternative hypotheses at both SOC and AE levels. However, in some situations, it is also worth to consider a one-sided test. For example, in a placebo-controlled trial, we compare the safety profile between a new treatment and a placebo, it is reasonable to expect that patients under placebo should not experience any drug-related safety issue. Thus, the null and alternative hypotheses should be better expressed as:

H_{0} : p_{d 1} = p_{d 0} vs H_{1} : p_{d 1} > p_{d 0} for all d = 1, 2, \dots, D .

The MiST approach can be slightly revised to examine such one-sided hypothesis testing. First, the corresponding hypothesis underling the MiST approach framework is:

H_{0} : β_{1} = \dots = β_{d} = 0 vs H_{1} : β_{d} > 0 for some d \in {1, 2, \dots, D} .

Since we assume that

β_{j} \overset{i . i . d}{~} N (π, τ^{2}) .

The above null hypothesis is then equivalent to:

H_{0} : π = 0 and τ^{2} = 0,

and the alternative hypothesis becomes

H_{1} : π > 0 and τ^{2} > 0.

Recall that this hypothesis can be divided into two parts: (a) H₀₁ : τ² = 0, and (b) H₀₂ : π = 0|τ² = 0. We test the first hypothesis H₀₁ in the same way as described above. For the second hypothesis H₀₂ : π = 0|τ² = 0, the following test statistic will be used:

U_{π} = {(Y 1_{d})}^{T} (T - \tilde{μ}),

where $\tilde{μ} = {({\tilde{μ}}_{1}, \dots, {\tilde{μ}}_{N})}^{T}$ with ${\tilde{μ}}_{i} = {logit}^{- 1} ({\tilde{α}}_{0})$ and ${\tilde{α}}_{0}$ is estimated under the null model when π = 0 and τ² = 0. Under π = 0 and τ² = 0, we have $Σ^{- 1 / 2} U_{π} ~ N (0, 1)$ , with Σ^−1/2 defined as before. For one-sided test, we reject the null hypothesis H₀₂ : π = 0|τ² = 0 if the observed Σ^−1/2U_π is too large.

Footnotes

CONFLICT OF INTEREST

The authors declare no potential conflict of interests.

SUPPORTING INFORMATION

Additional supporting information may be found online in the Supporting Information section at the end of this article.

See appendix for a discussion of one-sided test.

REFERENCES

1.ICH ICH, editor, ICH safety guidelines; 2013. http://www.ich.org/products/guidelines/safety/article/safetyguidelines.html. Published October 29, 2013.
2.Council for International Organizations of Medical Sciences. Management of Safety Information from Clinical Trials: Report of CIOMS Working Group VI. Geneva: ICA; 2005. [Google Scholar]
3.US Department of Health Human Services & Food and Drug Administration. Guidance for industry and investigators safety reporting requirements for INDs and BA/BE Studies; 2010.
4.Communication from the commission: ECC EC, editor, Detailed guidance on the collection, verification and presentation of adverse event/reaction reports arising from clinical trials on medicinal products for human use (‘CT-3’). 2011. http://www.kme-nmec.si/Docu/ct-3.pdf. Published January 10, 2017.
5.Xia HA, J CB, Schriver RC, Oster M, Hall DB. Planning and core analyses for periodic aggregate safety data reviews. Clin Trials. 2011;8(2):175–182. https://www.ncbi.nlm.nih.gov/pubmed/21270142. [DOI] [PubMed] [Google Scholar]
6.MedDRA M Introductory Guide MedDRA Version 19.0. Chantilly, VA: MedDRA Maintenance and Support Services Organization; 2016. [Google Scholar]
7.Sibbald B Rofecoxib (Vioxx) voluntarily withdrawn from market. Can Med Assoc J. 2004;171(9):1027–1028. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Krumholz HM, Ross JS, Presler AH, Egilman DS. What have we learnt from Vioxx? Bmj. 2007;334(7585):120–123. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Xia HA, Jiang Q. Statistical evaluation of drug safety data. Ther Innov Regul Sci. 2014;48(1):109–120. [DOI] [PubMed] [Google Scholar]
10.Mehrotra DV, Heyse JF. Use of the false discovery rate for evaluating clinical safety data. Stat Methods Med Res. 2004;13(3):227–238. [DOI] [PubMed] [Google Scholar]
11.Mehrotra DV, Adewale AJ. Flagging clinical adverse experiences: reducing false discoveries without materially compromising power for detecting true signals. Stat Med. 2012;31(18):1918–1930. https://www.ncbi.nlm.nih.gov/pubmed/22415725. [DOI] [PubMed] [Google Scholar]
12.Benjamani Y, Hochberg Y. Controlling the false discovery rate - a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol. 1995;57:289–300. [Google Scholar]
13.HJ X, Hongyu Z, ZH H. False discovery rate control with groups. J Am Stat Assoc. 2010;105(491):1215–1227. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Cui X, Churchill GA. Statistical tests for differential expression in cDNA microarray experiments. Genome Biol. 2003;4(4):210. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Berry SM, Berry DA. Accounting for multiplicities in assessing drug safety: a three-level hierarchical mixture model. Biometrics. 2004;60(2):418–426. https://www.ncbi.nlm.nih.gov/pubmed/15180667. [DOI] [PubMed] [Google Scholar]
16.Xia HA, Ma H, Carlin BP. Bayesian hierarchical modeling for detecting safety signals in clinical trials. J Biopharm Stat. 2011;21(5):1006–1029. https://www.ncbi.nlm.nih.gov/pubmed/21830928. [DOI] [PubMed] [Google Scholar]
17.DuMouchel W Multivariate Bayesian logistic regression for analysis of clinical study safety issues. Stat Sci. 2012;27(3):319–339. [Google Scholar]
18.Chi G, Hung H, O’Neill R. Some comments on “Adaptive trials and Bayesian statistics in drug development”. Biopharmaceutical Rep. 2001;9:7–10. [Google Scholar]
19.Benjamini Y, Bogomolov M. Selective inference on multiple families of hypotheses. J RStat Soc Ser B (Stat Methodol). 2014;76(1):297–318. [Google Scholar]
20.Sun J, Zheng Y, Hsu L. A unified mixed-effects model for rare-variant association in sequencing studies. Genet Epidemiol. 2013;37(4):334–344. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83(3):311–321. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89(1):82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Wang J, Zhao Q, Hastie T, Owen AB. Confounder adjustment in multiple hypothesis testing. 2015. arXiv preprint arXiv:150804178. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Leek JT, Storey JD. A general framework for multiple testing dependence. Proc Natl Acad Sci. 2008;105(48):18718–18723. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Benjamini Y, Krieger AM, Yekutieli D. Adaptive linear step-up procedures that control the false discovery rate. Biometrika. 2006;93(3):491–507. [Google Scholar]
26.Daniel Y False discovery rate control for non-positively regression dependent test statistics. J Stat Plan Infer. 2008;138(2):405–415. [Google Scholar]
27.Lunn AD, Davies SJ. A note on generating correlated binary variables. Biometrika. 1998;85(2):487–490. [Google Scholar]
28.Carragher R c212: methods for detecting safety signals in clinical trials using body-systems (System Organ Classes), R package version 0.90; 2017. https://CRAN.R-project.org/package=c212.
29.Sun J, Zheng Y, Hsu L. MiST: mixed effects score test for continuous outcomes, R package version 1.0; 2013. https://CRAN.R-project.org/package=MiST. [Google Scholar]
30.Whelan TJ, Olivotto IA, Parulekar WR, et al. Regional nodal irradiation in early-stage breast cancer. N Engl J Med. 2015;373(4):307–316. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Ling W, Charuvastra C, Collins JF, et al. Buprenorphine maintenance treatment of opiate dependence: a multicenter, randomized clinical trial. Addiction. 1998;93(4):475–486. [DOI] [PubMed] [Google Scholar]
32.Negus SS, Picker MJ, Dykstra LA. Kappa antagonist properties of buprenorphine in non-tolerant and morphine-tolerant rats. Psychopharmacology. 1989;98(1):141–143. [DOI] [PubMed] [Google Scholar]
33.Gilbert PE, Martin W. The effects of morphine and nalorphine-like drugs in the nondependent, morphine-dependent and cyclazocine-dependent chronic spinal dog. J Pharmacol Exp Ther. 1976;198(1):66–82. [PubMed] [Google Scholar]
34.Bueno L, Fioramonti J. Action of opiates on gastrointestinal function. Bailliere’s clinical. Gastroenterology. 1988;2(1):123–139. [DOI] [PubMed] [Google Scholar]
35.Gagnon-Bartsch J, Shem-Tov Y. The classification permutation test: a nonparametric test for equality of multivariate distributions; 2016. arXiv preprint arXiv:161106408. [Google Scholar]
36.Heller R, Heller Y, Gorfine M. A consistent multivariate test of association based on ranks of distances. Biometrika. 2012;100(2):503–510. [Google Scholar]
37.Oja H, Randles RH. Multivariate nonparametric tests. Stat Sci. 2004;19(4):598–605. [Google Scholar]
38.Romano JP, Shaikh AM, Wolf M. Control of the false discovery rate under dependence using the bootstrap and subsampling. Test. 2008;17(3):417–442. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Editor comments

NIHMS1713893-supplement-Editor_comments.doc^{(101.5KB, doc)}

[R1] 1.ICH ICH, editor, ICH safety guidelines; 2013. http://www.ich.org/products/guidelines/safety/article/safetyguidelines.html. Published October 29, 2013.

[R2] 2.Council for International Organizations of Medical Sciences. Management of Safety Information from Clinical Trials: Report of CIOMS Working Group VI. Geneva: ICA; 2005. [Google Scholar]

[R3] 3.US Department of Health Human Services & Food and Drug Administration. Guidance for industry and investigators safety reporting requirements for INDs and BA/BE Studies; 2010.

[R4] 4.Communication from the commission: ECC EC, editor, Detailed guidance on the collection, verification and presentation of adverse event/reaction reports arising from clinical trials on medicinal products for human use (‘CT-3’). 2011. http://www.kme-nmec.si/Docu/ct-3.pdf. Published January 10, 2017.

[R5] 5.Xia HA, J CB, Schriver RC, Oster M, Hall DB. Planning and core analyses for periodic aggregate safety data reviews. Clin Trials. 2011;8(2):175–182. https://www.ncbi.nlm.nih.gov/pubmed/21270142. [DOI] [PubMed] [Google Scholar]

[R6] 6.MedDRA M Introductory Guide MedDRA Version 19.0. Chantilly, VA: MedDRA Maintenance and Support Services Organization; 2016. [Google Scholar]

[R7] 7.Sibbald B Rofecoxib (Vioxx) voluntarily withdrawn from market. Can Med Assoc J. 2004;171(9):1027–1028. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Krumholz HM, Ross JS, Presler AH, Egilman DS. What have we learnt from Vioxx? Bmj. 2007;334(7585):120–123. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Xia HA, Jiang Q. Statistical evaluation of drug safety data. Ther Innov Regul Sci. 2014;48(1):109–120. [DOI] [PubMed] [Google Scholar]

[R10] 10.Mehrotra DV, Heyse JF. Use of the false discovery rate for evaluating clinical safety data. Stat Methods Med Res. 2004;13(3):227–238. [DOI] [PubMed] [Google Scholar]

[R11] 11.Mehrotra DV, Adewale AJ. Flagging clinical adverse experiences: reducing false discoveries without materially compromising power for detecting true signals. Stat Med. 2012;31(18):1918–1930. https://www.ncbi.nlm.nih.gov/pubmed/22415725. [DOI] [PubMed] [Google Scholar]

[R12] 12.Benjamani Y, Hochberg Y. Controlling the false discovery rate - a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol. 1995;57:289–300. [Google Scholar]

[R13] 13.HJ X, Hongyu Z, ZH H. False discovery rate control with groups. J Am Stat Assoc. 2010;105(491):1215–1227. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Cui X, Churchill GA. Statistical tests for differential expression in cDNA microarray experiments. Genome Biol. 2003;4(4):210. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Berry SM, Berry DA. Accounting for multiplicities in assessing drug safety: a three-level hierarchical mixture model. Biometrics. 2004;60(2):418–426. https://www.ncbi.nlm.nih.gov/pubmed/15180667. [DOI] [PubMed] [Google Scholar]

[R16] 16.Xia HA, Ma H, Carlin BP. Bayesian hierarchical modeling for detecting safety signals in clinical trials. J Biopharm Stat. 2011;21(5):1006–1029. https://www.ncbi.nlm.nih.gov/pubmed/21830928. [DOI] [PubMed] [Google Scholar]

[R17] 17.DuMouchel W Multivariate Bayesian logistic regression for analysis of clinical study safety issues. Stat Sci. 2012;27(3):319–339. [Google Scholar]

[R18] 18.Chi G, Hung H, O’Neill R. Some comments on “Adaptive trials and Bayesian statistics in drug development”. Biopharmaceutical Rep. 2001;9:7–10. [Google Scholar]

[R19] 19.Benjamini Y, Bogomolov M. Selective inference on multiple families of hypotheses. J RStat Soc Ser B (Stat Methodol). 2014;76(1):297–318. [Google Scholar]

[R20] 20.Sun J, Zheng Y, Hsu L. A unified mixed-effects model for rare-variant association in sequencing studies. Genet Epidemiol. 2013;37(4):334–344. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83(3):311–321. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89(1):82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Wang J, Zhao Q, Hastie T, Owen AB. Confounder adjustment in multiple hypothesis testing. 2015. arXiv preprint arXiv:150804178. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Leek JT, Storey JD. A general framework for multiple testing dependence. Proc Natl Acad Sci. 2008;105(48):18718–18723. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Benjamini Y, Krieger AM, Yekutieli D. Adaptive linear step-up procedures that control the false discovery rate. Biometrika. 2006;93(3):491–507. [Google Scholar]

[R26] 26.Daniel Y False discovery rate control for non-positively regression dependent test statistics. J Stat Plan Infer. 2008;138(2):405–415. [Google Scholar]

[R27] 27.Lunn AD, Davies SJ. A note on generating correlated binary variables. Biometrika. 1998;85(2):487–490. [Google Scholar]

[R28] 28.Carragher R c212: methods for detecting safety signals in clinical trials using body-systems (System Organ Classes), R package version 0.90; 2017. https://CRAN.R-project.org/package=c212.

[R29] 29.Sun J, Zheng Y, Hsu L. MiST: mixed effects score test for continuous outcomes, R package version 1.0; 2013. https://CRAN.R-project.org/package=MiST. [Google Scholar]

[R30] 30.Whelan TJ, Olivotto IA, Parulekar WR, et al. Regional nodal irradiation in early-stage breast cancer. N Engl J Med. 2015;373(4):307–316. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.Ling W, Charuvastra C, Collins JF, et al. Buprenorphine maintenance treatment of opiate dependence: a multicenter, randomized clinical trial. Addiction. 1998;93(4):475–486. [DOI] [PubMed] [Google Scholar]

[R32] 32.Negus SS, Picker MJ, Dykstra LA. Kappa antagonist properties of buprenorphine in non-tolerant and morphine-tolerant rats. Psychopharmacology. 1989;98(1):141–143. [DOI] [PubMed] [Google Scholar]

[R33] 33.Gilbert PE, Martin W. The effects of morphine and nalorphine-like drugs in the nondependent, morphine-dependent and cyclazocine-dependent chronic spinal dog. J Pharmacol Exp Ther. 1976;198(1):66–82. [PubMed] [Google Scholar]

[R34] 34.Bueno L, Fioramonti J. Action of opiates on gastrointestinal function. Bailliere’s clinical. Gastroenterology. 1988;2(1):123–139. [DOI] [PubMed] [Google Scholar]

[R35] 35.Gagnon-Bartsch J, Shem-Tov Y. The classification permutation test: a nonparametric test for equality of multivariate distributions; 2016. arXiv preprint arXiv:161106408. [Google Scholar]

[R36] 36.Heller R, Heller Y, Gorfine M. A consistent multivariate test of association based on ranks of distances. Biometrika. 2012;100(2):503–510. [Google Scholar]

[R37] 37.Oja H, Randles RH. Multivariate nonparametric tests. Stat Sci. 2004;19(4):598–605. [Google Scholar]

[R38] 38.Romano JP, Shaikh AM, Wolf M. Control of the false discovery rate under dependence using the bootstrap and subsampling. Test. 2008;17(3):417–442. [Google Scholar]

PERMALINK

A hierarchical testing approach for detecting safety signals in clinical trials

Xianming Tan

Bingshu E Chen

Jianping Sun

Tejendra Patel

Joseph G Ibrahim

Abstract

1 |. INTRODUCTION

2 |. METHODS

2.1 |. SOC level test

2.2 |. Test statistics

2.3 |. Inference at AE level

2.4 |. Relations with current approaches

3 |. SIMULATION STUDIES

3.1 |. Simulation setup

TABLE 1.

3.2 |. Simulation results

FIGURE 1.

FIGURE 2.

FIGURE 3.

4 |. REAL EXAMPLES

4.1 |. Clinical Trial: MA.20

TABLE 2.

SOC level:

Individual AE level:

4.2 |. Clinical Trial NIDA-CSP-999: Multicenter Clinical Trial of Buprenorphine - 3

TABLE 3.

SOC level:

Individual AE level:

5 |. DISCUSSION

FIGURE 4.

Supplementary Material

ACKNOWLEDGEMENTS

Funding information

APPENDIX

One-sided test vs two-sided test

Footnotes

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases