Abstract
When comparing multiple groups in clinical trials, we are not only interested in whether there is a difference between any groups but rather where the difference is. Such research questions lead to testing multiple individual hypotheses. To control the familywise error rate (FWER), we must apply some corrections or introduce tests that control the FWER by design. In the case of time-to-event data, a Bonferroni-corrected log-rank test is commonly used. This approach has two significant drawbacks: (i) it loses power when the proportional hazards assumption is violated and (ii) the correction generally leads to a lower power, especially when the test statistics are not independent. We propose two new tests based on combined weighted log-rank tests. One is a simple multiple contrast test of weighted log-rank tests, and one is an extension of the so-called CASANOVA test. The latter was introduced for factorial designs. We propose a new multiple contrast test based on the CASANOVA approach. Our test shows promise of being more powerful under crossing hazards and eliminates the need for additional p-value correction. We assess the performance of our tests through extensive Monte Carlo simulation studies covering both proportional and non-proportional hazard scenarios. Finally, we apply the new and reference methods to a real-world data example. The new approaches control the FWER and show reasonable power in all scenarios. They outperform the adjusted approaches in some non-proportional settings in terms of power.
Supplementary Information
The online version contains supplementary material available at 10.1007/s10985-025-09676-9.
Keywords: Multiple contrast tests, Non-proportional hazards, Survival analysis, Weighted log-rank test
Introduction
Time-to-event or survival analysis is ubiquitous across medical research, engineering, and social sciences. Trials often involve multiple groups (treatment arms) or factorial designs, creating unique statistical challenges. The primary research focuses not merely on whether any arms differ but specifically on identifying which groups show differences. Thus, traditional global test procedures like ANOVA-type methods, which test null hypotheses of equal hazard ratios or cumulative hazard rate functions, are often inadequate (Konietschke et al. 2012; Ditzhaus et al. 2023). Instead, flexible multiple comparison procedures are crucial in modern data analysis. Current approaches typically employ pairwise multiple log-rank tests with adjustments for multiplicity (e.g., Bonferroni correction) (Logan et al. 2005), but these methods can lack efficiency due to restrictive assumptions about the correlation structure of test statistics (Gao et al. 2008; Gao and Alvo 2008). In recent years, many researchers developed multiple contrast test procedures (MCTPs) along with simultaneous confidence intervals (SCIs) (usually conducted as maximum tests), which are valid for arbitrary correlations of the test statistics and use the correlation within the multiplicity adjustment for various endpoints (means, proportions, Mann–Whitney effects) (Bretz et al. 2001; Schaarschmidt et al. 2009; Hasler and Hothorn 2008; Konietschke et al. 2013; Blanche et al. 2022). Munko et al. (2024) introduced a restricted mean survival time (RMST)-based multiple contrast tests for time-to-event data. Since the RMST should not be employed under crossing hazards (Dormuth et al. 2022, 2023), we aim to close this gap and introduce a powerful and flexible MCTP for analyzing survival data with crossing hazards.
The log-rank test is one of the most prominent test procedures in survival analysis. The method is well known to be optimal when the proportional hazards (PH) assumption is met, but significantly loses power otherwise (Dormuth et al. 2023). Even though the problem is fairly well known, a many investigators of (clinical) trials still ignore the issue and publish their findings upon log-rank tests in leading high-quality peer-reviewed journals, even when the assumption is violated (Kristiansen 2012; Trinquart et al. 2016; Dormuth et al. 2023). For the analysis of two independent samples, weighted log-rank tests and their combinations comprise a great alternative to the classical log-rank test and are beneficial in non-proportional hazards models (Andersen et al. 1993; Fleming and Harrington 1991; Brendel et al. 2014; Ditzhaus and Friedrich 2020). Ditzhaus and Friedrich (2020) propose a Wald-type test of multiple weight functions within a single multivariate test. Which weight function to choose depends on the alternative of interest and cannot be recommended in a general way. However, the test does not provide information on which weight function appears most powerful. For the analysis of more than two samples and factorial designs, Ditzhaus et al. (2023) extended these procedures to the Cumulative Aalen Survival Analysis-of-Variance (CASANOVA) method. In principle, they are global ANOVA-based tests (quadratic forms) and can be used to estimate and test main and interaction effects in general factorial designs. Estimating and testing user-specific contrasts are impossible, limiting their application in statistical practice. To overcome these shortcomings, we propose a novel flexible MCTP. Extensive simulation studies indicate that the test is more powerful under non-proportional hazards and eliminates the need for additional p-value correction. The remainder of the paper is organized as follows. The second section introduces the main statistical methods employed in the analyses. The third section describes the simulation setup and the corresponding results. The following section applies the methods of interest to a real-world data example. The conclusions are drawn in section five, together with future research questions.
Set up
Multiple contrasts are faced in many research questions related to time-to-event endpoints. Applying separate tests without adjusting for multiple testing increases the likelihood of false discoveries and inflated error rates. In the following, we present different well-established statistical methods for an underlying multiple contrast problem with time-to-event endpoints, as well as our newly developed method based on a combination of multi-directional log-rank tests and the concept of maximum tests.
Statistical model
First, we define the underlying statistical model. We consider a study design involving
groups (treatment arms) of
independent subjects, each with time-to-event data
and right-censoring time
. The statistical model considered here can be summarized by mutually independent positive random variables
where
and
are both continuous distribution functions, respectively. Furthermore, let
denote the observed time and
the censoring status with
being the indicator function. The statistical model considered here does not entail any parameters but rather the survival distributions that could be used to define reasonable treatment effects. The cumulative hazard rate function for group j is defined by
![]() |
1 |
with
the hazard rate of group j.
We further assume non-zero sized groups by
with
as
and we exclude the case of only censored values within one group by assuming that
and
and some
.
Multiple null hypotheses
The cumulative hazard rate function of treatment arm j called
, summarizes the total accumulated risk of experiencing the event that has been gained by progressing to time t. No difference (i.e., no effect) between treatment arms
and
with
, corresponds to
for all t, or, equivalently,
. In the several sample problems, let
be a contrast matrix satisfying
with
and
denoting vectors of ones and zeros, respectively. We denote the entries of
as
. For ease of presentation, we describe the pairwise comparisons only. Here, the most prominent matrices are the ones of Dunnett- and Tukey-type. The entries are composed of a single
and 1, indicating the two sample comparisons of interest. We define the corresponding index sets
and
. In the following, we will indicate the position in the matrix or vector by the corresponding indices
and
, for example,
for
and
.
The hypotheses we seek to infer are expressed in relation to the cumulative hazard rate functions as follows:
![]() |
with
denoting the transposed vector of
and I being either
or
. In general, the contrast matrix selection depends on the specific question of interest underlying the analysis.
Statistical tests
Adjusted log-rank
As a reference method, we consider the Bonferroni adjusted log-rank test. Therefore, we define the Bonferroni-adjusted significance level
where
is the original significance level and q is the number of comparisons. The Bonferroni adjustment for multiple comparisons in a survival setting is a standard procedure in clinical settings, as discussed in Logan et al. (2005). The authors have provided a comprehensive description and suggested various methods for adjusting the number of comparisons.
We define the weighted log-rank test as a generalization of the classical log-rank test. Therefore, we employ the conventional counting process notation. Let
represent the cumulative number of observed events within group j up to time t with
. Furthermore, we introduce
, which denotes the number of individuals at risk just before time t in group j. These counting processes enable us to define the Nelson–Aalen estimator for
as
for
and
. Finally, consider
to be the pooled sample size over the two groups of interest.
Then, the weighted log-rank statistic for testing the local null hypothesis
can be defined as Andersen et al. (1993):
![]() |
Here,
represents the left-continuous version of the estimator
with
the Kaplan–Meier estimator based on the pooled sample, and w is a continuous weight function and
. Fleming and Harrington (1991) examined a specific subclass of weights w given by
. For instance, when
, the log-rank test is obtained. We derive the individual p-values for the tests from the
distribution and compare them to
. To make a global statement, we compare the minimal p-value among all local tests to the adjusted significance level.
For practical implementation, we utilize the R package survival and its function survdiff. Therneau (2023)
Adjusted mdir
In the context of our specific objectives, we are interested in more robust testing procedures towards multiple alternatives. For two-group comparisons, the multi-directional log-rank test has been proposed as a combination procedure of different weighted log-rank tests (Brendel et al. 2014; Ditzhaus and Friedrich 2020). The test assumes the equality of survival under the null hypothesis, with the choice of weights determining the alternative hypothesis. We are particularly interested in weights that intersect the x-axis, such as
, as they are specifically designed to address crossing hazard alternatives.
By default, the R package mdir.logrank (Ditzhaus and Friedrich 2018) implements a combination of the log-rank weight
and this crossing weight. Dormuth et al. (2023) showed that this default set of weights seems to be robust against multiple alternatives. Nevertheless, if desired, additional weights can be combined to cover more alternative hypotheses. For the general case of m linearly independent weights
, the local test statistic takes a studentized quadratic form:
![]() |
The entries of
are given by
![]() |
with
the pooled Nelson–Aalen estimator of groups
and
.
represents the Moore–Penrose inverse of the empirical covariance matrix of the weighted log-rank tests. For linearly independent weights
fulfilling the assumptions of Ditzhaus et al. (2023) (continuous and of bounded variation), the test statistic
can be assumed to be
distributed under the null hypothesis. Ditzhaus and Friedrich (2020) also proposed a permutation-based approach. Due to inflated type I errors, we only consider the permutation-based approach.
Again, we employ the Bonferroni adjusted significance level
to compare to the obtained local p-values. Analogously to the adjusted log-rank test procedure, we obtain the global test decision by comparing the smallest p-value to the adjusted significance level.
MultiWeightedLR
Knowing that maximum tests are a common approach for multiple testing problems (Konietschke et al. 2013), a straightforward extension of the weighted log-rank test is to use the maximum over them and exploit the covariance structure between the different tests. We use the same weights as for the adjusted mdir approach without combining them in a quadratic form. Instead, we consider each weighted test individually. After calculating the corresponding covariance matrix, we take the maximum of all weighted test statistics as our global maximum test statistic. Mathematically, we write:
![]() |
For the local testing problem we focus on
. Similar to the proof of Theorem 2 in Ditzhaus et al. (2023), it can be shown that the vector
is, under regularity conditions, asymptotic centered multivariate normally distributed with covariance matrix
. We thus take the equicoordinate
-quantile (Konietschke et al. 2012) of this distribution as a critical value to obtain the MultiWeightedLR test in the statistic
.
multiCASANOVA
Ditzhaus et al. (2023) proposed the CASANOVA (Cumulative Aalen Survival Analysis-of-Variance) approach for general factorial designs with right-censored time-to-event data. The core idea of the method is an extension of weighted log-rank tests to the factorial design setup. Therefore, they expanded the combination approach of weighted log-rank tests (mdir) for the two-sample scenario to the general factorial survival designs implemented in the R package GFDsurv (Ditzhaus et al. 2022). For further information, we refer to Ditzhaus et al. (2023)
We aim to extend the CASANOVA approach to allow the estimation and testing of user-specific contrasts in a multiple testing framework. Compared to the aforementioned approaches, the main difference is that we consider pooled quantities over all groups, not only the two groups of interest. To this end, we define a local test statistic for contrast
as
![]() |
where
represents the left-continuous version of the pooled estimator
and Y(t) is the total number of individuals at risk over all groups. As in the adjusted mdir test, we combine several weights (still for one single contrast) by considering the corresponding quadratic form given by
![]() |
where the inner matrix is defined by
![]() |
and
represents its Moore–Penrose inverse. Similar to the maximum approach within MultiWeightedLR, we now consider the maximum of these Wald-type statistics over all contrasts of interest as the global test statistic
![]() |
Note that we did not take the maximum over the different weights, as those are already incorporated within the quadratic forms.
We use the common wild bootstrap approach for counting processes in time-to-event analyses (Bluhmki et al. 2019, 2018) to approximate the limiting distribution. Therefore, we consider independent and identically distributed variables
with
and
. The wild bootstrap version of the normalized Nelson–Aalen estimator
as defined in Bluhmki et al. (2019) is then given by:
![]() |
The motivation behind
stems from the martingale representation of the normalized Nelson–Aalen estimator:
![]() |
where
indicates that the difference between both sides converges to 0 in probability, and
denotes the martingale obtained from the Doob–Meyer decomposition of
with
. The wild bootstrap Nelson–Aalen version
is thus obtained by replacing the unobservable martingales
with the observable
. As shown in Bluhmki et al. (2019), the distribution of
and the conditional distribution of
(given the data) coincide asymptotically. Assuming
, this implies that
can be used to approximate the
-null distribution of
. We thus define the wild bootstrap versions
and
of
and
, respectively, by replacing
with
.
Since counting processes are discrete, we opt for discrete distributions for the
. We focus on two common choices: (i) the Rademacher distribution (Liu 1988), and (ii) the centered Poisson distribution (Mammen 2012). This results in two different wild bootstrap quantiles depending on the distribution of choice:
the
-quantile of
given our data
. Then we obtain the global test decision by evaluating
and the local test decisions by
.
Simulation study
We conducted an extensive simulation study in R 4.4.0 (R Core Team 2021) to evaluate the rejection rate and the power performance of the candidate methods.
Simulation setup
We simulated data for
groups considering the Tukey- and Dunnett-type contrast matrices. We considered four scenarios, each with different distribution functions. Each represents a specific case of hazard relationships, such as (i) proportional hazards, (ii) non-proportional and non-crossing hazards, (iii) crossing hazards, and (iv) a mixed scenario. The specific survival functions are presented in Table 1. We set the group size for each scenario to 100; the censoring rates vary between
and
with uniform censoring. The work of Dormuth et al. (2023) indicated that the choice of censoring distribution does not have a major impact on the performance of statistical tests. Considering all possible combinations of censoring, survival distributions, and contrast matrices, we end up with a total of 4(scenarios)
1120 parameter combinations = 4480 different settings.
Table 1.
Simulation scenarios
| Scenario | CDF | Visualization of the survival and hazard curves |
|---|---|---|
| Prop |
|
![]() |
| NProp |
|
![]() |
| Cross |
|
![]() |
| Mix |
|
![]() |
This is because we only considered the combination of different survival time distributions for the individual groups, but we did not consider the order in which they were combined. This means that
is the same combination as
. For the Tukey-type contrast matrices, we considered every possible comparison for
that results in six tests. For the Dunnett-type contrast matrices, we compared the first group to all the other groups, resulting in 3 contrasts.
10, 000 simulation runs with 1000 resampling iterations were performed for each setting. The global level of significance was set to 0.05 throughout.
Simulation results under the null hypothesis
Figure 1 illustrates the familywise error rate (FWER) for all survival scenarios for the different contrast matrix types. We set the
-level to
. The dashed lines represent the corresponding binomial precision interval, based on the 10,000 simulation runs.
Fig. 1.
FWER under
for all settings for the Dunnett-type (left) and Tukey-type (right) contrast matrices. The dashed lines represent the borders of the binomial precision interval 
For both contrast matrices, almost all methods control the FWER well. The adjusted mdir is the only test that is a little liberal when comparing the median to the global
-level of
for the Dunnett-type matrix. The new multiple-testing approaches are more conservative than the adjusted approaches, especially for the Tukey-type contrast matrices, with the multiWeightedLR being the most conservative.
Simulation results under the alternative hypothesis
We focused on the local decisions under the alternative hypothesis to assess the power. Figures 2 and 3 illustrate the rejection rates when different survival distributions are present. Each figure consists of four subfigures, one for each scenario. It should be noted that a higher number of tests decreases the power of each local hypothesis. This property is visible in the plots, showing generally higher power for the Dunnett plots than the Tukey plots. Besides that, the tests behave similarly for both contrast matrix types. The adjusted log-rank test is the most powerful in the setting with proportional hazards, while the other tests perform equally well. Under non-proportional but non-crossing hazards, all tests have a high power, with the new approaches yielding a slightly lower variability. In the crossing scenario, the log-rank test loses power drastically due to violating the PH assumption. The four approaches designed for nPH data have high power, with the multiWeightedLR being slightly less powerful than the other three tests. In the mixed setting, the adjusted mdir performs best in terms of power, followed by the three methods introduced in this paper. The log-rank test has the highest variability and the lowest median power.
Fig. 2.
Local power over all tests under the alternative for Dunnett-type contrasts for all four scenarios (each boxplot contains 1136 data points)
Fig. 3.
Local power over all tests under the alternative for Tukey-type contrasts for all four scenarios (each boxplot contains 2016 data points)
The rejection rates for the local tests with no difference in survival are depicted in Figures S1 and S2 in the Supplemental Material. Overall, the rejection rates among the approaches are similar, with lower rejection rates for the Tukey-type contrast matrices. Additionally, the power for each local test is provided in the tables in the Supplemental Material.
The adjusted mdir test performs best for two of the four settings considered. Considering that it showed slightly liberal behavior under the null hypothesis, these results should be interpreted carefully. The methods introduced in this publication yield robust results regarding power among the different scenarios. The adjusted log-rank test loses power dramatically in the scenario with crossing hazards.
In the Supplement (Figures S3–S5), we present an additional analysis of the behavior of the different tests regarding the FWER and power for smaller sample sizes (
). The results indicate that the multiWeightedLR approach exhibits an inflated FWER, likely due to the normal approximation. In contrast, both multiCASANOVA bootstrap approaches maintain strong control over the family-wise error rate and consistently deliver good results in terms of power. The adjusted LR and mdir test still control the FWER but show increased variability in terms of power.
Illustrative data example
To illustrate the novel approaches on real-world data, we used publicly available data from the CoMMpass study (dbGaP accession: phs000748.v4.p3). This study is designed to associate clinical outcome with genetic profiles and contains longitudinal clinical and molecular data from multiple myeloma (MM) patients. Based on the transcriptional profile and the expression level of biologically relevant core machinery that plays a vital role in the stress response (Heynen et al. 2023), we clustered MM patients into seven groups. Figure 4 shows the Kaplan–Meier curves of the seven different groups. We assume that we are interested in comparing every group with one another and consider a Tukey design. The significance level was set to
with a total of 21 tests. The corrected significance level is thus
.
Fig. 4.
Kaplan–Meier plot of the seven treatment groups of patients with multiple myeloma (MM)
By examining the survival curves, we anticipate that the methods will identify a significant overall difference among the groups. Specifically, we expect group 2 to differ from the other groups. However, we do not expect to see any differences among groups 3, 4, and 5. We applied all testing procedures described in this paper to investigate these premises. For all approaches combining multiple weighted log-rank tests, we included the
and
. We set the number of resampling iterations to 1000 for all resampling-based approaches.
The detailed results are listed in the Supplemental Material Table S1. The adjusted log-rank and mdir test detected six significant differences between groups, while the three new methods only detected five. The found differences are consistent among the methods. All tests found the pair-wise differences between groups two and three, four, five, and seven, as well as between groups one and four, to be significant. A significant result for the comparison between groups two and six was only found by the adjusted LR test and the adjusted mdir test.
All tests could reject the global hypothesis of any difference between groups as well. In summary, we could show that in the case of a real-world application, the results are consistent with the results of the adjusted log-rank test.
Discussion
We explored various statistical methods for addressing multiple contrast problems with time-to-event endpoints, including traditional and newly developed approaches. To assess the approaches’ performance, we compared the Family-Wise Error Rate (FWER) control and the power performance of these methods under different survival scenarios. The results of our simulation study and real-world data application provide valuable insights into the strengths and limitations of each approach.
Most methods maintain adequate control of the FWER. The adjusted mdir test exhibited a slightly liberal behavior, particularly for Dunnett-type contrasts. This deviation suggests that while the adjusted mdir test might be powerful, it occasionally exceeds the acceptable error rate, which warrants caution in its interpretation under null conditions. On the other hand, the multiWeightedLR and multiCASANOVA methods were generally more conservative, particularly for Tukey-type contrast matrices. This conservativeness could imply a lower risk of Type I errors but may come at the cost of reduced statistical power.
Under alternative hypotheses, the power analysis revealed notable differences in the tests’ performance depending on the survival scenario. For proportional hazards, the adjusted log-rank test showed the highest power, outperforming the other methods. Under non-proportional and non-crossing hazards, we could observe high power among all tests, with the new approaches showing slightly lower variability. The robustness of these methods suggests that they are suitable choices when proportional hazards are not guaranteed. The log-rank test’s power decreased drastically in the specific case of crossing hazards. In contrast, the four approaches specifically designed for non-proportional hazards (adjusted mdir, multiWeightedLR, and the two multiCASANOVA variants) maintained high power, confirming their utility in these settings. Finally, the adjusted mdir test outperformed other methods in the mixed scenario, achieving the best power performance. Although slightly less powerful, the new methods provided more consistent results across different scenarios, highlighting their robustness.
The results suggest potential areas for further methodological improvements. While the Bonferroni correction is widely used for controlling type I error rates, its conservative nature may result in lower power, particularly in settings with many comparisons. More sophisticated adjustment techniques, like the Holm procedure, could better balance error rate control and power, as discussed in previous studies.
Additionally, evaluating the performance of these methods in unbalanced designs could provide a more comprehensive understanding of their behavior in practical applications. This could be particularly interesting since (Munko et al. 2024) showed that such conditions could boost the power of specific local designs.
In the illustrative data example involving patients with multiple myeloma, the new methods produced consistent results with those obtained from the adjusted log-rank tests. Although the novel approaches identified less significant differences than the traditional methods, their findings were largely aligned, underscoring their reliability in practical scenarios. This consistency and the conservative behavior in terms of FWER control suggest that the new methods still offer a robust alternative for analyzing time-to-event data in clinical studies. Future research would include more efficient exploitation of the FWER for the new approaches, e.g., by incorporating closed testing approaches. In general, it is essential to critically assess whether a higher number of statistically significant results truly reflects a superior testing approach, as statistical significance does not inherently equate to clinical relevance.
Supplementary Information
Below is the link to the electronic supplementary material.
Acknowledgements
We sincerely thank the reviewers for their insightful feedback and thoughtful recommendations, which have significantly strengthened this work.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Footnotes
Deceased: MarcDitzhaus.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- Andersen PK, Borgan O, Gill RD, Keiding N (1993) Statistical models based on counting processes. Springer, Berlin
- Blanche P, Dartigues JF, Riou J (2022) A closed max-t test for multiple comparisons of areas under the ROC curve. Biometrics 78(1):352–363 [DOI] [PubMed] [Google Scholar]
- Bluhmki T, Dobler D, Beyersmann J, Pauly M (2019) The wild bootstrap for multivariate Nelson–Aalen estimators. Lifetime Data Anal 25:97–127 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bluhmki T, Schmoor C, Dobler D, Pauly M, Finke J, Schumacher M, Beyersmann J (2018) A wild bootstrap approach for the Aalen–Johansen estimator. Biometrics 74(3):977–985 [DOI] [PubMed] [Google Scholar]
- Brendel M, Janssen A, Mayer CD, Pauly M (2014) Weighted Logrank permutation tests for randomly right censored life science data. Scand J Stat 41(3):742–761. 10.1111/sjos.12059 [Google Scholar]
- Bretz F, Genz A, A Hothorn L (2001) On the numerical availability of multiple comparison procedures. Biometrical J 43(5):645–656.
- Ditzhaus M, Dobler D, Pauly M, Steinhauer P, Munko M (2022) GFDsurv: tests for survival data in general factorial designs. R package version 0.1.1. https://CRAN.R-project.org/package=GFDsurv
- Ditzhaus M, Friedrich S (2018) mdir.logrank: Multiple-direction logrank test . https://CRAN.R-project.org/package=mdir.logrank. R package version 0.0.4
- Ditzhaus M, Friedrich S (2020) More powerful Logrank permutation tests for two-sample survival data. J Stat Comput Simul 90(12):2209–2227 [Google Scholar]
- Ditzhaus M, Genuneit J, Janssen A, Pauly M (2023) CASANOVA: permutation inference in factorial survival designs. Biometrics 79(1):203–215 [DOI] [PubMed] [Google Scholar]
- Dormuth I, Liu T, Xu J, Pauly M, Ditzhaus M (2023) A comparative study to alternatives to the log-rank test. Contemp Clin Trials 128:107165 [DOI] [PubMed] [Google Scholar]
- Dormuth I, Liu T, Xu J, Yu M, Pauly M, Ditzhaus M (2022) Which test for crossing survival curves? A user’s guideline. BMC Med Res Methodol 22(1):1–7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fleming TR, Harrington DP (1991) Counting processes and survival analysis, vol. 625. Wiley, London
- Gao X, Alvo M (2008) Nonparametric multiple comparison procedures for unbalanced two-way layouts. J Stat Plan Inference 138(12):3674–3686 [Google Scholar]
- Gao X, Alvo M, Chen J, Li G (2008) Nonparametric multiple comparison procedures for unbalanced one-way factorial designs. J Stat Plan Inference 138(6):2574–2591 [Google Scholar]
- Hasler M, Hothorn LA (2008) Multiple contrast tests in the presence of heteroscedasticity. Biometrical J Math Methods Biosci 50(5):793–800 [DOI] [PubMed] [Google Scholar]
- Heynen GJ, Baumgartner F, Heider M, Patra U, Holz M, Braune J, Kaiser M, Schäffer I, Bamopoulos SA, Ramberger E et al (2023) SUMOylation inhibition overcomes proteasome inhibitor resistance in multiple myeloma. Blood Adv 7(4):469–481 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Konietschke F, Bösiger S, Brunner E, Hothorn LA (2013) Are multiple contrast tests superior to the ANOVA? Int J Biostat 9(1). 10.1515/ijb-2012-0020 [DOI] [PubMed]
- Konietschke F, Hothorn LA, Brunner E (2012) Rank-based multiple test procedures and simultaneous confidence intervals. Electron J Stat 6. 10.1214/12-EJS691
- Kristiansen IS (2012) PRM39 survival curve convergences and crossing: a threat to validity of meta-analysis? Value Health 15(7):A652 [Google Scholar]
- Liu RY (1988) Bootstrap Procedures under some non-iid Models. Ann Stat 16(4):1696–1708 [Google Scholar]
- Logan BR, Wang H, Zhang MJ (2005) Pairwise multiple comparison adjustment in survival analysis. Stat Med 24(16):2509–2523 [DOI] [PubMed] [Google Scholar]
- Mammen E (2012) When does Bootstrap work?: asymptotic results and simulations, vol. 77. Springer, Berlin
- Munko M, Ditzhaus M, Dobler D, Genuneit J (2024) RMST-based multiple contrast tests in general factorial designs. Stat Med 43(10):1849–1866. 10.1002/sim.10017 [DOI] [PubMed] [Google Scholar]
- R Core Team: R: a language and environment for statistical computing (2021). https://www.R-project.org/
- Schaarschmidt F, Biesheuvel E, Hothorn LA (2009) Asymptotic simultaneous confidence intervals for many-to-one comparisons of binary proportions in randomized clinical trials. J Biopharm Stat 19(2):292–310 [DOI] [PubMed] [Google Scholar]
- Therneau TM (2023) A package for survival analysis in R . https://CRAN.R-project.org/package=survival. R package version 3.5-5
- Trinquart L, Jacot J, Conner SC, Porcher R (2016) Comparison of treatment effects measured by the hazard ratio and by the ratio of restricted mean survival times in oncology randomized controlled trials. J Clin Oncol 34(15):1813–1819 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




































