Skip to main content
Springer logoLink to Springer
. 2026 Jan 14;32(1):8. doi: 10.1007/s10985-025-09676-9

Beyond Bonferroni: new multiple contrast tests for time-to-event data under non-proportional hazards

Ina Dormuth 1,, Carolin Herrmann 2, Frank Konietschke 3, Markus Pauly 1,4, Matthias Wirth 5,6, Marc Ditzhaus 7
PMCID: PMC12804333  PMID: 41533205

Abstract

When comparing multiple groups in clinical trials, we are not only interested in whether there is a difference between any groups but rather where the difference is. Such research questions lead to testing multiple individual hypotheses. To control the familywise error rate (FWER), we must apply some corrections or introduce tests that control the FWER by design. In the case of time-to-event data, a Bonferroni-corrected log-rank test is commonly used. This approach has two significant drawbacks: (i) it loses power when the proportional hazards assumption is violated and (ii) the correction generally leads to a lower power, especially when the test statistics are not independent. We propose two new tests based on combined weighted log-rank tests. One is a simple multiple contrast test of weighted log-rank tests, and one is an extension of the so-called CASANOVA test. The latter was introduced for factorial designs. We propose a new multiple contrast test based on the CASANOVA approach. Our test shows promise of being more powerful under crossing hazards and eliminates the need for additional p-value correction. We assess the performance of our tests through extensive Monte Carlo simulation studies covering both proportional and non-proportional hazard scenarios. Finally, we apply the new and reference methods to a real-world data example. The new approaches control the FWER and show reasonable power in all scenarios. They outperform the adjusted approaches in some non-proportional settings in terms of power.

Supplementary Information

The online version contains supplementary material available at 10.1007/s10985-025-09676-9.

Keywords: Multiple contrast tests, Non-proportional hazards, Survival analysis, Weighted log-rank test

Introduction

Time-to-event or survival analysis is ubiquitous across medical research, engineering, and social sciences. Trials often involve multiple groups (treatment arms) or factorial designs, creating unique statistical challenges. The primary research focuses not merely on whether any arms differ but specifically on identifying which groups show differences. Thus, traditional global test procedures like ANOVA-type methods, which test null hypotheses of equal hazard ratios or cumulative hazard rate functions, are often inadequate (Konietschke et al. 2012; Ditzhaus et al. 2023). Instead, flexible multiple comparison procedures are crucial in modern data analysis. Current approaches typically employ pairwise multiple log-rank tests with adjustments for multiplicity (e.g., Bonferroni correction) (Logan et al. 2005), but these methods can lack efficiency due to restrictive assumptions about the correlation structure of test statistics (Gao et al. 2008; Gao and Alvo 2008). In recent years, many researchers developed multiple contrast test procedures (MCTPs) along with simultaneous confidence intervals (SCIs) (usually conducted as maximum tests), which are valid for arbitrary correlations of the test statistics and use the correlation within the multiplicity adjustment for various endpoints (means, proportions, Mann–Whitney effects) (Bretz et al. 2001; Schaarschmidt et al. 2009; Hasler and Hothorn 2008; Konietschke et al. 2013; Blanche et al. 2022). Munko et al. (2024) introduced a restricted mean survival time (RMST)-based multiple contrast tests for time-to-event data. Since the RMST should not be employed under crossing hazards (Dormuth et al. 2022, 2023), we aim to close this gap and introduce a powerful and flexible MCTP for analyzing survival data with crossing hazards.

The log-rank test is one of the most prominent test procedures in survival analysis. The method is well known to be optimal when the proportional hazards (PH) assumption is met, but significantly loses power otherwise (Dormuth et al. 2023). Even though the problem is fairly well known, a many investigators of (clinical) trials still ignore the issue and publish their findings upon log-rank tests in leading high-quality peer-reviewed journals, even when the assumption is violated (Kristiansen 2012; Trinquart et al. 2016; Dormuth et al. 2023). For the analysis of two independent samples, weighted log-rank tests and their combinations comprise a great alternative to the classical log-rank test and are beneficial in non-proportional hazards models (Andersen et al. 1993; Fleming and Harrington 1991; Brendel et al. 2014; Ditzhaus and Friedrich 2020). Ditzhaus and Friedrich (2020) propose a Wald-type test of multiple weight functions within a single multivariate test. Which weight function to choose depends on the alternative of interest and cannot be recommended in a general way. However, the test does not provide information on which weight function appears most powerful. For the analysis of more than two samples and factorial designs, Ditzhaus et al. (2023) extended these procedures to the Cumulative Aalen Survival Analysis-of-Variance (CASANOVA) method. In principle, they are global ANOVA-based tests (quadratic forms) and can be used to estimate and test main and interaction effects in general factorial designs. Estimating and testing user-specific contrasts are impossible, limiting their application in statistical practice. To overcome these shortcomings, we propose a novel flexible MCTP. Extensive simulation studies indicate that the test is more powerful under non-proportional hazards and eliminates the need for additional p-value correction. The remainder of the paper is organized as follows. The second section introduces the main statistical methods employed in the analyses. The third section describes the simulation setup and the corresponding results. The following section applies the methods of interest to a real-world data example. The conclusions are drawn in section five, together with future research questions.

Set up

Multiple contrasts are faced in many research questions related to time-to-event endpoints. Applying separate tests without adjusting for multiple testing increases the likelihood of false discoveries and inflated error rates. In the following, we present different well-established statistical methods for an underlying multiple contrast problem with time-to-event endpoints, as well as our newly developed method based on a combination of multi-directional log-rank tests and the concept of maximum tests.

Statistical model

First, we define the underlying statistical model. We consider a study design involving Inline graphic groups (treatment arms) of Inline graphic independent subjects, each with time-to-event data Inline graphic and right-censoring time Inline graphic. The statistical model considered here can be summarized by mutually independent positive random variables Inline graphic where Inline graphic and Inline graphic are both continuous distribution functions, respectively. Furthermore, let Inline graphic denote the observed time and Inline graphic the censoring status with Inline graphic being the indicator function. The statistical model considered here does not entail any parameters but rather the survival distributions that could be used to define reasonable treatment effects. The cumulative hazard rate function for group j is defined by

graphic file with name d33e386.gif 1

with Inline graphic the hazard rate of group j.

We further assume non-zero sized groups by Inline graphic with Inline graphic as Inline graphic and we exclude the case of only censored values within one group by assuming that Inline graphic and Inline graphic Inline graphic and some Inline graphic.

Multiple null hypotheses

The cumulative hazard rate function of treatment arm j called Inline graphic, summarizes the total accumulated risk of experiencing the event that has been gained by progressing to time t. No difference (i.e., no effect) between treatment arms Inline graphic and Inline graphic with Inline graphic, corresponds to Inline graphic for all t, or, equivalently, Inline graphic. In the several sample problems, let Inline graphic be a contrast matrix satisfying Inline graphic with Inline graphic and Inline graphic denoting vectors of ones and zeros, respectively. We denote the entries of Inline graphic as Inline graphic. For ease of presentation, we describe the pairwise comparisons only. Here, the most prominent matrices are the ones of Dunnett- and Tukey-type. The entries are composed of a single Inline graphic and 1, indicating the two sample comparisons of interest. We define the corresponding index sets Inline graphic and Inline graphic. In the following, we will indicate the position in the matrix or vector by the corresponding indices Inline graphic and Inline graphic, for example, Inline graphic for Inline graphic and Inline graphic.

The hypotheses we seek to infer are expressed in relation to the cumulative hazard rate functions as follows:

graphic file with name d33e527.gif

with Inline graphic denoting the transposed vector of Inline graphic and I being either Inline graphic or Inline graphic. In general, the contrast matrix selection depends on the specific question of interest underlying the analysis.

Statistical tests

Adjusted log-rank

As a reference method, we consider the Bonferroni adjusted log-rank test. Therefore, we define the Bonferroni-adjusted significance level Inline graphic where Inline graphic is the original significance level and q is the number of comparisons. The Bonferroni adjustment for multiple comparisons in a survival setting is a standard procedure in clinical settings, as discussed in Logan et al. (2005). The authors have provided a comprehensive description and suggested various methods for adjusting the number of comparisons.

We define the weighted log-rank test as a generalization of the classical log-rank test. Therefore, we employ the conventional counting process notation. Let Inline graphic represent the cumulative number of observed events within group j up to time t with Inline graphic. Furthermore, we introduce Inline graphic, which denotes the number of individuals at risk just before time t in group j. These counting processes enable us to define the Nelson–Aalen estimator for Inline graphic as Inline graphic for Inline graphic and Inline graphic. Finally, consider Inline graphic to be the pooled sample size over the two groups of interest.

Then, the weighted log-rank statistic for testing the local null hypothesis Inline graphic can be defined as Andersen et al. (1993):

graphic file with name d33e628.gif

Here, Inline graphic represents the left-continuous version of the estimator Inline graphic with Inline graphic the Kaplan–Meier estimator based on the pooled sample, and w is a continuous weight function and Inline graphic. Fleming and Harrington (1991) examined a specific subclass of weights w given by Inline graphic Inline graphic. For instance, when Inline graphic, the log-rank test is obtained. We derive the individual p-values for the tests from the Inline graphic distribution and compare them to Inline graphic. To make a global statement, we compare the minimal p-value among all local tests to the adjusted significance level.

For practical implementation, we utilize the R package survival and its function survdiff. Therneau (2023)

Adjusted mdir

In the context of our specific objectives, we are interested in more robust testing procedures towards multiple alternatives. For two-group comparisons, the multi-directional log-rank test has been proposed as a combination procedure of different weighted log-rank tests (Brendel et al. 2014; Ditzhaus and Friedrich 2020). The test assumes the equality of survival under the null hypothesis, with the choice of weights determining the alternative hypothesis. We are particularly interested in weights that intersect the x-axis, such as Inline graphic, as they are specifically designed to address crossing hazard alternatives.

By default, the R package mdir.logrank (Ditzhaus and Friedrich 2018) implements a combination of the log-rank weight Inline graphic and this crossing weight. Dormuth et al. (2023) showed that this default set of weights seems to be robust against multiple alternatives. Nevertheless, if desired, additional weights can be combined to cover more alternative hypotheses. For the general case of m linearly independent weights Inline graphic, the local test statistic takes a studentized quadratic form:

graphic file with name d33e730.gif

The entries of Inline graphic are given by

graphic file with name d33e738.gif

with Inline graphic the pooled Nelson–Aalen estimator of groups Inline graphic and Inline graphic. Inline graphic represents the Moore–Penrose inverse of the empirical covariance matrix of the weighted log-rank tests. For linearly independent weights Inline graphic fulfilling the assumptions of Ditzhaus et al. (2023) (continuous and of bounded variation), the test statistic Inline graphic can be assumed to be Inline graphic distributed under the null hypothesis. Ditzhaus and Friedrich (2020) also proposed a permutation-based approach. Due to inflated type I errors, we only consider the permutation-based approach.

Again, we employ the Bonferroni adjusted significance level Inline graphic to compare to the obtained local p-values. Analogously to the adjusted log-rank test procedure, we obtain the global test decision by comparing the smallest p-value to the adjusted significance level.

MultiWeightedLR

Knowing that maximum tests are a common approach for multiple testing problems (Konietschke et al. 2013), a straightforward extension of the weighted log-rank test is to use the maximum over them and exploit the covariance structure between the different tests. We use the same weights as for the adjusted mdir approach without combining them in a quadratic form. Instead, we consider each weighted test individually. After calculating the corresponding covariance matrix, we take the maximum of all weighted test statistics as our global maximum test statistic. Mathematically, we write:

graphic file with name d33e791.gif

For the local testing problem we focus on Inline graphic. Similar to the proof of Theorem 2 in Ditzhaus et al. (2023), it can be shown that the vector Inline graphic is, under regularity conditions, asymptotic centered multivariate normally distributed with covariance matrix Inline graphic. We thus take the equicoordinate Inline graphic-quantile (Konietschke et al. 2012) of this distribution as a critical value to obtain the MultiWeightedLR test in the statistic Inline graphic.

multiCASANOVA

Ditzhaus et al. (2023) proposed the CASANOVA (Cumulative Aalen Survival Analysis-of-Variance) approach for general factorial designs with right-censored time-to-event data. The core idea of the method is an extension of weighted log-rank tests to the factorial design setup. Therefore, they expanded the combination approach of weighted log-rank tests (mdir) for the two-sample scenario to the general factorial survival designs implemented in the R package GFDsurv (Ditzhaus et al. 2022). For further information, we refer to Ditzhaus et al. (2023)

We aim to extend the CASANOVA approach to allow the estimation and testing of user-specific contrasts in a multiple testing framework. Compared to the aforementioned approaches, the main difference is that we consider pooled quantities over all groups, not only the two groups of interest. To this end, we define a local test statistic for contrast Inline graphic as

graphic file with name d33e844.gif

where Inline graphic represents the left-continuous version of the pooled estimator Inline graphic and Y(t) is the total number of individuals at risk over all groups. As in the adjusted mdir test, we combine several weights (still for one single contrast) by considering the corresponding quadratic form given by

graphic file with name d33e863.gif

where the inner matrix is defined by

graphic file with name d33e867.gif

and Inline graphic represents its Moore–Penrose inverse. Similar to the maximum approach within MultiWeightedLR, we now consider the maximum of these Wald-type statistics over all contrasts of interest as the global test statistic

graphic file with name d33e875.gif

Note that we did not take the maximum over the different weights, as those are already incorporated within the quadratic forms.

We use the common wild bootstrap approach for counting processes in time-to-event analyses (Bluhmki et al. 2019, 2018) to approximate the limiting distribution. Therefore, we consider independent and identically distributed variables Inline graphic with Inline graphic and Inline graphic. The wild bootstrap version of the normalized Nelson–Aalen estimator Inline graphic as defined in Bluhmki et al. (2019) is then given by:

graphic file with name d33e907.gif

The motivation behind Inline graphic stems from the martingale representation of the normalized Nelson–Aalen estimator:

graphic file with name d33e915.gif

where Inline graphic indicates that the difference between both sides converges to 0 in probability, and Inline graphic denotes the martingale obtained from the Doob–Meyer decomposition of Inline graphic withInline graphic. The wild bootstrap Nelson–Aalen version Inline graphic is thus obtained by replacing the unobservable martingales Inline graphic with the observable Inline graphic. As shown in Bluhmki et al. (2019), the distribution of Inline graphic and the conditional distribution of Inline graphic (given the data) coincide asymptotically. Assuming Inline graphic, this implies that Inline graphic can be used to approximate the Inline graphic-null distribution of Inline graphic. We thus define the wild bootstrap versions Inline graphic and Inline graphic of Inline graphic and Inline graphic, respectively, by replacing Inline graphic with Inline graphic.

Since counting processes are discrete, we opt for discrete distributions for the Inline graphic. We focus on two common choices: (i) the Rademacher distribution (Liu 1988), and (ii) the centered Poisson distribution (Mammen 2012). This results in two different wild bootstrap quantiles depending on the distribution of choice: Inline graphic the Inline graphic-quantile of Inline graphic given our data Inline graphic. Then we obtain the global test decision by evaluating Inline graphic and the local test decisions by Inline graphic.

Simulation study

We conducted an extensive simulation study in R 4.4.0 (R Core Team 2021) to evaluate the rejection rate and the power performance of the candidate methods.

Simulation setup

We simulated data for Inline graphic groups considering the Tukey- and Dunnett-type contrast matrices. We considered four scenarios, each with different distribution functions. Each represents a specific case of hazard relationships, such as (i) proportional hazards, (ii) non-proportional and non-crossing hazards, (iii) crossing hazards, and (iv) a mixed scenario. The specific survival functions are presented in Table 1. We set the group size for each scenario to 100; the censoring rates vary between Inline graphic and Inline graphic with uniform censoring. The work of Dormuth et al. (2023) indicated that the choice of censoring distribution does not have a major impact on the performance of statistical tests. Considering all possible combinations of censoring, survival distributions, and contrast matrices, we end up with a total of 4(scenarios) Inline graphic 1120 parameter combinations = 4480 different settings.

Table 1.

Simulation scenarios

Scenario CDF Visualization of the survival and hazard curves
Prop

Inline graphic

Inline graphic

Inline graphic

Inline graphic

graphic file with name 10985_2025_9676_Figa_HTML.gif
NProp

Inline graphic

Inline graphic

Inline graphic

Inline graphic

graphic file with name 10985_2025_9676_Figb_HTML.gif
Cross

Inline graphic

Inline graphic

Inline graphic

Inline graphic

graphic file with name 10985_2025_9676_Figc_HTML.gif
Mix

Inline graphic

Inline graphic

Inline graphic

Inline graphic

graphic file with name 10985_2025_9676_Figd_HTML.gif

This is because we only considered the combination of different survival time distributions for the individual groups, but we did not consider the order in which they were combined. This means that Inline graphic is the same combination as Inline graphic. For the Tukey-type contrast matrices, we considered every possible comparison for Inline graphic that results in six tests. For the Dunnett-type contrast matrices, we compared the first group to all the other groups, resulting in 3 contrasts.

10, 000 simulation runs with 1000 resampling iterations were performed for each setting. The global level of significance was set to 0.05 throughout.

Simulation results under the null hypothesis

Figure 1 illustrates the familywise error rate (FWER) for all survival scenarios for the different contrast matrix types. We set the Inline graphic-level to Inline graphic. The dashed lines represent the corresponding binomial precision interval, based on the 10,000 simulation runs.

Fig. 1.

Fig. 1

FWER under Inline graphic for all settings for the Dunnett-type (left) and Tukey-type (right) contrast matrices. The dashed lines represent the borders of the binomial precision interval Inline graphic

For both contrast matrices, almost all methods control the FWER well. The adjusted mdir is the only test that is a little liberal when comparing the median to the global Inline graphic-level of Inline graphic for the Dunnett-type matrix. The new multiple-testing approaches are more conservative than the adjusted approaches, especially for the Tukey-type contrast matrices, with the multiWeightedLR being the most conservative.

Simulation results under the alternative hypothesis

We focused on the local decisions under the alternative hypothesis to assess the power. Figures 2 and 3 illustrate the rejection rates when different survival distributions are present. Each figure consists of four subfigures, one for each scenario. It should be noted that a higher number of tests decreases the power of each local hypothesis. This property is visible in the plots, showing generally higher power for the Dunnett plots than the Tukey plots. Besides that, the tests behave similarly for both contrast matrix types. The adjusted log-rank test is the most powerful in the setting with proportional hazards, while the other tests perform equally well. Under non-proportional but non-crossing hazards, all tests have a high power, with the new approaches yielding a slightly lower variability. In the crossing scenario, the log-rank test loses power drastically due to violating the PH assumption. The four approaches designed for nPH data have high power, with the multiWeightedLR being slightly less powerful than the other three tests. In the mixed setting, the adjusted mdir performs best in terms of power, followed by the three methods introduced in this paper. The log-rank test has the highest variability and the lowest median power.

Fig. 2.

Fig. 2

Local power over all tests under the alternative for Dunnett-type contrasts for all four scenarios (each boxplot contains 1136 data points)

Fig. 3.

Fig. 3

Local power over all tests under the alternative for Tukey-type contrasts for all four scenarios (each boxplot contains 2016 data points)

The rejection rates for the local tests with no difference in survival are depicted in Figures S1 and S2 in the Supplemental Material. Overall, the rejection rates among the approaches are similar, with lower rejection rates for the Tukey-type contrast matrices. Additionally, the power for each local test is provided in the tables in the Supplemental Material.

The adjusted mdir test performs best for two of the four settings considered. Considering that it showed slightly liberal behavior under the null hypothesis, these results should be interpreted carefully. The methods introduced in this publication yield robust results regarding power among the different scenarios. The adjusted log-rank test loses power dramatically in the scenario with crossing hazards.

In the Supplement (Figures S3–S5), we present an additional analysis of the behavior of the different tests regarding the FWER and power for smaller sample sizes (Inline graphic). The results indicate that the multiWeightedLR approach exhibits an inflated FWER, likely due to the normal approximation. In contrast, both multiCASANOVA bootstrap approaches maintain strong control over the family-wise error rate and consistently deliver good results in terms of power. The adjusted LR and mdir test still control the FWER but show increased variability in terms of power.

Illustrative data example

To illustrate the novel approaches on real-world data, we used publicly available data from the CoMMpass study (dbGaP accession: phs000748.v4.p3). This study is designed to associate clinical outcome with genetic profiles and contains longitudinal clinical and molecular data from multiple myeloma (MM) patients. Based on the transcriptional profile and the expression level of biologically relevant core machinery that plays a vital role in the stress response (Heynen et al. 2023), we clustered MM patients into seven groups. Figure 4 shows the Kaplan–Meier curves of the seven different groups. We assume that we are interested in comparing every group with one another and consider a Tukey design. The significance level was set to Inline graphic with a total of 21 tests. The corrected significance level is thus Inline graphic.

Fig. 4.

Fig. 4

Kaplan–Meier plot of the seven treatment groups of patients with multiple myeloma (MM)

By examining the survival curves, we anticipate that the methods will identify a significant overall difference among the groups. Specifically, we expect group 2 to differ from the other groups. However, we do not expect to see any differences among groups 3, 4, and 5. We applied all testing procedures described in this paper to investigate these premises. For all approaches combining multiple weighted log-rank tests, we included the Inline graphic and Inline graphic. We set the number of resampling iterations to 1000 for all resampling-based approaches.

The detailed results are listed in the Supplemental Material Table S1. The adjusted log-rank and mdir test detected six significant differences between groups, while the three new methods only detected five. The found differences are consistent among the methods. All tests found the pair-wise differences between groups two and three, four, five, and seven, as well as between groups one and four, to be significant. A significant result for the comparison between groups two and six was only found by the adjusted LR test and the adjusted mdir test.

All tests could reject the global hypothesis of any difference between groups as well. In summary, we could show that in the case of a real-world application, the results are consistent with the results of the adjusted log-rank test.

Discussion

We explored various statistical methods for addressing multiple contrast problems with time-to-event endpoints, including traditional and newly developed approaches. To assess the approaches’ performance, we compared the Family-Wise Error Rate (FWER) control and the power performance of these methods under different survival scenarios. The results of our simulation study and real-world data application provide valuable insights into the strengths and limitations of each approach.

Most methods maintain adequate control of the FWER. The adjusted mdir test exhibited a slightly liberal behavior, particularly for Dunnett-type contrasts. This deviation suggests that while the adjusted mdir test might be powerful, it occasionally exceeds the acceptable error rate, which warrants caution in its interpretation under null conditions. On the other hand, the multiWeightedLR and multiCASANOVA methods were generally more conservative, particularly for Tukey-type contrast matrices. This conservativeness could imply a lower risk of Type I errors but may come at the cost of reduced statistical power.

Under alternative hypotheses, the power analysis revealed notable differences in the tests’ performance depending on the survival scenario. For proportional hazards, the adjusted log-rank test showed the highest power, outperforming the other methods. Under non-proportional and non-crossing hazards, we could observe high power among all tests, with the new approaches showing slightly lower variability. The robustness of these methods suggests that they are suitable choices when proportional hazards are not guaranteed. The log-rank test’s power decreased drastically in the specific case of crossing hazards. In contrast, the four approaches specifically designed for non-proportional hazards (adjusted mdir, multiWeightedLR, and the two multiCASANOVA variants) maintained high power, confirming their utility in these settings. Finally, the adjusted mdir test outperformed other methods in the mixed scenario, achieving the best power performance. Although slightly less powerful, the new methods provided more consistent results across different scenarios, highlighting their robustness.

The results suggest potential areas for further methodological improvements. While the Bonferroni correction is widely used for controlling type I error rates, its conservative nature may result in lower power, particularly in settings with many comparisons. More sophisticated adjustment techniques, like the Holm procedure, could better balance error rate control and power, as discussed in previous studies.

Additionally, evaluating the performance of these methods in unbalanced designs could provide a more comprehensive understanding of their behavior in practical applications. This could be particularly interesting since (Munko et al. 2024) showed that such conditions could boost the power of specific local designs.

In the illustrative data example involving patients with multiple myeloma, the new methods produced consistent results with those obtained from the adjusted log-rank tests. Although the novel approaches identified less significant differences than the traditional methods, their findings were largely aligned, underscoring their reliability in practical scenarios. This consistency and the conservative behavior in terms of FWER control suggest that the new methods still offer a robust alternative for analyzing time-to-event data in clinical studies. Future research would include more efficient exploitation of the FWER for the new approaches, e.g., by incorporating closed testing approaches. In general, it is essential to critically assess whether a higher number of statistically significant results truly reflects a superior testing approach, as statistical significance does not inherently equate to clinical relevance.

Supplementary Information

Below is the link to the electronic supplementary material.

Acknowledgements

We sincerely thank the reviewers for their insightful feedback and thoughtful recommendations, which have significantly strengthened this work.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Footnotes

Deceased: MarcDitzhaus.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. Andersen PK, Borgan O, Gill RD, Keiding N (1993) Statistical models based on counting processes. Springer, Berlin
  2. Blanche P, Dartigues JF, Riou J (2022) A closed max-t test for multiple comparisons of areas under the ROC curve. Biometrics 78(1):352–363 [DOI] [PubMed] [Google Scholar]
  3. Bluhmki T, Dobler D, Beyersmann J, Pauly M (2019) The wild bootstrap for multivariate Nelson–Aalen estimators. Lifetime Data Anal 25:97–127 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bluhmki T, Schmoor C, Dobler D, Pauly M, Finke J, Schumacher M, Beyersmann J (2018) A wild bootstrap approach for the Aalen–Johansen estimator. Biometrics 74(3):977–985 [DOI] [PubMed] [Google Scholar]
  5. Brendel M, Janssen A, Mayer CD, Pauly M (2014) Weighted Logrank permutation tests for randomly right censored life science data. Scand J Stat 41(3):742–761. 10.1111/sjos.12059 [Google Scholar]
  6. Bretz F, Genz A, A Hothorn L (2001) On the numerical availability of multiple comparison procedures. Biometrical J 43(5):645–656.
  7. Ditzhaus M, Dobler D, Pauly M, Steinhauer P, Munko M (2022) GFDsurv: tests for survival data in general factorial designs. R package version 0.1.1. https://CRAN.R-project.org/package=GFDsurv
  8. Ditzhaus M, Friedrich S (2018) mdir.logrank: Multiple-direction logrank test . https://CRAN.R-project.org/package=mdir.logrank. R package version 0.0.4
  9. Ditzhaus M, Friedrich S (2020) More powerful Logrank permutation tests for two-sample survival data. J Stat Comput Simul 90(12):2209–2227 [Google Scholar]
  10. Ditzhaus M, Genuneit J, Janssen A, Pauly M (2023) CASANOVA: permutation inference in factorial survival designs. Biometrics 79(1):203–215 [DOI] [PubMed] [Google Scholar]
  11. Dormuth I, Liu T, Xu J, Pauly M, Ditzhaus M (2023) A comparative study to alternatives to the log-rank test. Contemp Clin Trials 128:107165 [DOI] [PubMed] [Google Scholar]
  12. Dormuth I, Liu T, Xu J, Yu M, Pauly M, Ditzhaus M (2022) Which test for crossing survival curves? A user’s guideline. BMC Med Res Methodol 22(1):1–7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Fleming TR, Harrington DP (1991) Counting processes and survival analysis, vol. 625. Wiley, London
  14. Gao X, Alvo M (2008) Nonparametric multiple comparison procedures for unbalanced two-way layouts. J Stat Plan Inference 138(12):3674–3686 [Google Scholar]
  15. Gao X, Alvo M, Chen J, Li G (2008) Nonparametric multiple comparison procedures for unbalanced one-way factorial designs. J Stat Plan Inference 138(6):2574–2591 [Google Scholar]
  16. Hasler M, Hothorn LA (2008) Multiple contrast tests in the presence of heteroscedasticity. Biometrical J Math Methods Biosci 50(5):793–800 [DOI] [PubMed] [Google Scholar]
  17. Heynen GJ, Baumgartner F, Heider M, Patra U, Holz M, Braune J, Kaiser M, Schäffer I, Bamopoulos SA, Ramberger E et al (2023) SUMOylation inhibition overcomes proteasome inhibitor resistance in multiple myeloma. Blood Adv 7(4):469–481 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Konietschke F, Bösiger S, Brunner E, Hothorn LA (2013) Are multiple contrast tests superior to the ANOVA? Int J Biostat 9(1). 10.1515/ijb-2012-0020 [DOI] [PubMed]
  19. Konietschke F, Hothorn LA, Brunner E (2012) Rank-based multiple test procedures and simultaneous confidence intervals. Electron J Stat 6. 10.1214/12-EJS691
  20. Kristiansen IS (2012) PRM39 survival curve convergences and crossing: a threat to validity of meta-analysis? Value Health 15(7):A652 [Google Scholar]
  21. Liu RY (1988) Bootstrap Procedures under some non-iid Models. Ann Stat 16(4):1696–1708 [Google Scholar]
  22. Logan BR, Wang H, Zhang MJ (2005) Pairwise multiple comparison adjustment in survival analysis. Stat Med 24(16):2509–2523 [DOI] [PubMed] [Google Scholar]
  23. Mammen E (2012) When does Bootstrap work?: asymptotic results and simulations, vol. 77. Springer, Berlin
  24. Munko M, Ditzhaus M, Dobler D, Genuneit J (2024) RMST-based multiple contrast tests in general factorial designs. Stat Med 43(10):1849–1866. 10.1002/sim.10017 [DOI] [PubMed] [Google Scholar]
  25. R Core Team: R: a language and environment for statistical computing (2021). https://www.R-project.org/
  26. Schaarschmidt F, Biesheuvel E, Hothorn LA (2009) Asymptotic simultaneous confidence intervals for many-to-one comparisons of binary proportions in randomized clinical trials. J Biopharm Stat 19(2):292–310 [DOI] [PubMed] [Google Scholar]
  27. Therneau TM (2023) A package for survival analysis in R . https://CRAN.R-project.org/package=survival. R package version 3.5-5
  28. Trinquart L, Jacot J, Conner SC, Porcher R (2016) Comparison of treatment effects measured by the hazard ratio and by the ratio of restricted mean survival times in oncology randomized controlled trials. J Clin Oncol 34(15):1813–1819 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials


Articles from Lifetime Data Analysis are provided here courtesy of Springer

RESOURCES