K-Sample Comparisons using Propensity Analysis

Sin-Ho Jung; Sang Ah Chi; Hyun Joo Ahn

doi:10.1002/bimj.201800049

. Author manuscript; available in PMC: 2019 Jun 1.

Published in final edited form as: Biom J. 2019 Jan 7;61(3):698–713. doi: 10.1002/bimj.201800049

K-Sample Comparisons using Propensity Analysis

Sin-Ho Jung ¹, Sang Ah Chi ², Hyun Joo Ahn ³

PMCID: PMC6461520 NIHMSID: NIHMS1002903 PMID: 30614546

Abstract

In this paper, we investigate K-group comparisons on survival endpoints for observational studies. In clinical databases for observational studies, treatment for patients are chosen with probabilities varying depending on their baseline characteristics. This often results in non-comparable treatment groups because of imbalance in baseline characteristics of patients among treatment groups. In order to overcome this issue, we conduct propensity analysis and match the subjects with similar propensity scores across treatment groups or compare weighted group means (or weighted survival curves for censored outcome variables) using the inverse probability weighting (IPW). To this end, multinomial logistic regression has been a popular propensity analysis method to estimate the weights. We propose to use decision tree method as an alternative propensity analysis due to its simplicity and robustness. We also propose IPW rank statistics, called Dunnett-type test and ANOVA-type test, to compare 3 or more treatment groups on survival endpoints. Using simulations, we evaluate the finite sample performance of the weighted rank statistics combined with these propensity analysis methods. We demonstrate these methods with a real data example. The IPW method also allows us for unbiased estimation of population parameters of each treatment group. In this paper, we limit our discussions to survival outcomes, but all the methods can be easily modified for any type of outcomes, such as binary or continuous variables.

Keywords: ANOVA, Decision tree, Dunnett test, Inverse probability weighting, Multinomial logistic regression

1. Introduction

In a prospective study, such as a phase III clinical trial, patients are randomly assigned to different treatment groups independently of the baseline characteristics, also called predictors, so that the distribution of the predictors are well balanced among treatment groups and the statistical testing to compare the efficacy of the treatments controls the type I error rate accurately even without adjusting for the predictors. We often use a stratified randomization method for a perfect balance of the predictor distributions among treatment groups by selecting some important predictors as the stratification factors.

In the data set of an observation study, however, treatment group is usually confounded with predictors. In this case, we may conduct a multivariable regression analysis including the group identifier and predictors as covariates to compare the groups adjusting for the potential bias due to confounding. In the presence of confounding, clinical investigators do not feel so comfortable with a comparison among treatment groups, so that they want to generate a subset of data with a balanced predictor distributions among treatment groups as if they were from a randomized trial.

In non-randomized (observational) studies, the use of propensity analysis has received much attention in clinical research to compare clinical endpoints between groups. A propensity score is a measure of how the treatment of each patient is selected depending on the values of predictors. If we can find equal number of patients from each group with the same propensity score, then we can make up a perfectly balanced data set, called matched data, by selecting only those patients for an efficacy analysis. With matched data, we can conduct a K-sample comparison using a standard univariate analysis method, but we often have to discard a large part of data set during the matching procedure. An alternative approach is to use a weighted test statistic for comparing groups with inverse probability weighting (IPW) by keeping all the original data, e.g. Curtis et al. (2007) and Breslow et al. (2009).

There have been plenty of publications about propensity method since the original works by Rosenbaum and Rubin (1983, 1984). Most of these publications, however, are focused on comparing two patient groups from non-randomized studies and its extension to K(≥ 3) has not been fully investigated yet, especially for survival analysis. Logistic regression method has been popularly used in propensity analysis for two treatment groups. Use of multinomial logistic regression method was proposed by Rubin (1998) in the estimation of propensity scores for K-sample comparison problems, and has been studied by many investigators including Imbens (2000) and McCaffrey et al. (2013). In a K-sample comparison, accurate estimation of propensity scores is a key component. If treatment selection probabilities have a non-monotone trend on a predictor, it is not easy to accurately estimate the propensity scores using a model-based regression method. To overcome this issue, some investigators (Stone et al. 1995; Pruzek & Cen 2002; Westreich et al. 2010) have applied classification tree method for propensity analysis as an alternative to the logistic regression method.

McCaffrey et al. (2013) propose to repeatedly apply the 2-sample tree method between each experimental treatment group and a common control, and adjust the K − 1 dimensional propensity scores for K sample comparisons. But, they report that the K sample comparison result can be different depending on which group to be chosen as the control. We propose a decision tree method for a propensity analysis of K sample comparison through a direct classification of K treatment groups not requiring specification of a control group. We use a pruning method based on χ²-tests for testing if two propensity strata have different probabilities of treatment selection or not.

We also propose two types of weighted rank statistics to compare K groups on survival outcomes, using ANOVA-type and Dunnett-type (1955) testing methods combined with IPW. We evaluate the performance of the weighted testing methods combined with two propensity analysis methods to estimate the weights, multinomial logistic regression method and decision tree method, through simulations. We apply the proposed IPW method and a matched data method using the propensity scores estimated by the two propensity analysis methods to a real data example. We limit our discussions to survival outcome, but the results can be easily modified for other types of variables, such as continuous or binary.

2. Propensity Analysis

Suppose that there are n_k patients for treatment group k(= 1,…,K) and $n = \sum_{k = 1}^{K} n_{k} .$ From each patient, we have data on group identity, m covariates (z_1i,…, z_mi)^T, and a survival outcome. In the first step of propensity analysis, we estimate propensity scores using the predictors and groups identity. Outcome data are not used in this step.

2.1. Multinomial Logistic Regression Method

At first, we review the multinomial logistic regression method that was proposed by Rubin (1998). Let (P_1i,…,P_Ki) be the propensities of K groups for patient i(= 1,…, n) with predictors z_i = (1, z_1i…, z_mi)^T. We consider logistic models

\log \frac{P_{k i}}{P_{K i}} = β_{k}^{T} z_{i} = β_{0 k} + β_{1 k} z_{1 i} + \dots + β_{m k} z_{m i}

(1)

for k = 1,…,K − 1, where β_k = (β_0k,β_1k,…,β_mk) is a vector of regression coefficients. For each k(= 1,…,K − 1), β_k is estimated by fitting the logistic regression (1) using data from groups k and K as in a propensity analysis for a two-sample comparison. For i = 1,…, n_k + n_K, the binary response variable is y_i = 1 (= 0) if patient i belongs to group k (group K).

Since $\sum_{k = 1}^{K} P_{k i} = 1$ and $P_{k i} = P_{K i} \exp (β_{k}^{T} z_{i})$ for k = 1,…,K − 1 from (1), we have

P_{K i} = \frac{1}{1 + \sum_{k = 1}^{K - 1} \exp (β_{k}^{T} z_{i})}

and

P_{k i} = \frac{exp (β_{k}^{T} z_{i})}{1 + \sum_{k^{'} = 1}^{K - 1} exp (β_{k^{'}}^{T} z_{i})} for k = 1, \dots, K - 1.

For k = 1,…,K − 1, let ${\hat{β}}_{k}$ denote the estimator of β_k from the regression analysis between treatment groups k and K.

Now, we rearrange the data so that ${\tilde{z}}_{k i}$ denotes the covariate vector for patient i(= 1,…, n_k) belonging to treatment group k(= 1,…,K). Using this notation, the allocation probability for this patient is estimated by

{\hat{P}}_{k i} = \frac{\exp ({\hat{β}}_{k}^{T} {\tilde{z}}_{k i})}{1 + \sum_{k^{'} = 1}^{K - 1} \exp ({\hat{β}}_{k^{'}}^{T} {\tilde{z}}_{k i})} if 1 \leq k \leq K - 1

and

{\hat{P}}_{K i} = \frac{1}{1 + \sum_{k^{'} = 1}^{K - 1} exp ({\hat{β}}_{k^{'}}^{T} {\tilde{z}}_{K i})} .

We use ${\hat{P}}_{k i}$ as a propensity score. For a propensity matching, we may define strata by grouping patients with similar values for $({\hat{P}}_{1 i}, \dots, {\hat{P}}_{K - 1, i})$ using a clustering method. A simple clustering method may be, for example with K = 3, to partition the ranges of ${\hat{P}}_{1 i}$ and ${\hat{P}}_{2 i}$ from n patients into J₁ and J₂ intervals, respectively, and construct J₁ × J₂ strata, each of which consists of a stratum of patients with similar propensity scores $({\hat{P}}_{1}, {\hat{P}}_{2}, {\hat{P}}_{3}) .$ Within each stratum, we can randomly select a certain number of patients from each of the three groups so that the set of allocation proportions (γ₁, γ₂, γ₃) with $\sum_{k = 1}^{3} γ_{k} = 1$ is identical across the J₁ × J₂ strata. This procedure is called vector matching by Abadie A & Imbens (2006). Lopez and Gutman (2017) provide a review of various matching method for multiple treatment comparisons.

If we want an inference based on IPW, then we use $w_{k i} = 1 / {\hat{P}}_{k i}$ as the weights for patient i in group k.

2.2. Decision Tree Method

The multinomial logistic regression is a parametric approach, so that the final K group comparison results as well as the propensity analysis results can be sensitive to the assumed regression model. Decision tree is a robust alternative to avoid such issues of regression methods.

We define a stratum based on propensity scores that are associated with relative frequencies of the K groups. Ideally, each stratum should be so homogeneous that the propensity scores of the patients within each stratum should be similar. On the other hand, the relative frequencies of the K groups should be very different between any two different strata. The further away two propensity strata are, the more different the relative frequencies between the two strata will be. We derive a propensity score estimation method based on this concept.

A decision tree is a tool to find the best decision for an object based on the values of its predictors. While regression trees usually have a continuous or binomial outcome variable, decision trees have a nominal categorical outcome variable. If there are two possible decisions (or groups to be compared), then a decision tree is identical to the binary regression tree. We consider decision trees with K(≥ 3) possible decisions that are called treatment groups in this paper. At a node, we go through all possible cutoff values for every predictor and identify the cutoff value of a predictor that gives the most significant partitioning in terms of the proportions among K groups. As the level of nodes goes down, the possible cutoffs of a predictor will be limited to a smaller range. If the cutoff value is too extreme, then one of two strata partitioned by the cutoff will be too small, especially with a continuous predictor. Since we do not want to consider too small strata, we may not consider extreme cutoff values.

In order to measure the significance of a classification, we propose to use the p-value from the chi-squared test for a 2×K frequency table (see Table 1) comprising two strata identified by a cutoff value of a predictor and K groups. We continue the classification procedure until there exist no more significant cutoff values for any predictors for all nodes. When this procedure is completed, each terminal node consists of a subset of patients with similar propensity scores. In this sense, we call each terminal node a propensity stratum, and the frequencies of the K groups within each stratum can be used for propensity matching.

Table 1.

K × 2 frequency table defined by cutoff value c for predictor z

Propensity stratum	Treatment group
Propensity stratum	1	2	…	K
z ≤ c	n₁₁	n₁₂	…	n_1K
z > c	n₂₁	n₂₂	…	n_2K

Open in a new tab

For the purpose of statistical trimming of regression trees, Jung et al. (2014) propose to control the familywise error rate (FWER) in determining the significance of K tables accounting for the multiplicity of the predictors and the cutoff values of each predictor. This rule, however, will produce a very short decision tree and each of the resulting branches (i.e. strata) will consist of patients with very wide range of propensity scores. So, in this paper, we propose to control the marginal type I error rate based on the p-values of χ²-tests with K − 1 degrees of freedom. If the total sample size is large and we want each stratum to have very homogeneous propensity scores, then we use a large type I error rate, called α₁.

When there is no more significant classification, each of the leaves, also called final nodes, can be counted as a stratum consisting of subjects with similar propensity scores and the frequencies among the K groups are summarized by a K × 1 table. A decision tree may possibly result in over-classification. If there are terminal nodes with similar relative frequencies, then we may combine them as a single propensity stratum for efficient estimation of propensity scores. We make this decision if the p-value of the χ²-test to compare the relative frequencies between two strata is larger than a prespecified type I error rate, α₂. We repeat this procedure until there are no more pairs of strata with p-value of χ² test larger than the specified α₂ level.

We can consider matching and inverse probability weight (IPW) approaches for an unbiased comparison of the outcome among K groups based on the final decision tree. For matching, we randomly select patients from each stratum using certain proportions among K groups so that the set of proportions is identical across the strata. A simple example will be to select equal number of patients from each group within each stratum, corresponding to a 1-to-1 matching in a two group propensity matching case. Oftentimes, control groups have much more subjects than case groups. If some groups have a lot more subjects than the others for all strata, then we may consider an unbalanced matching among K groups, corresponding to 1-to-m matching in a two group propensity matching. More specifically, suppose that we want allocation proportions of (γ₁,…,γ_K) among K groups with γ_k > 0 and $\sum_{k = 1}^{K} γ_{k} = 1 .$ Then, in a propensity matching, we randomly select subjects from large groups to satisfy the allocation proportions (γ₁,…,γ_K) within each stratum.

If stratum j has n_jk patients in treatment group k, then the propensity score for these patients is estimated by ${\hat{P}}_{j k} = n_{j k} / \sum_{k^{'} = 1}^{K} n_{j k^{'}}$ and the weights for IPW are given as $w_{k i} = 1 / {\hat{P}}_{j k} = n_{j k}^{- 1} \sum_{k^{'} = 1}^{K} n_{j k^{'}} .$ In summary, a decision tree for propensity analysis may proceed as follows. The first node is the whole data.

[I] Classification procedure
- [Ia] For each predictor, partition a node into two nodes using a value of the predictor as a cutoff value, and calculate the p-value of the $χ_{K - 1}^{2}$ test.
- [Ib] Repeat [Ia] for all possible cutoff values with respect to all predictors. If the smallest p-value is smaller than α₁, then split the node into two nodes using the corresponding cutoff value.
- [Ic] If there exist no p-values smaller than α₁, then the classification procedure ends.
[II] Pooling procedure: Suppose that a classification procedure resulted in D₁ nodes (or strata).
- [IIa] For each of D₁(D₁ − 1)/2 pairs of nodes, calculate the p-value of the $χ_{K - 1}^{2}$ test comparing the allocation proportions between two nodes.
- [IIb] In [IIa], if the largest p-value is larger than α₂, combine the corresponding pair of nodes into one.
- [IIc] Repeat [Ia] and [Ib] until there exists no pair of nodes with p-value larger than α₂, to result in D₁ strata.
[III] Among D₂ strata, discard the strata with 0-frequency for any treatment groups, to result in D₃ strata.

Note that, while steps [I] and [II] do not change the sample size, step [III] can lower the sample size for the final analysis. Step [II] corresponds to variable deletion, while step [I] corresponds to variable addition, in stepwise regression.

3. Weighted Test Statistics

Propensity analysis discussed in the previous section uses group identity and predictors, but not outcome data. Once a propensity analysis is completed, we want to conduct a statistical testing to compare the groups on an outcome variable incorporating the propensity analysis result. Cole and Hernan (2004) and Xie and Chaofeng (2005) propose a weighted 2-sample log-rank test for survival outcomes.

In a K-sample test, the null hypothesis is that all K groups have the same survival distributions, i.e. H₀ : S₁(t) = ⋯ = S_K(t) for t ≥ 0, where S_k(t) denotes the survivor function of the population when all patients received treatment k. The statistical testing method for this null hypothesis is different depending on the study design and the study objective. For example, if all K groups are cases, then we may compare their survival distributions using a one-way ANOVA-type test. On the other hand, if a group is a control and the remaining K − 1 groups are cases, then we may use a Dunnettype test to compare each case group with the control. Jung and Hui (2002) and Jung et al. (2008) propose sample size formulas for ANOVA-type and Dunnett-type rank tests, respectively, for comparing the survival distributions among K groups when the distributions of predictors are balanced among treatment groups. In this section, we propose weighted version of these rank tests and derive their asymptotic distributions for large n assuming that n_k/n → γ_k(> 0) for k = 1,…,K.

We observe censored survival time X_ki, denoting the minimum of the survival and the censoring times, and event indicator δ_ki, taking 1 if an event is observed and 0 otherwise, for patient i(= 1,…, n_k) in group k(= 1,…,K) who has a weight w_ki from a propensity analysis. For each patient, the censoring time is independent of the survival time. Let N_ki(t) = δ_kiI(X_ki ≤ t) and Y_ki(t) = I(X_ki ≥ t) denote the event and the at-risk processes for patient i in group k, and $N_{k} (t) = \sum_{i = 1}^{n_{k}} w_{k i} N_{k i} (t), Y_{k} (t) = \sum_{i = 1}^{n_{k}} w_{k i} Y_{k i} (t), {\bar{Y}}_{k} (t) = \sum_{i = 1}^{n_{k}} w_{k i}^{2} Y_{k i} (t), N (t) = \sum_{k = 1}^{K} N_{k} (t),$ and $Y (t) = \sum_{k = 1}^{K} Y_{k} (t) .$ Further, let $M_{k i} (t) = w_{k i} \int_{0}^{t} {d N_{k i} (s) - Y_{k i} (s) d Λ (s)}, M_{k} (t) = \sum_{i = 1}^{n_{k}} M_{k i} (t),$ and $M (t) = \sum_{k = 1}^{K} M_{k} (t) .$ Let $w_{k} = \sum_{i = 1}^{n_{k}} w_{k i}, w = \sum_{k = 1}^{K} w_{k} = \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} w_{k i},$ and ${\bar{w}}_{k} = \sum_{i = 1}^{n_{k}} w_{k i}^{2} .$

Before discussing K-sample comparison problems, we briefly investigate a weighted Kaplan-Meier estimate as a 1-sample problem.

3.1. Weighted Kaplan-Meier estimate

A weighted estimator for the survival function S_k(t), called weighted Kaplan-Meier (1958) estimator by Galimberti et al. (2002), of group k is obtained by

{\hat{S}}_{k} (t) = \prod_{s \leq t} {1 - \frac{Δ N_{k} (s)}{Y_{k} (s)}} = \prod_{{1 \leq i \leq n_{k} : X_{k i} \leq t}} {1 - \frac{\sum_{i^{'} = 1}^{n_{k}} w_{k i^{'}} δ_{k i^{'}} δ_{k i^{'}} I (X_{k i^{'}} = X_{k i})}{\sum_{i^{'} = 1}^{n_{k}} w_{k i^{'}} I (X_{k i^{'}} \geq X_{k i})}},

where ΔN_k(t) = N_k(t) − N_k(t−). Note that M_ki(t) is a 0-mean martingale with variance $w_{k i}^{2} \int_{0}^{t} Y_{k i} (t) d Λ_{k} (t) .$ Using similar arguments of Fleming and Harrington (1991) for the Kaplan-Meier estimator, we can show that ${\hat{S}}_{k} (t)$ is approximately normal with mean S_k(t) and variance $σ_{k}^{2} (t)$ that can be consistently estimated by

{\hat{σ}}_{k} {(t)}^{2} = {\hat{S}}_{k}^{2} (t) \int_{0}^{t} \frac{{\bar{Y}}_{k} (s)}{Y_{k} {(s)}^{2}} d {\hat{Λ}}_{k} (s) = {\hat{S}}_{k} {(t)}^{2} \sum_{i = 1}^{n_{k}} δ_{k i} w_{k i} \frac{\sum_{i^{'} = 1}^{n_{k}} w_{k i^{'}}^{2} I (X_{k i^{'}} \geq X_{k i})}{{\sum_{i^{'} = 1}^{n_{k}} w_{k i^{'}} I (X_{k i^{'}} \geq X_{k i})}^{3}},

where ${\hat{Λ}}_{k} (t) = \int_{0}^{t} Y_{k} {(t)}^{- 1} d N_{k} (t)$ is a weighted version of Aalen-Nelson estimator (Aalen 1978; Nelson 1969) for the cumulative hazard function Λ_k(t) = −log S_k(t) for group k.

3.2. One-Way ANOVA-Type Tests

In an ANOVA-type test, we want to test H₀ : Λ₁(t) = ⋯ = Λ_K(t)(= Λ(t)) for t ≥ 0 against H₁ : Λ_k(t) ≠ Λ_k′(t) for some k ≠ k′.

A weighted version of Nelson-Aalen estimator for the common cumulative hazard function Λ(t) under H₀ is given as $\hat{Λ} (t) = \int_{0}^{t} Y^{- 1} (s) d N (s) .$ For testing H₀, we consider a weighted log-rank test W = (W₁,…,W_K−1)^T, where

\begin{matrix} W_{k} = \frac{\sqrt{n}}{w_{k}} \int_{0}^{\infty} Y_{k} (t) {d {\hat{Λ}}_{k} (t) - d \hat{Λ} (t)} \\ = \frac{\sqrt{n}}{w_{k}} {\int_{0}^{\infty} d N_{k} (t) - \int_{0}^{\infty} \frac{Y_{k} (t)}{Y (t)} d N (t)} \\ = \frac{\sqrt{n}}{w_{k}} {\sum_{i = 1}^{n_{k}} δ_{k i} w_{k i} - \sum_{l = 1}^{K} \sum_{i = 1}^{n_{l}} \frac{δ_{l i} w_{l i} \sum_{i^{'} = 1}^{n_{k}} w_{k i^{'}} I (X_{k i^{'}} \geq X_{l i})}{\sum_{l^{'} = 1}^{K} \sum_{i^{'} = 1}^{n_{l^{'}}} w_{l^{'} i^{'}} I (X_{l^{'} i^{'}} \geq X_{l i})}} . \end{matrix}

Appendix A shows that, under H₀, W is asymptotically normal with mean 0 and variance V that can be consistently estimated by $\hat{V} = {({\hat{v}}_{k, k^{'}})}_{(K - 1) \times (K - 1)}$ with

\begin{matrix} {\hat{v}}_{k, k^{'}} = \frac{n}{w_{k} w_{k^{'}}} \sum_{l = 1}^{K} \int_{0}^{\infty} {ξ_{k l} - \frac{Y_{k} (t)}{Y (t)}} {ξ_{k^{'} l} - \frac{Y_{k^{'}} (t)}{Y (t)}} \frac{{\bar{Y}}_{l} (t)}{Y (t)} d N (t) \\ = \frac{n}{w_{k} w_{k^{'}}} \sum_{l = 1}^{K} \sum_{l^{'} = 1}^{K} \sum_{i^{'} = 1}^{n_{l^{'}}} δ_{l^{'} i^{'}} w_{l^{'} i^{'}} {ξ_{k l} - \frac{\sum_{i^{″} = 1}^{n_{k}} w_{k i^{″}} I (X_{k i^{″}} \geq X_{l^{'} i^{'}})}{\sum_{l^{″} = 1}^{K} \sum_{i^{″} = 1}^{n_{l^{″}}} w_{l^{″} i^{″}} I (X_{l^{″} i^{″}} \geq X_{l^{'} i^{'}})}} \\ \times {ξ_{k^{'} l} - \frac{\sum_{i^{″} = 1}^{n_{k^{'}}} w_{k^{'} i^{″}} I (X_{k^{'} i^{″}} \geq X_{l^{'} i^{'}})}{\sum_{l^{″} = 1}^{K} \sum_{i^{″} = 1}^{n_{l^{″}}} w_{l^{″} i^{″}} I (X_{l^{″} i^{″}} \geq X_{l^{'} i^{'}})}} \frac{\sum_{i^{″} = 1}^{n_{l}} w_{l i^{″}}^{2} I (X_{l i^{″}} \geq X_{l^{'} i^{'}})}{\sum_{l^{″} = 1}^{K} \sum_{i^{″} = 1}^{n_{l^{″}}} w_{l^{″} i^{″}} I (X_{l^{″} i^{″}} \geq X_{l^{'} i^{'}})}, \end{matrix}

where ξ_k,k′ = I(k = k′). Hence, with a specified type I error rate α, we can reject H₀ if $Q = W^{T} {\hat{V}}^{- 1} W$ is larger than $χ_{K - 1, 1 - α}^{2} .$

3.3. Dunnett-Type Tests

In K-sample problems, oftentimes one of the groups is a control and the remaining groups are cases. We assume that group K is the control. In this case, we usually want to test if each of the K −1 case groups is more efficacious than the control group. For k = 1,…,K − 1, we want to test the null hypothesis H_k : Λ_k(t) = Λ_K(t) against the alternative hypothesis ${\bar{H}}_{k} : Λ_{k} (t) \neq Λ_{K} (t) .$ Denote $H_{0} = \cap_{k = 1}^{K - 1} H_{k}$ and $H_{a} = \cup_{k = 1}^{K - 1} {\bar{H}}_{k} .$ For k = 1,…,K − 1, the weighted log-rank statistic (Peto R, Peto 1972; Jung & Hui 2002) for testing H_k is given by

\begin{matrix} U_{k} = \frac{\sqrt{n} (w_{k} + w_{K})}{w_{k} w_{K}} \int_{0}^{\infty} \frac{Y_{k} (t) Y_{K} (t)}{Y_{k} (t) + Y_{K} (t)} {d {\hat{Λ}}_{K} (t) - d {\hat{Λ}}_{k} (t)} \\ = \frac{\sqrt{n} (w_{k} + w_{K})}{w_{k} w_{K}} {\sum_{i = 1}^{n_{K}} \frac{δ_{K i} w_{K i} Y_{k} (X_{K i})}{Y_{k} (X_{K i}) + Y_{K} (X_{K i})} - \sum_{i = 1}^{n_{k}} \frac{δ_{k i} w_{k i} Y_{K} (X_{k i})}{Y_{k} (X_{k i}) + Y_{K} (X_{k i})}} \\ = \frac{\sqrt{n} (w_{k} + w_{K})}{w_{k} w_{K}} {\sum_{i = 1}^{n_{K}} \frac{δ_{K i} w_{K i} \sum_{i^{'} = 1}^{n_{k}} w_{k i^{'}} I (X_{k i^{'}} \geq X_{K i})}{\sum_{i^{'} = 1}^{n_{k}} w_{k i^{'}} I (X_{k i^{'}} \geq X_{K i}) + \sum_{i^{'} = 1}^{n_{K}} w_{K i^{'}} I (X_{K i^{'}} \geq X_{K i})} - \sum_{i = 1}^{n_{k}} \frac{δ_{k i} w_{k i} \sum_{i^{'} = 1}^{n_{K}} w_{K i^{'}} I (X_{K i^{'}} \geq X_{k i})}{\sum_{i^{'} = 1}^{n_{k}} w_{k i^{'}} I (X_{k i^{'}} \geq X_{k i}) + \sum_{i^{'} = 1}^{n_{K}} w_{K i^{'}} I (X_{K i^{'}} \geq X_{k i})}} . \end{matrix}

Note that a positive U_k value implies that group k has a longer survival time than group K. By Appendix B, under H_k, U_k is approximately normal with mean 0 and variance $σ_{k}^{2}$ that can be consistently estimated by

\begin{matrix} {\hat{σ}}_{k}^{2} = \frac{n {(w_{k} + w_{K})}^{2}}{w_{k}^{2} w_{K}^{2}} \int_{0}^{\infty} \frac{Y_{k}^{2} (t) {\bar{Y}}_{K} (t) + Y_{K}^{2} (t) {\bar{Y}}_{k} (t)}{{Y_{k} (t) + Y_{K} (t)}^{2}} d {\hat{Λ}}_{(k)} (t) \\ = \frac{n {(w_{k} + w_{K})}^{2}}{w_{k}^{2} w_{K}^{2}} [\sum_{i = 1}^{n_{k}} δ_{k i} w_{k i} \frac{Y_{k} {(X_{k i})}^{2} {\bar{Y}}_{K} (X_{k i}) + Y_{K} {(X_{k i})}^{2} {\bar{Y}}_{k} (X_{k i})}{{Y_{k} (X_{k i}) + Y_{K} (X_{k i})}^{3}} + \sum_{i = 1}^{n_{K}} δ_{K i} w_{K i} \frac{Y_{k} {(X_{K i})}^{2} {\bar{Y}}_{K} (X_{K i}) + Y_{K} {(X_{K i})}^{2} {\bar{Y}}_{k} (X_{K i})}{{Y_{k} (X_{K i}) + Y_{K} (X_{K i})}^{3}}], \end{matrix}

where ${\hat{Λ}}_{(k)} (t) = \int_{0}^{t} {Y_{k} (s) + Y_{K} (s)}^{- 1} {d N_{k} (s) + d N_{K} (s)}$ denotes the weighted Nelson-Aalen estimator under H_k.

If we reject H_k with a type I error probability of α, the probability of false rejections for the global null hypothesis H₀ will be larger than the nominal α level due to the multiple testing issue. In order to avoid this issue, statisticians control the FWER. For a chosen critical value c, we reject H_k if $| U_{k} / {\hat{σ}}_{k} | > c .$ Dunnett-type tests control the FWER at α by choosing c = c_α satisfying

α = P {\cup_{k = 1}^{K - 1} (| U_{k} / {\hat{σ}}_{k} | > c) | H_{0}} .

(2)

The Bonferroni test, rejecting H_k with a type I error probability of α/(K − 1), results in too a conservative testing result.

From Appendix B, for large n under H₀, $(U_{1} / {\hat{σ}}_{1}, \dots, U_{K - 1} / {\hat{σ}}_{K - 1})$ is approximately normal with means 0, variances 1, and correlation coefficients

{\hat{ρ}}_{k k^{'}} = \frac{w_{k} w_{k^{'}} {\bar{w}}_{K}}{\sqrt{(w_{K}^{2} {\bar{w}}_{k} + w_{k}^{2} {\bar{w}}_{K}) (w_{K}^{2} {\bar{w}}_{k^{'}} + w_{k^{'}}^{2} {\bar{w}}_{K})}} .

for 1 ≤ k < k′ ≤ K − 1. Hence, we can obtain c = c_α from (2) by using a numerical method (Genz & Bretz 2000; Gassmann et al. 2002) or a simulation method (Bang et al. 2005). We focus on the former method in this paper. Let ϕ_K−1(u₁,…, u_K−1) denote the joint probability density function (PDF) of the (K − 1)-variate normal distribution with marginal means 0, variances 1, and correlation coefficients ${\hat{ρ}}_{k k^{'}} .$ Then, from (2), we obtain c = c_α by solving

α = 1 - \int_{- c}^{c} \dots \int_{- c}^{c} ϕ_{K - 1} (u_{1}, \dots, u_{K - 1}) d u_{1} \dots d u_{K - 1}

(3)

with respect to c.

If (Z₁, Z₂) is bivariate normal with means (μ₁, μ₂), variances $(σ_{1}^{2}, σ_{2}^{2}),$ correlation coefficient ρ, then it is well known that Z₁ conditioning on Z₂ = z₂ is normal with mean μ₁ + (ρσ₁/σ₂)(z₂ − μ₂) and variance $σ_{1}^{2} (1 - ρ^{2}) .$ Hence, for K = 3, (3) can be expressed using a one dimensional integration

α = 1 - \int_{- c}^{c} ϕ (u) {Φ (\frac{c - {\hat{ρ}}_{12} u}{\sqrt{1 - {\hat{ρ}}_{12}^{2}}}) - Φ (\frac{- c - {\hat{ρ}}_{12} u}{\sqrt{1 - {\hat{ρ}}_{12}^{2}}})} d u,

where ϕ(·) and Φ(·) are the probability density function and the cumulative density function of N(0,1).

Our Dunnett-type test is conducted as follows.

Calculate the weights (w_ki, i = 1,…, n_k, k = 1,…,K) using the multinomial logistic regression or decision tree method.
By solving (3), obtain the critical value c_α.
For k = 1,…,K − 1, we reject H_k if $| U_{k} / {\hat{σ}}_{k} | > c_{α} .$

Given an observed test statistic $z_{k} = U_{k} / {\hat{σ}}_{k}$ for group k(= 1,…,K−1), the FWER-adjusted p-value (Jung et al. 2005), p_k, is given as

p_{k} = 1 - \int_{- | z_{k} |}^{| z_{k} |} \dots \int_{- | z_{k} |}^{| z_{k} |} ϕ_{n} (u_{1}, \dots, u_{K - 1}) d u_{1} \dots d u_{K - 1},

which is simplified to

p_{k} = 1 - \int_{- | z_{k} |}^{| z_{k} |} ϕ (u) {Φ (\frac{| z_{k} | - {\hat{ρ}}_{12} u}{\sqrt{1 - {\hat{ρ}}_{12}^{2}}}) - Φ (\frac{- | z_{k} | - {\hat{ρ}}_{12} u}{\sqrt{1 - {\hat{ρ}}_{12}^{2}}})} d u,

for K = 3.

4. Numerical Studies

4.1. Simulations

We conduct simulation studies to evaluate the performance of the K-sample comparison methods combined with propensity analysis methods under various settings. We consider comparing the survival distribution of K = 3 groups using the ANOVA or Dunnett test combined with the IPW approach. The third group (K = 3) is the control group for Dunnett test. The weights are estimated by the multinomial logistic regression or decision tree method.

For patient i, the predictor z_i was generated from U(0, 3). Given a predictor value z_i, the allocation proportion for each experimental group is determined by one of following models.

(A1) (Multinomial logistic model)

γ_{1} = γ_{2} = \frac{\exp (0.15 z_{i})}{1 + 2 \exp (0.15 z_{i})}, γ_{3} = 1 - 2 γ_{1}

(A2) (Multinomial logistic model with an interval with controls only)

γ_{1} = γ_{2} = \frac{\exp (0.15 z_{i})}{1 + 2 \exp (0.15 z_{i})}, γ_{3} = 1 - 2 γ_{1}

if z_i ∈ [0, 2], and γ₃ = 1 if z_i ∈ (2, 3].

The survival distribution of patient i in treatment group k with predictor z_i is generated from an exponential distribution with an annual hazard rate of

λ_{i} = 0.2 \exp (β_{k} + 0.5 z_{i}) .

Note that β_k denotes the effect of group k and 0.5z_i denotes the impact of the predictor on the survival outcome. We consider a null hypothesis H₀ : β₁ = β₂ = β₃ = 0, and alternative hypotheses H₁ : (β₁, β₂, β₃) = (−0.35,−0.35,0.05) and H₂ : (β₁, β₂, β₃) = (−0.45,−0.25,0.15).

We consider another model with allocation probabilities non-monotone in z_i.

(A3) (A non-multinomial logistic model)

γ_{1} = γ_{2} = \frac{1.25}{\sqrt{2 π}} \exp {- \frac{{(z_{i} - 1.5)}^{2}}{2}}, γ_{3} = 1 - 2 γ_{1}

For (A3), the propensity score for group 3 is close to 1 fof z_i close to 0 or 3, and close to 0 for z_i = 1.5. This propensity model is matched with an exponential survival distribution whose hazard rate is non-monotone in z_i,

λ_{i} = 0.4 \exp {β_{k} + 0.3 - 0.6 I (1 < z_{i} < 2)} .

The censoring variable is generated from U(2, 7) mimicking a study with 5 years of patient registry and 2 years of additional follow-up to generate about 20% of censoring. The parameter values for propensity models and survival distributions are chosen for at least 80% of power by the decision tree method.

For the decision tree method, we use α₁ = 0.05 for classification and α₂ = 0.3 for pooling, and extreme cutoff values resulting in less than 5 patients in one stratum are not considered. We consider classification only (I), classification and pooling (I+II), classification and discarding the strata with 0-frequency groups (I+III), or all of the three steps (I+II+III), and report the average number of strata D for each of these procedures.

Under each simulation setting, we generate 10,000 simulation data sets of size n = 1,000, and apply the ANOVA or Dunnett test using the weights estimated by multinomial logistic regression or decision tree method to each sample. The empirical type I error rate and power are estimated by the proportion of simulation samples rejecting the null hypothesis with nominal α = 0.05 among 10,000 simulation samples.

Table 2(a) summarizes the simulation results. If the treatment selection follows a multinomial logistic model (A1), then the weighted rank tests using the weights from both multinomial logistic and decision tree methods control the type I error rate closely under H₀. Under allocation model (A1), the decision tree method does not seem to need steps (II) and (III) in addition to classification (I). If there exists an interval of predictor z_i with control patients only (A2) or if the allocation probabilities do not have monotone trends in z_i (A3), then the weighted rank tests using the propensity scores from multinomial logistic regression do not control the type I error accurately.

Table 2.

Empirical rejection probabilities for the weighted log-rank tests based on multinomial logistic regression and decision tree propensity methods. The decision tree method, we also report average number of strata (D) after the first classification (I), after pooling the strata with similar allocation proportions (I+II), after discarding the strata with no frequency in any of the three treatment groups (I+III), and pooling the strata with similar allocation proportions and discarding the strata with no frequency in any of the three treatment groups (I+II+III).(a) When K = 3

Allocation	Censoring	Multinomial		Tree (I)			Tree (I+II)			Tree (I+III)			Tree (I+II+III)
Allocation	Censoring	ANO	Dun	ANO	Dun	D	ANO	Dun	D	ANO	Dun	D	ANO	Dun	D
(i) Undet H₀ : β₁=β₂=β₃
A1	19.6%	0.034	0.034	0.039	0.039	4.1	0.039	0.040	3.5	0.039	0.039	3.8	0.039	0.039	3.2
A2	19.6%	0.098	0.109	0.664	0.680	4.2	0.665	0.679	3.8	0.047	0.044	3.0	0.047	0.045	2.6
A3	17.4%	0.798	0.834	0.083	0.085	11.7	0.083	0.085	8.0	0.060	0.063	10.3	0.059	0.064	7.0
(ii) Undet H₁ : β₁=β₂<β₃
A1	26.6%	0.962	0.971	0.942	0.955	4.1	0.940	0.955	3.5	0.942	0.955	3.8	0.941	0.955	3.2
A2	24.4%	0.998	0.998	1.000	1.000	4.2	1.000	1.000	3.8	0.921	0.936	2.9	0.920	0.935	2.6
A3	23.4%	1.000	1.000	0.854	0.866	11.7	0.853	0.864	8.0	0.847	0.871	10.3	0.846	0.870	7.0
(iii) Undet H₂ : β₁<β₂<β₃
A1	24.6%	0.900	0.912	0.871	0.884	4.1	0.869	0.883	3.5	0.972	0.884	3.8	0.870	0.883	3.2
A2	21.8%	0.999	0.999	1.000	1.000	4.2	1.000	1.000	3.8	0.964	0.973	3.0	0.964	0.972	2.6
A3	21.2%	1.000	1.000	0.897	0.839	11.7	0.898	0.839	8.1	0.891	0.842	10.3	0.892	0.844	7.0

Open in a new tab

Under (A2) and (A3) allocation models, the decision tree method controls the type I error accurately when step (III) is incorporated. From the average number of strata D, as expected, step (III) much decreases the number of strata for model (A2), and step (II) much decreases the number of strata for model (A3).

Under H₁ and H₂, the empirical powers of the cases where type I is accurately controlled are bold faced. The Dunnett-type test looks slightly more powerful than the ANOVA-type test in most of the simulation settings except for model (A3) under H₂. If a multinomial logistic allocation model is valid, then the weighted rank tests based on multinomial logistic regression method have a slightly higher power than those based on decision tree method.

A reviewer suggests to add some simulations with a more general survival distributions, such as Weibull, comparisons among K = 4 treatment groups, and a similar setting to that of the real data example presented below. We generate m = 4 correlated covariates as follows. At first, a random vector (u₁, u₂, u₃, u₄) is generated from the multivariate normal distribution with marginal means 0 and variances 1 and a exchangeable dependency with correlation coefficient 0.5. These random variables are transformed for z₁ = u₁, z₂ = [z₂], z₃ = Φ⁻¹(u₃), and z₄ = I(u₁ ≤ 0.5), where 〈a〉 denotes the round down of a. Note that, marginally, z₃ is U(0,1) and z₄ is Bernoulli(0.5) random variables. The treatment selection follows a multinomial logistic model with β_0k = 0, β_1k = β_2k = 0.3, and β_3k = β_4k = 0.2 for k = 1,2,3. By this multinomial logistic model, about 25.8% are allocated to each of groups 1, 2, and 3, and the remaining 22.5% are allocated to the control group 4. The survival time for a patient in treatment group k with covariate values (z₁, z₂, z₃, z₄) is generated from Weibull hazard function of given as

λ (t | z) = ν λ_{0} {(λ_{0} t)}^{ν - 1} exp (β_{k} + θ_{1} z_{1} + θ_{2} z_{2} + θ_{3} z_{3} + θ_{4} z_{4})

with θ₁ = θ₂ = θ₃ = θ₄ = 0.25 and nu = 0.5. We consider a null hypothesis H₀ : β₁ = β₂ = β₃ = β₄ = 0, and alternative hypotheses H₁ : (β₁, β₂, β₃,= β₄) = (−0.65,−0.65,−0.65,0) and H₂ : (β₁, β₂, β₃,= β₄) = (−0.75,−0.5,−0.25,0). Under each hypothesis, λ₀ is selected for about 33% censoring by U(2,7) censoring distribution. Other simulation parameters are identically set to those of Table 2(a) for K = 3. Table 2(b) reports simulation results with n = 400. We observe that, under H₀, the weighted ANOVA and Dunnett tests control the type I error rate closely to the nominal level α = 0.05 by multinomial logistic regression and decision tree with different options, while the unweighted rank tests are severely anticonservative. Using multinomial logistic regression, the weighted ANOVA test is slightly more powerful than the weighted Dunnett test under both H₁ and H₂, while, using decision tree method, the weighted Dunnett test is slightly more (less) powerful than the weighted ANOVA test under H₁ (H₂). The decision tree method has similar power with different options, and it looks slightly less powerful than the multinomial logistic regression method.

Table 2.(b).

When K = 4 (including results for unweighted ANOVA and Dunnett tests)

Hypothesis	Unweighted		Multinomial		Tree (I)			Tree (I+II)			Tree (I+II+III)
Hypothesis	ANO	Dun	ANO	Dun	ANO	Dun	D	ANO	Dun	D	ANO	Dun	D
H₀	0.123	0.153	0.037	0.037	0.040	0.047	6.4	0.043	0.045	5.0	0.039	0.038	3.8
H₁	0.764	0.815	0.948	0.966	0.867	0.909	6.5	0.866	0.904	5.0	0.866	0.894	3.9
H₂	0.857	0.796	0.947	0.957	0.904	0.888	6.3	0.897	0.886	4.9	0.896	0.879	3.8

Open in a new tab

4.2. A Real Data Example

The proposed methods are applied to a real clinical observational study. Postoperative analgesic methods are suggested to have an impact on long term prognosis after cancer surgery through opioid-induced immune suppression. Lee et al. (2017) report analysis results comparing the recurrence-free survival (RFS), defined as time to cancer recurrence or death, from surgery among three analgesic methods (K = 3): intravenous patient controlled analgesia (PCA, group 1), paravertebral block (PVB, group 2), and thoracic epidural analgesia (TEA, group 3), for lung cancer surgery. Excluding cases with missing data, our analysis includes a total of 363 patients (111 for PCA, 137 for PVB, and 115 for TEA) among whom 111 patients had disease recurrence. We consider four covariates of body mass index (BMI), smoking, cancer stage and blood transfusion during surgery (BT) that are known to be important predictors for the outcome and study population.

We estimate the weights by multinomial logistic regression and decision tree methods. Table 3 reports the regression estimates, standard errors and p-values from multinomial logistic propensity analysis. Since the regression estimates of smoking from the two logistic regressions have negative sign, we find that more patients with smoking tend to belong to TEA. Figure 1 shows the strata identified by the decision tree method. In the decision tree analysis, we used α = 0.2 for a little finer classification and α = 0.3 for pooling. In Figure 1, each leaf represents a stratum of patients with similar propensity scores, and the frequencies for the three groups of each stratum are given within a box. The p-value at each branch is from χ² test with 2(= (3−1)×(2−1)) degrees of freedom for classification or pooling. The classification step has identified 8 strata, but two of them are combined during the pooling step to result in a total of 7 strata to be used for comparing RFS among the three treatment groups. Figure 2 reports the weighted Kaplan-Meier curves using these propensity analysis methods based on multinomial logistic regression (Fig 2.a) and decision tree (Fig 3.b). The IPW method using the multinomial logistic regression for testing the three groups gives us p-value = 0.040 by the ANOVA test, and FWER-adjusted p-value = 0.022 between PCA and TEA and 0.704 between PVB and TEA by the Dunnett test. On the other hand, the IPW method using the decision tree gives us p-value = 0.025 by the ANOVA test, and FWER-adjusted p-value = 0.011 between PCA and TEA and 0.469 between PVB and TEA by the Dunnett test. From these analyses, it seems that, compared to TEA, PCA has significantly longer RFS and PVB seems to have similar RFS. All p-values in this analysis are 2-sided.

Table 3:

Regression estimates (EST), standard errors (SE) and p-values (PVAL) of logistic regression models for propensity analysis using TEA group as the reference group

	PCA vs. TEA			PVB vs. TEA
Parameter	EST	SE	PVAL	EST	SE	PVAL
Intercept	−0.636	0.690	0.178	1.039	0.5766	0.036
BMI (<18.5 vs. 18.5–25 vs. ≥ 25)	0.235	0.275	0.197	−0.336	0.240	0.081
Smoking (yes/no)	−0.706	0.340	0.019	−0.631	0.303	0.019
Stage (<3 vs. ≥ 3)	−0.973	0.351	0.003	0.134	0.276	0.313
BT (yes/no)	1.353	0.316	0.000	−0.034	0.314	0.457

Open in a new tab

Figure 1. — Decision tree analysis of Lee et al. (2017). Each leaf denoting a stratum has frequencies of treatment groups of PCA, PVB, and TEA.

Figure 2. — Weighted Kaplan-Meier curves: (a) Using multinomial logistic regression

Figure 3. — Kaplan-Meier curves of matched data: (b) Based on propensity analysis using decision tree method

We also analyze the data after balanced (1-to-1-to-1) matching. Using the multinomial logistic regressions, we partitioned the propensity scores of each group into J₁ = J₂ = 3 intervals to result in 234 matched observations. From Fig 1, the seven strata from decision tree method will result in 240 matched observations by keeping all patients of the smallest group within each stratum. From Table 4, we observe that the distributions of predictors are well balanced among the three groups using both multinomial regression and decision tree methods. Figures 3 displays the Kaplan-Meier curves of RFS among the three matched groups using the multinomial regression and decision tree methods. We observe that the Kaplan-Meier curves for matched data using multinomial logistic regression method (Figure 3a) are similar to the weighted Kaplan-Meier curves of Figure 2a, while those of PVB and TEA groups using decision tree method (Figure 3b) are more separated than the corresponding weighted Kaplan-Meier curves in Figure 2b. Using the propensity scores from multinomial logistic regression, the matched data have ANOVA p-value = 0.066 by the (unweighted) ANOVA test (Jung & Hui 2002), and FWER-adjusted p-value = 0.052 between PCA and TEA and 0.979 between PVB and TEA by the (unweighted) Dunnett test (Jung et al. 2008). Using the propensity analysis from decision tree, the matched data have ANOVA p-value = 0.057 by the ANOVA test, and FWER-adjusted p-value = 0.029 between PCA and TEA and 0.500 between PVB and TEA by the Dunnett test. With the decreased sample sizes, the unweighted statistical tests using matched data have a little lower significance level than, but similar conclusions as, the weighted tests using the full data.

Table 4:

Matched data: distributions of covariates among three patient groups and p-values from χ² tests

		Multinomial Regression				Decision Tree
Covariate		PCA	PVB	TEA	p-value	PCA	PVB	TEA	p-value
BMI	<18.5 18.5–25 ≥25	4 55 27	3 53 22	4 55 19	0.889	3 52 25	3 50 27	1 52 27	0.861
Smoking	No Yes	72 14	68 10	66 12	0.815	66 14	65 15	63 17	0.828
Stage	<3 ≥3	66 20	58 20	59 19	0.939	62 18	62 18	62 18	1.000
BT	No Yes	62 24	61 17	53 25	0.351	58 22	60 20	60 20	0.917

Open in a new tab

The clinical study was designed to show that, compared to the local anesthetic based analgesic methods (TEA and PVB), the opioid based analgesic method (PCA) would decrease RFS. Due to a high failure rate and unstable hemodynamics frequently observed in the local anesthetic based analgesic methods, however, the local anesthetic based analgesic methods were shown to have shorter RFS than the opioid based analgesic method.

4.3. Discussions and Conclusions

We have investigated K-group comparisons for observational studies with a survival endpoint. The weighted rank tests proposed in this paper can be easily modified for other types of endpoints, such as binary or continuous types. In order to estimate the weights (propensity scores) of the rank tests, we reviewed the multinomial logistic regression and proposed a decision tree method as an alternative.

Through simulations, the proposed weighted rank tests were shown to perform well as far as the weights were accurately estimated by the propensity analysis methods. We found that the decision tree method provided very robust propensity scores overall if it incorporate a step to discard strata with zero frequency for any treatment group to ensure comparability among K treatment groups for all strata included in the analysis. The popular multinomial logistic model was found to provide powerful and robust propensity scores if the allocation proportion to each group is monotone and there are no ranges of predictors with zero frequency for any group. When there exists a range with no frequencies for some treatment groups, Zanutto et al. (2005) propose to find a common support of (K − 1)-dimensional propensity scores and discard the observations not in the common support. In a real data analysis, however, it is not always easy to identify a common support since the estimated propensity scores are discrete in nature. We can do this if we partition the range of propensity scores and define strata as in our real data analysis.

Zhu & Lu (2015) propose a Dunnett-type test derived from a Cox regression model

λ_{j} (t) = λ_{0 j} \exp (β_{1} x_{1} + \dots + β_{K - 1} x_{K - 1})

for patients in propensity stratum j with covariates (x₁,…, x_K−1), where x_k = 1 if a patient belongs to treatment group k and = 0 otherwise. They claim that β_k measures the difference in survival distribution between group k and the control group K, but this is not true. Actually, β_k measures the difference between group k and the remaining K − 1 groups combined. Furthermore, they assume a specific covariance structure, called one-factor structure, among the regression estimators to derive the multiplicity adjusted critical value c using Hsu’s (1992) approach. Our method does not require this assumption which may not hold for a given data set.

We have focused on IPW method in this paper. But one may want to use standard (unweighted) test statistics using matched data. In this case, the decision tree method automatically and optimally defines strata, which makes data matching easy, while the multinomial logistic regression method requires an additional step to define strata. In the analysis of our example, we simply partitioned each dimension of propensity scores so that each interval has similar propensity scores, but we could use a more technical (K − 1)-dimensional clustering method, such as decision tree or machine learning, e.g. Westreich et al. (2010), Abadie & Imbens (2006), and Lee et al. (2009). The computer programs are composed in Fortran and available upon request from the first author.

Appendices: Asymptotic Distribution of Weighted Log-Rank Tests

In the following appendices, we use the notations of ψ_k = lim_n→∞ w_k/n, $ψ = \lim_{n \to \infty} w / n = \sum_{k = 1}^{K} ψ_{k},$ and ${\bar{ψ}}_{k} = \lim_{n \to \infty} {\bar{w}}_{k} / n .$

Appendix A. An ANOVA-type Test

Under H₀, we have

\begin{array}{l} W_{k} = \frac{\sqrt{n}}{w_{k}} \int_{0}^{\infty} Y_{k} (t) {d {\hat{Λ}}_{k} (t) - d \hat{Λ} (t)} \\ = \frac{\sqrt{n}}{w_{k}} {\int_{0}^{\infty} d M_{k} (t) - \int_{0}^{\infty} \frac{Y_{k} (t)}{Y (t)} d M (t)} \\ = \frac{\sqrt{n}}{w_{k}} \sum_{l = 1}^{K} \int_{0}^{\infty} {ξ_{k l} - \frac{Y_{k} (t)}{Y (t)}} \sum_{i = 1}^{n_{l}} d M_{l i} (t) . \end{array}

Note that {M_ki(t), i = 1,…, n_k, k = 1,…,K} are independent 0-mean martingales under H₀, and R_k(t) = Y_k(t)/Y(t) are predictable function that uniformly converges to $r_{k} (t) = ψ_{k} S_{k} (t) / \sum_{l = 1}^{K} ψ_{l} S_{l} (t) .$ Hence, by the martingale central limit theorem [21], W = (W₁,…,W_K−1)^T is asymptotically normal with mean 0 and variance V = (v_k,k′)_{(K−1)×(K−1)}, where

v_{k, k^{'}} = \frac{1}{ψ_{k} ψ_{k^{'}}} \sum_{l = 1}^{K} \int_{0}^{\infty} {ξ_{k l} - r_{k} (t)} {ξ_{k^{'} l} - r_{k^{'}} (t)} {\bar{y}}_{l} (t) d Λ (t) .

${\bar{y}}_{k} (t) = \lim_{n \to \infty} {\bar{Y}}_{k} (t) / n = {\bar{ψ}}_{k} S_{k} (t) G (t),$ and G(t) is the survivor function of the censoring distribution.

V can be consistently estimated by replacing ψ_k, ${\bar{y}}_{k} (t),$ r_k(t) and Λ(t) with their consistent estimators w_k/n, ${\bar{Y}}_{k} (t) / n,$ R_k(t) and $\hat{Λ} (t),$ respectively, i.e. $\hat{V} = {({\hat{v}}_{k, k^{'}})}_{(K - 1) \times (K - 1)}$ with

{\hat{v}}_{k, k^{'}} = \frac{n}{w_{k} w_{k^{'}}} \sum_{l = 1}^{K} \int_{0}^{\infty} {ξ_{k l} - \frac{Y_{k} (t)}{Y (t)}} {ξ_{k^{'} l} - \frac{Y_{k^{'}} (t)}{Y (t)}} {\bar{Y}}_{l} (t) d \hat{Λ} (t) .

Appendix B. A Dunnett-Type Test

Under H₀ : Λ₁(t) = ⋯ = Λ_K(t)(= Λ(t)), for k = 1,…,K − 1, we have

U_{k} = \frac{\sqrt{n} (w_{k} + w_{K})}{w_{k} w_{K}} \int_{0}^{\infty} \frac{Y_{k} (t) Y_{K} (t)}{Y_{k} (t) + Y_{K} (t)} {\frac{d M_{K} (t)}{Y_{K} (t)} - \frac{d M_{k} (t)}{Y_{k} (t)}},

which can be expressed as

U_{k} = \frac{ψ_{k} + ψ_{K}}{\sqrt{n} ψ_{k} ψ_{K}} \int_{0}^{\infty} \frac{y_{k} (t) y_{K} (t)}{y_{k} (t) + y_{K} (t)} {\frac{d M_{K} (t)}{y_{K} (t)} - \frac{d M_{k} (t)}{y_{k} (t)}} + o_{p} (1) .

(A1)

Using the martingale central limit theorem, we can show that, under H₀, U_k is asymptotically normal with mean 0 and variance

σ_{k}^{2} = \frac{{(ψ_{k} + ψ_{K})}^{2}}{ψ_{k}^{2} ψ_{K}^{2}} \int_{0}^{\infty} \frac{y_{k}^{2} (t) {\bar{y}}_{K} (t) + y_{K}^{2} (t) {\bar{y}}_{k} (t)}{{y_{k} (t) + y_{K} (t)}^{2}} d Λ (t)

which can be consistently estimated by

{\hat{σ}}_{k}^{2} = \frac{n {(w_{k} + w_{K})}^{2}}{w_{k}^{2} w_{K}^{2}} \int_{0}^{\infty} \frac{Y_{k}^{2} (t) {\bar{Y}}_{K} (t) + Y_{K}^{2} (t) {\bar{Y}}_{k} (t)}{{Y_{k} (t) + Y_{K} (t)}^{2}} d {\hat{Λ}}_{(k)} (t)

by replacing the parameters with their estimators obtained using the data of groups k and K.

By the multivariate martingale central limit theorem applied to (A1), (U₁,…,U_K−1) is approximately normal with mean 0 and variance-covariance matrix Σ = (σ_kk′)_{(K−1)×(K−1)} with $σ_{k k} = σ_{k}^{2}$ and

σ_{k k^{'}} = \frac{(ψ_{k} + ψ_{K}) (ψ_{k^{'}} + ψ_{K})}{ψ_{k} ψ_{k^{'}} ψ_{K}^{2}} \int_{0}^{\infty} \frac{y_{k} (t) y_{k^{'}} (t) {\bar{y}}_{K} (t)}{{y_{k} (t) + y_{K} (t)} {y_{k^{'}} (t) + y_{K} (t)}} d Λ (t)

for k ≠ k′. Under H₀, we have y_k(t) = ψ_kS(t)G(t) and ${\bar{y}}_{k} (t) = {\bar{ψ}}_{k} S (t) G (t),$ where S(t) = exp{−Λ(t)} denotes the common survivor function. Hence, under H₀, we have

σ_{k}^{2} = (\frac{{\bar{ψ}}_{k}}{ψ_{k}^{2}} + \frac{{\bar{ψ}}_{K}}{ψ_{K}^{2}}) \int_{0}^{\infty} G (t) d S (t)

and

σ_{k k^{'}} = \frac{{\bar{ψ}}_{K}}{ψ_{K}^{2}} \int_{0}^{\infty} G (t) d S (t)

for 1 ≤ k < k′ ≤ K − 1, so that the correlation coefficient between U_k and U_k′ is given as

ρ_{k k^{'}} = \frac{ψ_{k} ψ_{k^{'}} {\bar{ψ}}_{K}}{\sqrt{(ψ_{K}^{2} {\bar{ψ}}_{k} + ψ_{k}^{2} {\bar{ψ}}_{K}) (ψ_{K}^{2} {\bar{ψ}}_{k^{'}} + ψ_{k^{'}}^{2} {\bar{ψ}}_{K})}} .

Note that the correlation coefficients depend only on the weights, but not on the censoring or survival distribution.

By replacing ψ_k and ${\bar{ψ}}_{k}$ with their consistent estimators w_k/n and ${\bar{w}}_{k} / n,$ respectively, we obtain a consistent estimator of ρ_kk′

{\hat{ρ}}_{k k^{'}} = \frac{w_{k} w_{k^{'}} {\bar{w}}_{K}}{\sqrt{(w_{K}^{2} {\bar{w}}_{k} + w_{k}^{2} {\bar{w}}_{K}) (w_{K}^{2} {\bar{w}}_{k^{'}} + w_{k^{'}}^{2} {\bar{w}}_{K})}}

for 1 ≤ k < k′ ≤ K − 1.

REFERENCES

Aalen OO (1978). Nonparametric inference for a family of counting processes. Annals of Statistics, 6, 701–726. [Google Scholar]
Abadie A & Imbens GW (2006). Large sample properties of matching estimators for average treatment effects. Econometrica, 74, 235267. [Google Scholar]
Bang HJ, Jung SH, & George SL (2005). A simulation-based multiple testing procedure and sample size calculation. Journal of Biopharmaceutical Statistics, 15, 957–967. [DOI] [PubMed] [Google Scholar]
Breslow NE, Lumley T, Ballantyne CM, Chambless LE, & Kulich M (2009). Using the whole cohort in the analysis of case-cohort data. American Journal of Epidemiology, 169, 13981405. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cole SR & Hernan MA (2004). Adjusted survival curves with inverse probability weights. Computer Methods and Programs in Biomedicine, 75, 4549. [DOI] [PubMed] [Google Scholar]
Curtis LH, Hammill BG, Eisenstein EL, Kramer JM, & Anstrom KJ (2007). Using inverse probability-weighted estimators in comparative effectiveness analyses with observational databases. Medical Care, 45, S103–S107. [DOI] [PubMed] [Google Scholar]
Dunnett CW (1955). A multiple comparison procedure for comparing several treatments with a control. Journal of American Statistical Association, 50, 1096–1121. [Google Scholar]
Fleming TR & Harrington DP (1991). Counting processes and survival analysis. Wiley: New York. [Google Scholar]
Galimberti S, Sasieni P, & Valsecchi MG (2002). A weighted KaplanMeier estimator for matched data with application to the comparison of chemotherapy and bone-marrow transplant in leukaemia. Statistics in Medicine, 21, 38473864. [DOI] [PubMed] [Google Scholar]
Gassmann HI, Deak I, & Szantai T (2002). Computing multivariate normal probabilities: A new look. Journal of Computational and Graphical Statistics, 11, 920–949. [Google Scholar]
Genz A & Bretz F (2000). Numerical computation of critical values for multiple comparison problems ASA Proceedings of the Sections on Statistical Computing and Statistical Graphics, 84–87. [Google Scholar]
Hsu JC (1992). The factor analytic approach to simultaneous inference in the general linear model. Journal of Computational and Graphical Statistics, 1, 151–168. [Google Scholar]
Imbens GW (2000). The role of the propensity score in estimating dose-response functions. Biometrika, 87, 706710. [Google Scholar]
Jung SH, Bang H, & Young S (2005). Sample size calculation for multiple testing in microarray data analysis. Biostatistics, 6, 157–169. [DOI] [PubMed] [Google Scholar]
Jung SH, Chen Y, & Ahn H (2014). Type I error control for tree classification. Cancer Informatics, 13, 11–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jung SH & Hui S (2002). Sample size calculations to compare K different survival distributions. Lifetime Data Analysis, 8, 361–73. [DOI] [PubMed] [Google Scholar]
Jung SH, Kim C, & Chow SC (2008). Sample size calculation for the log-rank tests for multi-arm trials with a common control. Journal of Korean Statistical Society, 37, 11–22. [Google Scholar]
Kaplan EL & Meier P (1958). Nonparametric estimation from incomplete observations. Journal of American Statistical Association;53:457481. [Google Scholar]
Lee BK, Lessler J, & Stuart EA (2009). Improving propensity score weighting using machine learning. Statistics in Medicine, 29, 337–346. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lee EK, Ahn HJ, Zo JI, Kim K, Jung DM, & Park JH (2017). Paravertebral block does not reduce cancer recurrence, but is related to higher overall survival in lung cancer surgery: A retrospective cohort Study. Anesthesia and Analgesia, 125, 1322–1328. [DOI] [PubMed] [Google Scholar]
Lopez MJ & Gutman R (2017). Estimation of causal effects with multiple treatments: a review and new ideas. arXiv:1701.05132 [stat.ME].
McCaffrey DF, Griffin BA, Almirall D, Slaughter ME, Ramchand R, & Burgette LF (2013). A tutorial on propensity score estimation for multiple treatments using generalized boosted models. Statistics in Medicine, 32, 33883414. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nelson W (1969). Hazard plotting for incomplete failure data. Journal of Quality Technology, 1, 27–52. [Google Scholar]
Peto R & Peto J (1972). Asymptotically efficient rank invariant test procedures (with discussion). Journal of the Royal Statistical Society, Series A, 135, 185–206. [Google Scholar]
Pruzek RM & Cen L (2002). Propensity score analysis with graphics: A comparison of two kinds of gallbladder surgery Paper presented at the annual meeting of the Society for Multivariate Experimental Psychology, Charlottesville, VA. [Google Scholar]
Rosenbaum PR & Rubin DB (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55. [Google Scholar]
Rosenbaum PR & Rubin DB (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of American Statistical Association, 95, 749–759. [Google Scholar]
Rubin DB (1998). Estimation from nonrandomized treatment comparisons using subclassification on propensity scores In Nonrandomized comparative clinical studies, ed. Abel U and Koch A, 85–100. Dusseldorf, Germany: Symposion. [Google Scholar]
Stone RA, Obrosky DS, Singer DE, Kapoor WN, Fine MJ. (1995). Propensity score adjustment for pretreatment differences between hospitalized and ambulatory patients with community-acquired pneumonia. Medical Care, 33, AS56–66. [PubMed] [Google Scholar]
Westreich D, Lessler J, & Funk MJ (2010). Propensity estimation: neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternative to logistic regression. Journal of Clinical Epidemiology, 63, 826–833. [DOI] [PMC free article] [PubMed] [Google Scholar]
Xie J & Liu C (2005). Adjusted KaplanMeier estimator and log-rank test with inverse probability of treatment weighting for survival data. Statistics in Medicine, 24, 30893110. [DOI] [PubMed] [Google Scholar]
Zanutto E, Lu B, & Hornik R (2005). Using propensity score subclassification for multiple treatment doses to evaluate a national antidrug media campaign. Journal of Educational and Behavioral Statistics, 30, 59–73. [Google Scholar]
Zhu H & Lu B (2015). Multiple comparisons for survival data with propensity score adjustment. Computational Statistics and Data Analysis, 86, 42–51. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] Aalen OO (1978). Nonparametric inference for a family of counting processes. Annals of Statistics, 6, 701–726. [Google Scholar]

[R2] Abadie A & Imbens GW (2006). Large sample properties of matching estimators for average treatment effects. Econometrica, 74, 235267. [Google Scholar]

[R3] Bang HJ, Jung SH, & George SL (2005). A simulation-based multiple testing procedure and sample size calculation. Journal of Biopharmaceutical Statistics, 15, 957–967. [DOI] [PubMed] [Google Scholar]

[R4] Breslow NE, Lumley T, Ballantyne CM, Chambless LE, & Kulich M (2009). Using the whole cohort in the analysis of case-cohort data. American Journal of Epidemiology, 169, 13981405. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Cole SR & Hernan MA (2004). Adjusted survival curves with inverse probability weights. Computer Methods and Programs in Biomedicine, 75, 4549. [DOI] [PubMed] [Google Scholar]

[R6] Curtis LH, Hammill BG, Eisenstein EL, Kramer JM, & Anstrom KJ (2007). Using inverse probability-weighted estimators in comparative effectiveness analyses with observational databases. Medical Care, 45, S103–S107. [DOI] [PubMed] [Google Scholar]

[R7] Dunnett CW (1955). A multiple comparison procedure for comparing several treatments with a control. Journal of American Statistical Association, 50, 1096–1121. [Google Scholar]

[R8] Fleming TR & Harrington DP (1991). Counting processes and survival analysis. Wiley: New York. [Google Scholar]

[R9] Galimberti S, Sasieni P, & Valsecchi MG (2002). A weighted KaplanMeier estimator for matched data with application to the comparison of chemotherapy and bone-marrow transplant in leukaemia. Statistics in Medicine, 21, 38473864. [DOI] [PubMed] [Google Scholar]

[R10] Gassmann HI, Deak I, & Szantai T (2002). Computing multivariate normal probabilities: A new look. Journal of Computational and Graphical Statistics, 11, 920–949. [Google Scholar]

[R11] Genz A & Bretz F (2000). Numerical computation of critical values for multiple comparison problems ASA Proceedings of the Sections on Statistical Computing and Statistical Graphics, 84–87. [Google Scholar]

[R12] Hsu JC (1992). The factor analytic approach to simultaneous inference in the general linear model. Journal of Computational and Graphical Statistics, 1, 151–168. [Google Scholar]

[R13] Imbens GW (2000). The role of the propensity score in estimating dose-response functions. Biometrika, 87, 706710. [Google Scholar]

[R14] Jung SH, Bang H, & Young S (2005). Sample size calculation for multiple testing in microarray data analysis. Biostatistics, 6, 157–169. [DOI] [PubMed] [Google Scholar]

[R15] Jung SH, Chen Y, & Ahn H (2014). Type I error control for tree classification. Cancer Informatics, 13, 11–18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Jung SH & Hui S (2002). Sample size calculations to compare K different survival distributions. Lifetime Data Analysis, 8, 361–73. [DOI] [PubMed] [Google Scholar]

[R17] Jung SH, Kim C, & Chow SC (2008). Sample size calculation for the log-rank tests for multi-arm trials with a common control. Journal of Korean Statistical Society, 37, 11–22. [Google Scholar]

[R18] Kaplan EL & Meier P (1958). Nonparametric estimation from incomplete observations. Journal of American Statistical Association;53:457481. [Google Scholar]

[R19] Lee BK, Lessler J, & Stuart EA (2009). Improving propensity score weighting using machine learning. Statistics in Medicine, 29, 337–346. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Lee EK, Ahn HJ, Zo JI, Kim K, Jung DM, & Park JH (2017). Paravertebral block does not reduce cancer recurrence, but is related to higher overall survival in lung cancer surgery: A retrospective cohort Study. Anesthesia and Analgesia, 125, 1322–1328. [DOI] [PubMed] [Google Scholar]

[R21] Lopez MJ & Gutman R (2017). Estimation of causal effects with multiple treatments: a review and new ideas. arXiv:1701.05132 [stat.ME].

[R22] McCaffrey DF, Griffin BA, Almirall D, Slaughter ME, Ramchand R, & Burgette LF (2013). A tutorial on propensity score estimation for multiple treatments using generalized boosted models. Statistics in Medicine, 32, 33883414. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Nelson W (1969). Hazard plotting for incomplete failure data. Journal of Quality Technology, 1, 27–52. [Google Scholar]

[R24] Peto R & Peto J (1972). Asymptotically efficient rank invariant test procedures (with discussion). Journal of the Royal Statistical Society, Series A, 135, 185–206. [Google Scholar]

[R25] Pruzek RM & Cen L (2002). Propensity score analysis with graphics: A comparison of two kinds of gallbladder surgery Paper presented at the annual meeting of the Society for Multivariate Experimental Psychology, Charlottesville, VA. [Google Scholar]

[R26] Rosenbaum PR & Rubin DB (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55. [Google Scholar]

[R27] Rosenbaum PR & Rubin DB (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of American Statistical Association, 95, 749–759. [Google Scholar]

[R28] Rubin DB (1998). Estimation from nonrandomized treatment comparisons using subclassification on propensity scores In Nonrandomized comparative clinical studies, ed. Abel U and Koch A, 85–100. Dusseldorf, Germany: Symposion. [Google Scholar]

[R29] Stone RA, Obrosky DS, Singer DE, Kapoor WN, Fine MJ. (1995). Propensity score adjustment for pretreatment differences between hospitalized and ambulatory patients with community-acquired pneumonia. Medical Care, 33, AS56–66. [PubMed] [Google Scholar]

[R30] Westreich D, Lessler J, & Funk MJ (2010). Propensity estimation: neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternative to logistic regression. Journal of Clinical Epidemiology, 63, 826–833. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Xie J & Liu C (2005). Adjusted KaplanMeier estimator and log-rank test with inverse probability of treatment weighting for survival data. Statistics in Medicine, 24, 30893110. [DOI] [PubMed] [Google Scholar]

[R32] Zanutto E, Lu B, & Hornik R (2005). Using propensity score subclassification for multiple treatment doses to evaluate a national antidrug media campaign. Journal of Educational and Behavioral Statistics, 30, 59–73. [Google Scholar]

[R33] Zhu H & Lu B (2015). Multiple comparisons for survival data with propensity score adjustment. Computational Statistics and Data Analysis, 86, 42–51. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

K-Sample Comparisons using Propensity Analysis

Sin-Ho Jung

Sang Ah Chi

Hyun Joo Ahn

Abstract

1. Introduction

2. Propensity Analysis