Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Jun 1.
Published in final edited form as: Biom J. 2019 Jan 7;61(3):698–713. doi: 10.1002/bimj.201800049

K-Sample Comparisons using Propensity Analysis

Sin-Ho Jung 1, Sang Ah Chi 2, Hyun Joo Ahn 3
PMCID: PMC6461520  NIHMSID: NIHMS1002903  PMID: 30614546

Abstract

In this paper, we investigate K-group comparisons on survival endpoints for observational studies. In clinical databases for observational studies, treatment for patients are chosen with probabilities varying depending on their baseline characteristics. This often results in non-comparable treatment groups because of imbalance in baseline characteristics of patients among treatment groups. In order to overcome this issue, we conduct propensity analysis and match the subjects with similar propensity scores across treatment groups or compare weighted group means (or weighted survival curves for censored outcome variables) using the inverse probability weighting (IPW). To this end, multinomial logistic regression has been a popular propensity analysis method to estimate the weights. We propose to use decision tree method as an alternative propensity analysis due to its simplicity and robustness. We also propose IPW rank statistics, called Dunnett-type test and ANOVA-type test, to compare 3 or more treatment groups on survival endpoints. Using simulations, we evaluate the finite sample performance of the weighted rank statistics combined with these propensity analysis methods. We demonstrate these methods with a real data example. The IPW method also allows us for unbiased estimation of population parameters of each treatment group. In this paper, we limit our discussions to survival outcomes, but all the methods can be easily modified for any type of outcomes, such as binary or continuous variables.

Keywords: ANOVA, Decision tree, Dunnett test, Inverse probability weighting, Multinomial logistic regression

1. Introduction

In a prospective study, such as a phase III clinical trial, patients are randomly assigned to different treatment groups independently of the baseline characteristics, also called predictors, so that the distribution of the predictors are well balanced among treatment groups and the statistical testing to compare the efficacy of the treatments controls the type I error rate accurately even without adjusting for the predictors. We often use a stratified randomization method for a perfect balance of the predictor distributions among treatment groups by selecting some important predictors as the stratification factors.

In the data set of an observation study, however, treatment group is usually confounded with predictors. In this case, we may conduct a multivariable regression analysis including the group identifier and predictors as covariates to compare the groups adjusting for the potential bias due to confounding. In the presence of confounding, clinical investigators do not feel so comfortable with a comparison among treatment groups, so that they want to generate a subset of data with a balanced predictor distributions among treatment groups as if they were from a randomized trial.

In non-randomized (observational) studies, the use of propensity analysis has received much attention in clinical research to compare clinical endpoints between groups. A propensity score is a measure of how the treatment of each patient is selected depending on the values of predictors. If we can find equal number of patients from each group with the same propensity score, then we can make up a perfectly balanced data set, called matched data, by selecting only those patients for an efficacy analysis. With matched data, we can conduct a K-sample comparison using a standard univariate analysis method, but we often have to discard a large part of data set during the matching procedure. An alternative approach is to use a weighted test statistic for comparing groups with inverse probability weighting (IPW) by keeping all the original data, e.g. Curtis et al. (2007) and Breslow et al. (2009).

There have been plenty of publications about propensity method since the original works by Rosenbaum and Rubin (1983, 1984). Most of these publications, however, are focused on comparing two patient groups from non-randomized studies and its extension to K(≥ 3) has not been fully investigated yet, especially for survival analysis. Logistic regression method has been popularly used in propensity analysis for two treatment groups. Use of multinomial logistic regression method was proposed by Rubin (1998) in the estimation of propensity scores for K-sample comparison problems, and has been studied by many investigators including Imbens (2000) and McCaffrey et al. (2013). In a K-sample comparison, accurate estimation of propensity scores is a key component. If treatment selection probabilities have a non-monotone trend on a predictor, it is not easy to accurately estimate the propensity scores using a model-based regression method. To overcome this issue, some investigators (Stone et al. 1995; Pruzek & Cen 2002; Westreich et al. 2010) have applied classification tree method for propensity analysis as an alternative to the logistic regression method.

McCaffrey et al. (2013) propose to repeatedly apply the 2-sample tree method between each experimental treatment group and a common control, and adjust the K − 1 dimensional propensity scores for K sample comparisons. But, they report that the K sample comparison result can be different depending on which group to be chosen as the control. We propose a decision tree method for a propensity analysis of K sample comparison through a direct classification of K treatment groups not requiring specification of a control group. We use a pruning method based on χ2-tests for testing if two propensity strata have different probabilities of treatment selection or not.

We also propose two types of weighted rank statistics to compare K groups on survival outcomes, using ANOVA-type and Dunnett-type (1955) testing methods combined with IPW. We evaluate the performance of the weighted testing methods combined with two propensity analysis methods to estimate the weights, multinomial logistic regression method and decision tree method, through simulations. We apply the proposed IPW method and a matched data method using the propensity scores estimated by the two propensity analysis methods to a real data example. We limit our discussions to survival outcome, but the results can be easily modified for other types of variables, such as continuous or binary.

2. Propensity Analysis

Suppose that there are nk patients for treatment group k(= 1,…,K) and n=k=1Knk. From each patient, we have data on group identity, m covariates (z1i,…, zmi)T, and a survival outcome. In the first step of propensity analysis, we estimate propensity scores using the predictors and groups identity. Outcome data are not used in this step.

2.1. Multinomial Logistic Regression Method

At first, we review the multinomial logistic regression method that was proposed by Rubin (1998). Let (P1i,…,PKi) be the propensities of K groups for patient i(= 1,…, n) with predictors zi = (1, z1i, zmi)T. We consider logistic models

logPkiPKi=βkTzi=β0k+β1kz1i++βmkzmi (1)

for k = 1,…,K − 1, where βk = (β0k1k,…mk) is a vector of regression coefficients. For each k(= 1,…,K − 1), βk is estimated by fitting the logistic regression (1) using data from groups k and K as in a propensity analysis for a two-sample comparison. For i = 1,…, nk + nK, the binary response variable is yi = 1 (= 0) if patient i belongs to group k (group K).

Since k=1KPki=1 and Pki=PKiexp(βkTzi) for k = 1,…,K − 1 from (1), we have

PKi=11+k=1K1exp(βkTzi)

and

Pki=exp(βkTzi)1+k=1K1exp(βkTzi)  for k=1,,K1.

For k = 1,…,K − 1, let β^k denote the estimator of βk from the regression analysis between treatment groups k and K.

Now, we rearrange the data so that z˜ki denotes the covariate vector for patient i(= 1,…, nk) belonging to treatment group k(= 1,…,K). Using this notation, the allocation probability for this patient is estimated by

P^ki=exp(β^kTz˜ki)1+k=1K1exp(β^kTz˜ki)  if 1kK1

and

P^Ki=11+k=1K1exp(β^kTz˜Ki).

We use P^ki as a propensity score. For a propensity matching, we may define strata by grouping patients with similar values for (P^1i,,P^K1,i) using a clustering method. A simple clustering method may be, for example with K = 3, to partition the ranges of P^1i and P^2i from n patients into J1 and J2 intervals, respectively, and construct J1 × J2 strata, each of which consists of a stratum of patients with similar propensity scores (P^1,P^2,P^3). Within each stratum, we can randomly select a certain number of patients from each of the three groups so that the set of allocation proportions (γ1, γ2, γ3) with k=13γk=1 is identical across the J1 × J2 strata. This procedure is called vector matching by Abadie A & Imbens (2006). Lopez and Gutman (2017) provide a review of various matching method for multiple treatment comparisons.

If we want an inference based on IPW, then we use wki=1/P^ki as the weights for patient i in group k.

2.2. Decision Tree Method

The multinomial logistic regression is a parametric approach, so that the final K group comparison results as well as the propensity analysis results can be sensitive to the assumed regression model. Decision tree is a robust alternative to avoid such issues of regression methods.

We define a stratum based on propensity scores that are associated with relative frequencies of the K groups. Ideally, each stratum should be so homogeneous that the propensity scores of the patients within each stratum should be similar. On the other hand, the relative frequencies of the K groups should be very different between any two different strata. The further away two propensity strata are, the more different the relative frequencies between the two strata will be. We derive a propensity score estimation method based on this concept.

A decision tree is a tool to find the best decision for an object based on the values of its predictors. While regression trees usually have a continuous or binomial outcome variable, decision trees have a nominal categorical outcome variable. If there are two possible decisions (or groups to be compared), then a decision tree is identical to the binary regression tree. We consider decision trees with K(≥ 3) possible decisions that are called treatment groups in this paper. At a node, we go through all possible cutoff values for every predictor and identify the cutoff value of a predictor that gives the most significant partitioning in terms of the proportions among K groups. As the level of nodes goes down, the possible cutoffs of a predictor will be limited to a smaller range. If the cutoff value is too extreme, then one of two strata partitioned by the cutoff will be too small, especially with a continuous predictor. Since we do not want to consider too small strata, we may not consider extreme cutoff values.

In order to measure the significance of a classification, we propose to use the p-value from the chi-squared test for a 2×K frequency table (see Table 1) comprising two strata identified by a cutoff value of a predictor and K groups. We continue the classification procedure until there exist no more significant cutoff values for any predictors for all nodes. When this procedure is completed, each terminal node consists of a subset of patients with similar propensity scores. In this sense, we call each terminal node a propensity stratum, and the frequencies of the K groups within each stratum can be used for propensity matching.

Table 1.

K × 2 frequency table defined by cutoff value c for predictor z

Propensity
stratum
Treatment group
1 2 K
zc n11 n12 n1K
z > c n21 n22 n2K

For the purpose of statistical trimming of regression trees, Jung et al. (2014) propose to control the familywise error rate (FWER) in determining the significance of K tables accounting for the multiplicity of the predictors and the cutoff values of each predictor. This rule, however, will produce a very short decision tree and each of the resulting branches (i.e. strata) will consist of patients with very wide range of propensity scores. So, in this paper, we propose to control the marginal type I error rate based on the p-values of χ2-tests with K − 1 degrees of freedom. If the total sample size is large and we want each stratum to have very homogeneous propensity scores, then we use a large type I error rate, called α1.

When there is no more significant classification, each of the leaves, also called final nodes, can be counted as a stratum consisting of subjects with similar propensity scores and the frequencies among the K groups are summarized by a K × 1 table. A decision tree may possibly result in over-classification. If there are terminal nodes with similar relative frequencies, then we may combine them as a single propensity stratum for efficient estimation of propensity scores. We make this decision if the p-value of the χ2-test to compare the relative frequencies between two strata is larger than a prespecified type I error rate, α2. We repeat this procedure until there are no more pairs of strata with p-value of χ2 test larger than the specified α2 level.

We can consider matching and inverse probability weight (IPW) approaches for an unbiased comparison of the outcome among K groups based on the final decision tree. For matching, we randomly select patients from each stratum using certain proportions among K groups so that the set of proportions is identical across the strata. A simple example will be to select equal number of patients from each group within each stratum, corresponding to a 1-to-1 matching in a two group propensity matching case. Oftentimes, control groups have much more subjects than case groups. If some groups have a lot more subjects than the others for all strata, then we may consider an unbalanced matching among K groups, corresponding to 1-to-m matching in a two group propensity matching. More specifically, suppose that we want allocation proportions of (γ1,…,γK) among K groups with γk > 0 and k=1Kγk=1. Then, in a propensity matching, we randomly select subjects from large groups to satisfy the allocation proportions (γ1,…K) within each stratum.

If stratum j has njk patients in treatment group k, then the propensity score for these patients is estimated by P^jk=njk/k=1Knjk and the weights for IPW are given as wki=1/P^jk=njk1k=1Knjk. In summary, a decision tree for propensity analysis may proceed as follows. The first node is the whole data.

  • [I] Classification procedure

    • [Ia] For each predictor, partition a node into two nodes using a value of the predictor as a cutoff value, and calculate the p-value of the χK12 test.

    • [Ib] Repeat [Ia] for all possible cutoff values with respect to all predictors. If the smallest p-value is smaller than α1, then split the node into two nodes using the corresponding cutoff value.

    • [Ic] If there exist no p-values smaller than α1, then the classification procedure ends.

  • [II] Pooling procedure: Suppose that a classification procedure resulted in D1 nodes (or strata).

    • [IIa] For each of D1(D1 − 1)/2 pairs of nodes, calculate the p-value of the χK12 test comparing the allocation proportions between two nodes.

    • [IIb] In [IIa], if the largest p-value is larger than α2, combine the corresponding pair of nodes into one.

    • [IIc] Repeat [Ia] and [Ib] until there exists no pair of nodes with p-value larger than α2, to result in D1 strata.

  • [III] Among D2 strata, discard the strata with 0-frequency for any treatment groups, to result in D3 strata.

Note that, while steps [I] and [II] do not change the sample size, step [III] can lower the sample size for the final analysis. Step [II] corresponds to variable deletion, while step [I] corresponds to variable addition, in stepwise regression.

3. Weighted Test Statistics

Propensity analysis discussed in the previous section uses group identity and predictors, but not outcome data. Once a propensity analysis is completed, we want to conduct a statistical testing to compare the groups on an outcome variable incorporating the propensity analysis result. Cole and Hernan (2004) and Xie and Chaofeng (2005) propose a weighted 2-sample log-rank test for survival outcomes.

In a K-sample test, the null hypothesis is that all K groups have the same survival distributions, i.e. H0 : S1(t) = ⋯ = SK(t) for t ≥ 0, where Sk(t) denotes the survivor function of the population when all patients received treatment k. The statistical testing method for this null hypothesis is different depending on the study design and the study objective. For example, if all K groups are cases, then we may compare their survival distributions using a one-way ANOVA-type test. On the other hand, if a group is a control and the remaining K − 1 groups are cases, then we may use a Dunnettype test to compare each case group with the control. Jung and Hui (2002) and Jung et al. (2008) propose sample size formulas for ANOVA-type and Dunnett-type rank tests, respectively, for comparing the survival distributions among K groups when the distributions of predictors are balanced among treatment groups. In this section, we propose weighted version of these rank tests and derive their asymptotic distributions for large n assuming that nk/nγk(> 0) for k = 1,…,K.

We observe censored survival time Xki, denoting the minimum of the survival and the censoring times, and event indicator δki, taking 1 if an event is observed and 0 otherwise, for patient i(= 1,…, nk) in group k(= 1,…,K) who has a weight wki from a propensity analysis. For each patient, the censoring time is independent of the survival time. Let Nki(t) = δkiI(Xkit) and Yki(t) = I(Xkit) denote the event and the at-risk processes for patient i in group k, and Nk(t)=i=1nkwkiNki(t),Yk(t)=i=1nkwkiYki(t),Y¯k(t)=i=1nkwki2Yki(t),N(t)=k=1KNk(t), and Y(t)=k=1KYk(t). Further, let Mki(t)=wki0t{dNki(s)Yki(s)dΛ(s)},Mk(t)=i=1nkMki(t), and M(t)=k=1KMk(t). Let wk=i=1nkwki,w=k=1Kwk=k=1Ki=1nkwki, and w¯k=i=1nkwki2.

Before discussing K-sample comparison problems, we briefly investigate a weighted Kaplan-Meier estimate as a 1-sample problem.

3.1. Weighted Kaplan-Meier estimate

A weighted estimator for the survival function Sk(t), called weighted Kaplan-Meier (1958) estimator by Galimberti et al. (2002), of group k is obtained by

S^k(t)=st{1ΔNk(s)Yk(s)}={1ink:Xkit}{1i=1nkwkiδkiδkiI(Xki=Xki)i=1nkwkiI(XkiXki)},

where ΔNk(t) = Nk(t) − Nk(t−). Note that Mki(t) is a 0-mean martingale with variance wki20tYki(t)dΛk(t). Using similar arguments of Fleming and Harrington (1991) for the Kaplan-Meier estimator, we can show that S^k(t) is approximately normal with mean Sk(t) and variance σk2(t) that can be consistently estimated by

σ^k(t)2=S^k2(t)0tY¯k(s)Yk(s)2dΛ^k(s)=S^k(t)2i=1nkδkiwkii=1nkwki2I(XkiXki){i=1nkwkiI(XkiXki)}3,

where Λ^k(t)=0tYk(t)1dNk(t) is a weighted version of Aalen-Nelson estimator (Aalen 1978; Nelson 1969) for the cumulative hazard function Λk(t) = −log Sk(t) for group k.

3.2. One-Way ANOVA-Type Tests

In an ANOVA-type test, we want to test H0 : Λ1(t) = ⋯ = ΛK(t)(= Λ(t)) for t ≥ 0 against H1 : Λk(t) ≠ Λk(t) for some kk′.

A weighted version of Nelson-Aalen estimator for the common cumulative hazard function Λ(t) under H0 is given as Λ^(t)=0tY1(s)dN(s). For testing H0, we consider a weighted log-rank test W = (W1,…,WK−1)T, where

Wk=nwk0Yk(t){dΛ^k(t)dΛ^(t)}=nwk{0dNk(t)0Yk(t)Y(t)dN(t)}=nwk{i=1nkδkiwkil=1Ki=1nlδliwlii=1nkwkiI(XkiXli)l=1Ki=1nl'wliI(XliXli)}.

Appendix A shows that, under H0, W is asymptotically normal with mean 0 and variance V that can be consistently estimated by V^=(v^k,k)(K1)×(K1) with

v^k,k=nwkwkl=1K0{ξklYk(t)Y(t)}{ξklYk(t)Y(t)}Y¯l(t)Y(t)dN(t)=nwkwkl=1Kl=1Ki=1nlδliwli{ξkli=1nkwkiI(XkiXli)l=1Ki=1nlwliI(XliXli)}×{ξkli=1nkwkiI(XkiXli)l=1Ki=1nlwliI(XliXli)}i=1nlwli2I(XliXli)l=1Ki=1nlwliI(XliXli),

where ξk,k′ = I(k = k′). Hence, with a specified type I error rate α, we can reject H0 if Q=WTV^1W is larger than χK1,1α2.

3.3. Dunnett-Type Tests

In K-sample problems, oftentimes one of the groups is a control and the remaining groups are cases. We assume that group K is the control. In this case, we usually want to test if each of the K −1 case groups is more efficacious than the control group. For k = 1,…,K − 1, we want to test the null hypothesis Hk : Λk(t) = ΛK(t) against the alternative hypothesis H¯k:Λk(t)ΛK(t). Denote H0=k=1K1Hk and Ha=k=1K1H¯k. For k = 1,…,K − 1, the weighted log-rank statistic (Peto R, Peto 1972; Jung & Hui 2002) for testing Hk is given by

Uk=n(wk+wK)wkwK0Yk(t)YK(t)Yk(t)+YK(t){dΛ^K(t)dΛ^k(t)}=n(wk+wK)wkwK{i=1nKδKiwKiYk(XKi)Yk(XKi)+YK(XKi)i=1nkδkiwkiYK(Xki)Yk(Xki)+YK(Xki)}=n(wk+wK)wkwK{i=1nKδKiwKii=1nkwkiI(XkiXKi)i=1nkwkiI(XkiXKi)+i=1nKwKiI(XKiXKi)i=1nkδkiwkii=1nKwKiI(XKiXki)i=1nkwkiI(XkiXki)+i=1nKwKiI(XKiXki)}.

Note that a positive Uk value implies that group k has a longer survival time than group K. By Appendix B, under Hk, Uk is approximately normal with mean 0 and variance σk2 that can be consistently estimated by

σ^k2=n(wk+wK)2wk2wK20Yk2(t)Y¯K(t)+YK2(t)Y¯k(t){Yk(t)+YK(t)}2dΛ^(k)(t)=n(wk+wK)2wk2wK2[i=1nkδkiwkiYk(Xki)2Y¯K(Xki)+YK(Xki)2Y¯k(Xki){Yk(Xki)+YK(Xki)}3+i=1nKδKiwKiYk(XKi)2Y¯K(XKi)+YK(XKi)2Y¯k(XKi){Yk(XKi)+YK(XKi)}3],

where Λ^(k)(t)=0t{Yk(s)+YK(s)}1{dNk(s)+dNK(s)} denotes the weighted Nelson-Aalen estimator under Hk.

If we reject Hk with a type I error probability of α, the probability of false rejections for the global null hypothesis H0 will be larger than the nominal α level due to the multiple testing issue. In order to avoid this issue, statisticians control the FWER. For a chosen critical value c, we reject Hk if |Uk/σ^k|>c. Dunnett-type tests control the FWER at α by choosing c = cα satisfying

α=P{k=1K1(|Uk/σ^k|>c)|H0}. (2)

The Bonferroni test, rejecting Hk with a type I error probability of α/(K − 1), results in too a conservative testing result.

From Appendix B, for large n under H0, (U1/σ^1,,UK1/σ^K1) is approximately normal with means 0, variances 1, and correlation coefficients

ρ^kk=wkwkw¯K(wK2w¯k+wk2w¯K)(wK2w¯k+wk2w¯K).

for 1 ≤ k < k′ ≤ K − 1. Hence, we can obtain c = cα from (2) by using a numerical method (Genz & Bretz 2000; Gassmann et al. 2002) or a simulation method (Bang et al. 2005). We focus on the former method in this paper. Let ϕK−1(u1,…, uK−1) denote the joint probability density function (PDF) of the (K − 1)-variate normal distribution with marginal means 0, variances 1, and correlation coefficients ρ^kk. Then, from (2), we obtain c = cα by solving

α=1ccccϕK1(u1,,uK1)du1duK1 (3)

with respect to c.

If (Z1, Z2) is bivariate normal with means (μ1, μ2), variances (σ12,σ22), correlation coefficient ρ, then it is well known that Z1 conditioning on Z2 = z2 is normal with mean μ1 + (ρσ12)(z2μ2) and variance σ12(1ρ2). Hence, for K = 3, (3) can be expressed using a one dimensional integration

α=1ccϕ(u){Φ(cρ^12u1ρ^122)Φ(cρ^12u1ρ^122)}du,

where ϕ(·) and Φ(·) are the probability density function and the cumulative density function of N(0,1).

Our Dunnett-type test is conducted as follows.

  1. Calculate the weights (wki, i = 1,…, nk, k = 1,…,K) using the multinomial logistic regression or decision tree method.

  2. By solving (3), obtain the critical value cα.

  3. For k = 1,…,K − 1, we reject Hk if |Uk/σ^k|>cα.

Given an observed test statistic zk=Uk/σ^k for group k(= 1,…,K−1), the FWER-adjusted p-value (Jung et al. 2005), pk, is given as

pk=1|zk||zk||zk||zk|ϕn(u1,,uK1)du1duK1,

which is simplified to

pk=1|zk||zk|ϕ(u){Φ(|zk|ρ^12u1ρ^122)Φ(|zk|ρ^12u1ρ^122)}du,

for K = 3.

4. Numerical Studies

4.1. Simulations

We conduct simulation studies to evaluate the performance of the K-sample comparison methods combined with propensity analysis methods under various settings. We consider comparing the survival distribution of K = 3 groups using the ANOVA or Dunnett test combined with the IPW approach. The third group (K = 3) is the control group for Dunnett test. The weights are estimated by the multinomial logistic regression or decision tree method.

For patient i, the predictor zi was generated from U(0, 3). Given a predictor value zi, the allocation proportion for each experimental group is determined by one of following models.

(A1) (Multinomial logistic model)

γ1=γ2=exp(0.15zi)1+2exp(0.15zi),γ3=12γ1

(A2) (Multinomial logistic model with an interval with controls only)

γ1=γ2=exp(0.15zi)1+2exp(0.15zi),γ3=12γ1

if zi ∈ [0, 2], and γ3 = 1 if zi ∈ (2, 3].

The survival distribution of patient i in treatment group k with predictor zi is generated from an exponential distribution with an annual hazard rate of

λi=0.2exp(βk+0.5zi).

Note that βk denotes the effect of group k and 0.5zi denotes the impact of the predictor on the survival outcome. We consider a null hypothesis H0 : β1 = β2 = β3 = 0, and alternative hypotheses H1 : (β1, β2, β3) = (−0.35,−0.35,0.05) and H2 : (β1, β2, β3) = (−0.45,−0.25,0.15).

We consider another model with allocation probabilities non-monotone in zi.

(A3) (A non-multinomial logistic model)

γ1=γ2=1.252πexp{(zi1.5)22},γ3=12γ1

For (A3), the propensity score for group 3 is close to 1 fof zi close to 0 or 3, and close to 0 for zi = 1.5. This propensity model is matched with an exponential survival distribution whose hazard rate is non-monotone in zi,

λi=0.4exp{βk+0.30.6I(1<zi<2)}.

The censoring variable is generated from U(2, 7) mimicking a study with 5 years of patient registry and 2 years of additional follow-up to generate about 20% of censoring. The parameter values for propensity models and survival distributions are chosen for at least 80% of power by the decision tree method.

For the decision tree method, we use α1 = 0.05 for classification and α2 = 0.3 for pooling, and extreme cutoff values resulting in less than 5 patients in one stratum are not considered. We consider classification only (I), classification and pooling (I+II), classification and discarding the strata with 0-frequency groups (I+III), or all of the three steps (I+II+III), and report the average number of strata D for each of these procedures.

Under each simulation setting, we generate 10,000 simulation data sets of size n = 1,000, and apply the ANOVA or Dunnett test using the weights estimated by multinomial logistic regression or decision tree method to each sample. The empirical type I error rate and power are estimated by the proportion of simulation samples rejecting the null hypothesis with nominal α = 0.05 among 10,000 simulation samples.

Table 2(a) summarizes the simulation results. If the treatment selection follows a multinomial logistic model (A1), then the weighted rank tests using the weights from both multinomial logistic and decision tree methods control the type I error rate closely under H0. Under allocation model (A1), the decision tree method does not seem to need steps (II) and (III) in addition to classification (I). If there exists an interval of predictor zi with control patients only (A2) or if the allocation probabilities do not have monotone trends in zi (A3), then the weighted rank tests using the propensity scores from multinomial logistic regression do not control the type I error accurately.

Table 2.

Empirical rejection probabilities for the weighted log-rank tests based on multinomial logistic regression and decision tree propensity methods. The decision tree method, we also report average number of strata (D) after the first classification (I), after pooling the strata with similar allocation proportions (I+II), after discarding the strata with no frequency in any of the three treatment groups (I+III), and pooling the strata with similar allocation proportions and discarding the strata with no frequency in any of the three treatment groups (I+II+III).(a) When K = 3

Allocation Censoring Multinomial Tree (I) Tree (I+II) Tree (I+III) Tree (I+II+III)
ANO Dun ANO Dun D ANO Dun D ANO Dun D ANO Dun D
(i) Undet H0 : β1=β2=β3
A1 19.6% 0.034 0.034 0.039 0.039 4.1 0.039 0.040 3.5 0.039 0.039 3.8 0.039 0.039 3.2
A2 19.6% 0.098 0.109 0.664 0.680 4.2 0.665 0.679 3.8 0.047 0.044 3.0 0.047 0.045 2.6
A3 17.4% 0.798 0.834 0.083 0.085 11.7 0.083 0.085 8.0 0.060 0.063 10.3 0.059 0.064 7.0
(ii) Undet H1 : β1=β2<β3
A1 26.6% 0.962 0.971 0.942 0.955 4.1 0.940 0.955 3.5 0.942 0.955 3.8 0.941 0.955 3.2
A2 24.4% 0.998 0.998 1.000 1.000 4.2 1.000 1.000 3.8 0.921 0.936 2.9 0.920 0.935 2.6
A3 23.4% 1.000 1.000 0.854 0.866 11.7 0.853 0.864 8.0 0.847 0.871 10.3 0.846 0.870 7.0
(iii) Undet H2 : β1<β2<β3
A1 24.6% 0.900 0.912 0.871 0.884 4.1 0.869 0.883 3.5 0.972 0.884 3.8 0.870 0.883 3.2
A2 21.8% 0.999 0.999 1.000 1.000 4.2 1.000 1.000 3.8 0.964 0.973 3.0 0.964 0.972 2.6
A3 21.2% 1.000 1.000 0.897 0.839 11.7 0.898 0.839 8.1 0.891 0.842 10.3 0.892 0.844 7.0

Under (A2) and (A3) allocation models, the decision tree method controls the type I error accurately when step (III) is incorporated. From the average number of strata D, as expected, step (III) much decreases the number of strata for model (A2), and step (II) much decreases the number of strata for model (A3).

Under H1 and H2, the empirical powers of the cases where type I is accurately controlled are bold faced. The Dunnett-type test looks slightly more powerful than the ANOVA-type test in most of the simulation settings except for model (A3) under H2. If a multinomial logistic allocation model is valid, then the weighted rank tests based on multinomial logistic regression method have a slightly higher power than those based on decision tree method.

A reviewer suggests to add some simulations with a more general survival distributions, such as Weibull, comparisons among K = 4 treatment groups, and a similar setting to that of the real data example presented below. We generate m = 4 correlated covariates as follows. At first, a random vector (u1, u2, u3, u4) is generated from the multivariate normal distribution with marginal means 0 and variances 1 and a exchangeable dependency with correlation coefficient 0.5. These random variables are transformed for z1 = u1, z2 = [z2], z3 = Φ−1(u3), and z4 = I(u1 ≤ 0.5), where 〈a〉 denotes the round down of a. Note that, marginally, z3 is U(0,1) and z4 is Bernoulli(0.5) random variables. The treatment selection follows a multinomial logistic model with β0k = 0, β1k = β2k = 0.3, and β3k = β4k = 0.2 for k = 1,2,3. By this multinomial logistic model, about 25.8% are allocated to each of groups 1, 2, and 3, and the remaining 22.5% are allocated to the control group 4. The survival time for a patient in treatment group k with covariate values (z1, z2, z3, z4) is generated from Weibull hazard function of given as

λ(t|z)=νλ0(λ0t)ν1exp(βk+θ1z1+θ2z2+θ3z3+θ4z4)

with θ1 = θ2 = θ3 = θ4 = 0.25 and nu = 0.5. We consider a null hypothesis H0 : β1 = β2 = β3 = β4 = 0, and alternative hypotheses H1 : (β1, β2, β3,= β4) = (−0.65,−0.65,−0.65,0) and H2 : (β1, β2, β3,= β4) = (−0.75,−0.5,−0.25,0). Under each hypothesis, λ0 is selected for about 33% censoring by U(2,7) censoring distribution. Other simulation parameters are identically set to those of Table 2(a) for K = 3. Table 2(b) reports simulation results with n = 400. We observe that, under H0, the weighted ANOVA and Dunnett tests control the type I error rate closely to the nominal level α = 0.05 by multinomial logistic regression and decision tree with different options, while the unweighted rank tests are severely anticonservative. Using multinomial logistic regression, the weighted ANOVA test is slightly more powerful than the weighted Dunnett test under both H1 and H2, while, using decision tree method, the weighted Dunnett test is slightly more (less) powerful than the weighted ANOVA test under H1 (H2). The decision tree method has similar power with different options, and it looks slightly less powerful than the multinomial logistic regression method.

Table 2.(b).

When K = 4 (including results for unweighted ANOVA and Dunnett tests)

Hypothesis Unweighted Multinomial Tree (I) Tree (I+II) Tree (I+II+III)
ANO Dun ANO Dun ANO Dun D ANO Dun D ANO Dun D
H0 0.123 0.153 0.037 0.037 0.040 0.047 6.4 0.043 0.045 5.0 0.039 0.038 3.8
H1 0.764 0.815 0.948 0.966 0.867 0.909 6.5 0.866 0.904 5.0 0.866 0.894 3.9
H2 0.857 0.796 0.947 0.957 0.904 0.888 6.3 0.897 0.886 4.9 0.896 0.879 3.8

4.2. A Real Data Example

The proposed methods are applied to a real clinical observational study. Postoperative analgesic methods are suggested to have an impact on long term prognosis after cancer surgery through opioid-induced immune suppression. Lee et al. (2017) report analysis results comparing the recurrence-free survival (RFS), defined as time to cancer recurrence or death, from surgery among three analgesic methods (K = 3): intravenous patient controlled analgesia (PCA, group 1), paravertebral block (PVB, group 2), and thoracic epidural analgesia (TEA, group 3), for lung cancer surgery. Excluding cases with missing data, our analysis includes a total of 363 patients (111 for PCA, 137 for PVB, and 115 for TEA) among whom 111 patients had disease recurrence. We consider four covariates of body mass index (BMI), smoking, cancer stage and blood transfusion during surgery (BT) that are known to be important predictors for the outcome and study population.

We estimate the weights by multinomial logistic regression and decision tree methods. Table 3 reports the regression estimates, standard errors and p-values from multinomial logistic propensity analysis. Since the regression estimates of smoking from the two logistic regressions have negative sign, we find that more patients with smoking tend to belong to TEA. Figure 1 shows the strata identified by the decision tree method. In the decision tree analysis, we used α = 0.2 for a little finer classification and α = 0.3 for pooling. In Figure 1, each leaf represents a stratum of patients with similar propensity scores, and the frequencies for the three groups of each stratum are given within a box. The p-value at each branch is from χ2 test with 2(= (3−1)×(2−1)) degrees of freedom for classification or pooling. The classification step has identified 8 strata, but two of them are combined during the pooling step to result in a total of 7 strata to be used for comparing RFS among the three treatment groups. Figure 2 reports the weighted Kaplan-Meier curves using these propensity analysis methods based on multinomial logistic regression (Fig 2.a) and decision tree (Fig 3.b). The IPW method using the multinomial logistic regression for testing the three groups gives us p-value = 0.040 by the ANOVA test, and FWER-adjusted p-value = 0.022 between PCA and TEA and 0.704 between PVB and TEA by the Dunnett test. On the other hand, the IPW method using the decision tree gives us p-value = 0.025 by the ANOVA test, and FWER-adjusted p-value = 0.011 between PCA and TEA and 0.469 between PVB and TEA by the Dunnett test. From these analyses, it seems that, compared to TEA, PCA has significantly longer RFS and PVB seems to have similar RFS. All p-values in this analysis are 2-sided.

Table 3:

Regression estimates (EST), standard errors (SE) and p-values (PVAL) of logistic regression models for propensity analysis using TEA group as the reference group

PCA vs. TEA PVB vs. TEA
Parameter EST SE PVAL EST SE PVAL
Intercept −0.636 0.690 0.178 1.039 0.5766 0.036
BMI (<18.5 vs. 18.5–25 vs. ≥ 25) 0.235 0.275 0.197 −0.336 0.240 0.081
Smoking (yes/no) −0.706 0.340 0.019 −0.631 0.303 0.019
Stage (<3 vs. ≥ 3) −0.973 0.351 0.003 0.134 0.276 0.313
BT (yes/no) 1.353 0.316 0.000 −0.034 0.314 0.457

Figure 1.

Figure 1.

Decision tree analysis of Lee et al. (2017). Each leaf denoting a stratum has frequencies of treatment groups of PCA, PVB, and TEA.

Figure 2.

Figure 2.

Weighted Kaplan-Meier curves: (a) Using multinomial logistic regression

Figure 3.

Figure 3.

Kaplan-Meier curves of matched data: (b) Based on propensity analysis using decision tree method

We also analyze the data after balanced (1-to-1-to-1) matching. Using the multinomial logistic regressions, we partitioned the propensity scores of each group into J1 = J2 = 3 intervals to result in 234 matched observations. From Fig 1, the seven strata from decision tree method will result in 240 matched observations by keeping all patients of the smallest group within each stratum. From Table 4, we observe that the distributions of predictors are well balanced among the three groups using both multinomial regression and decision tree methods. Figures 3 displays the Kaplan-Meier curves of RFS among the three matched groups using the multinomial regression and decision tree methods. We observe that the Kaplan-Meier curves for matched data using multinomial logistic regression method (Figure 3a) are similar to the weighted Kaplan-Meier curves of Figure 2a, while those of PVB and TEA groups using decision tree method (Figure 3b) are more separated than the corresponding weighted Kaplan-Meier curves in Figure 2b. Using the propensity scores from multinomial logistic regression, the matched data have ANOVA p-value = 0.066 by the (unweighted) ANOVA test (Jung & Hui 2002), and FWER-adjusted p-value = 0.052 between PCA and TEA and 0.979 between PVB and TEA by the (unweighted) Dunnett test (Jung et al. 2008). Using the propensity analysis from decision tree, the matched data have ANOVA p-value = 0.057 by the ANOVA test, and FWER-adjusted p-value = 0.029 between PCA and TEA and 0.500 between PVB and TEA by the Dunnett test. With the decreased sample sizes, the unweighted statistical tests using matched data have a little lower significance level than, but similar conclusions as, the weighted tests using the full data.

Table 4:

Matched data: distributions of covariates among three patient groups and p-values from χ2 tests

Multinomial Regression Decision Tree
Covariate PCA PVB TEA p-value PCA PVB TEA p-value
BMI <18.5
18.5–25
≥25
4
55
27
3
53
22
4
55
19
0.889 3
52
25
3
50
27
1
52
27
0.861
Smoking No
Yes
72
14
68
10
66
12
0.815 66
14
65
15
63
17
0.828
Stage <3
≥3
66
20
58
20
59
19
0.939 62
18
62
18
62
18
1.000
BT No
Yes
62
24
61
17
53
25
0.351 58
22
60
20
60
20
0.917

Figure 3.

Figure 3.

Kaplan-Meier curves of matched data: (a) Based on propensity analysis using multinomial logistic regression

Figure 2.

Figure 2.

Weighted Kaplan-Meier curves: (b) Using decision tree method

The clinical study was designed to show that, compared to the local anesthetic based analgesic methods (TEA and PVB), the opioid based analgesic method (PCA) would decrease RFS. Due to a high failure rate and unstable hemodynamics frequently observed in the local anesthetic based analgesic methods, however, the local anesthetic based analgesic methods were shown to have shorter RFS than the opioid based analgesic method.

4.3. Discussions and Conclusions

We have investigated K-group comparisons for observational studies with a survival endpoint. The weighted rank tests proposed in this paper can be easily modified for other types of endpoints, such as binary or continuous types. In order to estimate the weights (propensity scores) of the rank tests, we reviewed the multinomial logistic regression and proposed a decision tree method as an alternative.

Through simulations, the proposed weighted rank tests were shown to perform well as far as the weights were accurately estimated by the propensity analysis methods. We found that the decision tree method provided very robust propensity scores overall if it incorporate a step to discard strata with zero frequency for any treatment group to ensure comparability among K treatment groups for all strata included in the analysis. The popular multinomial logistic model was found to provide powerful and robust propensity scores if the allocation proportion to each group is monotone and there are no ranges of predictors with zero frequency for any group. When there exists a range with no frequencies for some treatment groups, Zanutto et al. (2005) propose to find a common support of (K − 1)-dimensional propensity scores and discard the observations not in the common support. In a real data analysis, however, it is not always easy to identify a common support since the estimated propensity scores are discrete in nature. We can do this if we partition the range of propensity scores and define strata as in our real data analysis.

Zhu & Lu (2015) propose a Dunnett-type test derived from a Cox regression model

λj(t)=λ0jexp(β1x1++βK1xK1)

for patients in propensity stratum j with covariates (x1,…, xK−1), where xk = 1 if a patient belongs to treatment group k and = 0 otherwise. They claim that βk measures the difference in survival distribution between group k and the control group K, but this is not true. Actually, βk measures the difference between group k and the remaining K − 1 groups combined. Furthermore, they assume a specific covariance structure, called one-factor structure, among the regression estimators to derive the multiplicity adjusted critical value c using Hsu’s (1992) approach. Our method does not require this assumption which may not hold for a given data set.

We have focused on IPW method in this paper. But one may want to use standard (unweighted) test statistics using matched data. In this case, the decision tree method automatically and optimally defines strata, which makes data matching easy, while the multinomial logistic regression method requires an additional step to define strata. In the analysis of our example, we simply partitioned each dimension of propensity scores so that each interval has similar propensity scores, but we could use a more technical (K − 1)-dimensional clustering method, such as decision tree or machine learning, e.g. Westreich et al. (2010), Abadie & Imbens (2006), and Lee et al. (2009). The computer programs are composed in Fortran and available upon request from the first author.

Appendices: Asymptotic Distribution of Weighted Log-Rank Tests

In the following appendices, we use the notations of ψk = limn→∞ wk/n, ψ=limnw/n=k=1Kψk, and ψ¯k=limnw¯k/n.

Appendix A. An ANOVA-type Test

Under H0, we have

Wk=nwk0Yk(t){dΛ^k(t)dΛ^(t)}=nwk{0dMk(t)0Yk(t)Y(t)dM(t)}=nwkl=1K0{ξklYk(t)Y(t)}i=1nldMli(t).

Note that {Mki(t), i = 1,…, nk, k = 1,…,K} are independent 0-mean martingales under H0, and Rk(t) = Yk(t)/Y(t) are predictable function that uniformly converges to rk(t)=ψkSk(t)/l=1KψlSl(t). Hence, by the martingale central limit theorem [21], W = (W1,…,WK−1)T is asymptotically normal with mean 0 and variance V = (vk,k)(K−1)×(K−1), where

vk,k=1ψkψkl=1K0{ξklrk(t)}{ξklrk(t)}y¯l(t)dΛ(t).

y¯k(t)=limnY¯k(t)/n=ψ¯kSk(t)G(t), and G(t) is the survivor function of the censoring distribution.

V can be consistently estimated by replacing ψk, y¯k(t), rk(t) and Λ(t) with their consistent estimators wk/n, Y¯k(t)/n, Rk(t) and Λ^(t), respectively, i.e. V^=(v^k,k)(K1)×(K1) with

v^k,k=nwkwkl=1K0{ξklYk(t)Y(t)}{ξklYk(t)Y(t)}Y¯l(t)dΛ^(t).

Appendix B. A Dunnett-Type Test

Under H0 : Λ1(t) = ⋯ = ΛK(t)(= Λ(t)), for k = 1,…,K − 1, we have

Uk=n(wk+wK)wkwK0Yk(t)YK(t)Yk(t)+YK(t){dMK(t)YK(t)dMk(t)Yk(t)},

which can be expressed as

Uk=ψk+ψKnψkψK0yk(t)yK(t)yk(t)+yK(t){dMK(t)yK(t)dMk(t)yk(t)}+op(1). (A1)

Using the martingale central limit theorem, we can show that, under H0, Uk is asymptotically normal with mean 0 and variance

σk2=(ψk+ψK)2ψk2ψK20yk2(t)y¯K(t)+yK2(t)y¯k(t){yk(t)+yK(t)}2dΛ(t)

which can be consistently estimated by

σ^k2=n(wk+wK)2wk2wK20Yk2(t)Y¯K(t)+YK2(t)Y¯k(t){Yk(t)+YK(t)}2dΛ^(k)(t)

by replacing the parameters with their estimators obtained using the data of groups k and K.

By the multivariate martingale central limit theorem applied to (A1), (U1,…,UK−1) is approximately normal with mean 0 and variance-covariance matrix Σ = (σkk′)(K−1)×(K−1) with σkk=σk2 and

σkk=(ψk+ψK)(ψk+ψK)ψkψkψK20yk(t)yk(t)y¯K(t){yk(t)+yK(t)}{yk(t)+yK(t)}dΛ(t)

for kk′. Under H0, we have yk(t) = ψkS(t)G(t) and y¯k(t)=ψ¯kS(t)G(t), where S(t) = exp{−Λ(t)} denotes the common survivor function. Hence, under H0, we have

σk2=(ψ¯kψk2+ψ¯KψK2)0G(t)dS(t)

and

σkk=ψ¯KψK20G(t)dS(t)

for 1 ≤ k < k′ ≤ K − 1, so that the correlation coefficient between Uk and Uk′ is given as

ρkk=ψkψkψ¯K(ψK2ψ¯k+ψk2ψ¯K)(ψK2ψ¯k+ψk2ψ¯K).

Note that the correlation coefficients depend only on the weights, but not on the censoring or survival distribution.

By replacing ψk and ψ¯k with their consistent estimators wk/n and w¯k/n, respectively, we obtain a consistent estimator of ρkk

ρ^kk=wkwkw¯K(wK2w¯k+wk2w¯K)(wK2w¯k+wk2w¯K)

for 1 ≤ k < k′ ≤ K − 1.

REFERENCES

  1. Aalen OO (1978). Nonparametric inference for a family of counting processes. Annals of Statistics, 6, 701–726. [Google Scholar]
  2. Abadie A & Imbens GW (2006). Large sample properties of matching estimators for average treatment effects. Econometrica, 74, 235267. [Google Scholar]
  3. Bang HJ, Jung SH, & George SL (2005). A simulation-based multiple testing procedure and sample size calculation. Journal of Biopharmaceutical Statistics, 15, 957–967. [DOI] [PubMed] [Google Scholar]
  4. Breslow NE, Lumley T, Ballantyne CM, Chambless LE, & Kulich M (2009). Using the whole cohort in the analysis of case-cohort data. American Journal of Epidemiology, 169, 13981405. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Cole SR & Hernan MA (2004). Adjusted survival curves with inverse probability weights. Computer Methods and Programs in Biomedicine, 75, 4549. [DOI] [PubMed] [Google Scholar]
  6. Curtis LH, Hammill BG, Eisenstein EL, Kramer JM, & Anstrom KJ (2007). Using inverse probability-weighted estimators in comparative effectiveness analyses with observational databases. Medical Care, 45, S103–S107. [DOI] [PubMed] [Google Scholar]
  7. Dunnett CW (1955). A multiple comparison procedure for comparing several treatments with a control. Journal of American Statistical Association, 50, 1096–1121. [Google Scholar]
  8. Fleming TR & Harrington DP (1991). Counting processes and survival analysis. Wiley: New York. [Google Scholar]
  9. Galimberti S, Sasieni P, & Valsecchi MG (2002). A weighted KaplanMeier estimator for matched data with application to the comparison of chemotherapy and bone-marrow transplant in leukaemia. Statistics in Medicine, 21, 38473864. [DOI] [PubMed] [Google Scholar]
  10. Gassmann HI, Deak I, & Szantai T (2002). Computing multivariate normal probabilities: A new look. Journal of Computational and Graphical Statistics, 11, 920–949. [Google Scholar]
  11. Genz A & Bretz F (2000). Numerical computation of critical values for multiple comparison problems ASA Proceedings of the Sections on Statistical Computing and Statistical Graphics, 84–87. [Google Scholar]
  12. Hsu JC (1992). The factor analytic approach to simultaneous inference in the general linear model. Journal of Computational and Graphical Statistics, 1, 151–168. [Google Scholar]
  13. Imbens GW (2000). The role of the propensity score in estimating dose-response functions. Biometrika, 87, 706710. [Google Scholar]
  14. Jung SH, Bang H, & Young S (2005). Sample size calculation for multiple testing in microarray data analysis. Biostatistics, 6, 157–169. [DOI] [PubMed] [Google Scholar]
  15. Jung SH, Chen Y, & Ahn H (2014). Type I error control for tree classification. Cancer Informatics, 13, 11–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Jung SH & Hui S (2002). Sample size calculations to compare K different survival distributions. Lifetime Data Analysis, 8, 361–73. [DOI] [PubMed] [Google Scholar]
  17. Jung SH, Kim C, & Chow SC (2008). Sample size calculation for the log-rank tests for multi-arm trials with a common control. Journal of Korean Statistical Society, 37, 11–22. [Google Scholar]
  18. Kaplan EL & Meier P (1958). Nonparametric estimation from incomplete observations. Journal of American Statistical Association;53:457481. [Google Scholar]
  19. Lee BK, Lessler J, & Stuart EA (2009). Improving propensity score weighting using machine learning. Statistics in Medicine, 29, 337–346. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Lee EK, Ahn HJ, Zo JI, Kim K, Jung DM, & Park JH (2017). Paravertebral block does not reduce cancer recurrence, but is related to higher overall survival in lung cancer surgery: A retrospective cohort Study. Anesthesia and Analgesia, 125, 1322–1328. [DOI] [PubMed] [Google Scholar]
  21. Lopez MJ & Gutman R (2017). Estimation of causal effects with multiple treatments: a review and new ideas. arXiv:1701.05132 [stat.ME].
  22. McCaffrey DF, Griffin BA, Almirall D, Slaughter ME, Ramchand R, & Burgette LF (2013). A tutorial on propensity score estimation for multiple treatments using generalized boosted models. Statistics in Medicine, 32, 33883414. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Nelson W (1969). Hazard plotting for incomplete failure data. Journal of Quality Technology, 1, 27–52. [Google Scholar]
  24. Peto R & Peto J (1972). Asymptotically efficient rank invariant test procedures (with discussion). Journal of the Royal Statistical Society, Series A, 135, 185–206. [Google Scholar]
  25. Pruzek RM & Cen L (2002). Propensity score analysis with graphics: A comparison of two kinds of gallbladder surgery Paper presented at the annual meeting of the Society for Multivariate Experimental Psychology, Charlottesville, VA. [Google Scholar]
  26. Rosenbaum PR & Rubin DB (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55. [Google Scholar]
  27. Rosenbaum PR & Rubin DB (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of American Statistical Association, 95, 749–759. [Google Scholar]
  28. Rubin DB (1998). Estimation from nonrandomized treatment comparisons using subclassification on propensity scores In Nonrandomized comparative clinical studies, ed. Abel U and Koch A, 85–100. Dusseldorf, Germany: Symposion. [Google Scholar]
  29. Stone RA, Obrosky DS, Singer DE, Kapoor WN, Fine MJ. (1995). Propensity score adjustment for pretreatment differences between hospitalized and ambulatory patients with community-acquired pneumonia. Medical Care, 33, AS56–66. [PubMed] [Google Scholar]
  30. Westreich D, Lessler J, & Funk MJ (2010). Propensity estimation: neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternative to logistic regression. Journal of Clinical Epidemiology, 63, 826–833. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Xie J & Liu C (2005). Adjusted KaplanMeier estimator and log-rank test with inverse probability of treatment weighting for survival data. Statistics in Medicine, 24, 30893110. [DOI] [PubMed] [Google Scholar]
  32. Zanutto E, Lu B, & Hornik R (2005). Using propensity score subclassification for multiple treatment doses to evaluate a national antidrug media campaign. Journal of Educational and Behavioral Statistics, 30, 59–73. [Google Scholar]
  33. Zhu H & Lu B (2015). Multiple comparisons for survival data with propensity score adjustment. Computational Statistics and Data Analysis, 86, 42–51. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES