Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Aug 19.
Published in final edited form as: Biometrics. 2002 Mar;58(1):21–29. doi: 10.1111/j.0006-341x.2002.00021.x

Principal Stratification in Causal Inference

Constantine E Frangakis 1, Donald B Rubin 2
PMCID: PMC4137767  NIHMSID: NIHMS593449  PMID: 11890317

Summary

Many scientific problems require that treatment comparisons be adjusted for posttreatment variables, but the estimands underlying standard methods are not causal effects. To address this deficiency, we propose a general framework for comparing treatments adjusting for posttreatment variables that yields principal effects based on principal stratification. Principal stratification with respect to a posttreatment variable is a cross-classification of subjects defined by the joint potential values of that posttreatment variable under each of the treatments being compared. Principal effects are causal effects within a principal stratum. The key property of principal strata is that they are not affected by treatment assignment and therefore can be used just as any pretreatment covariate, such as age category. As a result, the central property of our principal effects is that they are always causal effects and do not suffer from the complications of standard posttreatment-adjusted estimands. We discuss briefly that such principal causal effects are the link between three recent applications with adjustment for posttreatment variables: (i) treatment noncompliance, (ii) missing outcomes (dropout) following treatment noncompliance, and (iii) censoring by death. We then attack the problem of surrogate or biomarker endpoints, where we show, using principal causal effects, that all current definitions of surrogacy, even when perfectly true, do not generally have the desired interpretation as causal effects of treatment on outcome. We go on to formulate estimands based on principal stratification and principal causal effects and show their superiority.

Keywords: Biomarker, Causal inference, Censoring by death, Missing data, Posttreatment variable, Principal stratification, Quality of life, Rubin causal model, Surrogate

1. Background

Decisions in medicine, public health, and social policy depend critically on appropriate evaluation of competing treatments and policies. The extraction of information about such comparisons, which we can broadly view as causal inference, has been a growing area of statistical research in recent years. A statistical framework for causal inference that has received especially increasing attention is the one based on potential outcomes, originally introduced by Neyman (1923) for randomized experiments and randomization-based inference and generalized and extended by Rubin (1974, 1977, 1978) for nonrandomized studies and alternative forms of inference. Fundamentally, in this framework, often termed Rubin’s causal model (Holland, 1986), a unit (e.g., a patient) is considered at a particular place and time; treatments are interventions, each of which can be potentially applied to each unit; and potential outcomes are all the outcomes that would be observed when each of the treatments would be applied to each of the units. Then a causal comparison between, say, two treatments is a comparison of the potential outcomes of the same group of units under the two treatment conditions.

A major difference between the potential outcomes and other frameworks for causal inference (e.g., simultaneous equations; Goldberger, 1972; Heckman, 1978) is that, in the former, the definition of causal effects is separated from any probability models about the way in which units are assigned to treatments, namely the assignment mechanism (Rubin, 1978), and this separation is regarded broadly (though not uniformly; cf., Dawid, 2000) as useful. This clarifying role of potential outcomes has been important in research, including, e.g., the earlier works on the concept of ignorable assignment (Rubin, 1974, 1977, 1978), propensity scores (Rosenbaum and Rubin, 1983a), the concept of sequential ignorability and associated methods (Rubin, 1978; Robins, 1986), and others. More recently, methods also became available to address treatment noncompliance using potential outcomes, starting mainly with work by Baker and Lindeman (1994), Imbens and Rubin (1994), Robins and Greenland (1994), Angrist, Imbens, and Rubin (1996), and are currently receiving even more attention (e.g., Frangakis and Rubin, 1999; Hirano et al., 2000), although earlier related work was discussed by Sommer and Zeger (1991).

In Section 2, we discuss the more general problem of how to formulate comparisons of treatments adjusting for a posttreatment outcome variable that is not the primary endpoint. We document that the current estimands called net-treatment comparisons are not causal effects, as noted by Rosenbaum (1984). We also discuss that other current estimands in this problem (e.g., Robins and Greenland, 1992) assume the posttreatment variable is controllable and thus are difficult to interpret when the posttreatment variable is not directly controlled.

In Section 3, we present a general framework for comparing treatments where the estimands are adjusted for posttreatment variables and yet are always causal effects, i.e., principal effects using principal stratification. A principal stratification with respect to a posttreatment variable is a cross-classification of the units based on their joint potential values of that variable under each of the treatments being compared. Principal effects are comparisons of treatments within principal strata. The key property of a principal stratification is that it is not affected by treatment and therefore can be used as a pretreatment covariate. Thus, the central property of our principal effects is that they are always causal effects. In Section 4, we discuss briefly that principal causal effects link three recent applications.

In Section 5, we discuss the problem of surrogate endpoints and show, using principal effects, that all current definitions of surrogacy, even when true, do not generally define causal effects of treatment on outcome. We go on to formulate estimands based on principal stratification and principal effects and show their superiority. Section 6 provides remarks and directions for further research; throughout, we focus on the fundamental issue of definition of estimands rather than methods of estimation.

2. Adjusting Causal Effects for Posttreatment Variables: Goal and Standard Approaches

Consider a group of units i = 1, …, n where each can be potentially assigned either a standard treatment (z = 1) or a new treatment (z = 2). (For more treatments, see Section 6.) The objective is to measure an outcome Y (e.g., survival status) at a specific time after assignment of each unit. Let Yi(z) be the value of Y if unit i is assigned treatment z for z = 1, 2. Then a causal effect of assignment on the outcome Y is defined to be a comparison between the ordered sets of potential outcomes on a common set of units, e.g., a comparison between

{Yi(1):iset1}and{Yi(2):iset2}, (2.1)

given the groups of units, set1 and set2, being compared are identical (Neyman, 1923; Rubin, 1974, 1978). Examples include a comparison of the means of Yi(1) and Yi(2) or the median of Yi(2) — Yi(1) for i = 1, …, n. The potential outcomes and the causal effects are generally not all observable, even with random assignment, although such assignment simplifies estimation. When additional covariates are measured prior to the assignment, comparisons in the subgroup of units with a given covariate value describe subgroup causal effects of the assignment. With no loss of generality and to avoid extra notation, we will subsequently assume we are already within cells defined by observed pretreatment variables and will ignore the issue of sampling of units from a population.

In many types of studies, after each unit i gets assigned treatment Zi, a posttreatment variable Siobs is measured, in addition to measuring the main outcome Y. For simplicity of notation, we assume the variable Siobs is binary (e.g., 1 = low, 2 = high), although our approach can be immediately extended (cf., Section 6). Important types of studies where such posttreatment variables arise include

  • clinical trials, where a posttreatment variable Sobs is a measure of subjects’ compliance to the originally assigned treatment,

  • studies with long follow-up, where whether or not the subject drops out is a posttreatment variable (missingness of outcome),

  • studies where the outcome intended to be recorded can be censored by death,

  • studies comparing drugs for AIDS patients, where surrogate markers of progression, such as CD4 count and measures of viral load (Prentice, 1989; Lin, Fleming, and De Gruttola, 1997; Buyse et al., 2000), are posttreatment variables.

The first three are discussed briefly in Section 4 and the fourth is discussed at length in Section 5.

The variable Sobs generally encodes characteristics of the unit as well as of the treatment. For instance, in the example of clinical trials above, posttreatment noncompliance encodes information about efficacy—the effect of taking the treatment, as well as characteristics of compliance behavior of individual subjects. In such cases, an important study goal, and our objective, is to compare the effects of treatments on Y “after adjusting” for the posttreatment characteristics in a way that the adjusted estimands are causal effects.

A standard method adjusts for the posttreatment variable using a comparison (e.g., difference in means) between the distributions

pr{Yiobs|Siobs=s,Zi=1}

and

pr{Yiobs|Siobs=s,Zi=2}, (2.2)

where Yiobs=Yi(Zi), the observed outcome. Comparison (2.2) is called the net treatment effect of assignment Z adjusting for the posttreatment variable Sobs (Cochran, 1957; Rosenbaum, 1984) and compares outcomes under standard versus new treatment for subjects who got a common value s (e.g., s = high) of Sobs. For example, the current definitions for a surrogate endpoint by Prentice (1989), Freedman, Graubard, and Schatzkin (1992), Lin et al. (1997, equation (2)), Buyse and Molenbergh (1998), and Buyse et al. (2000) are all based on regressions in the sense of (2.2). The key to understanding such adjustments is to recognize that Siobs is Si (Zi), i.e., the observed value of one of two potential values Si(1), Si(2), depending on treatment assignment.

Assume for simplicity the condition that the treatment assignment Zi is completely randomized, i.e., pr(Zi = 1 | Si(1), Si(2), Yi(1), Yi(2)) is a common constant across subjects. Then the net treatment comparison (2.2) is equivalent to the comparison between

pr{Yi(1)|Si(1)=s}and pr{Yi(2)|Si(2)=s}. (2.3)

Comparison (2.2) is problematic if the treatment has any effect on the posttreatment variable (Rosenbaum, 1984) because the groups {i : Si(1) = s} (i.e., who get posttreatment value s under standard treatment) and {i : Si(2) = s} (i.e., who get posttreatment value s under new treatment) are not the same groups of subjects. Then, according to definition (2.1), the comparison (2.2) is not a causal effect. This concern is known to epidemiologists as posttreatment selection bias in estimating causal effects (cf., Rosenbaum, 1984; Robins and Greenland, 1992).

Potential values Si(1) and Si(2) were also used by Robins and Greenland (1992) (RG), but, like Rosenbaum (1984), RG did not use those values to define causal effects adjusted for the posttreatment variable. Instead, RG used a framework where both the treatment and the posttreatment variable are controllable, and defined a priori counterfactual values of outcomes Y that would have been observed under assignment to treatment z and if the posttreatment variable somehow were simultaneously forced to attain a value s. This framework, with its a priori counterfactual estimands, is not compatible with the studies we consider, which do not directly control the posttreatment variable. Specifically, most of the values of outcomes in this framework are not just unobserved existent values, but are nonexistent (a priori counterfactual) in the studies we consider. For example, consider a subject who, when assigned the standard treatment, yields a low value of the posttreatment CD4: for that subject, the value of the outcome Y if assignment to standard treatment were to yield a high value of the posttreatment CD4 is nonexistent (i.e., a priori counterfactual) in the study (see also Section 5.2). Evidently, no existing approach has suitably addressed these limitations.

3. Principal Stratification and Principal Causal Effects

Our proposal for adjustment for the posttreatment variable always generates causal effects because it always compares potential outcomes for a common set of people. Consider all the potential values of the posttreatment variable jointly and construct the following partitions.

Definition: (a) The basic principal stratification P0 with respect to posttreatment variable S is the partition of units i = 1, …, n such that, within any set of P0, all units have the same vector (Si(1), Si(2)).

(b) A principal stratification P with respect to posttreatment variable S is a partition of the units whose sets are unions of sets in the basic principal stratification P0.

An example of a principal stratification P is the partition of subjects into the set whose posttreatment variable is unaffected by treatment in this study (i.e., with Si(2) = Si(1)) and into the remaining subjects (i.e., with Si(2) ≠ Si (1)). It is important to note that, generally, we cannot directly observe the principal stratum to which a subject belongs because we cannot directly observe both Si(1) and Si(2) for any i. For example, a subject with Si(1) = 2 may belong to either stratum{i : Si(1) = 2, Si(2) = 1} or stratum {i : Si(1) = 2, Si(2) = 2}. It is, nevertheless, also important at this stage to act as if we knew both S(1) and S(2) in order to determine which quantities are causal.

Generally, a principal stratification generates the following estimands.

Definition: Let P be a principal stratification with respect to the posttreatment variable S and let SiP indicate the stratum of P to which unit i belongs. Then a principal effect with respect to that principal stratification is defined as a comparison of potential outcomes under standard versus new treatment within a principal stratum ς in P, i.e., a comparison between the ordered sets

{Yi(1):SiP=ς}and{Yi(2):SiP=ς}. (3.1)

The importance of principal effects draws from their conditioning on principal strata. Although the potential variable Si(1) generally differs from Si(2), the value of the ordered pair (Si(1), Si(2)) is, by definition, not affected by treatment, just like the pair (birth date, gender). Therefore, we have the following.

Property 1: The stratum SiP, to which unit i belongs, unaffected by treatment for any principal stratification P.

And, by definition (2.1), we have the following.

Property 2: Any principal effect, as defined in (3.1), is a causal effect.

Expressed in epidemiologists’ terminology, if memberships SiP were known, stratification of the subjects by SiP would adjust for the personal characteristics reflected in the posttreatment variable without introducing treatment selection bias for any principal stratification P.

The standard net-treatment comparisons (2.2) are functions of the basic principal causal effects and the distribution of the corresponding principal strata. Thus, if we have the basic principal causal effects and the counts of units in each basic principal stratum, we learn more, not less, about the problem than if we have only net-treatment comparisons. Moreover, because principal effects are causal effects, their estimation is critical for understanding the process by which treatments act on subjects and, in some situations, are also useful for more reliable generalization of results, as we shall see.

Setting principal causal effects to be the goal also helps focus the role of inference. Inference about the principal effects, e.g., in P0, requires prediction of the subjects’ missing memberships to the principal strata, as determined by Smis = {Si(z) : all i; zZi}, as well as prediction of the subjects’ missing potential outcomes Ymis = {Yi(z) : all i; zZi}. Specifically, the observed data are Hobs = (Yobs, Sobs, Z) and the likelihood is

L(Hobs;θS,θY)=pr{Z|Y(1),Y(2)),SP0}×pr(SP0|θS)×pr{(Y(1),Y(2))|SP0;θY}dYmisdSmis, (3.2)

where θS and θY denote parameters governing the proportions of the basic principal strata and the distribution of potential outcomes in these strata, respectively. In (3.2), omission of the unit subscript i means collection over all subjects in the data, integration over Ymis operates on the decomposition Y = (Yobs, Ymis), and integration over Smis operates on (Sobs, Smis) that determine membership to the principal strata. (Note that, in problems where a principal stratum implies that the outcome Yobs itself is missing, e.g., as in those discussed in Section 4, the likelihood is a modification of (3.2).)

The likelihood function (3.2) can be used for estimation of principal causal effects as functions of θY and θS with either likelihood or Bayesian inference. With no additional assumptions, there is generally no unique maximum likelihood estimate of (θY, θS). Nevertheless, we can often build plausible restrictions to capitalize on the scientific structure of each problem, e.g., using covariates to predict principal strata, including information on dose—response curves within principal strata or information on lag until and length of time for a treatment action based on pharmacokinetics. The framework can also host a combination of estimation with sensitivity analyses for the causal effects, e.g., in the sense of exploring ranges of unobserved quantities, as done, in different contexts, in Rosenbaum and Rubin (1983b), Scharfstein, Rotnitzky, and Robins (1999), and Goetghebeur et al. (2000), and whose extreme application results in the use of bounds (e.g., Manski, 1990; Balke and Pearl, 1997).

4. Brief Review of Principal Effects in Three Examples

We briefly review three examples of recently worked problems involving posttreatment variables: (i) treatment noncompliance, (ii) missing outcomes following treatment noncompliance, and (iii) censoring by death.

An example of recent methods for addressing treatment noncompliance appears in Imbens and Rubin (1994, 1997), who reanalyzed a study on vitamin A by Sommer and Zeger (1991). In that study, (a) the controlled intervention was randomization of children to receive vitamin A or not and the outcome was mortality, (b) the uncontrolled posttreatment variable was the actual taking of vitamin A, and (c) interest focused on formulating and estimating the effect of taking versus not taking vitamin A (as opposed to the effect of being assigned or not assigned to take vitamin A). To address (c), Imbens and Rubin (1997) estimated the complier average causal effect (CACE), which is a causal effect of assignment on all the study subjects who would comply if assigned treatment and would comply if assigned control (full compliers). Therefore, this approach to adjusting for noncompliance is a special case of the framework of Section 3, where the full compliers are a stratum in the principal stratification with respect to the posttreatment compliance behavior. In that and related applications dealing with noncompliance, CACE is a special case of a principal effect. Thus, the following comparison of CACE to other estimands when faced with noncompliance shows the strengths of our framework.

First, CACE is, by Property 2, always a well-defined causal effect. In contrast, a standard estimand to evaluate the actual taking of treatment compares the observed outcomes of subjects taking new treatment (vitamin A) to the observed outcomes of subjects taking control, within treatment assignment arm; i.e., it compares pr(Yobs | Sobs = 2, Z = z) to pr(Yobs | Sobs = 1, Z = z) for z = 0, 1, which, in analogy to (2.2), is a net-treatment effect of the new treatment adjusted for assignment. The comparison between these estimands for z = 0, 1, also known as an as-treated estimand, is not a causal effect without the exchangeability of prognosis for subjects who take and those who do not take new treatment within assignment arm. Quite generally, however, practitioners and regulatory agencies (e.g., the U.S. Food and Drug Administration) do not trust such exchangeability assumptions for uncontrolled compliance (e.g., Coronary Drug Project Research Group, 1980; Zelen, 1990). Other estimands to address the actual taking of treatment are defined by comparing subjects’ outcomes that, for a fixed level of the controllable assignment z, would have been observed under two scenarios: first, if all subjects (including noncompliers) would have somehow been forced to take the new treatment; second, if the same subjects would have somehow been forced to take the standard treatment. These estimands, therefore, involve outcome values that are a priori counterfactual (see also Frangakis, Rubin, and Zhou, 2002, rejoinder), i.e., they do not exist as functions of the controllable factor (z) alone and therefore their meaning as causal effects is not well defined.

Considerable growth of the literature has followed or was proposed independently of Imbens and Rubin (1994, 1997) on methods to better address noncompliance (e.g., Baker and Lindeman, 1994; Robins and Greenland, 1994; Angrist et al., 1996; Goetghebeur and Molenberghs, 1996; Robins, 1998; Rubin, 1998). On the other hand, we are aware of no previous work that has linked such recent approaches for noncompliance to the more general class of problems with posttreatment variables.

An important such problem was recently reported by Barnard et al. (2001) in a large experiment to evaluate school-choice programs, where (a) the randomized intervention was the offering of school vouchers to children of low income parents and (b) uncontrolled posttreatment variables were both the actual use of vouchers and the subsequent taking of tests to measure achievement. For such cases, Frangakis and Rubin (1997, 1999) had shown that, in order to estimate even the intention-to-treat effect of randomized treatment on achievement ability (i.e., an effect that ignores compliance), (i) it is not appropriate to use intention-to-treat analyses (i.e., analyses that ignore compliance data) and (ii) the principal strata defined by both compliance and missingness of outcome must be used. Barnard et al. (2001) took into account these principal strata and thereby proposed a more appropriate method of estimation of intention-to-treat effects as well as of other effects.

Another important such problem is discussed by Rubin (1998, Section. 6; 2000), i.e., censoring by death: subjects are assigned to treatments, the intended outcome is quality of life at one year after assignment, and the posttreatment variable indicates death before the first year. Quality of life is “missing” for such cases not because a nonnull value exists and is unobserved, as often treated by standard approaches, but simply because a nonnull value does not exist. Formulating causal effects of treatment on quality of life is subtle, first because such comparisons are restricted by the life of subjects and second because life, as a posttreatment variable, can be affected by treatment. The outline described in Rubin (2000) to address this problem is another special case of principal stratification.

Other types of posttreatment censoring can also be addressed using principal stratification and effects. For example, Frangakis and Rubin (2001) use a related formulation to address design and estimation of survival curves using double sampling in the combined presence of administrative censoring and loss to follow-up (see also Baker, Wax, and Patterson, 1993).

5. Defining Surrogate Endpoints Using Principal Causal Effects

5.1 The Two Goals of Surrogate Endpoints and Previous Approaches Revisited

Often in therapeutic trials, comparison of treatments for the outcome of primary importance, e.g., survival time, may require a long and practically infeasible follow-up. Nevertheless, if there exist variables measurable early in the follow-up that are known to be linked to the effect of the treatments on survival, then such variables can arguably help understand the effect of treatments on the outcome. There is currently a growing literature on such “surrogate” or “biomarker” endpoint variables (e.g., Prentice, 1989; Freedman et al., 1992; Lin et al., 1997; Buyse et al., 2000). The most fundamental question is the definition of a surrogate endpoint so that it has an appropriate interpretation and can be used reliably for prediction.

To help fix ideas, consider a study where the treatments are standard (z = 1) and new (z = 2) therapy for AIDS patients. If patient i is assigned treatment z, let Yi(z) denote the outcome of survival time (the primary endpoint) and let Si(z) denote the measurement of CD4 count (H = high, L = low) at two months after treatment assignment. Also, to better present our arguments in a simple setting, we assume that no patient dies before two months so that Siobs=Si(Zi) is measured for all subjects, that treatment assignments {Zi} are completely randomized, and that Yiobs is measured for all subjects, thereby creating what we call a validation study.

In order to have an appropriate interpretation as a surrogate, the posttreatment variable S should possess two properties.

Property 3 (Causal Necessity): S is necessary for the effect of treatment on the outcome Y in the sense that an effect of treatment on Y can occur only if an effect of treatment on S has occurred.

Property 4 (Statistical Generalizability): Sobs should well predict Yobs in an application study, where we do not wait for measurements Yobs.

The property of causal necessity is important because it tells us if the treatment can act on the outcome without acting on the surrogate. This information is central for improving the focus of therapy or drug development. The property of generalizability is important when it is not feasible to wait for the primary outcome in the application study.

In an early effort to satisfy these properties, Prentice (1989) defined Sobs to be a surrogate if it satisfies certain criteria, mainly that the observed outcome Yiobs(=Yi(Zi)) should be conditionally independent of the assigned treatment Zi given the observed value Siobs of the posttreatment variable in the validation study. (Prentice (1989) used a hazard regression parameterization for multiple-time measurements on Sobs. For clarity, we discuss the single-time measurement case—the generalization is simple but notationally tedious.) When exact independence is not expected, related approaches have been proposed that compare results of the regression of the outcome on treatment before and after conditioning on the variable Sobs, as with comparison of parameter coefficients (Freedman et al., 1992; Lin et al., 1997) and more recently with comparison of coefficients of determination (e.g., Buyse and Molenberghs, 1998; Buyse et al., 2000; Gail et al., 2000).

More generally, all these approaches are based on net-treatment comparisons (Section 2), where Sobs is considered a surrogate if treatment Z is no longer a good predictor (relative to Sobs) of outcome Yobs when conditioning on both Sobs and Z in the validation study. Thus, with respect to the way of adjusting for Sobs, we can collectively regard all such current definitions as variants generated from Prentice’s main criterion for defining a statistical surrogate as follows.

Definition (Statistical Surrogate in a Randomized Experiment): S is a statistical surrogate for a comparison of the effect of z = 1 versus z = 2 on Y if, for all fixed s, that comparison of the distributions in (2.2) results in equality.

It may appear that this definition of surrogate based on net-treatment comparisons is sufficient for both properties, causal necessity and generalizability. In contrast with standard beliefs (e.g., Prentice, 1989; individual-level surrogacy of Buyse et al., 2000), however, none of the approaches based on the definition of statistical surrogacy satisfies Property 3 of causal necessity. As we will show in the next section, in a study where the posttreatment S is a statistical surrogate, there will generally exist units with no causal effect of treatment on the statistical surrogate and that, nevertheless, experience causal effects of treatment on outcome. Conversely, in a study where there is no causal effect of treatment on outcome unless it occurs together with a causal effect of treatment on the surrogate, S will generally not be a statistical surrogate.

We offer a new criterion for surrogacy using principal stratification and principal causal effects. We show that the new criterion does satisfy Property 3 of causal necessity. Moreover, in Section 5.3, we also discuss the role of principal stratification for better satisfying Property 4, statistical generalizability of the surrogate.

5.2 Definition of Principal Surrogate and Property of Causal Necessity

Principal surrogate

Consider the basic principal strata of the simple study example of Section 5.1:

  • subjects whose CD4 count would be low and unaffected by treatment {i : Si(1) = Si(2) = L}, whom we label for simplicity as sicker patients;

  • subjects whose CD4 count would be high and unaffected by treatment {i : Si(1) = Si(2) = H} and whom we label as healthier;

  • subjects whose CD4 count under new treatment would be higher than under standard treatment {i : Si(1) = L and Si(2) = H} and whom we label as normal;

  • subjects whose CD4 count under new treatment would be lower than under standard treatment {i : Si(1) = H and Si(2) = L} and whom we label as special.

We propose the following definition of a surrogate based on principal stratification.

Definition: S is a principal surrogate for a comparison of the effect of z = 1 versus z = 2 on Y if, for all fixed s, that comparison between the ordered sets

{Yi(1):Si(1)=Si(2)=s}

and

{Yi(2):Si(1)=Si(2)=s} (5.1)

results in equality.

That is, causal effects of treatment on outcome Y may only exist when causal effects of treatment on the posttreatment variable S exist. Thus, our criterion based on principal stratification immediately satisfies Property 3 of causal necessity of the previous section.

Although definition (5.1) does not involve an assumption about the assignment model for Zi, under randomization, criterion (5.1) implies that the same comparison applied to

pr{Yiobs|Si(1)=Si(2)=s,Zi=1}

and

pr{Yiobs|Si(1)=Si(2)=s,Zi=2} (5.2)

also results in equality. The following result then asserts that Property 3 is not shared by a statistical surrogate.

Result 1: In a randomized experiment and with respect to any comparison, we have that:

  1. If the posttreatment variable S is a principal surrogate, then it is not generally a statistical surrogate.

  2. If the posttreatment variable S is a statistical surrogate, then it is not generally a principal surrogate.

To better understand the implications of Result 1, we offer a proof by discussing the two examples of Figure 1 for the comparison of averages (to show the result, in the figures, we need only consider scenarios with no special subjects). First consider Figure la. The subgroups of patients who experience no causal effect of treatment on the CD4 counts (sicker and healthier) experience no causal effect of treatment on survival. Therefore, by criterion (5.2), CD4 count is a principal surrogate in this study.

Figure 1.

Figure 1

Distinction between statistical and principal surrogates. Dashed boxes represent missing information, solid boxes represent observed information.

However, when s = L, the subgroup {i:Siobs=L,Zi=1} of subjects in the upper side of (2.2) is the mixture of sicker and normal patients under standard treatment, whereas the subgroup {i:Siobs=L,Zi=2} is, in fact, a different group of subjects—the sicker patients only—under new treatment. Using the numbers of Figure la, the upper side of (2.2) has mean 20 months, whereas the lower side of (2.2) has mean 10 months. It follows that CD4 is not a statistical surrogate. Therefore, although the standard interpretation would be that the new treatment decreases survival whenever it cannot change a low value of the surrogate, that conclusion is incorrect, as the principal surrogacy of S clearly indicates.

Consider now Figure lb. For the sicker patients, the new treatment has no causal effect on their CD4 count but does have a 10-month causal effect on increasing survival (comparing sicker patients’ survival under new versus standard treatment). Similarly, a 10-month increase in survival holds for the healthier patients in the study. Therefore, CD4 count is not a principal surrogate, that is, there can be an effect of treatment on survival when there is no effect of treatment on the surrogate. Using the criterion of statistical surrogacy, however, for s = L, we obtain that both the upper and lower sides of (2.2) have mean 20 months. Similarly, for s = H, both the upper and lower sides of (2.2) have mean 50 months, so CD4 is, by definition, a statistical surrogate for the average comparison. Therefore, although, here the standard interpretation would be that treatment does not change survival without changing the surrogate, this conclusion is incorrect. The discrepancy indicated in Result 1 occurs more generally because a statistical surrogate does not generally involve causal effects.

Associative and dissociative effects

We also propose, more generally than assessing principal surrogacy, to evaluate the effects of treatment on outcome that are associative and dissociative with effects on the posttreatment variable in the validation study. An effect on outcome that is dissociative with an effect on surrogate is defined as a comparison between the ordered sets

{Yi(1):Si(1)=Si(2)}and{Yi(2):Si(1)=Si(2)} (5.3)

that were equated in (5.1). An effect on outcome that is associative with an effect on surrogate is defined as a comparison between the ordered sets

{Yi(1):Si(1)Si(2)}and{Yi(2):Si(1)Si(2)}. (5.4)

Both (5.3) and (5.4) can, in principle, be further stratified on basic principal strata.

Both the associative effect (5.4) and the dissociative effect (5.3) are causal effects, by Property 2 of Section 3. If the dissociative effect is large (small), then we are to conclude that there is large (small) causal effect of treatment on outcome for subjects for whom treatment does not affect CD4. Similarly, if the associative effect is large (small), then we are to conclude that there is large (small) causal effect of treatment on outcome for subjects for whom treatment does affect CD4. A comparison between (5.4) and (5.3) then measures the degree to which a causal effect of treatment on outcome occurs together with a causal effect of treatment on the surrogate. For example, if this association is high, it can indicate that developing a drug to target biophysiological characteristics of the surrogate may be a good way to target the clinical endpoint Y. It is important to note that causal interpretation of the latter association is not automatic, in contrast with (5.4) and (5.3), and should be examined experimentally in a new (perhaps laboratory) study where an intervention manipulating a factor in addition to z would be applied, e.g., to increase CD4. For that new study, the potential outcomes would be regarded as functions not of the uncontrolled posttreatment CD4 but of the new factorial interventions used to change it.

Finally, we emphasize that the approach we present is applicable to continuous posttreatment variables as well, where analogous comparisons are formulated as the conditional distributions of the causal effect of treatment on outcome given principal strata of the posttreatment variable, which differ from the individual-level comparisons of Buyse et al. (2000, Section 4.2) (the latter still being net-treatment comparisons).

5.3 Principal Stratification and Property of Statistical Generalizability

We now examine the use of principal stratification to predict outcomes in a randomized application study. Here we distinguish the distributions of principal strata and of outcomes given principal strata between a validation and an application study, respectively,

prV{S(1),S(2)},prV{Yobs|S(1),S(2),Z}, (5.5)
prA{S(1),S(2)},prA{Yobs|S(1),S(2),Z}, (5.6)

and assume all distributions are available except prA{Yobs | S(1), S(2), Z}.

Before the outcomes Yiobs in the application study are known, they could be predicted by their predictive distribution, denoted by prA(Yobs | Sobs, Z) Because the distributions (5.6) determine the distributions of all observable data in that study, we have (under randomization)

prA(Yobs|Sobs,Z)=prA{Yobs|S(1),S(2),Z}prA{S(1),S(2)}dSmisprA{S(1),S(2)}dSmis. (5.7)

Without waiting for any outcome Yobs, however, the correct predictive distribution is not available because prA {Yobs | S(1), S(2), Z} is not available. To address this, the standard approach predicts the outcomes Yobs in the application study using the predictive distribution from the validation study, prV(Yobs | Sobs, Z), effectively replacing in (5.7) both distributions of (5.6) with those of (5.5). But the application study can differ from the validation study in either the distribution of principal strata or the potential outcomes given principal strata, in which case the validation predictive distribution will be incorrect for the application study. This may help explain empirical evidence that regressions prV (Yobs | Sobs, Z) in one validation study can be quite different in another study with the same type of treatment, outcome, and surrogate (e.g., Fleming and DeMets, 1996).

Consider, alternatively, replacing only the outcome component in the right side of (5.7) with that of the validation study, to obtain the synthetic predictive distribution defined as

prSYN(Yobs|Sobs,Z)=prV{Yobs|S(1),S(2),Z}prA{S(1),S(2)}dSmisprA{S(1),S(2)}dSmis. (5.8)

By any measure, it is more likely that the right side of (5.6) equals the right side of (5.5) than it is that both the right side and the left side of (5.6) equal, respectively, those in (5.5). Therefore, using the synthetic predictive distribution (5.8) should be a more plausible approximation to the correct predictive distribution in the application study than the predictive distribution from the validation study.

6. Remarks and Extensions

For comparing treatment effects on outcomes adjusting for posttreatment variables, we focused on estimands before estimation by formulating principal causal effects. We compared our estimands with existing estimands and separated this discussion from issues of their estimation, which can only be relevant when the estimands are relevant.

As discussed in Section 3, membership of subjects to the principal strata is not generally fully observed and so estimation must involve techniques for incomplete (missing) data. Moreover, because with no restrictions there is generally a range of parameter values that maximize the likelihood, it is important to couple our framework with plausible additional assumptions specific to each context. Such explicit restrictions (e.g., latent ignorability of outcome missingness or the compound exclusion restriction; Frangakis and Rubin, 1999) can be scientifically more plausible than the implicit assumptions of standard approaches and can also lead to increased precision of estimated principal causal effects. It is therefore a distinct advantage that our framework formalizes why and what types of assumptions are needed and how to incorporate them to make inference in these problems.

Although we concentrated on examples with two treatments and binary posttreatment variable, the framework is immediately applicable to posttreatment variables that are multivariate (e.g., as in the experiment on school choice, Section 4), time-dependent, or continuous (end of Section 5.2) and to multiple treatments, say z = 1, …, k. In the latter case, the basic principal strata with respect to S are subgroups of subjects with the same vector (Si(1),…,Si(k)). Then, as in (3.1), principal causal effects are comparisons of the potential outcomes among strata that are unions of the basic principal strata.

In summary, continued use of the current frameworks in problems with posttreatment variables (e.g., surrogate endpoints) in principle makes incorrect attributions of effects of treatments. As Buyse (unpublished) noted recently about the comparison of our framework to the existing ones for surrogate endpoints: “Until now, we had always thought that the roles of biology and statistics did not mix in these complex problems. But principal causal effects set the framework for allowing biological assumptions in statistical methods and vice versa.” We hope that this article provokes the development and dissemination of more principled frameworks.

Acknowledgements

We thank the editor, the associate editor, two anonymous reviewers, Stuart Baker, Marc Buyse, Steve Goodman, Jennifer Hill, Sue Marcus, Susan Murphy, Dan Scharfstein, and Scott Zeger for constructive comments and the H.-C. Yang Memorial Fund, the U.S. National Institute of Child Health and Human Development (R01 HD38209), and the National Science Foundation for partial support. Link to related work and software: http://biosunOl.biostat.jhsph.edu/~cfrangak/papers.

Contributor Information

Constantine E. Frangakis, Department of Biostatistics, Johns Hopkins University, 615 N. Wolfe Street, Baltimore, Maryland 21205, U.S.A. cfrangak@jhsph.edu

Donald B. Rubin, Department of Statistics, Harvard University, 1 Oxford Street, Cambridge, Massachusetts 02138, U.S.A. rubin@stat.harvard.edu

References

  1. Angrist J, Imbens GW, Rubin DB. Identification of causal effects using instrumental variables (with discussion) Journal of the American Statistical Association. 1996;91:444–472. [Google Scholar]
  2. Baker SG, Lindeman KS. The paired availability design: A proposal for evaluating epidural analgesia during labor. Statistics in Medicine. 1994;13:2269–2278. doi: 10.1002/sim.4780132108. [DOI] [PubMed] [Google Scholar]
  3. Baker SG, Wax Y, Patterson BH. Regression analysis of grouped survival data: Informative censoring and double sampling. Biometrics. 1993;49:379–389. [PubMed] [Google Scholar]
  4. Balke A, Pearl J. Bounds on treatment effects from studies with imperfect compliance. Journal of the American Statistical Association. 1997;92:1171–1176. [Google Scholar]
  5. Barnard J, Frangakis CE, Hill JL, Rubin DB, et al. School choice in NY City: A Bayesian analysis of an imperfect randomized experiment. In: Gatsonis C, editor. Case Studies in Bayesian Statistics (with discussion) New York: Springer-Verlag; 2001. in press. [Google Scholar]
  6. Buyse M, Molenberghs G. The validation of surrogate endpoints in randomized experiments. Biometrics. 1998;54:1014–1029. [PubMed] [Google Scholar]
  7. Buyse M, Molenberghs G, Burzykowski T, Renard D, Geys H. The validation of surrogate endpoints in meta-analyses of randomized experiments. Biostatistics. 2000;1:49–68. doi: 10.1093/biostatistics/1.1.49. [DOI] [PubMed] [Google Scholar]
  8. Cochran WG. Analysis of covariance: Its nature and uses. Biometrics. 1957;13:261–281. [Google Scholar]
  9. Coronary Drug Project Research Group. Influence of adherence to treatment and response of cholesterol on mortality in the coronary drug project. New England Journal of Medicine. 1980;303:1038–1041. doi: 10.1056/NEJM198010303031804. [DOI] [PubMed] [Google Scholar]
  10. Dawid AP. Causal inference without counterfactuals (with discussion) Journal of the American Statistical Association. 2000;95:407–448. [Google Scholar]
  11. Fleming TR, DeMets DL. Surrogate end points in clinical trials: Are we being misled? Annals of Internal Medicine. 1996;125:605–613. doi: 10.7326/0003-4819-125-7-199610010-00011. [DOI] [PubMed] [Google Scholar]
  12. Frangakis CE, Rubin DB. A new approach to the idiosyncratic problem of drug-noncompliance with subsequent loss to follow-up. American Statistical Association, Proceedings of the Biopharmacy Section. 1997:206–211. [Google Scholar]
  13. Frangakis CE, Rubin DB. Addressing complications of intention-to-treat analysis in the combined presence of all-or-none treatment-noncompliance and subsequent missing outcomes. Biometrika. 1999;86:365–379. [Google Scholar]
  14. Frangakis CE, Rubin DB. Addressing an idiosyncrasy in estimating survival curves using double-sampling in the presence of self-selected right censoring (with discussion) Biometrics. 2001;57:333–353. doi: 10.1111/j.0006-341x.2001.00333.x. [DOI] [PubMed] [Google Scholar]
  15. Frangakis CE, Rubin DB, Zhou XH. Clustered encouragement design with individual non-compliance: Bayesian Inference and application to Advance Directive forms (with discussion) Biostatistics. 2002 doi: 10.1093/biostatistics/3.2.147. in press. [DOI] [PubMed] [Google Scholar]
  16. Freedman LS, Graubard BI, Schatzkin A. Statistical validation of intermediate endpoints for chronic diseases. Statistics in Medicine. 1992;11:167–178. doi: 10.1002/sim.4780110204. [DOI] [PubMed] [Google Scholar]
  17. Gail M, Pfeiffer R, Houwelingen H, Carroll RJ. On meta-analytic assessment of surrogate outcomes. Biostatistics. 2000;1:231–246. doi: 10.1093/biostatistics/1.3.231. [DOI] [PubMed] [Google Scholar]
  18. Goetghebeur E, Molenberghs G. Causal inference in a placebo-controlled clinical trial with binary outcome and ordered compliance. Journal of the American Statistical Association. 1996;91:928–934. [Google Scholar]
  19. Goetghebeur E, Kenward M, Molenberghs G, Vansteelandt S. Proceedings of the Annual Meeting. Indianapolis, Indiana: American Statistical Association; 2000. Inferential tools for sensitivity analysis and noncompliance in clinical trials. [Google Scholar]
  20. Goldberger AS. Structural equation methods in the social sciences. Econometrika. 1972;40:979–1001. [Google Scholar]
  21. Heckman JJ. Dummy endogenous variables in a simultaneous equation system. Econometrika. 1978;46:931–959. [Google Scholar]
  22. Hirano K, Imbens G, Rubin DB, Zhou X-H. Estimating the effect of an influenza vaccine in an encouragement design. Biostatistics. 2000;1:69–88. doi: 10.1093/biostatistics/1.1.69. [DOI] [PubMed] [Google Scholar]
  23. Holland P. Statistics and causal inference. Journal of the American Statistical Association. 1986;81:945–970. [Google Scholar]
  24. Imbens GW, Rubin DB. Causal inference with instrumental variables. Cambridge, Massachusetts: Harvard Institute of Economic Research; 1994. Discussion Paper 1676. [Google Scholar]
  25. Imbens GW, Rubin DB. Bayesian inference for causal effects in randomized experiments with noncompliance. Annals of Statistics. 1997;25:305–327. [Google Scholar]
  26. Lin DY, Fleming TR, De Gruttola V. Estimating the proportion of treatment effect explained by a surrogate marker. Statistics in Medicine. 1997;16:1515–1527. doi: 10.1002/(sici)1097-0258(19970715)16:13<1515::aid-sim572>3.0.co;2-1. [DOI] [PubMed] [Google Scholar]
  27. Manski CF. Non-parametric bounds on treatment effects. American Economic Review, Papers and Proceedings. 1990;80:319–323. [Google Scholar]
  28. Neyman J. On the application of probability theory to agricultural experiments: Essay on principles, Section 9. Translated in Statistical Science. 1923;5:465–480. (1990). [Google Scholar]
  29. Prentice RL. Surrogate endpoints in clinical trials: Definition and operational criteria. Statistics in Medicine. 1989;8:431–440. doi: 10.1002/sim.4780080407. [DOI] [PubMed] [Google Scholar]
  30. Robins JM. A new approach to causal inference in mortality studies with sustained exposure periods—Application to control of the healthy worker survivor effect. Mathematical Modelling. 1986;7:1393–1512. [Google Scholar]
  31. Robins JM. Correction for non-compliance in equivalence trials. Statistics in Medicine. 1998;17:269–302. doi: 10.1002/(sici)1097-0258(19980215)17:3<269::aid-sim763>3.0.co;2-j. [DOI] [PubMed] [Google Scholar]
  32. Robins JM, Greenland S. Identifiability and exchangeability of direct and indirect effects. Epidemiology. 1992;3:143–155. doi: 10.1097/00001648-199203000-00013. [DOI] [PubMed] [Google Scholar]
  33. Robins JM, Greenland S. Adjusting for differential rates of prophylaxis therapy for PCP in high- versus low-dose AZT treatment arms in an AIDS randomized trial. Journal of the American Statistical Association. 1994;89:737–479. [Google Scholar]
  34. Rosenbaum PR. The consequences of adjustment for a concomitant variable that has been affected by the treatment. The Journal of the Royal Statistical Society, Series A. 1984;147:656–666. [Google Scholar]
  35. Rosenbaum P, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983a;70:41–55. [Google Scholar]
  36. Rosenbaum PR, Rubin DB. Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome. Journal of the Royal Statistical Society, Series B. 1983b;45:212–218. [Google Scholar]
  37. Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology. 1974;66:688–701. [Google Scholar]
  38. Rubin DB. Assignment to a treatment group on the basis of a covariate. Journal of Educational Statistics. 1977;2:1–26. [Google Scholar]
  39. Rubin DB. Bayesian inference for causal effects. Annals of Statistics. 1978;6:34–58. [Google Scholar]
  40. Rubin DB. More powerful randomization-based p-values in double-blind trials with noncompliance (with discussion) Statistics in Medicine. 1998;17:371–389. doi: 10.1002/(sici)1097-0258(19980215)17:3<371::aid-sim768>3.0.co;2-o. [DOI] [PubMed] [Google Scholar]
  41. Rubin DB. Comment on “Causal inference without counterfactuals”. In: Dawid AP, editor. Journal of the American Statistical Association. Vol. 95. 2000. pp. 435–437. [Google Scholar]
  42. Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for nonignorable drop-out using semiparametric nonresponse models (with discussion) Journal of the American Statistical Association. 1999;94:1096–1146. [Google Scholar]
  43. Sommer A, Zeger S. On estimating efficacy from clinical trials. Statistics in Medicine. 1991;10:45–52. doi: 10.1002/sim.4780100110. [DOI] [PubMed] [Google Scholar]
  44. Zelen M. Discussion of presidential address “Biostatistical collaboration in medical research”. In: Ellenberg JH, editor. Biometrics. Vol. 46. 1990. pp. 28–29. [PubMed] [Google Scholar]

RESOURCES