Skip to main content
The Permanente Journal logoLink to The Permanente Journal
. 2012 Fall;16(4):e100–e120. doi: 10.7812/tpp/12-046

Analysis of Nonintervention Studies: Technical Supplement

Mikel Aickin
PMCID: PMC3523943  PMID: 23251125

Abstract

Methods for analyzing data in nonintervention clinical studies are substantially different from those that are appropriate for randomized clinical trials. Although the latter methods are well known, the former are not. A systematic approach for dealing with statistical confounding in nonintervention research has been developed over the past 30 to 40 years, and the essence of this theory constitutes the contents of this article. An accompanying, less technical article explains the implications of these results for clinical research.

Introduction

This supplement provides the technical probability arguments that justify statements in the print exposition.1 It presents the details explaining why certain analytical steps are necessary with nonintervention data and indicates how they can be carried out. Understanding these fundamental arguments is important if nonintervention studies are to have their proper role in medical research.2 The assumption here is that the reader is reasonably familiar with the mathematical manipulations of probability theory and statistics. In addition, some acquaintance with the work of either Donald Rubin3,4 or James Heckman4,5 would be helpful, but not necessary since the most relevant topics will be covered here.

Overview

Most of the challenging parts of this article are in the underlying concepts o f probability distributions, independence, and conditional independence. The actual derivations of results that have implications for analysis are generally straightforward. For this purpose the notation for probability manipulations is set out in detail and rigidly followed throughout.

Probability can be used to describe associations, but it is not a strong enough language to incorporate causal ideas. Because one wants to discover the effects of treatments in a causal sense, it is appropriate to base methods of inference on a concept of causation. Since this path threatens to become both laborious and contentious, the position taken here is that the idea of alternative realities (also called counterfactuals) is an adequate basis.

The central problem of nonintervention studies is the uncertainty about how a given treatment was selected for a given patient. The concern is that the process of treatment choice may create an artifactual association between treatment and outcome, or that it may hide a true causal association. Thus, the analysis of data from a nonintervention study should focus on the mechanism by which treatments are selected. Within the alternative realities framework it is possible to formulate a condition under which the treatment selection mechanism can be safely ignored, for the purposes of estimating causal effects.

Most versions of causation identify common causes (of both treatment selection and outcome) as the only source of artifactual associations. This leads to the idea that if one could somehow hold all of the common causes fixed, then any remaining relationship between treatment and outcome would be causal. In probability theory, “holding fixed”| is accomplished by conditioning. In practice, this means forming groups of patients who are identical (or more realistically, similar) on all the common causes, and within such a patient group estimating the treatment-outcome causal effect. This is a rather old method in statistical theory, and it will be advocated here.

It is not difficult to imagine that in medical situations the list of common causes may be rather long, and that the idea of matching patients on a long list might present some difficulties. Very surprisingly, it turns out that in the two-treatment situation one can compute a single number, the propensity score, such that forming patient groups homogeneous on the propensity score removes the ill effects of all of the common causes that go into the computation of the score, within each patient group.

All of the analyses that emerge from this development take into account the matching that forms patient subgroups. In a sense, this amounts to remaining agnostic about the specifics of how the common causes bring about treatment choices. A selection model makes this connection specific and then leads to specific analyses that do not depend on matching or subgrouping.

The steps for analysis in nonintervention studies that follow from this development are stated in the following guidelines.

  1. Try to determine the common causes of treatment and outcome.

  2. Form patient subgroups homogeneous on the common causes, and analyze treatment-outcome relationships within these subgroups, summarizing across subgroups if possible.

  3. If there are too many common causes for simple matching, then compute propensity scores from the common causes, form patient subgroups homogeneous on the propensity score, and proceed as in 2.

  4. As an alternative or additional analysis, assume a selection model and employ the samplewide inferential procedure that follows from it, usually assuming homogeneous treatment effects across patients.

Notation

Before delving into the theory, it is worthwhile to establish a small amount of notation. I will use y for an outcome variable, meaning some measure of benefit or harm that can be made on a patient, in some well-defined way that is specified by a research protocol. For the most part the theory can be developed in reference to a single patient and then extended to a population of patients. In other cases it is necessary to explicitly bring in collections of patients, and then yi is taken to be the outcome for the i-th patient (where i is simply a label to distinguish different patients). Note that y could be a list of outcome variables rather than a single measured variable. I will use t for a binary treatment indicator, which means that for an untreated or control patient we have t = 0, and for a treated patient, t = 1. The association of 0 with control and 1 with active treatment is merely a convention for making it easier to talk about the theory. The 0 treatment could itself be an alternative active form of treatment. The point is that for most of what will be presented here, there are only two treatment options. As with y, ti would denote the treatment received by patient i, and in greater generality t could stand for a list of treatments. Other letters will be used for variables that play various other roles in the theoretical development.

For any two variables u and v, we say that u and v are independent and write u ╧ v to mean that pr(u,v) = pr(u)pr(v). The “pr” notation is a general template for talking about probabilities. The reason is that there are at least three different levels of generality with which one can understand probability statements, and the different notations that are used in these variants are somewhat inconsistent with each other. The template notation is general enough that it can be translated into any of the three viewpoints, depending on the desires of the reader. Thus pr(u,v) = pr(u)pr(v) signifies that the probability of event u happening simultaneously with event v is the product of their individual probabilities. Because the conditional probability of u given v is defined by pr(u|v) = pr(u,v)/pr(v), u ╧ v is the same as pr(u|v) = pr(u). In general, E[...] is used for the mean or expectation of the variable replacing the dots. We say that u and v are uncorrelated and write u ⊥ to mean that E[uv] = E[u]E[v]. This is equivalent to saying that the covariance of u and v is 0; E[uv] – E[u]E[v] = 0. In general, u ╧ v implies u ⊥ v, but the reversed implication only holds in special cases. If u and v are binary variables (that is, they assume only the values 0 or 1), then u ⊥ v means the same as u ╧ v. The conditional expectation of u given v is E[u|v]. It has two defining properties. First, it is a function of v. Second, the residual u – E[u|v] is uncorrelated with every function of v (which is denoted (u – E[u|v]) ⊥ fctn(v). The linear regression operator is L[u|v]. Its defining properties are first that it is a linear function of v, and second, the residual u – L[u|v] is uncorrelated with v. If u is a binary variable and v is any variable, then u ⊥ fctn(v) is equivalent to u ╧ v. Variables u and v are conditionally independent given variable w if pr(u,v|w) = pr(u|w)pr(v|w). This is written as u ╧ v | w, and it is the same as pr(u|v,w) = pr(u|w). As this indicates, all conditions expressed in terms of probability distributions can be expressed for conditional distributions. Thus E[uv|w] = E[u|w]E[v|w] means that u and v are conditionally uncorrelated given w, written u ⊥ v | w.

It is perhaps worth mentioning that theoretical results are often stated and proved under the exacting conditions given above. In practice, however, one may often take a more lenient view, by saying (for example) that u is nearly independent of v or nearly uncorrelated with v given w, and similarly for other kinds of statements. This reflects an underlying belief that when an exact conclusion is deduced from an exact assumption about associations, then the conclusion remains approximately true when the assumptions hold approximately.

Causation

Although the language of probability is widely used in discussing issues of analysis in biomedical research, in a certain way it falls short of what is needed. In essence, probability distributions only describe variability of measurements over a population of patients, or associations among measurements, again observed as we look over a patient population. There is nothing in probability theory that explains why or how these associations arise, and this is the source of the deficiency. Specifically, we want an association between treatment t and outcome y to be causal, in the sense that it represents a type of relationship that can be ported to other circumstances. Again, there is nothing in probability theory that distinguishes a causal from a casual relationship.

An appealing and intuitive way to talk about causal relationships is based on the method of alternative realities. This applies to the treatment-outcome paradigm by declaring that there is a functional relationship of the form:

graphic file with name i1552-5775-16-4-e100-eq01.jpg

Here y and t continue to stand for the outcome and treatment for a given patient. The variable u actually stands for all other variables that are necessary to determine y. There is a further implication that if one were to replace u with some simplification of itself, the above equation would not continue to hold. In other words, u is a minimal representation of the causes of y. If one leaves u out entirely, then the model says that t is the only cause of y. Since this is usually quite unrealistic, in general there will be a u variable. Sometimes, for notational convenience, u is left out even though it is implied. This is confusing, but standard.

It is very important to understand what the equation of the alternative realities model says. It does not simply state a relationship among observed quantities. That is, if the model were true, then as we went through the patient population recording the values of y, t, and u, we would always find that their relationship was given by the function η. But this is not the whole of what is being asserted. The additional piece is that if the universe had turned out in some other way such that the patients had different values for y, t, and u than they actually had, we would still find the relationship specified by to hold. In the statistical and causation literature the alternative realities model is called the “counterfactual” model. In my view this is a problematic term. It is designed to capture the fact that the model specifies what would happen under conditions that did not actually happen. In this sense, counterfactual means counter to what actually happened. But the term is easily misinterpreted to mean counter to some fact—that is, false—which makes it sound like an absurd theory. In fact, the alternative realities theory asserts the relationship y = η (t,u) as the important causal “fact,” making “counterfactual” sound like “selfcontradictory.”

Because the notion of alternative realities only becomes intuitive after one has some experience with it, there is value in looking at the simplest possible case, where y is binary. We can then write out all possible functions η in Table 1:

Table 1.

Possible η functions

graphic file with name i1552-5775-16-4-e100-t01.jpg

For the sake of exposition I have oversimplified here to a case in which t is the sole cause of y. Each patient has a hypothetical response profile consisting of the pair η(0), the response to control, and η (1), the response to treatment. A response of 0 is taken to be a failure, and 1 is taken to be a success. As shown in Table 1, there are only four possibilities. This means that the patient population is divided into four subpopulations whose types are defined in the leftmost column of the table. The Impervious Negative patients are those who will fail regardless of how they are treated, while the Impervious Positive patients will succeed independently of treatment. From the standpoint of treatment effectiveness, one would like the Susceptible Positive subpopulation to be as large as possible, since these are the patients who would be helped by treatment. Correspondingly, we would like the Susceptible Negative group to be as small as possible, since these patients will be harmed by treatment.

As this example indicates, the alternative realities model does not impose any constraints on our observations. That is, it is not a testable theory in the sense that some conceivable experiment might prove it wrong. Instead it is a theoretical framework for thinking and talking about what we mean by “causal relationship.” It is designed to clarify our thinking about the real world, not to explain how the real world works in some cosmic sense.

There is an additional, very important part of the alternative realities model. It is that the model only holds for certain ways of determining the value of t. This issue is invisible in probability theory, where we would write pr(y|t) for the probability distribution of y given t (or maybe E[y|t] or L[y|t] if we were concerned with means). The expression pr(y|t) says nothing about how t was determined, because it is derived from pr(y,t) and pr(t), both of which are equally silent about the ontogeny oft. To incorporate this notion into the notation, I would write:

graphic file with name i1552-5775-16-4-e100-eq02.jpg

for the conditional distribution of y given t, where t was determined by a mechanism σ. I would want to make statements about this conditional probability for a collection of mechanisms S. For example, if I wanted to say that y was independent of t for all mechanisms σ in S (which is written σ ∈ S), then I would write:

graphic file with name i1552-5775-16-4-e100-eq03.jpg

Note that this does not say what would happen if t were to be determined by some mechanism not in S. The idea that causal laws can only be stated along with the conditions under which they obtain is due to the philosopher John Mackie.6,7 This was a substantial step forward in causal thinking, since most writers before Mackie took the implicit position that the causation they were talking about was universal. For the alternative realities version of causation, we should write:

graphic file with name i1552-5775-16-4-e100-eq04.jpg

to explicitly include the treatment-determining mechanism (or mechanisms) for which causality is being asserted.

An example of a mechanism in medicine is the practitioner who provides a particular therapy, or the clinic within which it is provided. This is important because different causal laws can be in place in these different settings. Another example is a therapy provided with the intent of producing a benefit, as is true in natural clinical settings, in contrast to a therapy given to discover its effect, as is true in clinical trials. In fact, the central dogma of randomized clinical trials is that results obtained under the second mechanism can be ported to the first mechanism, which is an assertion of causality. Many behavioral interventions have several different mechanisms. If the treatment is weight loss, then it can be produced by an unexplained loss of appetite, conscious calorie restriction according to a specific regimen, bariatric surgery, or an exercise program. Relating weight loss to a medical outcome in a causal sense means asserting a relationship in some or all of these different circumstances.

Whatever one thinks of the alternative realities model for causation, it has at least a few virtues. First, it is transparent. Rather than leaving definitions up in the air with vacuous ambiguities, the theory says exactly how it considers causal relationships. Even further, when it says that y is caused by other variables, then they should all be listed explicitly. Second, it makes the important point that universal causal laws may not exist and therefore need not be discussed. Every causal rule, expressed as a hypothetical response profile, is taken to hold only under specified conditions, in particular, under conditions specifying how the treatment was determined. This latter point brings up one of the embarrassments of the alternative realities model, that specifying a “mechanism” for how t is determined sounds very much like a circumlocution for “what causes t,” which introduces a certain amount of circularity into the model. Although philosophically suboptimal, in practice this does not make any problems, as we w ill see.

Ignorability

Throughout biomedicine it is taken as given that the way to quantify a treatment effect is by calculation of E[y|t], or perhaps L[ y|t]. This is the expected outcome as a function of treatment t, or a linear approximation of it. If this quantity actually depends on t, then one would like to say that this dependency is causal and use the result for making treatment decisions. To be sure, there are variations on this theme. One of the most prominent is the case where y is binary, with failure denoted 0 and success denoted 1. In this special case, E[y|t] = pr(y = 1|t). The analysis that is almost universally used in this case is logistic regression. I will have comments below about whether this is wise. In other cases there are other statistical models, but in general they involve E[y|t] either directly or indirectly.

I will now focus on the binary outcome case, because it simplifies the exposition, and extensions to more complex outcomes will be given toward the end of the discussion. So for the moment, when I write pr(y) I am, for all practical purposes, really only referring to pr(y = 1). (This is because pr(y = 0) = 1 Đ pr(y = 1), so it is redundant.)

To use pr(y|t) in some causal sense, I need to put it into the context of the alternative realities model. The most obvious way to do this is with an assertion:

graphic file with name i1552-5775-16-4-e100-eq05.jpg

This means that for the treatment-determining mechanisms in S, the probability distribution of observed outcomes in the two treatment groups is the same as the probability distribution of hypothetical outcomes under the alternative realities model. In the binary case, this is equivalent to:

graphic file with name i1552-5775-16-4-e100-eq06.jpg

This means that the probability of an observed success given treatment t is the same as the probability that η(t) denotes a success, for any fixed value of t. Note that the probability distribution of η(t) is derived from the population of patients, not from some chance mechanism operating for each individual patient. The important experimental implication of the equation above is that an estimate of the left side is also an estimate of the right side. Thus, when this condition holds we can say that the fraction of successes in the treated group estimates the fraction of the population for whom η(1) = 1, and likewise the fraction of successes in the control group estimates the fraction of the population for whom η(0) = 1. Clearly, this is a very favorable set of circumstances, which is why it does not happen naturally.

Donald Rubin provided a very simple condition under which it will happen.8 In 1976, he said that the treatment mechanism was ignorable, provided:

graphic file with name i1552-5775-16-4-e100-eq07.jpg

He introduced this idea in the context where t indicated whether the value of a variable was observed or not, but the same principle is directly applicable to treatments, as noted in later work with Paul Rosenbaum.9

Ignorability means that the event t = τ is independent of the binary outcome η(τ) for the treatment-determining mechanism σ, and for any possible τ. This is the same as:

graphic file with name i1552-5775-16-4-e100-eq08.jpg

The reason this is clever is that it gives us:

graphic file with name i1552-5775-16-4-e100-eq09.jpg

The first equality comes from the alternative realities model, and the second is ignorability of σ. Thus, Rosenbaum and Rubin's main result is that if the treatment-determining mechanism is ignorable, then we can use the sample fraction of successes in the treatment (or control) sample to estimate the population fraction of successes η(1) = 1 (or successes η(0) = 1). Or even more simply, ignorable mechanisms can be ignored.

Followers of the ignorability argument then use pr(η(1) = 1 : σ) – pr(η(0) = 1 : σ) as the measure of treatment effect. A treatment is thus effective to the extent that the patient population contains more occurrences of η(1) = 1 than it does of η(0) = 1.

There is an innocent -seeming generalization of ignorability: that for some variable w (or collection of variables w) we have:

graphic file with name i1552-5775-16-4-e100-eq10.jpg

or again more simply in the binary outcome case:

graphic file with name i1552-5775-16-4-e100-eq11.jpg

Thus, although we do not have σ ignorable across the entire patient population, within groups determined by w we do have ignorability. This is called conditional ignorability. It points toward the strategy of stratifying the sample and the analysis, justifying some of the earliest work on this approach by William Cochran.10 The fraction of successes on treatment τ within a group of patients having a common value of w estimates the fraction of cases where η(τ) = 1 in that patient subgroup. The treatment effect pr[(η(1) = 1 | w : σ) – pr(η(0) = 1 | w : σ) is now defined separately for each value of w. Nearly all analysts then pool these different treatment effects in some way. I will question whether this is a good idea below, but for now let us merely accept it.

Before moving on, we may note that the idea of ignorability provides a rationale for randomization in clinical trials. The presumption is that when the treatment is determined by a computer generated random number, the selected treatment is independent of all clinically measured quantities. This means that [t = τ] is essentially independent of “everything else,” and so it is independent of η(τ). That is, the treatment-determining mechanism of randomization is ignorable. Obviously the fundamental issue for nonintervention research is whether there are instances of ignorability other than randomization.

Common causes

Assuming again that we are in the nonintervention situation, and randomization is out of the question, the issue is whether there are ever variables that lead to conditional ignorability. To see that there are, let us specify the alternative realities model as:

graphic file with name i1552-5775-16-4-e100-eq12.jpg

with the additional condition:

graphic file with name i1552-5775-16-4-e100-eq13.jpg

This says that w is part of the cause of y, and it allows w to also be part of the cause of t. Although u may be another part of the common cause, when we condition on w there is no relationship between u and t. Thus we may say that u has been removed as a common cause. We take this setup to mean that w consists of the important common causes of y and t, and that u is a removable part of common cause. This is again a notion of a minimal set of common causes, in the sense that by conditioning on them, all other potential common causes are not in fact common causes, because they are (conditionally) independent of treatment.

The immediate result of our definition is:

graphic file with name i1552-5775-16-4-e100-eq14.jpg

This shows that the assumption that one can condition on variables that will make σ ignorable is the same as assuming that there are common causes of y and t when σ is the mechanism determining t.

One of the ways in which this result is not fully satisfying is that the list of variables in w that make up the common causes may be rather lengthy. The question then becomes whether there is any way to simplify things.

The propensity score

Rosenbaum and Rubin introduced the propensity score as the probability of treatment, given a battery of explanatory factors.9,11 Again relying on the binary treatment case for the purposes of exposition, the propensity score can be written as:

graphic file with name i1552-5775-16-4-e100-eq15.jpg

Note two simple facts. First, there are multiple propensity scores, one for every choice of a prediction battery w. Second, there is also a propensity score for each treatment-determining mechanism. For these reasons, I can define the notation for the propensity score by:

graphic file with name i1552-5775-16-4-e100-eq16.jpg

This is to emphasize that it is a function of w and σ. Consequently, if we think of w as a chance variable (one with a probability distribution), then the propensity score is also a chance variable. This is, in fact, the view taken by propensity score analysis.

Rubin and Rosenbaum in effect make use of a highly intuitive, but somewhat strange result. It can be stated for general chance variables as:

graphic file with name i1552-5775-16-4-e100-eq17.jpg

That is, u is uncorrelated with any function of v given the value of E[u|v]. Recalling that E[u|v] is a function of v, we understand this to mean that E[u|v] carries all the correlation information that v carries about u. Another way of saying this is the seeming tautology that there is no information in v that can improve E[u|v] as a v-based predictor of u. The proof of the above result is an immediate consequence of properties of conditional expectation : letting g be any function of v:

graphic file with name i1552-5775-16-4-e100-eq18.jpg

(As a digression, this result considerably extends the propensity argument to multiple treatments and even quantitative treatments. See Appendix 1.)

When u is a binary variable, then the Rubin-Rosenbaum result can be strengthened to:

graphic file with name i1552-5775-16-4-e100-eq19.jpg

This result is now applied very easily. First, suppose that we take the alternative realities model to be:

graphic file with name i1552-5775-16-4-e100-eq20.jpg

Recalling that π(w : σ) = E[t | w : σ] = pr(t = 1 | w : σ), we have:

graphic file with name i1552-5775-16-4-e100-eq21.jpg

That is, within a group of patients having a common value of π(w: σ), w is conditionally independent of [t = 1]. It is easy to see that this applies to [t = 0], and so it applies to any [t = τ]. It follows immediately from this that

graphic file with name i1552-5775-16-4-e100-eq22.jpg

This says that the treatment mechanism is conditionally ignorable given the propensity score.

We can see with a small amount of additional work that this extends to a more general situation. Again take:

graphic file with name i1552-5775-16-4-e100-eq23.jpg

with w as the minimal common causes, so that [t= ╧] ╧ u | w : σ. Then (leaving off the trailing : σ), for any function g(w,u):

graphic file with name i1552-5775-16-4-e100-eq24.jpg

Thus:

graphic file with name i1552-5775-16-4-e100-eq25.jpg

and so:

graphic file with name i1552-5775-16-4-e100-eq26.jpg

We have, therefore, a very powerful way to justify simple estimation of treatment effects in a nonintervention situation. Just compute a propensity score for each patient in the sample, on the basis of the minimal common causes. Within each patient group having the same propensity score, estimate the treatment effect, which might be pr(η(1) = 1 | w : σ) – pr(η(0) = 1 | w : σ). It is, of course, extremely natural to imagine that this effect does depend on w but that it is free of σ. That is, how the treatment is determined has nothing to do with the population distribution of the hypothetical responses. This would allow us to remove σ in the above notation, but this may not be as critically important as it seems, because in specific cases we are always dealing with some specified treatment-determining mechanism.

Here is a way of summarizing the situation to make it more applicable. In the whole patient population we have common causes that threaten to undermine our attempts to see the causal relationship between treatment and outcome. If we can identify a minimal subset of the common causes, then we can compute the propensity score on the basis of that minimal subset. Then within each patient stratum where the propensity score is constant, the common causes are no longer common causes, and so we can properly estimate treatment effects within each of these homogeneous groups.

In doing all of this, we should leave out of the propensity score any variable that causes treatment selection but does not influence the outcome (given the common causes we have identified). We will see below why this second fact is important. To see that it is true, suppose that y ╧ v | w but pr(t = τ| w, v : σ) = π(w,v) genuinely depends on v. In this case v is not part of the common cause, and the question is about the consequences of including in the propensity score anyway. Compute covariances:

graphic file with name i1552-5775-16-4-e100-eq27.jpg

There is no guarantee here that w ╧ v | π(w,v), and therefore the right side can be nonzero. But this says that, conditional on π(w,v), y and v are correlated. Thus, within each patient group that is homogeneous with respect to propensity score π(w,v), v has been turned into a common cause. In other words, if a cause of t which is not a cause of y is included in the propensity score, then even though it was not a common cause initially, the analysis conditioning on the propensity score can turn it into a common cause. This possibility suggests that propensity score development should try to avoid causes oft that are not common causes. There is another reason for this that I will explain below.

The pooling fallacy

Many of the analysts who have used propensity scores have fallen into a modeling trap. The easiest way to see the trap is in the context of an overall treatment effect model, and for simplicity of exposition we restrict to the case of linear models. Thus, suppose that:

graphic file with name i1552-5775-16-4-e100-eq28.jpg

Assume that w is the measured common cause. Since I am not interested in the details of its relation to outcome, I can represent it somewhat ambiguously, as above. I am taking u to be the determinants of y that are not common causes, or are removable. I am also allowing the very real possibility that the treatment effect depends on w. Now suppose that a linear model is to be employed for analysis:

graphic file with name i1552-5775-16-4-e100-eq29.jpg

This strategy (or one with logistic regression when y is binary) is very commonly used, especially with the substitution of L[...|...] in place of E[...|...]. It seems to be the kind of analysis that follows from the fact that conditioning on the propensity score removes the effects of common causes. The last two terms on the right create no problems, because I am not interested in estimating the effects of common causes. In fact, this is one of the things I am giving up in the attempt to estimate the true treatment effects. The problem is that the coefficient of t, the π(w)-specific treatment effect, need not be the correct effect. In fact, it will only be the case that the true conditional treatment effect is correctly estimated when it is a function of the propensity score, which seems like a rather strained condition. Since this feature is not built in to the propensity score, it is certainly not guaranteed.

It is very often assumed (usually implicitly) that in models like those above, the treatment effect β(w) is a constant β, despite the reduction in clinical relevance that this entails. Then the fact that E[w|t, π(w)] = E[w|π(w)] removes treatment effects from the contribution of w to the outcome, justifying the conditioning on t and π(w) rather than on t and w. If the model and constant treatment effect assumptions are valid then this approach makes sense, but it seems unwise to invoke untested assumptions in a process that appears to be driven by the attempt to avoid them. (There is an additional refinement12 that suggests analysis conditional on the treatment residual; see Appendix 2).

Similar considerations apply to other commonly used models that are based on linear combinations of treatment indicators and other factors. This includes logistic regression. The lesson is that a proper propensity score analysis is only guaranteed to be correct when one actually stratifies the patient sample on the propensity score and carries out a stratified analysis. Simply including the propensity score as an additional factor in some partly linear model (like regression, or logistic regression, or proportional hazards modeling) may not eliminate the bias caused by common causes. The reason this is concerning is that this latter procedure is so widely used.

Here is another procedure that has become virtually standard in propensity-based analysis but which can be shown to be problematic. It is based on the notion that if we have rendered treatment and common causes unrelated in each propensity group, then it is reasonable to expect that they will be unrelated in the entire sample. That is, if we ignore the grouping then we should see no relationship between treatment and common causes. Thus when patients have been matched on propensity scores, the matching is discarded and an analysis is conducted to see whether the distribution of common causes is the same in both treatment groups. If it is concluded that this is true, then that is used to justify a conventional two-group analysis, just as if we had done a clinical trial.

To see why this approach is selfcontradictory, we first need a simple but poorly known result from elementary probability. Let A, B, and C denote three events. Suppose that A ╧ B | C and A ╧ B. Then, either A ╧ C or B ╧ C. To see this, first compute:

graphic file with name i1552-5775-16-4-e100-eq30.jpg

where C* is the complement of C. I can use the same argument with B* in place of B to have:

graphic file with name i1552-5775-16-4-e100-eq31.jpg

Subtract this from the last equation above to get:

graphic file with name i1552-5775-16-4-e100-eq32.jpg

which is equivalent to the conclusion. To apply this to propensity scores, we certainly have [t = τ] ╧ w | π(w), because of how the score is constructed. The practice that I am condemning then looks to see whether [t = τ] ╧ w. If the analyst “succeeds” by finding this, he/she has also found that either [t = τ] ╧ π(w) or w ╧ π(w). The second alternative happens only when π(w) is constant, in which case it is surely independent oft. Thus, in either case [t = τ] ╧ π(w). But if the propensity score is independent of what it is trying to predict, then certainly it is useless.

The practice of verifying balance on common causes between treatment groups (ignoring propensity groups) probably has its roots in the concern that balancing on propensity scores need not balance on any substantive common causes, which seems to present a problem about the patient groups being comparable. As the above argument demonstrates, seeking to show such balance amounts to verifying that the construction of the propensity score has utterly failed in its purpose. It is difficult to criticize researchers for conducting this meaningless ritual, however, because it is one of the few things that statisticians are unanimous in recommending.13–16 I will say below what I think should be done about checking individual variables after matching.

There is yet a further reason to avoid pooling the propensity strata and fitting an overall model. In general, we must have that t and π(w) are correlated. If this were not true, pr(t = 1 | w) would fail to be a predictor of t, so that w does not really consist of common causes. Now it is well known that when t and some other variable highly correlated with t appear in the same linear model, the precision of the coefficient of t suffers. Specifically, the sampling standard deviation of the treatment effect estimate becomes elevated when a correlated variable is added to the model. As the correlation rises above 0.90, so that the prediction oft becomes very good, the estimate of the treatment effect may be so imprecise as to be useless.

This exemplifies a very old dilemma in statistical parameter estimation. In general, if one modifies an estimation procedure to reduce its sampling variability, then a bias is introduced or magnified. Conversely, if a modification is put into place to reduce or eliminate the bias, then the sampling variability is raised. This is called “the trade-off between bias and precision”. The irony in propensity score development is that if one includes a long list of variables in the score, then one is in effect increasing the correlation between the propensity score and the treatment, and thereby reducing the precision of the treatment effect that will appear in an overall model applied to the pooled data. This consideration argues very strongly for the strategy of stratified analysis (based on propensity score groups) as opposed to premature statistical modeling.

The effect gap

From the standpoint of patient-centered analysis, we need to reconsider how a treatment effect should be defined. Recall that the Rosenbaum-Rubin definition is pr(η(1) = 1 | w :σ) – pr(η(0) = 1 | w : σ), and that if w can be observed then this can be estimated from a sample. (There is an obvious additional practicality condition here: that π(w: σ) is neither 0 nor 1, since either case would provide no basis for comparing treatments within a homogeneous w-group.) This definition makes sense from the population or health policy perspective, since it indicates that one should recommend treatment (to everyone) only when there are more hypothetical favorable responses to treatment than there are unfavorable responses. But it is unclear what this means in terms of individual patients.

Table 2.

Probability of η functions

graphic file with name i1552-5775-16-4-e100-t02.jpg

In fact, the situation is much worse, since the conventional definition of treatment effect does not connect with the causal interpretation. Returning to the simplest treatment-outcome causal model, recall that this approach divided the patient population into four subpopulations. Let the fractions (or probabilities) of these subgroups be as shown above. Then:

graphic file with name i1552-5775-16-4-e100-eq33.jpg

This shows that both of these “benefit” probabilities contain fractions b of patients who are not benefited. Although it is true that their difference eliminates b, it also diminishes clinical meaning. Suppose that the conventional treatment benefit were 0.05. If c = 0.05 and d = 0.10, then 85% of the population is impervious. It now makes an enormous difference to the patient whether b = 0 (there is a large chance the patient is impervious and will do poorly) or b = 0.85 (there is a large chance the patient is impervious and will do well). In the first case it is most probable that the treatment will not help a desperate case, and in the second case it is most probable that the treatment is unneeded. On the other hand, if c = 0.40 and d = 0.45 (still a difference of 0.05), then only 15% of the population is impervious and the chance of benefiting from treatment is substantial. Considerations like these extend to more complex cases and suggest that we need to estimate the probability distribution of the pair (η(0), η(1)) rather than the marginal distributions of its parts, η(0) and η(1).

This line of thinking leads in another direction, the possibility that patient-centered outcome analysis should be the real gold standard for medical research. This would mean that we would like to estimate η(1) – η(0) (or η(1,w) – η(0,w)) on a patient-by-patient basis. For a patient population we would want (at a minimum) the mean of these (which we have if the propensity score is based on the actual common causes) and their standard deviation (which the propensity analysis does not provide). The natural way to try to solve this problem is by matching patients.

Matching

If I had the correct alternative realities model:

graphic file with name i1552-5775-16-4-e100-eq34.jpg

with w the measured common causes and u the measured influences on outcome (but not treatment), then a matching strategy seems plausible for solving our problem. That is, take two patients with w = wi, u = ui, and t = ti (i = 1,2). Then the pair (y1,y2) = (η(t1,w,u), η(t2,w,u)) under our assumptions. If it happened that t1 = τ and t2 = 1–τ, then we would have seen the complete hypothetical response profile (η (τ,w,u), η (1 – τ,w,u)).

The most we can hope for in a population sense is to estimate the probability distribution of (η (0), η (1)). To have a proper estimate, we need to define σ, which now means the mechanism of matching. Recall that before, σ stood for the mechanism that determined who got which treatment. Now I am saying that beyond that mechanism, if I am going to form pairs of patients for analysis, I need to take into account how I match them.

Expanding the Rosenbaum-Rubin terminology, and citing another of their results,17 I would say that the matching mechanism σ is ignorable provided:

graphic file with name i1552-5775-16-4-e100-eq35.jpg

This formulation assumes that we would condition on the common causes w, and in effect presumes that the causes of y (but not of t) are either unknown or not taken into consideration. The conclusion (based on the same reasoning as for ignorability of treatment-determining mechanisms) is

graphic file with name i1552-5775-16-4-e100-eq36.jpg

That is, any sample-based estimation procedure that is legitimate when conditioning on the common value of w can be used to estimate the (joint) probability distribution of the hypothetical response profile, again conditioning on w.

Perhaps the most obvious way of proceeding here is to sample pairs from the patient sample at random, using stratified random choice within homogeneous w-groups. This makes sense because, using the same argument that justifies treatment randomization, random sampling from the sample makes the resulting pairing “independent of everything.” Given adequate sample size, this would give us estimates of the four subpopulation fractions determined by the four hypothetical response profiles, within each w-group. Cases in which both patients got the same treatment provide indications of how strong the unmeasured u-variables are in determining the outcome (or alternatively, one could include u-variables in the strata, or estimate within w-strata effects in the context of a model that included the u-variables). Cases with opposite treatments provide information about within-w-group treatment effects, but again one needs to consider the extent to which u-variables are ignored or taken into account. It may be worthwhile to point out that this resampling analysis can replace patients after choosing them once. This means that if the sample is relatively small, one can look at all possible pairs of patients in w-strata in building up the estimates of the probabilities for the hypothetical response profiles. The problem with this strategy is that it is difficult to estimate the standard deviations of any resulting treatment effect estimates.

It is worthwhile to note a strange practice regarding matching that has developed in the applied propensity score literature: matching patients on propensities and then ignoring the matching. The argument is that by matching we are creating two comparable groups of patients, one receiving treatment and one receiving the control. “Comparable” here means that the two groups have the same distribution of common causes. The analysis then proceeds exactly as if one had done an intervention to produce the two groups. Thus, after computing a tentative propensity score, there is an analysis to see whether there are differences in the common causes between the groups, and perhaps to look for a different propensity score if there is. Again following convention, it is almost always the averages of common causes that are compared in the two groups. In theory, however, this practice makes no sense. Recall that a propensity score only renders treatment independent of common causes within propensity strata (patients with the same propensity score). The practice described above looks to see whether treatment is independent of common causes in the whole sample (pooling the strata). But as shown above, this can only happen when treatment selection is independent of the propensity score, which means that the presumed common causes are not actually common causes. Thus, one suspects that the practice of preferring propensity scores that make treatment relatively independent of common causes in the whole sample is actually a way of selecting propensity scores that do a poor job of predicting treatment.

Miss-‐specifications

There are two basic problems that have to be solved in performing a propensity-based analysis. The first is to determine a minimal set of common causes. Recall that “minimal ” here means that treatment selection is independent of the other common causes, given the minimal set. For most medical conditions there are some fairly obvious candidates, but in the case of good electronic medical records (EMR) systems the list of potential common causes can become considerable. It is important to bring in all possible sources of information here, including not only statistical analysis, but also clinical experience and general medical knowledge. Because the favorable results of conditioning on common causes only apply if one has in fact correctly identified them, this is a critical step of the analysis.

The second issue concerns the actual mathematical form of the propensity score. For binary treatment, it is virtually universal to use logistic regression to obtain a propensity score. The basic reason for this is probably inertia, in that logistic regression is a well-known and widely used statistical model. But if the actual conditional probability of treatment does not follow the logistic form, then one is not actually conditioning on the correct propensity score, and again there is no guarantee that any of the favorable results of a stratified analysis will be realized. This problem is not just statistical, since clinical and medical knowledge might contribute to an understanding of how common causes lead to treatment decisions.

This second problem (correct form of the propensity score) can often be reduced or eliminated in studies of large patient samples from EMR systems. The reason is that if the list of minimal common causes is not too long, it may actually be feasible to form strata that are homogeneous for all of them simultaneously. A subgroup of patients who share values on a battery of common causes is called a matched comparison group (MCG). One could then simply compute the fraction of patients selecting treatment within each MCG and use this as a nonparametric propensity score. However, since one of the main benefits of the propensity score is that it eases the stratum-forming process by reducing the conditioning to a single variable (as opposed to the whole list of common causes), and I have for the moment assumed that this was not actually a problem, the necessity of the propensity score vanishes. In fact, the propensity score was devised to deal with the relatively small samples that are typical of intervention studies, so its extension into patient-rich research settings is perhaps neither necessary nor desirable.

Multiple treatments

The reasoning carried out up to this point extends remarkably easily to the case of more than two treatment options. Let T denote the (finite) collection of possible treatments. Common causes are still the variables that influence both outcome and a choice of treatment from T. If one is in a sufficiently patient-rich setting so that the formation of MCGs is possible, then one can make any comparisons between treatments within an MCG with confidence that the effects of common causes have been removed, and summarization across MCGs is conceptually the same as in the case of two treatments. If propensity scores are to be used, then they are defined by:

graphic file with name i1552-5775-16-4-e100-eq37.jpg

If one wants to resort to modeling, there is a multiple-outcome version of logistic regression that can be used to estimate them. Knowledge of the values π(τ, w : σ) for all but one τ is equivalent to knowledge of all of them, because their sum is 1. The argument leading to [t = τ] ╧ η(τ, w) | π(τ, w: σ) (τ ∈ T) follows from precisely the same form of reasoning that worked in the case of two treatments (see Appendix 3). Thus, for the formation of propensity-score groups, one only needs to match on several propensity probabilities simultaneously, instead of matching on only one. Presumably there would usually be many fewer treatments than common causes, so even in this extended case propensity matching would be easier than matching on the common causes themselves.

Summary on stratification

The above exposition has been made to clarify the theoretical issues that underlie stratification and patient matching as strategies for reducing the biases in treatment effectiveness assessment that are produced by common causes. The development has been untempered by practical experience. Thus, this summary distills the principles that one should use in nonintervention analysis as they are revealed by the theory.

The first step is to identify a minimal set of common causes. Every variable in this list should be associated with both outcome and treatment selection. The reason is that it is only the set of common causes that produce biased estimation of treatment effects.

Any variable that influences outcome but is conditionally independent of treatment given the minimal set should not be used to stratify or to compute propensity scores. This is because a bloated list of common causes will needlessly reduce the usefulness of any model-based propensity score computation.

Nor should any variable that influences treatment decision but not outcome be used. This is due to the possibility that a variable of this type that is not a common cause will be turned into a common cause, even in the stratified analysis.

Form patient strata based on the actual minimal common causes, if possible. This amounts to a “nonparametric” version of propensity scoring, which reduces the risk of getting the wrong form for the conditional probability of treatment.

Failing the above possibility, form patient strata based on propensity scores. Because of the uncertainties about the correct form for this score, several different choices might be explored.

Carry out the analysis at the MCG or propensity-group level. Be sensitive to the possibility of different treatment effects in different strata. Consider modeling treatment effects in terms of stratum characteristics (the values of the common causes for an MCG). Avoid being too aggressive in pooling estimates to a single number when there is evidence of heterogeneity. In the case of propensity groups, since patients in the same group can have different values for individual variables that went into the propensity calculation, it may be desirable to use at least some of them as adjustment variables (in the usual linear modeling sense), but again at the group rather than the patient level.

Do not pool the data and apply overall statistical modeling. The theory of propensity scores only guarantees reduction of the unwanted effects of common causes within propensity strata; this does not extend across the pooled strata. The risk of bias is elevated. Avoid the risk of trivial, nonsignificant results due to high correlation between treatment and propensity score by not including the propensity score as a statistical adjustment variable.

An additional use of propensity scores has been proposed in which the analysis is weighted by the inverse of the propensity scores.18

Selection Modeling

The above warning about premature pooling of strata with an overall analysis results from the fact that propensity-based analysis does not model the treatment selection process directly. When a plausible selection model is worth consideration, then the method of analysis invented by James Heckman5 might be used. The development of this model is more challenging than what has been given above for stratification, but the final method of analysis is simpler.

Heckman's model poses two equations:

graphic file with name i1552-5775-16-4-e100-eq38.jpg

The first equation is a linear model that explains how the outcome y is related to treatment t, to one or more variables symbolized here by x, and to a residual term u. Although this appears to be a regression equation, it is not. The problem is that u, representing influences on y that are not measured, is assumed to be associated with t. The suggestions here are that x consists of measured causes of y unrelated to t, and that u represents the common causes that are the source of all the problems. The second equation is the mathematical way of saying that the treatment will be given (t = 1) when the value γw + v exceeds 0. Here, w represents one or more measured common causes, γ is one or more parameters, and v is a chance variable consisting of unmeasured factors that influence t. In practice, γw is a linear combination of variables (so γ stands for a list of coefficients and w stands for a list of variables).

The key assumption in Heckman's model is E[u|v] = v, with nonzero. That is, the part of the association between outcome and treatment that is due to common causes is modeled as a correlation between the residual u in the first equation, and the unmeasured determinant of t in the second equation.

The objective of this approach is to find a method of estimating the treatment effect β, and to do this it will be necessary to estimate the parameter (or parameters) γ. It will turn out that there will automatically be an estimate of ρ, which is worthwhile for a reality check on the selection part of the model.

We turn first to the estimation of γ. The final assumption in Heckman's model is a probability distribution for v:

graphic file with name i1552-5775-16-4-e100-eq39.jpg

Because v is unmeasured, the form of F is not critical; it is merely proposed because without some specification of a distribution for v, no further progress is possible. We (harmlessly) assume that F is symmetric around 0, and then:

graphic file with name i1552-5775-16-4-e100-eq40.jpg

This provides a probability model for the estimation of γ. A convenient choice for F is the standard logistic distribution:

graphic file with name i1552-5775-16-4-e100-eq41.jpg

Thus, the γ coefficient(s) can be estimated easily using logistic regression. (Note that the fitted model would have a constant term as well, again because the centering of v is actually immaterial.)

The computation will be simplified a bit if we define the entropy of the prediction as:

graphic file with name i1552-5775-16-4-e100-eq42.jpg

Returning to Heckman's first equation, for an untreated patient:

graphic file with name i1552-5775-16-4-e100-eq43.jpg

It is tacitly assumed that the final term on the right is free of x (that is, u ╧ x | t,w). With some effort one can then show that:

graphic file with name i1552-5775-16-4-e100-eq44.jpg

With even more effort one can derive:

graphic file with name i1552-5775-16-4-e100-eq45.jpg

where the first equality is generally true, and the second two equalities use special features of the logistic distribution.

Returning to Heckman's first equation, but now for a treated patient (making the same assumption):

graphic file with name i1552-5775-16-4-e100-eq46.jpg

Then:

graphic file with name i1552-5775-16-4-e100-eq47.jpg

and:

graphic file with name i1552-5775-16-4-e100-eq48.jpg

where again the first equation is general, and the second two equations are specific to the logistic distribution.

We can now put Heckman's two equations together:

graphic file with name i1552-5775-16-4-e100-eq49.jpg

There are several things worth noting. First, this is just the original equation for the outcome, with a term added on. Secondly, the part of this additional term that depends on w only uses F(γw). Given that the first part of the computation will be a logistic regression, this is nothing more than the estimated probability of treatment, which any good statistics program will provide. Third, ρ appears in this equation as a regression parameter, and thus it will be estimated in the second part of the computation. If it is near 0, then either there is no common cause problem, or else the battery of factors w is not (all of) the minimal common causes. On the other hand, a nonzero ρ estimate is an indication that the analysis is at least going in the right direction.

Summary on selection

We can see that although Heckman's approach involves more mathematics (basically some easy calculus), it yields a rather simple method. Like stratification, it requires that the common causes w be identified. This is, of course, the problematic part of any method that tries to remove the artifacts produced by common causes. It gives the option of identifying an additional set of variables x. These might include causes of y that are unrelated to t. In practice, some of the components of w might be included in x, the idea being that causes of y might appear in the final equation apart from their role in determining t. The computations consist of three steps:

  • Perform a logistic regression oft on w.

  • Compute the estimated prediction oft (that is, E[t|w] = pr(t = 1|w)).

  • Regress the outcome on treatment (t), potentially other factors (x), and on the additional term derived by Heckman involving t and w.

The selection model approach differs from the stratification or matching approach in that there is an explicit modeling assumption about how the treatment selection takes place. It is this assumption (or set of assumptions) that leads to the simple, overall regression equation. It may seem that no price needs to be paid for this, but that is not quite true. The additional term that Heckman's model includes is almost certainly correlated with treatment, and that correlation will degrade the precision with which the treatment effect is estimated. If the correlation is high enough, then an analysis may be powerless unless the sample size is rather large. Again, in the context of patient-rich EMRs, sample size may not be the limiting factor. Another price to be paid is in getting the common causes wrong, or simply not being able to measure important common causes. There is nothing, of course, in the selection model that protects against the consequences of this. Perhaps a more severe defect with the selection model is the assumption of a homogeneous treatment effect. One of the advantages of stratification/matching is the opportunity to look specifically and in some detail at how treatment effects may vary across the patient population. Needless to say, there is in general no reason to avoid adding a selection model analysis to a stratified analysis.

Conclusions

Attempts to estimate therapeutic effects from nonintervention data, such as EMRs, cannot simply use the statistical analyses that have become traditional in clinical trials. The alternative methods that have been developed all involve identifying common causes, preferably a minimal set of common causes. From the standpoint of patient-centered research, the most attractive method is matched comparison groups, with the analysis focusing on potential heterogeneity of response, and modeling or pooling MCG-level effects when that is possible. If good matching in terms of common causes is not feasible, then forming patient strata on the basis of a propensity score, or perhaps several propensity scores, seems like an attractive route. The analytic objectives here would be the same as in an MCG analysis. If one wants to fit an overall treatment effect model, then it is unwise simply to inject a propensity score into an explanatory model. It seems more reasonable to take the approach of Heckman's or some other selection model. In all analyses of nonintervention data, it is sensible to use a multiplicity of approaches and to look for general convergence of the acceptable approaches to a common theme, while not getting lost in excessive, overpowered hypothesis testing that emphasizes trivial differences between treatments at the cost of missing the main point.

Disclosure Statement

The author(s) have no conflicts of interest to disclose.

Acknowledgments

Leslie E Parker, ELS, provided editorial assistance.

Appendix 1

The fact that t ⊥ fctn(v) | E[t|v] is central to the propensity argument. It also forms the basis for propensity-based methods that go considerably beyond the two -treatment case. If t stands for a list of binary indicators of treatments (whether mutually exclusive or not), then they are conditionally uncorrelated with v given E[t|v]. This means that if ti (for indices i) are the components of t, then conditioning on the collection of propensity scores pr(ti = 1|v) renders treatments uncorrelated with v. Propensity matching in this case amounts to matching on more than one propensity score.

In the case where t is a quantitative measure of treatment (dose, for example), the relevant fact is t ⊥ v | L[t|v]. Thus treatment (dose) t is uncorrelated with v conditional on the linear regression prediction L[t|v]. This result holds without any modeling assumptions. It also extends to doses of multiple treatments, and even to complex treatments that contain both binary and quantitative aspects.

Appendix 2

Assume the linear model:

graphic file with name i1552-5775-16-4-e100-eq50.jpg

where, as usual, y is the outcome, t is binary treatment, and w is the common cause. The treatment residual is t – E[t|w] = t – π(w), and of course t – π(w) ⊥ fctn(w). Thus:

graphic file with name i1552-5775-16-4-e100-eq51.jpg

under the sole assumption that e ╧ t – π(w). The argument for instead using E[y|t,π(w)] requires at least e ╧ t |π(w) (that e consist of at most removable common causes), whereas this latter condition is that e is independent of any treatment beyond (or short of) what would be predicted. This alternative procedure (regressing on the treatment residual) is even simpler than regression on treatment and propensity score. Since its appearance in the literature,12 it does not seem to have been employed in medical research. It is intriguing in part because it appears to bear some relationship to the adjustment term suggested by Heckman 's selection model.

Appendix 3

Many authors state that randomization is the only way to balance unmeasured prognostic factors across treatment groups. This assertion has been shown to be false by both mathematical arguments19 and statistical simulations. 20 Correspondingly, in the literature on nonintervention studies, it is often asserted that forming propensity-score groups (and by implication forming matched comparison groups) cannot balance unmeasured confounders. This assertion is also false, as the following argument demonstrates.

Let w denote the measured common causes, and let u denote an unmeasured common cause. Select two patients independently at random from the population, and let t1, u1, w1 and t2, u2, w2 be their treatments and values of u and w. If t ⊥ u | w, then:

graphic file with name i1552-5775-16-4-e100-eq52.jpg

If we only retain the two patients for paired analysis when they have the same values for w and opposite values for t, then the right side is 0, and since the left side is a measure of imbalance on u for the pair, balance (in expectation) has been achieved for an unmeasured common cause. This shows that with an appropriate matching method the formation of an MCG can balance an unmeasured confounder, in the same sense that randomization does so. In fact, the weaker condition E[u|t,w] = E[u|w] would suffice. This again underscores the importance of identifying a complete, minimal set of common causes.

The same result holds for propensity matching, assuming only E[u|t, π(w)] = E[u| π(w)], where π(w) is the propensity score. Thus it is also the case that propensity matching can balance an unmeasured confounder. It is, of course, true that when E[u|t,w] or E[u|t, π(w)] genuinely depends on t, then matching may not remove confounding by u, and in fact it can make the confounding worse.

The situation can be summed up by saying that matching is not guaranteed to balanced unmeasured confounders. The same is true of randomization, however, since it does not balance unmeasured confounders; it only distributes them haphazardly.

The warning for any matching method is that there may be common causes that have not been removed by conditioning, and of course the chief worry here is about unmeasured common causes, since they will by their nature elude consideration.

References

  • 1.Aickin M, Elder C. From medical records to clinical science. Perm J. 2012 Fall;16(4):67–74. doi: 10.7812/tpp/12-047. DOI: http://dx.doi.org/10.7812/TPP/12-047. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Black N. Why we need observational studies to evaluate the effectiveness of health care. BMJ. 1996 May 11;312(7040):1215–8. doi: 10.1136/bmj.312.7040.1215. DOI: http://dx.doi.org/10.1136/bmj.312.7040.1215. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Rubin DB. New York: Cambridge University Press; 2006. Matched sampling for causal effects. DOI: http://dx.doi.org/10.1017/CBO9780511810725. [Google Scholar]
  • 4.Guo S, Fraser MW. Thousand Oaks, CA: Sage Publications, Inc; 2009. Propensity score analysis. [Google Scholar]
  • 5.Heckman JJ. Sample selection bias as a specification error. Econometrica. 1979 Jan;47(1):153–61. DOI: http://dx.doi.org/10.2307/1912352. [Google Scholar]
  • 6.Mackie JL. Causes and conditions. American Philosophical Quarterly. 1965 Oct;2(4):245–64. [Google Scholar]
  • 7.Mackie JL. 1st ed. Oxford, UK: Oxford University Press; 1974 May 2. The cement of the universe: a study of causation. [Google Scholar]
  • 8.Rubin DB. Inference and missing data. Biometrika. 1976 Dec;63(3):581–92. DOI: http://dx.doi.org/10.2307/2335739. [Google Scholar]
  • 9.Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983 Apr;70(1):41–55. DOI: http://dx.doi.org/10.2307/2335942. [Google Scholar]
  • 10.Cochran WG. The effectiveness of adjustment by subclassification in removing bias in observational studies. Biometrics. 1968 Jun;24(2):295–313. DOI: http://dx.doi.org/10.2307/2528036. [PubMed] [Google Scholar]
  • 11.Rubin DB. Estimating causal effects from large data sets using propensity scores. Ann Intern Med. 1997 Oct 15;127(8 Pt 2):757–63. doi: 10.7326/0003-4819-127-8_part_2-199710151-00064. DOI: http://dx.doi.org/10.1017/CBO9780511810725.035. [DOI] [PubMed] [Google Scholar]
  • 12.Brumback B, Greenland S, Redman M, Kiviat N, Diehr P. The intensity-score approach to adjusting for confounding. Biometrics. 2003 Jun;59(2):274–85. doi: 10.1111/1541-0420.00034. DOI: http://dx.doi.org/10.1111/1541-042000034. [DOI] [PubMed] [Google Scholar]
  • 13.Austin PC. A critical appraisal of propensity -score matching in the medical literature between 1996 and 2003. Stat Med. 2008 May 30;27(12):2037–49. doi: 10.1002/sim.3150. DOI: http://dx.doi.org/10.1002/sim.3150. [DOI] [PubMed] [Google Scholar]
  • 14.Hill J. Discussion of research using propensity-score matching: comments on 'A critical appraisal of propensity-score matching in the medical literature between 1996 and 2003' by Peter Austin, Statistics in Medicine. Stat Med. 2008 May 30;27(12):2055–61. doi: 10.1002/sim.3245. discussion 2066–9. DOI: http://dx.doi.org/10.1002/sim.3245. [DOI] [PubMed] [Google Scholar]
  • 15.Hansen BB. The essential role of balance tests in propensity-matched observational studies: comments on ÔA critical appraisal of propensity-score matching in the medical literature between 1996 and 2003' by Peter Austin, Statistics in Medicine. Stat Med. 2008 May 30;27(12):2050–4. doi: 10.1002/sim.3208. discussion 2066–9. DOI: http://dx.doi.org/10.1002/sim.3208. [DOI] [PubMed] [Google Scholar]
  • 16.Stuart EA. Developing practical recommendations for the use of propensity scores: discussion of 'A critical appraisal of propensity-score matching in the medical literature between 1996 and 2003Ô by Peter Austin, Statistics in Medicine. Stat Med. 2008 May 30;27(12):2062–5. doi: 10.1002/sim.3207. discussion 2066–9. DOI: http://dx.doi.org/10.1002/sim.3207. [DOI] [PubMed] [Google Scholar]
  • 17.Rosenbaum PR, Rubin DB. The bias due to incomplete matching. Biometrics. 1985 Mar;41(1):103–16. DOI: http://dx.doi.org/10.2307/2530647. [PubMed] [Google Scholar]
  • 18.Hirano K, Imbens GW. Estimation of causal effects using propensity score weighting: an application to data on right heart catheterization. Health Services and Outcomes Research Methodology. 2001;2(3–4):259–78. DOI: http://dx.doi.org/10.1023/A:1020371312283. [Google Scholar]
  • 19.Aickin M. Beyond randomization. J Altern Complement Med. 2002 Dec;8(6):765–72. doi: 10.1089/10755530260511775. DOI: http://dx.doi.org/10.1089/10755530260511775. [DOI] [PubMed] [Google Scholar]
  • 20.Aickin M. A simulation study of the validity and efficiency of design-adaptive allocation to two groups in the regression situation. Int J Biostat. 2009 May 29;5(1) doi: 10.2202/1557-4679.1144. Article 19. DOI: http://dx.doi.org/10.2202/1557-4679.1144. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from The Permanente Journal are provided here courtesy of Kaiser Permanente

RESOURCES