Sensitivity Analyses for Means or Proportions with Missing Outcome Data

Stephen R Cole; Paul N Zivich; Jessie K Edwards; Bonnie E Shook-Sa; Michael G Hudgens

doi:10.1097/EDE.0000000000001627

. Author manuscript; available in PMC: 2024 Sep 1.

Published in final edited form as: Epidemiology. 2023 May 9;34(5):645–651. doi: 10.1097/EDE.0000000000001627

Sensitivity Analyses for Means or Proportions with Missing Outcome Data

Stephen R Cole ¹, Paul N Zivich ¹, Jessie K Edwards ¹, Bonnie E Shook-Sa ¹, Michael G Hudgens ¹

PMCID: PMC10524136 NIHMSID: NIHMS1898356 PMID: 37155639

Abstract

We describe an approach to sensitivity analysis introduced by Robins and colleagues, for the setting where the outcome is missing for some observations. This flexible approach focuses on the relationship between the outcomes and missingness, where data can be missing completely at random, missing at random given observed data, or missing not at random. We provide examples from HIV that include the sensitivity of the estimation of a mean and proportion under different missingness mechanisms. The approach illustrated provides a method for examining how the results of epidemiologic studies might shift as a function of bias due to missing data.

Keywords: Bias, Missing Data, Sensitivity Analysis

Introduction

Previous work has illustrated several approaches to account for missing outcome data (e.g., (1)), but these approaches may fail to provide valid results when missingness depends on unobserved variables. There are many methods for sensitivity analysis in such settings, but sensitivity analyses remain underused. Around the turn of the century, Robins and colleagues formalized a general sensitivity analysis approach (2). While important papers have emanated from this seminal work, e.g., (3–7), a wider audience may benefit from a better understanding of the general approach, particularly in a simple form. This improved understanding will make it easier to adapt sensitivity analyses to specific circumstances or generalize sensitivity analyses.

One reason to focus on the approach of Robins et al is that, in this approach, the parameter of interest is nonparametrically point-identified given the assumptions. This means that the approach makes assumptions strong enough to point-identify the parameter without placing any restrictions on the observed data distribution. A parameter is point-identified when it can be written as a function of the observed data distribution (8), and nonparametrically point identified when there are no restrictions placed on the observed data distribution. Here, the central theorem of (2), which provides a foundation for sensitivity analysis, is restated with a proof provided, and the approach is illustrated, with data and code provided, for three examples where missingness of the outcome variable depends on the outcome itself (i.e., missing not at random).

The Approach

Let $Y_{i}$ denote a measurement, of say CD4 cell count, for unit $i$ , where $i = 1, \dots, n$ . For context, the CD4 cell count is a measure of immune function relevant for determining HIV disease progression. CD4 counts per mm³ of blood are generally between 0 and 2000 cells, with a normal count between 500 and 1500 cells/mm³. CD4 counts lower than 500 are indicative of poor immune function, while counts lower than 200 cells/mm³ indicate high risk of opportunistic infections and mortality. The parameter of interest is the distribution function, or risk, $F_{Y} (y) = P (Y \leq y)$ ; or a functional of the risk (such as the mean), where $Y$ may be discrete or continuous (9). Here, we illustrate a sensitivity analysis to estimate the parameter of interest from a sample when some $Y_{i}$ are missing.

For the $n$ sample units, let $S_{i} = 1$ indicate that $Y_{i}$ is observed, and $S_{i} = 0$ otherwise. The observed sample data are then $O_{i} = \{S_{i}, S_{i} Y_{i}\}$ . Below subscripts $i$ may be omitted for brevity. Assumption A is that the $n$ units are randomly sampled from the target populations, i.e., $O_{1}, \dots O_{n}$ are independent and identically distributed. Assumption B is the positivity assumption that $P (S = 1| Y = y) > 0$ for all $f (y) > 0$ where $f$ denotes the density or probability mass function of $Y$ . The positivity assumption supposes there is a non-zero probability of $Y$ being observed for all possible values of $Y$ in the target population. When $Y$ is discrete and all possible values are observed in the sample, then assumption B is met. On the other hand, when $Y$ is continuous it is not possible to verify assumption B holds.

Let $F_{S, Y} (s, y) = P (S \leq s, Y \leq y)$ be the joint distribution function for the full data. Likewise, let $F_{S, S Y} (s, y) = P (S \leq s, S Y \leq y)$ be the distribution function for the observed data. If there was no missing data (i.e., $P (S = 1) = 1$ ), then $F_{Y} (y)$ would be point-identified and consistently estimated by the empirical distribution function of $Y$ because $S Y = Y$ for all units. However, because of the missing data, $F_{Y} (y)$ is only partially identifiable without additional assumptions, in the sense that we can identify nonparametric bounds (10–12). Point identification may be achieved through additional assumptions (13). One possible additional assumption to gain point identification is as follows.

Let $H (x)$ be an arbitrary but known, continuous, monotone increasing distribution function, such that $\lim_{x \to - \infty} H (x) = 0$ and $\lim_{x \to \infty} H (x) = 1$ (e.g., logistic function, log-log function). Also, let $q (Y; α)$ be a user-specified sensitivity function (e.g., $α Y, α^{Y}$ ), where $α$ is a sensitivity parameter. The choice of the sensitivity function is discussed below.

Theorem 1 (Robins et al.). Given the observed data distribution $F_{S, S Y}, H, q (Y; α)$ , and assumptions A and B, there exists a unique full data joint distribution $F_{S, Y}$ that implies the observed data distribution $F_{S, S Y}$ and

P (S = 1 | Y) = E (S | Y) = H [k + q (Y; α)]

[1]

holds for some $k \in (- \infty, \infty]$ . A proof of Theorem 1 is provided in the Appendix.

Theorem 1 implies that $F_{Y} (y)$ is point-identified. The implication is that under [1] we can consistently estimate $F_{Y} (y)$ and then vary the sensitivity parameter $α$ to see how our estimate changes. At extreme values of $α$ , when $q (Y; α)$ tends to positive or negative infinity, the sensitivity analysis results coincide with estimates of the nonparametric bounds. When $q (Y; α)$ is a constant, we are assuming the likelihood of $Y$ being missing is independent of the value of $Y$ (i.e., data are missing completely at random).

Equation [1] is a model for the missing data mechanism, like that used with inverse probability weights (IPW) (14). Specifically, with IPW for missingness, we model the probability of being observed as a function of observed covariates. A standard conditional exchangeability assumption would license us to ignore $Y$ in such a model. But here we have a model for the probability of being observed that is explicitly conditional on $Y$ , so that we can directly explore the sensitivity of results to missing data.

Note that $F_{Y} (y)$ may be written as

E {S I (Y \leq y) / H [k + q (Y; α)]},

[2]

by replacing $H [k + q (Y; α)]$ with $P (S = 1 | Y)$ based on [1], and then using iterated expectation. By replacing the expectation with its empirical counterpart, equation [2] suggests the following IPW estimator for $F_{Y} (y)$ , namely

n^{- 1} \sum_{i = 1}^{n} S_{i} I (Y_{i} \leq y) / H [k + q (Y_{i}; α)],

[3]

for a specified value of $α$ . In practice $k$ is not known and instead $F_{Y} (y)$ and $k$ can be jointly estimated by solving $\sum_{i = 1}^{n} g (O_{i}; θ) = 0$ using a root-finding algorithm (e.g., Levenberg-Marquardt) (15), where

g (O_{i}; θ) = (\begin{matrix} g_{k} (S_{i}; k) \\ g_{β} (O_{i}; k, β) \end{matrix}) = (\begin{matrix} S_{i} / H_{i} - 1 \\ S_{i} I (Y_{i} \leq y) / H_{i} - β \end{matrix}),

[4]

and $H_{i} = H [k + q (Y_{i}; α)], β = F_{Y} (y)$ , and $θ = (k, β)$ . Because the estimating equations corresponding with [4] are unbiased (i.e., have expectation zero) it follows under standard regularity conditions that the estimator is uniform statistically consistent and asymptotically normal ((16), Theorem 7.1, page 327) allowing asymptotically valid Wald-type 95% confidence intervals to be constructed using the estimated covariance matrix. The covariance of the estimators $\hat{k}, \hat{β}$ can be consistently estimated using the empirical sandwich covariance matrix.

The approach above can be extended to allow the probability of being observed to depend on both the outcome $Y$ and observed covariates. In the presence of observed covariates, one may wish to conduct a sensitivity analysis for the possible bias due to unmeasured covariates after controlling for the measured covariates. In particular, the $H$ term is now defined to be $H [Z γ + q (Y; α)]$ , where $Z = (1 W)$ , with covariates $W$ , and $γ$ is a vector of model parameters (including the intercept). Estimates of $θ = (γ, β)$ can be obtained by solving the estimating equation $\sum_{i = 1}^{n} g (O_{i}; θ) = 0$ , where

g (O_{i}; θ) = (\begin{matrix} g_{k} (S_{i}, Z_{i}; γ) \\ g_{β} (O_{i}; γ, β) \end{matrix}) = (\begin{matrix} (S_{i} / H_{i} - 1) Z_{i}^{'} \\ S_{i} I (Y_{i} \leq y) / H_{i} - β \end{matrix}),

[5]

$O_{i} = (Z_{i}, S_{i}, S_{i} Y_{i})$ , and $H_{i} = H [Z_{i} γ + q (Y_{i}; α)]$ . Note that here the positivity assumption B above is revised to be $P (S = 1| Y = y, W = w) > 0$ for all possible values of $y, w$ in the target population. Data and code (SAS using the IML procedure and Python using “delicatessen” (15)) are provided on GitHub (https://github.com/pzivich/publications-code).

Example 1: Mean CD4 cell count

Let $Y$ be the nadir CD4 count (cells/mm³) in a sample of $n$ = 1164 HIV positive adult women participating in the Women’s Interagency HIV Study (17). The Women’s Interagency HIV Study protocol was approved by participating site Institutional Review Boards and participants provided written informed consent. The full data are depicted as a boxplot in Figure 1, called dataset 1. In this example, the parameter of interest is the population mean CD4 count. In the full data, the 1164 women had a sample mean CD4 count of 394 (standard deviation 7.7) cells/mm³.

Figure 1. — Boxplots of CD4 cell count for 1164 participants in the Women’s Interagency HIV Study, 2009.

To illustrate this sensitivity analysis approach, we set the value of $Y$ to missing for approximately 1/4 of the data under three scenarios with true values $α_{0} \in {- 0.01,0, 0.01}$ , where $q (Y; α_{0}) = α_{0} Y$ , and $H$ is a logistic function, such that $α_{0}$ is the log odds ratio of $Y$ being observed given a one-unit difference in $Y$ . Note that whether $Y$ is observed depends on the value of $Y$ whenever $α_{0} \neq 0$ . The data are depicted as boxplots in Figure 1 as datasets 2, 3, and 4 where $α_{0}$ is −0.01, 0, and 0.01, respectively. When $α_{0} = 0$ , the data are missing completely at random, and the sample mean CD4 count was 385 (standard deviation 9.0) cells/mm³ in the 857 observed units (Figure 1, dataset 3). When $α_{0} \neq 0$ , the data are missing not at random. Here, the chance of missingness depends on the possibly unseen value of $Y$ itself. For example, when $α_{0} = . 01$ the odds of $Y$ being observed increase by $e^{100 α_{0}} = 2.72$ for every additional 100 CD4 cells/mm³. In this scenario, we expect those with low CD4 counts to be underrepresented in the observed data and indeed the sample mean CD4 count was 459 (standard deviation 8.7) cells/mm³ in the 890 observed units (Figure 1, dataset 4). For completeness, when $α_{0} = - . 01$ , we expect those with low CD4 counts to be overrepresented in the observed data and the sample mean CD4 count was 294 (standard deviation 5.5) cells/mm³ in the 869 observed units (Figure 1, dataset 2).

Figure 2 depicts the three sensitivity analyses where $α_{0} \in {- 0.01, 0, 0.01}$ in panels A, B, and C, respectively. The only adaptation required to the above estimating equation to accommodate the mean (rather than the proportion) is to replace the indicator term $I (Y_{i} \leq y)$ in the second estimating equation in [4] with $Y_{i}$ , such that the second estimating equation becomes $S_{i} Y_{i} / H_{i} - β$ . In each scenario, the estimated mean is plotted along with a 95% confidence interval, and the estimated nonparametric bounds are plotted as dotted horizontal lines. These bounds estimate the most extreme logically possible values (i.e., best- and worst-case scenarios) of the parameter given the observed data (11, 12). Note that these bounds can be estimated, under assumption B, by replacing all missing values with the observed data minimum and maximum values. As expected, the estimated mean asymptotes at the nonparametric bounds in each scenario. In each scenario, the complete case estimator of the mean coincides with the sensitivity analysis where $α = 0$ . In Figure 2 panel C, where $α_{0} = . 01$ , the sensitivity function approximately equals the full sample mean of 394 at approximately $α = . 01$ . Likewise, for $α_{0} = - . 01$ . Here and in the following examples, one would not expect the sensitivity analysis to return the exact full data estimate because the data analyzed is a single random draw subject to sampling error.

Example 2: Proportion of CD4 counts below 200 cells/mm³

In this next example, the parameter of interest is the proportion of women with CD4 count $\leq$ 200 cells/mm³, i.e., $F (200)$ . In the full data, 23% (estimated standard error [SE] = 1.2%) of women had a CD4 count $\leq$ 200 cells/mm³. When $α_{0} = 0$ , 24% (SE = 1.5%) of the 857 observed CD4 counts were $\leq$ 200 cells/mm³ (Figure 1, dataset 3). When $α_{0} = . 01$ , 12% (SE = 1.1) of the 890 observed CD4 counts were $\leq$ 200 cells/mm³ (Figure 1, dataset 4). When $α_{0} = - . 01$ , 30% (SE = 1.6) of the 869 observed CD4 counts were $\leq$ 200 cells/mm³ (Figure 1, dataset 2).

Figure 3 depicts the three sensitivity analyses where the true values $α_{0} \in {- 0.01,0, 0.01}$ are shown in panels A, B, and C, respectively. Again, the estimated nonparametric bounds are plotted as dotted horizontal lines and the (logistic) sensitivity function asymptotes at the estimated nonparametric bounds in each scenario. Note that the nonparametric lower and upper bounds can be estimated using plug-in estimators of $P (S = 1) P (Y | S = 1)$ and $P (S = 1) P (Y | S = 1) + P (S = 0)$ , respectively. In panel C, where $α_{0} = . 01$ , the sensitivity function approximately equals the full sample proportion of 23% when $α = . 01$ , likewise for panels A and B where $α_{0}$ = −0.01 and 0, respectively.

Example 3: Proportion of CD4 counts below 200 cells/mm³ with observed covariates

Next, we extend example 2 to the setting where we have additional data on covariates, say $W$ , measured on all $n$ units, which might be shared causes of observation and the outcome of interest. In such settings, one might wish the sensitivity analysis to account for measured covariates (14).

To extend the above example, we consider $W$ to include data on years of age (sample mean 36.4, standard deviation 8.2) as $W_{1}$ and an indicator of nonwhite race (672 of 1164) as $W_{2}$ . We take a random sample of the 1164 women to be the observed data (i.e., 742, 64%), with $S = 1$ , where the probability of being observed was based on a logistic model that depended on $α_{0} Y_{i}$ and $\log (4) (W_{1 i} > 30)$ , with the intercept chosen such that approximately 2/3 of the data were observed ( $S = 1$ ) and $α_{0} = . 01$ . Therefore, both ages larger than 30 and unobserved CD4 counts increased the probability of being observed. Note that whether the outcome is missing depends on $Y$ and $W_{1}$ but not $W_{2}$ . Data are shown in Figure 1, dataset 5.

In the observed data on 742 women, the mean CD4 count was 497 cells/mm³, and 7.4% (SE = 1.0) had a CD4 count < 200 cells/mm³. The estimated nonparametric bounds for the percent CD4 count < 200 cells/mm³ were 4.7% and 41.0%, setting all missing CD4 counts to the extreme observed values of 1933 or 0, respectively.

The sensitivity analysis results are shown in Figure 4. Using the same $H$ and $q$ functions as in examples 1 and 2, at large positive and negative values for the sensitivity parameter $α$ , the sensitivity analysis estimates were approximately equal to the estimated nonparametric bounds (shown as dotted horizontal lines on Figure 4). When $α = . 01$ , which was the correct value $α_{0}$ from the data generating mechanism, the sensitivity analysis estimate of the proportion with CD4 below 200 cells/mm³ was 20.6% (SE = 2.04), which was close to the full data estimate of 23%.

When $α = 0$ , the sensitivity analysis estimate was 7.2% (SE = 0.93). If $α = 0$ , the probability the outcome is missing is conditionally independent of the outcome given the covariates, i.e., the outcome is missing at random. In the missing at random setting, a standard IPW estimating function (18), using the same notation as above, is

g (O_{i}; θ) = (\begin{matrix} [S_{i} - P (S = 1 | W_{i}; γ)] Z_{i}^{'} \\ \frac{S_{i} I (Y_{i} \leq y)}{P (S = 1 | W_{i}; γ)} - β \end{matrix}),

[6]

where $θ = (γ, β)$ . Returning to the CD4 data, assuming a logistic model for $P (S = 1 | W; γ)$ with predictors age greater than 30 and nonwhite race, the standard IPW estimate was 7.2% (robust SE = 0.93, assuming weights known) for the percent CD4 < 200 cells/mm³. Note that the sensitivity analysis with $α = 0$ and the standard IPW estimator assuming missing at random give similar results. But the forms of the estimating functions differ. The second equation in [6] is equivalent to the second equation in [5], with $P (S = 1 | W_{i}; γ)$ replacing $H_{i}$ in [5]. However, the first equation in [6] differs from the first equation in [5], and although both estimating equations have expectation zero, they are not equivalent.

Discussion

We described and illustrated an approach to sensitivity analysis where missingness of the outcome variable depended on the outcome itself (i.e., missing not at random). The sensitivity analyses detailed here follows (at least) three simple desiderata, which provide a foundation for sensitivity analyses (2). First, the sensitivity analysis places no restrictions on the distribution of the observed data, but given the observed data and a chosen sensitivity function, the parameter of interest is point-identified. We can learn about a parameter, in the sense of point and interval estimates, when the parameter is both identified and estimable (19). In the illustrated sensitivity analyses, the parameter of interest is identified and estimable at the standard $\sqrt{n}$ rate. Second, if the sensitivity function $q$ tends to infinity (negative infinity) as $α$ becomes large (small), as with the choice of $H$ above, then at the extremes of the sensitivity function $q$ , the sensitivity analysis estimator will be approximately equal to the estimated nonparametric bounds of the full data interest parameter (11). These bounds are the minimum and maximum possible values for the target parameter, given the observed data. Here we used the observed minimum and maximum values of $Y$ to estimate the bounds. But if there is external knowledge that values outside the observed data range are possible, then the estimation of the bounds can incorporate such knowledge and results will be appropriately widened. In that case the sensitivity analysis described here will continue to approximately equal the narrower estimated observed data bounds as $α$ tends to positive and negative infinity under the condition that the sensitivity function $q$ tends to infinity (negative infinity) as $α$ becomes large (small). Third, at $q = 0$ , which corresponds to the assumption of data missing completely at random, the sensitivity analysis detailed here in examples 1 and 2 collapses to a standard maximum likelihood estimator (20). In more complicated settings, as shown in example 3, the proposed sensitivity analysis can be extended to semiparametric estimators and allow for covariates, such that at $q = 0$ the approach corresponds to the assumption of data missing at random (21, 22).

As noted above, the estimating function illustrated in example three [5] differs from the standard IPW estimating function [6]. The first estimating equation in [6] corresponds to the score equation for a logistic regression model for $P (S = 1 | W)$ , and is justified because score equations have mean zero. However, it is unclear how to extend the approach in [6] to the setting where the outcome is not missing at random because this first estimating equation in [6] cannot be evaluated when $S_{i} = 0$ (since we do not observe $Y_{i}$ ). On the other hand, the first estimating equation in [5] is $(S_{i} / H_{i} - 1) Z_{i}^{'}$ , which only requires evaluating $H_{i}$ when $S_{i} = 1$ .

In the examples above, we chose $H$ to be the logistic function and $q$ to be $α Y,$ but other choices could be made. The logistic function is a natural choice because logistic regression models are typically used to construct inverse probability (of missingness) weights. Moreover, $α Y$ is a natural first choice for $q$ because measured covariates typically enter the logistic model with main terms. Regardless, the sensitivity analysis depends on the choice of $H$ and $q$ . While the observed data distribution is not restricted by $H$ or $q$ , the full data distribution is restricted by $H$ and $q$ , and point identification therefore depends on correctly specifying $H$ and $q$ . The choice of $H$ and $q$ also affects whether the sensitivity analysis approaches the nonparametric bounds. In example 3, where covariates are observed, one could conduct a sensitivity analysis ignoring the covariates, as in examples 1 and 2. But in the presence of observed covariates, one often wishes to conduct a sensitivity analysis for the possible bias due to unmeasured covariates after controlling for the measured covariates.

The classical approach to sensitivity analysis (23) specifies the relationships between an unmeasured variable and both selection and the outcome (24), and is particularly useful when there is a suspected unmeasured variable that is a common cause of selection and the outcome. The present (2) and similar (25) approaches to sensitivity analysis specify only the relationship between the outcome and selection. The classic approach to sensitivity analysis ignores $P (S | Y)$ but requires specification of the relationships between an unmeasured variable and both $S$ and $Y$ , which may constrain the class of allowable structures that can be handled. While the present approach requires a model for $P (S | Y)$ but places no restrictions on the observed data distribution. The present approach can be extended to a Bayesian sensitivity analysis by sampling the $q$ function from a prior distribution (2). Finally, the present approach can be extended to estimate average causal effects, where potential outcomes are missing, by conducting the sensitivity analysis for each level of treatment (or exposure), where $α_{a}$ is the sensitivity parameter for treatment $A = a$ , and then contrasting results (21).

This approach to sensitivity analysis provides a framework to think about how the results of epidemiologic studies might shift as a function of bias due to missing data. To close, we echo (26) that while no universally optimal sensitivity analysis is possible, principled bias analyses can help epidemiologists make better choices.

Acknowledgments:

We thank Drs. Alex Breskin and Robert Platt for expert advice.

Funding Sources:

This work was supported by grants R01AI157758 and P30AI50410 from NIH.

Funding:

NIH grants R01AI157758, U01HL146194, P30AI50410.

Appendix: Proof of the Theorem.

We observe data $O = (S, S Y)$ , such that $F_{O} (o) = F_{S, S Y} (s, y) = P (S \leq s, S Y \leq y)$ and $f_{S} (s) = P (S = s)$ are identified, but $F_{S, Y} (s, y) = P (S \leq s, Y \leq y)$ is not identified. Define subdistribution functions $f_{S, Y} (s, y) = P (S = s, Y \leq y)$ and $f_{S, S Y} (s, t) = P (S = s, S Y \leq t)$ . Note that

f_{S, S Y} (s, t) = [\begin{matrix} 0 & s \notin {0, 1} \\ P (S = 0, 0 \leq t) = f_{S} (0) I (0 \leq t) & s = 0 \\ P (S = 1, Y \leq t) = f_{S, Y} (1, t) & s = 1 \end{matrix} .

Therefore, any joint distribution $F_{S, Y}$ with marginal mass function $f_{S} (0)$ and subdistribution function $f_{S, Y} (1, y)$ will give rise to the same data distribution $O$ . Moreover, $f_{S, Y} (0, y) = f_{S} (0) P (Y \leq y)$ is only partially identified because the observed data $O$ provides no information about $P (Y \leq y | S = 0)$ .

Suppose we assume [1], i.e.,

P (S = 1 | Y = y) = H [k + q (y)],

[1]

where $H$ is a known, continuous, increasing distribution function; $k \in (- \infty, \infty]$ ; and $q$ is a known function of $y$ . Let $f_{Y | S} (y | S = s) = \partial P (Y \leq y | S = s) / \partial y$ . Then by Bayes theorem

P (S = 1 | Y = y) = \frac{f_{Y | S} (y | S = 1) f_{S} (1)}{f_{Y | S} (y | S = 0) f_{S} (0) + f_{Y | S} (y | S = 1) f_{S} (1)} .

Therefore, using [1] to substitute $H$ for $P (S = 1 | Y)$ , and recalling the algebraic identity $a / (a + b) = 1 - {[1 + \frac{a}{b}]}^{- 1}$ , we have

f_{Y | S} (y | S = 0) = f_{Y | S} (y | S = 1) \frac{f_{S} (1)}{f_{S} (0)} {\frac{1}{H [k + q (y)]} - 1},

which is identified because all quantities on the right side are either identified or assumed known. In particular, $q$ is assumed known, and the three densities on the right side are identified. Below we show that $k$ is also identified. Therefore, $F_{S, Y}$ is identified. In other words, there is a unique distribution $F_{S, Y}$ with marginal mass function $f_{S} (s)$ and subdistribution $f_{S, Y} (1, y)$ such that $P (S = 1 | Y = y) = H [k + q (y)]$ .

The only remaining step in the proof entails showing that $k$ is identified. Let $E_{O} [.]$ denote expectation with respect to $F_{O}$ and $f_{Y}$ denote the marginal density of $Y$ , then

E_{O} [\frac{S}{H [k + q (S Y)]}] = \int \frac{s}{H [k + q (t)]} d F_{S, S Y} (s, t) = \int \frac{1}{H [k + q (t)]} P (S = 1 | Y = t) f_{Y} (t) d t = \int f_{Y} (t) d t = 1.

Therefore, $k$ is the solution to $E_{O} [S / H [k + q (S Y)] = 1$ , which is unique because $H$ is an increasing function.

Footnotes

Conflicts: None

Data availability: Data and code are provided on GitHub (https://github.com/pzivich/publications-code).

References

1.Cole SR, Zivich PN, Edwards JK, et al. Missing Outcome Data in Epidemiologic Studies. Am J Epidemiol 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Robins JM, Rotnitzky A, Scharfstein DO. Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models. In: Halloran ME, Berry D, eds. Statistical Models in Epidemiology, the Environment, and Clinical Trials. New York: Springer, 1999. [Google Scholar]
3.Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for nonignorable drop-out using semiparametric nonresponse models. JASA 1999;94:1096–146. [Google Scholar]
4.VanderWeele TJ, Ding P. Sensitivity Analysis in Observational Research: Introducing the E-Value. Ann Intern Med 2017;167(4):268–74. [DOI] [PubMed] [Google Scholar]
5.Robins JM, Finkelstein DM. Correcting for noncompliance and dependent censoring in an AIDS Clinical Trial with inverse probability of censoring weighted (IPCW) log-rank tests. Biometrics 2000;56(3):779–88. [DOI] [PubMed] [Google Scholar]
6.Greenland S. Multiple-bias modelling for analysis of observational data. JRSS A 2005;168:267–306. [Google Scholar]
7.Vanderweele TJ, Arah OA. Bias formulas for sensitivity analysis of unmeasured confounding for general outcomes, treatments, and confounders. Epidemiology 2011;22(1):42–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Aronow PM, Robins JM, Saarinen T, et al. Nonparametric identification is not enough, but randomized controlled trials are. arXiv 2021;2108.11342v1. [Google Scholar]
9.Cole SR, Hudgens MG, Brookhart MA, et al. Risk. Am J Epidemiol 2015;181(4):246–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Robins JM. The analysis of randomized and nonrandomized AIDS treatment trials using a new approach to causal inference in longitudinal studies. In: Sechrest L, Freeman H, Mulley A, eds. Health Service Research Methodology: A Focus on AIDS. Washington, DC: US Public Health Service, 1989:113–59. [Google Scholar]
11.Manski CF. Nonparametric bounds on treatment effects. The American Economic Review 1990;80:319–23. [Google Scholar]
12.Cole SR, Hudgens MG, Edwards JK, et al. Nonparametric Bounds for the Risk Function. Am J Epidemiol 2019;188(4):632–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Breskin A, Westreich D, Cole SR, et al. Using Bounds to Compare the Strength of Exchangeability Assumptions for Internal and External Validity. Am J Epidemiol 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Seaman SR, White IR. Review of inverse probability weighting for dealing with missing data. Stat Methods Med Res 2013;22(3):278–95. [DOI] [PubMed] [Google Scholar]
15.Zivich PN, Klose M, Cole SR, et al. Delicatessen: M-estimation in Python. arXiv 2022;2203.11300. [Google Scholar]
16.Boos DD, Stefanski LA. Essential Statistical Inference. New York: Springer; 2013. [Google Scholar]
17.Lau B, Cole SR, Gange SJ. Competing risk regression models for epidemiologic data. Am J Epidemiol 2009;170(2):244–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Robins JM, Hernán MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology 2000;11:550–60. [DOI] [PubMed] [Google Scholar]
19.Hadamard J. Sur les problems aux derivees partielles et leur signification physique. Princeton University Bulletin 1902;13:49–52. [Google Scholar]
20.Cole SR, Chu H, Greenland S. Maximum likelihood, profile likelihood, and penalized likelihood: a primer. Am J Epidemiol 2014;179(2):252–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Brumback BA, Hernán MA, Haneuse SJ, et al. Sensitivity analyses for unmeasured confounding assuming a marginal structural model for repeated measures. Stat Med 2004;23(5):749–67. [DOI] [PubMed] [Google Scholar]
22.Cole SR, Hernán MA, Margolick JB, et al. Marginal structural models for estimating the effect of highly active antiretroviral therapy initiation on CD4 cell count. Am J Epidemiol 2005;162(5):471–8. [DOI] [PubMed] [Google Scholar]
23.Cornfield J, Haenszel W, Hammond E, et al. Smoking and lung cancer: recent evidence and a discussion of some questions. Journal of the National Cancer Institute 1959;22:173–203. [PubMed] [Google Scholar]
24.Greenland S Basic methods for sensitivity analysis of biases. Int J Epidemiol 1996;25(6):1107–16. [PubMed] [Google Scholar]
25.Rosenbaum PR. Observational Studies. New York: Springer-Verlag; 1995. [Google Scholar]
26.Lesko CR, Cole SR, Schisterman EF. Editorial: Robust Sensitivities. Am J Epidemiol 2021;190(8):1437–48. [DOI] [PubMed] [Google Scholar]

[R1] 1.Cole SR, Zivich PN, Edwards JK, et al. Missing Outcome Data in Epidemiologic Studies. Am J Epidemiol 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Robins JM, Rotnitzky A, Scharfstein DO. Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models. In: Halloran ME, Berry D, eds. Statistical Models in Epidemiology, the Environment, and Clinical Trials. New York: Springer, 1999. [Google Scholar]

[R3] 3.Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for nonignorable drop-out using semiparametric nonresponse models. JASA 1999;94:1096–146. [Google Scholar]

[R4] 4.VanderWeele TJ, Ding P. Sensitivity Analysis in Observational Research: Introducing the E-Value. Ann Intern Med 2017;167(4):268–74. [DOI] [PubMed] [Google Scholar]

[R5] 5.Robins JM, Finkelstein DM. Correcting for noncompliance and dependent censoring in an AIDS Clinical Trial with inverse probability of censoring weighted (IPCW) log-rank tests. Biometrics 2000;56(3):779–88. [DOI] [PubMed] [Google Scholar]

[R6] 6.Greenland S. Multiple-bias modelling for analysis of observational data. JRSS A 2005;168:267–306. [Google Scholar]

[R7] 7.Vanderweele TJ, Arah OA. Bias formulas for sensitivity analysis of unmeasured confounding for general outcomes, treatments, and confounders. Epidemiology 2011;22(1):42–52. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Aronow PM, Robins JM, Saarinen T, et al. Nonparametric identification is not enough, but randomized controlled trials are. arXiv 2021;2108.11342v1. [Google Scholar]

[R9] 9.Cole SR, Hudgens MG, Brookhart MA, et al. Risk. Am J Epidemiol 2015;181(4):246–50. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Robins JM. The analysis of randomized and nonrandomized AIDS treatment trials using a new approach to causal inference in longitudinal studies. In: Sechrest L, Freeman H, Mulley A, eds. Health Service Research Methodology: A Focus on AIDS. Washington, DC: US Public Health Service, 1989:113–59. [Google Scholar]

[R11] 11.Manski CF. Nonparametric bounds on treatment effects. The American Economic Review 1990;80:319–23. [Google Scholar]

[R12] 12.Cole SR, Hudgens MG, Edwards JK, et al. Nonparametric Bounds for the Risk Function. Am J Epidemiol 2019;188(4):632–6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Breskin A, Westreich D, Cole SR, et al. Using Bounds to Compare the Strength of Exchangeability Assumptions for Internal and External Validity. Am J Epidemiol 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Seaman SR, White IR. Review of inverse probability weighting for dealing with missing data. Stat Methods Med Res 2013;22(3):278–95. [DOI] [PubMed] [Google Scholar]

[R15] 15.Zivich PN, Klose M, Cole SR, et al. Delicatessen: M-estimation in Python. arXiv 2022;2203.11300. [Google Scholar]

[R16] 16.Boos DD, Stefanski LA. Essential Statistical Inference. New York: Springer; 2013. [Google Scholar]

[R17] 17.Lau B, Cole SR, Gange SJ. Competing risk regression models for epidemiologic data. Am J Epidemiol 2009;170(2):244–56. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Robins JM, Hernán MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology 2000;11:550–60. [DOI] [PubMed] [Google Scholar]

[R19] 19.Hadamard J. Sur les problems aux derivees partielles et leur signification physique. Princeton University Bulletin 1902;13:49–52. [Google Scholar]

[R20] 20.Cole SR, Chu H, Greenland S. Maximum likelihood, profile likelihood, and penalized likelihood: a primer. Am J Epidemiol 2014;179(2):252–60. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Brumback BA, Hernán MA, Haneuse SJ, et al. Sensitivity analyses for unmeasured confounding assuming a marginal structural model for repeated measures. Stat Med 2004;23(5):749–67. [DOI] [PubMed] [Google Scholar]

[R22] 22.Cole SR, Hernán MA, Margolick JB, et al. Marginal structural models for estimating the effect of highly active antiretroviral therapy initiation on CD4 cell count. Am J Epidemiol 2005;162(5):471–8. [DOI] [PubMed] [Google Scholar]

[R23] 23.Cornfield J, Haenszel W, Hammond E, et al. Smoking and lung cancer: recent evidence and a discussion of some questions. Journal of the National Cancer Institute 1959;22:173–203. [PubMed] [Google Scholar]

[R24] 24.Greenland S Basic methods for sensitivity analysis of biases. Int J Epidemiol 1996;25(6):1107–16. [PubMed] [Google Scholar]

[R25] 25.Rosenbaum PR. Observational Studies. New York: Springer-Verlag; 1995. [Google Scholar]

[R26] 26.Lesko CR, Cole SR, Schisterman EF. Editorial: Robust Sensitivities. Am J Epidemiol 2021;190(8):1437–48. [DOI] [PubMed] [Google Scholar]

PERMALINK

Sensitivity Analyses for Means or Proportions with Missing Outcome Data

Stephen R Cole

Paul N Zivich

Jessie K Edwards

Bonnie E Shook-Sa

Michael G Hudgens

Abstract

Introduction

The Approach

Example 1: Mean CD4 cell count

Figure 1.

Figure 2.

Example 2: Proportion of CD4 counts below 200 cells/mm³

Figure 3.

Example 3: Proportion of CD4 counts below 200 cells/mm³ with observed covariates

Figure 4.

Discussion

Acknowledgments:

Funding Sources:

Funding:

Appendix: Proof of the Theorem.

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Sensitivity Analyses for Means or Proportions with Missing Outcome Data

Stephen R Cole

Paul N Zivich

Jessie K Edwards

Bonnie E Shook-Sa

Michael G Hudgens

Abstract

Introduction

The Approach

Example 1: Mean CD4 cell count

Figure 1.

Figure 2.

Example 2: Proportion of CD4 counts below 200 cells/mm3

Figure 3.

Example 3: Proportion of CD4 counts below 200 cells/mm3 with observed covariates

Figure 4.

Discussion

Acknowledgments:

Funding Sources:

Funding:

Appendix: Proof of the Theorem.

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Example 2: Proportion of CD4 counts below 200 cells/mm³

Example 3: Proportion of CD4 counts below 200 cells/mm³ with observed covariates