Summary
According to the law of likelihood, statistical evidence is represented by likelihood functions and its strength measured by likelihood ratios. This point of view has led to a likelihood paradigm for interpreting statistical evidence, which carefully distinguishes evidence about a parameter from error probabilities and personal belief. Like other paradigms of statistics, the likelihood paradigm faces challenges when data are observed incompletely, due to non-response or censoring, for instance. Standard methods to generate likelihood functions in such circumstances generally require assumptions about the mechanism that governs the incomplete observation of data, assumptions that usually rely on external information and cannot be validated with the observed data. Without reliable external information, the use of untestable assumptions driven by convenience could potentially compromise the interpretability of the resulting likelihood as an objective representation of the observed evidence. This paper proposes a profile likelihood approach for representing and interpreting statistical evidence with incomplete data without imposing untestable assumptions. The proposed approach is based on partial identification and is illustrated with several statistical problems involving missing data or censored data. Numerical examples based on real data are presented to demonstrate the feasibility of the approach.
Keywords: Censoring, evidential analysis, law of likelihood, likelihood paradigm, missing data, partial identification
1 Introduction
Statisticians are routinely asked to interpret observed data as statistical evidence, even though there is no consensus on what constitutes statistical evidence and how to measure its strength. Hacking (1965) suggests that the likelihood function is the mathematical representation of statistical evidence and that the likelihood ratio (LR) measures the strength of statistical evidence for one hypothesis versus another. This point of view has led to a likelihood paradigm for interpreting statistical evidence that carefully distinguishes evidence about a parameter from error probabilities and personal belief (Royall, 1997; Blume, 2002). Royall (2000) analyzes the probability of observing misleading evidence under parametric models, and Blume (2008) provides a parallel analysis for sequential trials. Royall & Tsou (2003) and Blume et al. (2007) develop adjusted likelihood functions with certain robustness properties under model failure.
Like other paradigms of statistics, the likelihood paradigm faces challenges when data are observed incompletely, due to non-response or censoring, for instance. Such problems, discussed extensively in the frequentist estimation theory as well as in Bayesian statistics, do not appear to have been considered in the likelihood paradigm. This is not surprising given the relative sparsity of the likelihood paradigm literature, which is nonetheless growing rapidly. Some of the estimation methods for incomplete data are likelihood-based and would be easily adaptable to the likelihood paradigm. However, the existing methods, likelihood-based or not, generally require assumptions about the mechanism that governs the incomplete observation of data. In the case of missing data, it is common practice to assume missing-completely-at-random (MCAR) or missing-at-random (MAR) in the sense of Rubin (1976) or specify a model for the missing-data mechanism. With censored data, one might assume independent censoring, possibly conditional on a set of covariates, or specify a model for the censoring mechanism. Such assumptions usually rely on external information and cannot be validated with the observed data. Without reliable external information, the use of untestable assumptions driven by convenience could potentially compromise the interpretability of the resulting likelihood as an objective representation of the observed evidence.
Example (Proportion). To fix ideas, consider the problem of making inference about a proportion with missing data. Suppose that X is a Bernoulli variable, and that the parameter of interest is γ = P(X = 1), a population proportion. If a random sample of X is completely observed, then the evidence about γ can be represented with a standard binomial likelihood. Suppose, however, that X is missing on some subjects, which can happen because of patient dropout in clinical trials, for instance. Let R be the observation indicator, so R = 1 if X is observed and 0 otherwise. If X is assumed MCAR in the sense that R and X are independent, then the likelihood for γ is proportional to a binomial likelihood based on the complete cases only. Alternatively, a selection model could be specified for the conditional distribution of R given X, and the resulting likelihood would involve both the complete and the incomplete cases (Little & Rubin, 2002). None of these assumptions can be tested with the observed data, and it is natural to ask exactly what the data suggest about γ without these untestable assumptions.
The goal of this work is to develop, for a class of incomplete data problems including the above example, a likelihood method to describe and quantify the observed evidence without imposing untestable assumptions. A notable consequence of eliminating such assumptions is that the parameter of interest may become partially identified. Manski (2003) gives an extensive discussion on the notion of partial identification, which has originated and become increasingly popular in the econometrics literature. Early research in this area has focused on characterization of identification regions, and recent research efforts have been directed toward confidence sets for identification regions as well as for the parameters of interest (e.g., Horowitz & Manski, 1998, 2000; Imbens & Manski, 2004; Beresteanu & Molinari, 2008; Zhang, 2009). The framework of Zhang (2009) for partial identification (PI) with incomplete data will be adopted in the present paper. The focus here, of course, is on developing a likelihood method for evidential analysis instead of constructing confidence sets.
The rest of the paper is organized as follows. Section 2 gives a brief overview of the likelihood paradigm. Section 3 describes the incomplete data problem in some generality and discusses the associated identification problem, using the notation and terminology of Zhang (2009). A profile likelihood approach is then proposed in Section 4. The proportion example introduced earlier will be used as an illustration throughout this discussion. The practical implications of the proposed approach are considered in Section 5, with an asymptotic discussion followed by a simulation study. The approach will be further illustrated in Sections 6–8 with additional statistical problems involving missing data or censored data. In most cases the illustration includes a numerical example based on real data. Section 9 gives concluding remarks.
2 The Likelihood Paradigm
The likelihood paradigm is concerned with interpreting statistical evidence and answering questions like “what do the data say?” with regard to two competing hypotheses. Hypothesis tests and posterior probabilities are commonly used to answer such questions in practice. Royall (1997) points out that the Neyman–Pearson theory for testing hypotheses is really a decision theory designed to answer such questions as “what should I do?”. Serious logical inconsistencies can arise when hypothesis tests are used to represent statistical evidence (Royall, 1997, Chapter 2). Posterior probabilities, on the other hand, incorporate prior information as well as the data and may not provide an objective representation of the observed evidence. The Bayesian approach may be more appropriate for questions like “what should I believe?”. A proper concept of evidence is missing from the major statistical theories that dominate today's statistical practice.
The missing concept of evidence can be found in the law of likelihood (Hacking, 1965), which can be stated as follows. Suppose a random variable Y * follows a distribution with density f(·;γ), which is known up to a parameter γ taking values in Γ. Then the likelihood for γ , based on the observation Y* = y*, is given by L(γ ) = f (y*; γ ). According to the law of likelihood, the data provide evidence supporting one hypothesis H1 : γ = γ1 over another hypothesis H2 : γ = γ2 if L(γ1) > L(γ2), and the strength of that evidence is measured by the ratio L(γ1)/L(γ2). Unlike conventional hypothesis tests, such a comparison does not require the hypotheses to be pre-specified. Royall (1997) and Blume (2002) discuss the implications of the law of likelihood and compare the resulting likelihood paradigm with other paradigms of statistics. Royall (1997) also suggests benchmarks for describing the strength of evidence. Specifically, a LR exceeding 8 (respectively, 32) is said to represent moderate (respectively, strong) evidence.
When γ is a vector, some of its components may be of greater interest than others. Suppose γ = (η, ν), where η is of primary interest and ν is a nuisance parameter. The law of likelihood does not prescribe a specific procedure to deal with nuisance parameters. Under special circumstances, one may be able to obtain a marginal, conditional or partial likelihood for η alone (Royall, 1997, Chapter 7). Another approach, based on the profile likelihood L̃(η) = supν L(η, ν), is more generally applicable. Royall (2000) shows that L̃ has reasonable asymptotic properties, even though it does not correspond to a real probability density function.
While the likelihood paradigm requires a reasonable likelihood function to operate on, it does not necessarily require correct specification of a parametric model. There are different approaches to robust inference under the likelihood paradigm. Royall & Tsou (2003) start with a working parametric model for which the object of inference (i.e., the probability limit of the maximum likelihood estimator) remains of interest even when the model fails. They apply an exponential adjustment to the likelihood function based on the working model and show that the adjusted likelihood is robust in an asymptotic sense. Alternatively, one could focus on defining the parameter of interest via estimating functions without specifying the entire distribution of Y*, because the relevant evidence about the parameter can then be represented with an empirical likelihood function, which has asymptotic properties parallel to those of the various parametric likelihood functions discussed earlier. As will be illustrated in Section 7, nonparametric likelihood functions can be very helpful in robust evidential analysis.
3 Incomplete Data and Partial Identification
Suppose now that Y* is not completely observed; instead we observe Y = h(Y*), where h is a known function. Let P* denote the set of all possible distributions of Y*; then the induced model for Y is P ={P*h−1 : P* ∈ P*}. It is often convenient to parameterize P as {Pθ : θ ∈ Θ}, where Θ may be finite- or infinite-dimensional. Suppose θ is identifiable in the usual sense that θ1 = θ2 whenever Pθ1 = Pθ2. This implies that θ is determined by P*, the distribution of Y*, so we can write θ = S(P*) for a known functional S : P* → Θ. The parameter of interest will be written as γ = T (P*) for another (known) functional T. There is no need to consider a nuisance component of γ now because γ is not required to determine P*.
Example (Proportion, Continued). The latent variable Y* in this example is a collection of independent copies of (X, R), and its distribution P* is defined by the probabilities pxr = P*(X = x, R = r ), x, r = 0, 1. The parameter of interest is then
The observable Y may be thought of as independent copies of (RX, R), since X can be recovered from RX if and only if R = 1. Note that Y is essentially a trinomial variable and its distribution is defined by
Here and in the sequel, dot (·) in place of a subscript denotes summation over that subscript; in particular, p·0 = p00 + p10.
A pair of values (θ, γ ) will be said to be compatible if they can possibly obtain together, that is, if there exists P* ∈ P* such that S(P*) = θ and T (P*) = γ. If the value of θ is known, then γ must be one of those values compatible with θ. The collection of such values is called the identification region for γ given θ. Formally, this is written as
where the superscript −1 denotes inverse image. Conversely, if the value of γ is given, then θ must be one of those values compatible with γ. This latter set of values is called the inverse identification region for θ given γ and is denoted by
The parameter γ is said to be completely identified if Γ|θ is a singleton, unidentified if Γ|θ = Γ, and partially identified if Γ|θ is a proper subset of Γ consisting of more than one element. This terminology refines the standard terminology by drawing a distinction between the partially identified case and the truly unidentified case.
Example (Proportion, Continued). In this example, the identification region for γ given θ is easily seen to be
and the inverse identification region for θ given γ is
The identification of γ depends on p·0, the probability that X is missing. Specifically, γ is completely identified if p·0 = 0(X always observed), unidentified if p·0 1(X always missing), and partially identified otherwise.
If completely identified, γ is essentially a function of θ and evidence about θ easily translates into evidence about γ. On the other hand, in the unidentified case, γ is completely arbitrary even if θ is known and there is little hope to make any meaningful inference about γ without additional information. This paper is primarily concerned with evidential interpretations about a partially identified γ.
4 A Profile Likelihood Approach
We shall take as our starting point a likelihood L for θ based on Y, which can be parametric, semiparametric or nonparametric, depending on the nature of θ. At this stage, there is no need to be specific about this likelihood, as long as it provides a natural representation of the observed evidence about θ. For motivation, let us consider the completely identified case where γ is a function of θ (but θ may not be a function of γ ). In this case, each value of γ corresponds to a distinct θ-set given by Θ|γ, and it seems reasonable to represent the evidence about γ with the profile likelihood:
| (1) |
Without complete identification, the hypothesis that γ equals some fixed value γ1 is no longer equivalent to the hypothesis θ ∈ Θ|γ1. The former implies the latter but not vice versa. Nonetheless, it seems natural to continue to use the profile likelihood L̃ given by (1) to represent the evidence about γ. As a consequence of partial identification, L̃ does not generally have a unique maximizer. If θ̂ is a maximum likelihood estimate of θ, then L̃ attains its maximum at every point in Γ|θ̂. On the other hand, the profile likelihood can take on lower values outside Γ|θ̂ so as to provide an informative representation of evidence, unless γ is unidentified (in which case L̃ is constant). The profile likelihood L̃ can be summarized numerically through support sets of the form Sk ={γ : L̃(γ) > sup L̃(Γ)/k}, k > 1. This looks similar to the likelihood ratio confidence sets derived by Zhang (2009), but the interpretations are very different. It can be shown that Sk is the smallest subset of Γ supported by a factor of k over its own complement. Confidence sets, on the other hand, do not have a direct evidential interpretation (Royall, 1997).
Example (Proportion, Continued). Let (Xi, Ri), i = 1,…, n, be independent copies of (X, R). Recall that Y is essentially a trinomial variable with probability vector θ = (p01, p11, p·0). A standard likelihood for θ is
where, and . It follows that θ̂ = (p̃01,p̃11,p̃·0) = n−1(N01, N11, N·0). To avoid trivialities, all components of θ̂ are assumed positive. Then the profile likelihood for γ is given by
| (2) |
Without missing data, this reduces to the usual binomial likelihood.
The method can be illustrated numerically with real data from the Obstructed Coronary Bypass Graft Trial, discussed previously by Hollis (2002). In this trial, 110 patients with obstructed coronary bypass grafts were treated with a stent, and the outcome of interest was restenosis at 6 months (an undesirable event). However, 24 of these patients dropped out of the study with unknown 6-month restenosis status. Of the remaining 86 patients, 32 had restenosis at 6 months while 54 did not. Based on these data, Figure 1 displays the profile likelihood for the 6-month restenosis rate given by (2) as well as a binomial likelihood from a complete-case analysis (restricted to patients with observed outcomes). In this and the subsequent likelihood plots, each likelihood function is divided by its maximum value so the peak value is invariably 1. Obviously, the profile likelihood with no assumptions about the missing-data mechanism appears “flatter” than the complete-case likelihood assuming MCAR. This is a natural consequence of partial identification, as will be discussed in Section 5.
Figure 1.
Likelihood functions for the restenosis rate at 6 months after stent placement in the Obstructed Coronary Bypass Graft Trial: the proposed profile likelihood (solid line) vs. a standard binomial likelihood from a complete-case analysis (dashed line).
5 Practical Implications
We now consider the practical implications of the proposed profile likelihood method, first asymptotically and then empirically. The following asymptotic discussion involves standard conditions and elementary arguments, which are omitted in order to avoid distractions. For a completely identified parameter, a reasonable likelihood is expected to be consistent in the sense that the true value γ0 eventually dominates every other fixed value. This can be formalized as for any γ ≠ γ0. The same property would be unrealistic to expect in the partially identified case, because the observed data provide no information to distinguish among the values in Γ|θ0, where θ0 is the true value of θ. This limitation applies to any likelihood method that avoids making untestable identifying assumptions, and the profile likelihood L̃ is no exception. As a consequence, L̃(γ0) may fail to dominate L̃(γ) if γ ≠ γ0 lies inside or on the boundary of Γ|θ0. On the other hand, no value of γ can dominate γ0 asymptotically under L̃. This is in contrast to likelihood methods based on untestable identifying assumptions, which are consistent when the assumptions hold but can be dangerously inconsistent, allowing γ0 to be dominated by other values, when the identifying assumptions fail. In large samples, L̃ will be largely proportional to the indicator function of Γ|θ0, which represents all that could possibly be learned about γ0 from the data without external information. In this sense, the profile likelihood method provides an objective representation of the observed evidence without additional assumptions. A sharper likelihood function can be obtained by invoking untestable identifying assumptions. While this is often reasonable to do, one should recognize that the resulting likelihood represents not only the observed evidence but also the identifying assumptions. The effect of the identifying assumptions is basically shrinking the identification region into a single value, which may or may not be equal (or close) to the true value, depending on the validity of the assumptions.
The following simulation study sheds some light on the finite-sample behaviour of the proposed approach in comparison with standard methods based on identifying assumptions. The study is conducted in the setting of the proportion example, under a selection model given by
| (3) |
Note that β determines the (non-)ignorability of the missing-data mechanism, with β = 0 corresponding to MCAR. In this study, data are simulated with n = 100, γ = P*(X = 1) = 0.5, β ∈ {0, 1,−1}, and α chosen such that p·0 = P(R = 0) = 0.1 or 0.3. In each scenario (combination of parameter values), 1,000 replicate samples are generated. The proposed method will be compared with a more standard likelihood method that can be described as follows. Note that model (3) is not identifiable without additional assumptions or constraints. One way to constrain the model is to fix the value of β, which allows us to obtain a standard profile likelihood for γ as
| (4) |
Note that, for β = 0 (MCAR), the last term disappears and the profile likelihood reduces to a binomial likelihood based on the complete cases. The maximization in (4) can be carried out with an EM algorithm which naturally treats (N00, N10) as the missing data. Given the current value α(k), the E-step amounts to computing
and the M-step sets α(k+1) to the maximizer of the complete-data log-likelihood
which can be found using a Newton–Raphson algorithm. To address the uncertainty about β,L̃β will be calculated for several different values of β (0, 1, −1). This will be referred to as a sensitivity analysis (SA) approach. The proposed method based on PI will be abbreviated as PI.
Suppose the true value γ0 = 0.5 is to be compared with the alternative γ1 = 0.3. Such a comparison can be made via LRs, and Table 1 gives the frequency distribution of the LRs for each method in each scenario. In Table 1, the LRs for γ0 versus γ1 are grouped into five intervals: 0–1/32 (strong evidence for γ1), 1/32–1/8 (moderate evidence for γ1), 1/8–8 (weak evidence), 8–32 (moderate evidence for γ0), and 32+ (strong evidence for γ0). It is easy to see that, under the SA approach, the mode of the likelihood (i.e., the maximum likelihood estimate) decreases with βguess, the working value of β. Indeed, Table 1 shows that smaller values of βguess (e.g., −1) are more likely to produce large LRs for γ0 versus γ1, whereas larger values of βguess (e.g., 1) are more prone to false claims of strong evidence for γ1. Setting βguess equal to βtrue, the true value of β, is not necessarily advantageous for comparing a given pair of hypotheses. The PI method is reliably more conservative than SA (for any βguess) and less likely to indicate strong evidence in either direction. This may be desirable, especially when comparing against SA with βguess = 1 in the last scenario. Increasing p·0 from 0.1 to 0.3 widens the identification interval and magnifies the difference between PI and SA. Overall, the simulation results suggest that the PI method operates like a systematic and exhaustive SA seeking consensus among all possible identifying assumptions (hence the conservative behaviour).
Table 1.
Empirical comparison of the proposed method based on PI with a SA approach in terms of the frequency distribution of LRs for comparing the true value γ0 = 0.5 against γ1 = 0.3 in the setting of the proportion example. Here p·0 = P(R = 0) is the proportion of missing data, βtrue is the true value of β, βguess is the value used to implement SA, and Γ|θ0 is the identification interval for γ. See Section 5 for details.
| Freq. Dist. (%) of LRs for γ0 = 0.5 vs. γ1 = 0.3 |
|||||||||
|---|---|---|---|---|---|---|---|---|---|
| p·0 | β true | Γ|θ0 | Method | β guess | 0–32−1 | 32−1–8−1 | 8−1–8 | 8–32 | 32+ |
| 0.1 | 0 | [0.45, 0.55] | SA | 0 | 0.1 | 0.5 | 6.5 | 5.4 | 87.5 |
| 1 | 1.1 | 0.9 | 13.2 | 7.7 | 77.1 | ||||
| −1 | 0.0 | 0.1 | 2.7 | 2.9 | 94.3 | ||||
| PI | 0.0 | 0.0 | 13.8 | 16.1 | 70.1 | ||||
| 1 | [0.47, 0.57] | SA | 0 | 0.1 | 0.0 | 1.9 | 3.2 | 94.8 | |
| 1 | 0.2 | 0.3 | 5.3 | 4.9 | 89.3 | ||||
| −1 | 0.0 | 0.1 | 0.5 | 0.7 | 98.7 | ||||
| PI | 0.0 | 0.0 | 5.4 | 10.4 | 84.2 | ||||
| −1 | [0.43, 0.53] | SA | 0 | 0.7 | 1.1 | 15.0 | 8.8 | 74.4 | |
| 1 | 2.6 | 2.3 | 22.0 | 13.1 | 60.0 | ||||
| −1 | 0.3 | 0.3 | 7.1 | 6.2 | 86.1 | ||||
| PI | 0.1 | 0.3 | 25.0 | 21.2 | 53.4 | ||||
| 0.3 | 0 | [0.35, 0.65] | SA | 0 | 0.4 | 0.6 | 12.9 | 9.8 | 76.3 |
| 1 | 6.6 | 5.7 | 36.4 | 14.0 | 37.3 | ||||
| −1 | 0.0 | 0.0 | 1.1 | 1.4 | 97.5 | ||||
| PI | 0.0 | 0.0 | 82.7 | 12.3 | 5.0 | ||||
| 1 | [0.40, 0.70] | SA | 0 | 0.0 | 0.1 | 1.4 | 1.2 | 97.3 | |
| 1 | 0.5 | 0.5 | 10.6 | 8.3 | 80.1 | ||||
| −1 | 0.0 | 0.0 | 0.1 | 0.2 | 99.7 | ||||
| PI | 0.0 | 0.0 | 45.9 | 23.4 | 30.7 | ||||
| −1 | [0.30, 0.60] | SA | 0 | 6.7 | 6.7 | 38.3 | 15.0 | 33.3 | |
| 1 | 33.7 | 13.7 | 39.8 | 6.3 | 6.5 | ||||
| −1 | 0.4 | 0.4 | 12.2 | 9.8 | 77.2 | ||||
| PI | 0.0 | 0.0 | 98.3 | 1.5 | 0.2 | ||||
6 Comparing Two Proportions
The proposed profile likelihood approach will be further illustrated with other statistical problems in the rest of the paper. The general notation of Sections 3 and 4 will be used throughout, although its exact meaning will depend on the particular problem. In this section, we extend the proportion example slightly and consider the problem of comparing two proportions with missing data. For j = 1, 2, let {Xji : i = 1,…, nj denote a random sample from a Bernoulli distribution with success probability γj. The two} samples are assumed independent, as in randomized clinical trials. As in the previous example, let Rji be the observation indicator for Xji, so Rji = 1 if Xji is observed and 0 otherwise. Thus Y* consists of (Xji, Rji ), i = 1,…, nj , j = 1, 2, and the distribution of Y* is characterized by pjxr = P*(Xj1 = x, Rj1 = r), j = 1, 2, x = 0, 1, r = 0, 1. Now, we can write γj = pj1. = pj10 + pj11, j = 1, 2. The observable consists (RjiXji , Rji), i = 1,…, nj, j 1, 2, and the identifiable parameter can be written as θ = (θ1,θ2), where θj = (pj01, pj11, pj.0), j = 1, 2. Between θ and γ (γ1, γ2), the compatibility relation can be expressed as pj11 ≤ γj ≤ pj11 + pj.0 = 1, 2.
A standard likelihood for θ based on Y is given by L(θ) = L1(θ1)L2(θ2), where with and . Let θ̂ = (θ̂1, θ̂2) denote the maximum likelihood estimate of θ. For each γj, the profile likelihood L̃j involves only Lj (θj ) and takes the form (2) with an additional subscript j to each relevant quantity. Each L̃j is maximized at the interval [γ̱j, γ̄j], where γ̱j = p̂j11 and γ̄j = p̂j11 + p̂j.0, and its maximum value is just Lj (θ̂γ). The profile likelihood for γ is simply L̃(γ) = L̃1(γ1)L̃2(γ2), which is maximized at [γ̱1, γ̄1] × [γ2, γ̄2].
Let γe = e(γ) denote an effect measure for comparing γ1 with γ2. Common choices of γe include the difference γ2 − γ1, the ratio γ2/γ1 and the odds ratio γ2(1 − γ1)/{γ1(1 − γ2)}. The profile likelihood for γe can be computed as L̃e(γe) = sup{L̃(γ) : e(γ) = γe}. It is easy to verify that this is identical to the original definition of L̃e(γe) according to (1). Suppose e(γ) = γ2 − γ1; then L̃e(γe) = supγ1 L̃(γ1, γ1 + γe) is maximized at [γ2 − γ̄1, γ̄2 − γ̱1] and its maximum value is equal to L(θ̂). For γe > γ̄2 − γ̱1, a moment of thought reveals that the maximization of L̃(γ1, γ1 + γe) over γ1 can be restricted to the interval [(γ̄2 − γe) ∨ 0, (1 − γe) ∧ γ1], where ∨ denotes maximum and ∧ denotes minimum. The maximizer can be found analytically as in Miettinen & Nurminen (1985) and Farrington & Manning (1990) or numerically using a bisection method (Zhang, 2006). At γe < γ̱2 − γ̄1, L̃e can be evaluated in a similar fashion by maximizing L̃(γ1, γ1 + γe) over γ1 ∈ [(−γe), ∨ γ̄1, (γ2 − γe) ∧ 1].
As an example, the Obstructed Coronary Bypass Graft Trial is actually a randomized study comparing stent placement with balloon angioplasty. The angioplasty arm consisted of another 110 patients. In this arm, the 6-month restenosis status was confirmed for 37 patients, ruled out for 43, and unknown for the other 30. Figure 2 shows a contour plot of the joint profile likelihood for the 6-month restenosis rates in the two treatment groups. Figure 3 presents two profile likelihood functions for the difference between the two rates: the proposed one based on PI and a more standard one assuming MAR. The proposed profile likelihood is largely neutral about the superiority of stent placement to balloon angioplasty, with both regions (superiority and non-superiority) containing plausible values that maximize the likelihood. Therefore, any evidence for superiority that might be found in the MAR-based profile likelihood should be attributed to the MAR assumption.
Figure 2.
Contour plot of the joint profile likelihood for the 6-month restenosis rates in the two treatment groups in the Obstructed Coronary Bypass Graft Trial.
Figure 3.
Profile likelihood functions based on PI (solid line) and the MAR assumption (dashed line) for the difference in 6-month restenosis rate (stent-angioplasty) in the Obstructed Coronary Bypass Graft Trial, with the dotted line separating regions of superiority and non-superiority.
7 Means of Bounded Variables
Suppose now that X follows an arbitrary distribution supported by the known interval [a, b] and is not always observed, with R the observation indicator as usual. Then the latent variable Y* consists of independent copies (Xi, Ri), i = 1,…, n. The distribution of Y* can be characterized by (p, F0, F1), where p = P(R = 1) and Fr denotes the conditional distribution of X given R = r, r = 0, 1. Let F denote the marginal distribution of X; then F = (1 − p)F0 + pF1. Suppose the target parameter is γ = EX = M(F), where M denotes the mean functional: M(F) = ∫x dF(x). This allows us to write γ = (1 − p)M(F0) + pM(F1). As in previous examples, the observable can be conceptualized as Y = {(RiXi, Ri) : i = 1,…, n, a transformation of Y*. The distribution of Y identifies p and F1 but not F0, so θ = (p, F1). Obviously, θ and γ are compatible if and only if
| (5) |
and γ is partially identified if p ∈ (0, 1) and −∞ < a < b < ∞. The boundedness of X is often reasonable to assume, although precise specification of the bounds may be difficult in practice.
A nonparametric likelihood for θ Based on Y can be defined as
| (6) |
where F1{x} denotes the point mass of F1 at x. Maximizing this likelihood yields the maximum likelihood estimate θ̂ = ( p̂, F̂1), where p̂ is the sample proportion of complete cases and F̂1 is the empirical distribution of the observed values of X. Maximizing the likelihood (6) subject to (5) leads to the profile likelihood for γ , which can also be written as
| (7) |
where Xai = Ri Xi + (1 − Ri)a and Xbi = Ri Xi + (1 − Ri)b, i = 1,…, n (Zhang, 2009, Appendix A.2). The constant n log n on the right-hand side of (7) serves to standardize the log-likelihood by setting its maximum value at 0. The maximum of l̃ is attained at γ ∈ [X̄a, X̄b], where
For γ < X̄a , the profile log-likelihood is easily seen to be
that is, the empirical log-likelihood for the pseudo-observations Xai (with missing X's replaced by a). Following Owen (1988, section 3), we have
where t(γ) satisfies
| (8) |
For γ ∈ (a, X̄a), t(γ) must lie in (−(max1≤i≤n Xai − γ)−1, −(min1≤i≤n Xbi − γ)−1), on which the left side of (8) is strictly decreasing in t(γ), so l̃(γ) can be evaluated using a bisection method or, more efficiently, a safeguarded zero finding algorithm (Owen, 1988). The computation for γ ∈ (X̄b, b) is completely analogous. Obviously, l̃ reduces to the empirical log-likelihood for the Xi if they are completely observed.
The preceding discussion can be easily extended to the two-sample situation as in Section 6, with the parameter space for γ = (γ1, γ2) replaced by [a, b]2.
8 Application to Censored Data
Besides missing data, the proposed method can also be used to deal with censored data. Let U be a failure time and V a censoring time with distributions F and G, respectively. These are not completely observed because of censoring. Instead we observe X = U ∧ V together with a censoring indicator Δ = I(U ≤ V). As usual, suppose the scientific interest lies in the distribution function F(u) or, equivalently, the survival function 1 − F(u). (To simplify notation, the same symbols are used here to denote probability distributions and the associated distribution functions.) A common assumption for handling censored data is independent censoring, namely that U and V are independent, which cannot be validated with the observed data (Tsiatis, 1975). The available methods that allow dependent censoring typically rely on other untestable assumptions about the censoring mechanism. In contrast, the proposed method can be used to represent the observed evidence about F(u) without imposing untestable assumptions.
This problem can be cast in the framework of Section 3 by taking Y* = (Ui, Vi) : i = 1,…, n} and Y = {(Xi, Δi) : i = 1,…, n}. Let p = P(Δ = 1) and let Fδ (respectively, Gδ denote the conditional distribution of U (respectively, V) given Δ = δ, δ = 0, 1. It follows that F = (1 − p)F0 + pF1 G = (1 − p)G0 + pG1 and X ~ (1 − p)G0 + pF1. The distribution of Y identifies θ (p, G0, F1), while the target parameter is
for a given time u. The compatibility relation between θ and γ can be characterized as
| (9) |
(Peterson, 1976; Zhang, 2009). A nonparametric likelihood for θ based on Y is given by
| (10) |
which is maximized at θ̂ = (p̂, Ĝ0, F̂1), where , Ĝ0 the empirical distribution of {Xi : Δi = 0}, and F̂1 is the empirical distribution {Xi : Δi = 1}. The corresponding profile likelihood for γ is just the maximum of (10) subject to (9), an equivalent expression for which
where I(·) is the indicator function. Obviously, l̃ attains its maximum value, 0, at γ ∈ [N1/n, N/n], where
For the other values of γ, it is easy to see that
If no censoring has occurred by time u (i.e., N1 = N), then l̃is simply the binomial log-likelihood based on N successes out of n trials. This can=again be extended to the two-sample situation along the lines of Section 6.
To illustrate this approach, let us consider a clinical study of patients with renal insufficiency, initially reported by Nahman et al. (1992). The outcome of interest in this study was the time (in months) from catheter placement to first exit-site infection, which could have been censored due to catheter failure. Figure 4 displays the profile likelihood for the probability of staying infection-free for 6 months, labelled as the 6M survival probability for brevity, in the subgroup of patients who had their catheters placed surgically.
Figure 4.
Profile likelihood for the 6-month survival probability in patients with surgically placed catheters in the study of Nahman et al. (1992).
9 Concluding Remarks
This paper presents a general profile likelihood method to represent statistical evidence with incomplete data. In contrast to standard methods for incomplete data, the proposed method does not require untestable assumptions that help identify the parameter, and provides an objective representation of evidence in the observed data alone. This is not to suggest that untestable identifying assumptions should not be made. Where appropriate, such assumptions can lead to sharpened inference and improved efficiency, as we have seen in Table 1 and Figures 1 and 3. One should be mindful, however, that such assumptions cannot be validated with the data and must be based on external information. Without total confidence in any set of identifying assumptions, it may be helpful to use the proposed method to understand “what the data say”.
As we have seen in the numerical examples, the proposed method typically yields likelihood functions that appear “flatter” than those resulting from additional assumptions. The difference becomes more apparent as the incomplete observation of data becomes more severe (with a larger proportion of missing data, e.g.). These observations should not be interpreted as an indication of inefficiency or even uselessness of the proposed method. Rather, they indicate just how little information there is in the incompletely observed data and how strong the additional assumptions really are. Imposing additional assumptions has the effect of mixing external information with the data. The resulting likelihood may appear more concentrated, but it obscures the two sources of information. The proposed method can help to clarify the contributions of different sources of information.
Acknowledgements
This research was supported in part by the Intramural Research Program of the NIH, Eunice Kennedy Shriver National Institute of Child Health and Human Development. The author would like to thank two anonymous referees whose comments have improved the manuscript greatly.
References
- Beresteanu A, Molinari F. Asymptotic properties for a class of partially identified models. Econometrica. 2008;76:763–814. [Google Scholar]
- Blume JD. Likelihood methods for measuring statistical evidence. Stat. Med. 2002;21:2563–2599. doi: 10.1002/sim.1216. [DOI] [PubMed] [Google Scholar]
- Blume JD. How often likelihood ratios are misleading in sequential trials. Commun. Stat. Theory Meth. 2008;37:1193–1206. [Google Scholar]
- Blume JD, Su L, Olveda RM, McGarvey ST. Statistical evidence for GLM regression parameters: A robust likelihood approach. Stat. Med. 2007;26:2919–2936. doi: 10.1002/sim.2759. [DOI] [PubMed] [Google Scholar]
- Farrington CP, Manning G. Test statistics and sample size formulae for comparative binomial trials with null hypothesis of non-zero risk difference or non-unity relative risk. Stat. Med. 1990;9:1447–1454. doi: 10.1002/sim.4780091208. [DOI] [PubMed] [Google Scholar]
- Hacking I. Logic of Statistical Inference. Cambridge University Press; New York: 1965. [Google Scholar]
- Hollis S. A graphical sensitivity analysis for clinical trials with nonignorable missing binary outcome. Stat. Med. 2002;21:3823–3834. doi: 10.1002/sim.1276. [DOI] [PubMed] [Google Scholar]
- Horowitz JL, Manski CF. Censoring of outcomes and regressors due to nonresponse: identification and estimation using weights and imputations. J. Econometrics. 1998;84:37–58. [Google Scholar]
- Horowitz JL, Manski CF. Nonparametric analysis of randomized experiments with missing covariate and outcome data. J. Amer. Statist. Assoc. 2000;95:77–81. [Google Scholar]
- Imbers GW, Manski CF. Confidence intervals for partially identified parameters. Econometrica. 2004;72:1845–1857. [Google Scholar]
- Little RJA, Rubin DB. Statistical Analysis with Missing Data. 2nd ed. Wiley; New York: 2002. [Google Scholar]
- Manski CF. Partial Identification of Probability Distributions. Spring-Verlag; New York: 2003. [Google Scholar]
- Miettinen O, Nurminen M. Comparative analysis of two rates. Stat. Med. 1985;4:213–226. doi: 10.1002/sim.4780040211. [DOI] [PubMed] [Google Scholar]
- Nahman NS, Middendorf DF, Bay WH, McElligott R, Powell S, Anderson J. Modification of the percutaneous approach to peritoneal dialysis catheter placement under peritoneoscopic visualization: clinical results in 78 patients. J. Am. Soc. Nephrol. 1992;3:103–107. doi: 10.1681/ASN.V31103. [DOI] [PubMed] [Google Scholar]
- Owen AB. Empirical likelihood ratio confidence intervals for a single functional. Biometrika. 1988;75:237–249. [Google Scholar]
- Peterson AV. Bounds for a joint distribution function with fixed sub-distribution functions: applications to competing risks. Proc. Natl. Acad. Sci. USA. 1976;73:11–13. doi: 10.1073/pnas.73.1.11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Royall R. Statistical Evidence: A Likelihood Paradigm. Chapman & Hall; Boca Raton, FL: 1997. [Google Scholar]
- Royall R. On the probability of observing misleading statistical evidence. J. Amer. Statist. Assoc. 2000;95:760–768. [Google Scholar]
- Royall R, Tsou T-S. Interpreting statistical evidence by using imperfect models: Robust adjusted likelihood functions. J. R. Stat. Soc. Ser. B Stat. Methodol. 2003;65:391–404. [Google Scholar]
- Rubin DB. Inference and missing data. Biometrika. 1976;63:581–592. [Google Scholar]
- Tsiatis AA. A nonidentifiability aspect of the problem of competing risks. Proc. Natl. Acad. Sci. USA. 1975;72:20–22. doi: 10.1073/pnas.72.1.20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang Z. Non-inferiority testing with a variable margin. Biom. J. 2006;48:948–965. doi: 10.1002/bimj.200610271. [DOI] [PubMed] [Google Scholar]
- Zhang Z. Likelihood-based confidence sets for partially identified parameters. J. Statist. Plann. Inference. 2009;139:696–710. [Google Scholar]




