Abstract
In this paper, we carry out an in-depth theoretical investigation for existence of maximum likelihood estimates for the Cox model (Cox, 1972, 1975) both in the full data setting as well as in the presence of missing covariate data. The main motivation for this work arises from missing data problems, where models can easily become difficult to estimate with certain missing data configurations or large missing data fractions. We establish necessary and sufficient conditions for existence of the maximum partial likelihood estimate (MPLE) for completely observed data (i.e., no missing data) settings as well as sufficient conditions for existence of the maximum likelihood estimate (MLE) for survival data with missing covariates via a profile likelihood method. Several theorems are given to establish these conditions. A real dataset from a cancer clinical trial is presented to further illustrate the proposed methodology.
Keywords: Missing at random (MAR), Monte Carlo EM algorithm, Existence of partial maximum likelihood estimate, Necessary and sufficient conditions, Partial likelihood, Proportional hazards model
1 Introduction
There is a vast literature on parameter estimation in the Cox model in presence of missing covariates, including Schluchter and Jackson (1989), Lin and Ying (1993), Lipsitz and Ibrahim (1996, 1998, 2000), Paik (1997), Paik and Tsai (1997), Chen and Little (1999), Herring and Ibrahim (2001), Leong, Lipsitz, and Ibrahim (2001), Chen (2002), Pons (2002), Herring, Ibrahim, and Lipsitz (2002, 2004), and Chen, Ibrahim, and Shao (2006). However, there is very little literature addressing specific theoretical conditions for the existence of MLE’s of the Cox model in either the full data case or in the presence of missing covariate data. We are not aware of specific literature that establishes specific theoretical results for existence of such estimates. This is what we set out to do in this paper. Specifically, we provide necessary and sufficient conditions for existence of the Maximum Partial Likelihood Estimate (MPLE) with no missing data as well as sufficient conditions for existence of the Maximum Likelihood Estimate (MLE) with Missing at Random (MAR) covariate data via the profile likelihood method. The methodology proposed here is quite new and will shed light on the characterizations of existence of the MPLE or MLE for the Cox model with complete data as well as with missing covariate data. The profile likelihood method for obtaining the MLE in the presence of MAR covariates is quite different from the other parametric and semiparametric approaches seen in the literature. The profile likelihood method is genuinely non-parametric in estimating the cumulative baseline hazard and does not require a semi-parametric estimate of the baseline hazard as is required in Lipsitz and Ibrahim (1998) and Herring and Ibrahim (2001).
We mention that Jacobsen (1989) establishes a necessary and sufficient condition for existence of the MPLE for the Cox model without missing covariate data, Chen, Ibrahim, and Shao (2004) consider issues in posterior propriety and characterize conditions for existence of the MLE in generalized linear models with MAR covariate data, and Huang, Chen, and Ibrahim (2005) carry out a detailed investigation of posterior propriety in generalized linear models with nonignorably missing covariate data. The methods and models considered in those papers are quite different from the Cox model setting. In the Cox model, i) we no longer have independence between the observations in the construction of the partial likelihood, that is, the complete data log-likelihood is not a sum of n independent observations, ii) the Cox regression model, and in particular, Cox’s partial likelihood, is an inherently semiparametric model, and thus a profile likelihood method considered here is quite different than the fully parametric models considered in Chen, Ibrahim, and Shao (2004) and Huang, Chen, and Ibrahim (2005), and iii) right censoring and tied observations require new theory not developed in Chen, Ibrahim, and Shao (2004) and Huang, Chen, and Ibrahim (2005). Thus, i) – iii) will require new theory for characterizing conditions for existence of the MPLE and MLE of the regression coefficients in the Cox model allowing for tied observations.
The significance of this work thus has two aspects. First, the proposed methodology will allow the data analyst to determine, for a given dataset, whether the MPLE or MLE exists before carrying out the analysis. Such a methodology is critical since it is not always clear from the computer output in an analysis whether the MPLE or MLE exists or not. Second, such conditions will be useful for determining suitable starting values for EM-type algorithms when fitting these models. Thus, the practical consequences of the proposed methodology is that we provide valuable tools for checking existence of the MPLE or MLE as well as inferential and computational tools for maximum likelihood based inference for the Cox model with or without MAR covariates.
The rest of this article is organized as follows. Section 2 presents several motivating examples. We give necessary and sufficient conditions for the existence of the MPLE with no missing data in Section 3 and give sufficient conditions for existence of the MLE in the presence of MAR covariate data in Section 4. The computational development involving the Monte Carlo EM (MCEM) algorithm is given in Section 5. Section 6 presents a detailed analysis of a lung cancer dataset to further illustrate the proposed methodology. Proofs of all theorems are given in the Appendix.
2 Motivating Examples
To fix ideas, let yi denote the minimum of the censoring time Ci and the survival time Ti, and let xi = (xi1,…,xip)′ be the p × 1 vector of covariates associated with yi for the ith subject. Denote by β = (β1,…, βp)′ the p × 1 vector of regression coefficients. Also, δi = 1{Ti = yi} is the indicator for the event for i = 1, 2, …, n, where n is the total number of observations and ℛ(t) = {i : yi ≥ t} is the set of subjects at risk at time t. Then, the partial likelihood of Cox (1975) is given by
(2.1) |
where Dobs = {(yi, δi, xi) : i = 1, 2, …, n} is the observed univariate right censored survival data. As usual, we assume throughout that xi does not include an intercept, since the interceptis not estimable in the Cox partial likelihood, and that given xi, Ti and Ci are independent. For the completely observed data Dobs, the maximum partial likelihood estimate (MPLE) is defined as β̂ = arg max Lp(β|Dobs). The asymptotic properties of β̂ have been well studied in the literature, and in fact, the MPLE can be computed via standard statistical software, such as the SAS procedure, PROC PHREG. However, it remains unclear when the MPLE exists and when it does not for a given dataset. To motivate the proposed methodology, we consider the following two examples.
Example 1: A Simple Illustration
Suppose n = 3, y1 and y2 are two failure times, y3 is a right censored survival time, and we have one binary covariate x. Let x1, x2, and x3 denote the three observed values of x. Assuming y1 < y2 < y3, the partial likelihood of Cox (1975) is then given by
where Dobs = {(yi, xi), i = 1, 2, 3}. We consider two special cases.
Case 1
x1 = x2 = 0 and x3 = 1. In this case, we have . Then, we can see that the maximum value of Lp(β1|D) is attained at β1 = −∞. Thus, the MPLE does not exist.
Case 2
x1 = 0, x2 = 1, and x3 = 0. In this case, we have . Then, the MPLE does exist. In fact, the MPLE of β1 is .
In Example 1, the partial likelihood function behaves quite differently by simply switching two observed values of the covariate: one leads to the existence of the MPLE and the other does not. Thus, a natural question is what are general if and only if conditions for the existence of the MPLE in the Cox model? From this illustrative example, we can see that this is not an easy problem to solve, as it requires an in-depth theoretical investigation to find such conditions.
Example 2: Prostate Cancer Data
We consider data, which consist of n = 550 men who were treated with radiation therapy following with six months of with short-course androgen suppression therapy for localized prostate cancer with at least one adverse risk factor (prostate-specific antigen [PSA] > 10 ng/mL, biopsy Gleason score 7 to 10, or 2002 American Joint Commission on Cancer (AJCC) clinical tumor category T2b or T2c) between 1989 and 2002. The outcome variable (yi) in years was time to prostate cancer death, which is continuous and subject to right censoring, and δi = pfail denotes the censoring indicator which equals 1 if the ith subject died due to prostate cancer, and 0 otherwise. The goal of this study was to determine whether the number of risk factors present was associated with time to prostate cancer death (Tsai et al., 2006).
Define A = I {PSA > 10}, B = I {Gleason ≥ 7}, and C = I {T2b or T2c}. We consider five covariates: AB, AC, BC, ABC, and age. There are no missing values in this data set. A Cox proportional hazards model was fitted to this data set. The following outputs were produced by SAS Procedure PHREG:
Variable | DF | Parameter Estimate | Standard Error | Chi-Square | Pr > ChiSq |
---|---|---|---|---|---|
AB | 1 | 0.39759 | 1.23355 | 0.1039 | 0.7472 |
AC | 1 | −14.30314 | 2107 | 0.0000 | 0.9946 |
BC | 1 | 0.59060 | 1.22714 | 0.2316 | 0.6303 |
ABC | 1 | 2.22155 | 0.80450 | 7.6253 | 0.0058 |
age | 1 | 0.02262 | 0.04821 | 0.2201 | 0.6390 |
From the above results, we see that although SAS Procedure PHREG does produce the estimates for all five covariates, clearly there is some identifiability problem with the covariate, AC, as it has a large value of the estimate along with a huge standard error compared to all other covariates. Now, the question is: are the MPLEs are really existent in this Cox model?
Example 3: Small Cell Lung Cancer Data
We consider data from a phase III advanced non-small-cell lung cancer (SCLC) clinical trial conducted by the University of North Carolina at Chapel Hill (LCCC 9719). The results of this study have been published in Socinski et al. (2002). The goal of this trial was to compare a defined duration of therapy (A) to continuous therapy followed by second line therapy (B) in order to determine optimal duration of therapy in SCLC patients. LCCC 9719 had n = 230 patients. We consider here five prognostic factors: x1 = treatment (2 arms: A and B, coded as 1 and 0), x2 = gender (female and male, coded as 0 and 1), x3 = age in years, x4 = highest grade toxicity (recorded by cycle) (2 levels: 0 versus > 0, coded as 0 and 1), and x5 = quality of life (QOL) FACTG score. For these five prognostic factors, x4 and x5 had missing information and x1, x2, and x3 were completely observed for all cases. In this dataset, there is a total missing covariate data fraction of 52.74% on these two covariates. The outcome variable (yi in months) is time to progression, which is continuous and subject to right censoring, and δi denotes the censoring indicator which equals 1 if the ith subject had disease progression, and 0 otherwise. The median follow up time is 3.94 months and the range of the follow up time is (0.10, 12.26) months. There are d = 102 distinct progression times and ties are present in the dataset. A summary of the dataset is given in Table 1. In the presence of missing covariates, a joint probability distribution must be specified for the progression time and the missing covariates, and a profile likelihood method is hence proposed for obtaining the MLE in Section 4, as a partial likelihood approach in this context may not be as desirable.
Table 1.
completely observed variables | ||
| ||
x1 | A | 114 |
(frequency) | B | 116 |
| ||
x2 | Male | 144 |
(frequency) | Female | 86 |
| ||
x3 | mean | 62.24 |
(years) | std dev | 10.17 |
| ||
y | censored | 83 |
(frequency) | relapsed | 147 |
| ||
missing covariates | ||
| ||
x4 | 0 | 155 |
(frequency) | 1 | 10 |
missing | 65 | |
| ||
x5 | mean | 78.14 |
(QOL score) | std dev | 15.31 |
missing | 81 | |
| ||
both x4 and x5 | missing | 27 |
one of x4 or x5 | missing | 119 |
3 Existence of the MPLE With No Missing Data
In this section, we characterize very general conditions for the existence of the MPLE of β for a given dataset Dobs under the Cox model with no missing covariate data. Define X* to be
(3.1) |
Let ki denote the number of subjects in ℛ(yi) for i = 1, 2,…, n. Also let . Then, X* is a K × p matrix. Using X*, we are led to the following theorem.
Theorem 3.1
The MPLE of β in (2.1) exists if the following conditions are satisfied:
(C1) X* is of full rank p; and
(C2) There exists a positive vector v, i.e., each component of v is positive, such that
(3.2) |
In addition, if (C1) is satisfied, then (C2) is a necessary condition for the existence of MPLE for β.
The proof of Theorem 3.1 is given in the Appendix.
Remark 3.1
In X* defined by (3.1), the rows corresponding to δi = 0 or xj = xi can be excluded. Thus, the effective numbers of rows in X* can be reduced substantially. Specifically, let , where the indicator function 1{xj ≠ xi} = 1 if xj ≠ xi and 0 otherwise. Then, the effective numbers of rows in X* is given by .
Remark 3.2
When ties are present, as discussed in Klein and Moeschberger (2003, Chapter 8), the partial likelihood may be defined as
(3.3) |
where , zi = Σj∈i xj, di = the number of events at yi, and i is the set of all individuals who have the event at time yi. We can thus rewrite (3.3) as
and Theorem 3.1 can be easily extended to the cases when ties are present. Note that the partial likelihood given by (3.3) is the likelihood of Breslow (1974), and the Breslow likelihood is the default choice in SAS to handle ties in the failure times.
Remark 3.3
Suppose y1 ≤ y2 ≤ ··· ≤ yn. Then, from condition (C2), it is easy to observe that if there exists a j such that x1j ≤ x2j ≤ ··· xnj, the MPLE of β does not exist. Also, when one of the components of xi, say, xij, is binary and the xij’s take the same value for δi = 1 or the the xij’s take the same value for δi = 0, then the MPLE of β does not exist.
Remark 3.4
When conditions (C1) and (C2) are satisfied for a subset of the data, the MPLE still does exist. To see this, we assume that the subset consists of the first n* observations. Then we have
The existence of the MPLE can obtain by simply applying Theorem 3.1 to the above upper bound. These subset conditions are only sufficient but not necessary. However, this result is particularly useful for large datasets, for which checking conditions (C1) and (C2) may not be computationally feasible.
Remark 3.5
Jacobsen (1989) also characterizes a necessary and sufficient condition for the existence of the MPLE. His condition can be stated as follows: there is no a ∈ Rp such that a′δi(xj − xi) ≥ 0 for j ∈ ℛ(yi) and 1 ≤ i ≤ n. According to Lemma A.1, we can see that Jacobsen’s condition implies (C2). Thus, (C2) is necessary for existence and for uniqueness. We note that the conditions stated in Theorem 3.1 are sufficient. However, compared to Jacobsen’s condition, the conditions (C1) and (C2) given in Theorem 3.1 are easier to check. First, it is straightforward to check condition (C1) that X*has full column rank. As discussed in Appendix A of Roy and Hobert (2007), condition (C2) can be checked with a simple linear program using the ‘simplex’ function from the ‘boot’ library in the R programming language.
Example 1: A Simple Illustration (revisited)
Recall that in Example 1, we have n = 3, y1 < y2 < y3, δ1 = δ2 = 1, and δ3 = 0. For Case 1 in which x1 = x2 = 0 and x3 = 1, we have k1 = 3, k2 = 2, k3 = 1, and K = 6. Thus, using (3.1), (X*) = (0, 0, 1, 0, 1, 0)′, which is a 6 × 1 matrix. After excluding the rows corresponding to δi = 0 or xj = xi, the effective number of rows in X* is K* = 2. It is easy to see that X* is of full rank, which is 1. Also, for any v = (v1, v2, v3, v4, v5, v6)′ such that vi > 0, (X*)′v = v3 + v5 > 0. Thus, by Theorem 3.1, the MPLE does not exist.
For Case 2, where x1 = 0, x2 = 1, and x3 = 0, using (3.1), we have (X*)′ = (0, 1, 0, 0, −1, 0). Let v = (v1, v2, v3, v4, v2, v5)′ for vj > 0, j = 1, 2, …, 5, (X*)′v ≡ 0. Obviously, X* is of full rank. Thus, the MPLE does exist by Theorem 3.1.
In general, we have X* = (0, x2 − x1, x3 − x1, 0, x3 − x2, 0)′ and (3.2) reduces to
(3.4) |
Condition (C1) requires that at least two of x1, x2, and x3 are different. If x1 = x2 and condition (C1) holds, then there is no positive solution v to (3.4) regardless of the value of x3. Thus, the MPLE always does not exist when x1 = x2. However, if x1 < x2, then the MPLE exists if x3 < x2 and does not exist if x3 ≥ x2. Similarly, if x2 < x1, the MPLE exists if x3 > x2 and does not exist if x3 ≤ x2. One interesting observation is that even if x3 < x1 < x2, the MPLE still exists although x3, for which δ3 = 0, is distinct from {x1, x2} in the sense that δ1 = δ2 = 1. Thus, the condition for existence of the MPLE cannot be characterized by the value of δi alone by fitting, for example, a binary regression model to δi while treating (1, xi)′ as a vector of covariates.
Example 2: Prostate Cancer Data (revisited)
After we further examined the data, we found that
Variable | δ= 0 | δ= 1 | Total |
---|---|---|---|
Only one of A, B, C | 253 | 2 | 255 |
Only AB not C | 116 | 1 | 117 |
Only AC not B | 35 | 0 | 35 |
Only BC not A | 64 | 1 | 65 |
ABC | 71 | 7 | 78 |
| |||
Total | 539 | 11 | 550 |
From the above table, “only AC not B” is the only group, in which there are no events. This explains why we obtained the unusual estimate and standard error for the regression coefficient corresponding to AC. From Remark 3.3, it becomes apparent that the MPLEs do not exist for this dataset if we fit the five covariates in the Cox model. One way to fix this problem is to combine AB, AC, and BC as one variable, which was called the two-factors only variable in Tsai et al. (2006).
4 Profile Maximum Likelihood Estimation in the Presence of Missing Covariates
When there are missing covariates, we assume that the distribution of the censoring time Ci does not depend on the missing covariates and the missingness is MAR. In this case, we cannot directly use the Cox partial likelihood since we need to model the failure time and the covariates jointly. Thus, instead of the partial likelihood approach, we use a profile likelihood approach when we have MAR covariates.
For notational simplicity, we assume that all failure times are distinct and let y1, y2, …, yd be d distinct failure times. Let h0(y) ≥ 0 denote the baseline hazard function and also let denote the baseline cumulative hazard function. Let and Dobs = (yi, δi, xi,obs, i = 1,2, …, n). Also let D = (yi, δi, xi,obs, xi,mis, i = 1, 2,…, n) denote the complete data. In addition, let ri = (ri1, ri2, …, rip)′ to be the vector of the p missing covariate indicators such that ril = 0 when xil is missing and ril = 1 when xil is observed for i = 1, 2, ···, n and l = 1, 2, ···, p. Since we assume ignorable missingness in the covariates (i.e., MAR covariates and the parameters of the missing data mechanism are distinct from the sampling model), we do not need to model ri. Also, we assume that the parameters of the distributions for the censoring times Ci’s are distinct from the sampling model. Thus, for ignorably missing covariates, ignoring the parts adhering to censoring and the missing data mechanism, the observed data likelihood function based on the Cox model (Cox, 1972) is given by
(4.1) |
where xmis = (xi,mis, i = 1, 2, …, n), f(xi,mis, xi,obs|α) denotes the joint distribution of xi, and α is the vector of parameters for the covariate distribution.
It is well known that the partial likelihood can be expressed as a profile likelihood (Johansen, 1983) by substituting a nonparametric maximum likelihood estimator for the cumulative baseline hazard function H0(y), which is a function of the fixed coefficients β, and that this nonparametric maximum likelihood estimator is necessarily a pure-jump estimator with jumps precisely at the observed event times. Following the profile likelihood approach (see, for example, Klein and Moeschberger (2003, Chapter 8)), we have
(4.2) |
We note that in (4.2), the function
is maximized when h0(yj) = 0 except for the times at which events occur. Thus, the MLE of(β,α) exists if the upper bound in the right-hand side of (4.2) goes to zero when . Write
(4.3) |
The following theorem characterizes the conditions for existence of the MLE of (β, h0, α) when the xij’s are bounded.
Theorem 4.1
If the xij’s are bounded, i.e., ai ≤ xij ≤ bi, define X** to be , j ∈ℛ(yi), δj = 0, 1 ≤ i ≤ n)′, where and each component of is equal to either or for all i. Then, the MLE of (β, h0, α) in (4.1) exists if the following conditions are satisfied: (C1*) lim||α||→∞ L(α|Dobs) = 0; (C2*) X** is of full rank; and (C3*) there exists a positive vector v such that X**′v = 0.
The proof of Theorem 4.1 is given in the Appendix. The main intuition behind Theorem 4.1 is that when the MLE exists under conditions (C2*) and (C3*) for the most extreme possible values of the missing covariates, then the MLE also exists for any intermediate values of the missing covariates, and averaging over the missing values will not affect the existence of the MLE. In Theorem 4.1, the elements of the matrix X* corresponding to the missing covariates are “filled-in” by either or , where and are in fact the two possible extreme values of the missing covariates when the xij’s are bounded.
The next theorem gives the sufficient conditions for existence of the MLE of (β, h0, α) when the xij’s unbounded.
Theorem 4.2
If the xij’s are unbounded, the MLE of (β, h0, α) in (4.1) exists if condition (C1*) in Theorem 4.1 and conditions (C1) and (C2) in Theorem 3.1 are satisfied for the completely observed cases.
The proof of Theorem 4.2 is given in the Appendix. Theorem 4.2 is practically useful as the conditions stated in this theorem are easy to check than those given in Theorem 4.1. We note that in Theorem 4.2, we are not doing a complete case analysis. Instead, we use a subset of the data with the completely observed cases to establish the sufficient conditions for the existence of the MLE when the missing covariates are unbounded.
Remark 4.1
Assume that the maximum number of missing components of xi, i = 1, …, n, is pi. Then, to verify the conditions given in Theorem 4.1, we need to check only the conditions (C2*) and (C3*) for at most 2pi possible X **’s.
Remark 4.2
When there are no missing covariates, it is easy to observe that the profile maximum likelihood estimate of β reduces to the MPLE, while the profile maximum likelihood estimate of α is the MLE.
Remark 4.3
Ibrahim, Lipsitz and Chen (1999) and Chen and Ibrahim (2001) provide a comprehensive set of guidelines for specifying the joint distribution of the covariate vector xi through a series of one dimensional conditional distributions. Condition (C1*) stated in Theorem 4.1 holds for many covariate distributions considered in Ibrahim, Lipsitz and Chen (1999) and Chen and Ibrahim (2001).
Remark 4.4
When there are ties in the event times, similar to Remark 3.2, the upper bound given in (4.2) can be modified as
where K > 0 is independent of β, α, and xi, and zi and di are defined in (3.3). Thus, all the theory developed in this subsection is still valid in the presence of ties.
Next, we consider an interesting special case where each missing component of xi is discrete and bounded.
Corollary 4.1
Assume that each missing component of xi is discrete and bounded. Then condition (C3*) given in Theorem 4.1 is also necessary for the existence of the MLE for (β, h0) if condition (C2*) is satisfied.
The proof of Corollary 4.1 directly follows from the fact that when each missing component of xi is discrete and bounded, we have
Thus, details of the proof are omitted for brevity.
5 Computational Development
When there are no missing covariates, computing the MPLE of β is straightforward and, in fact, the MPLE can be computed via standard statistical software, such as the SAS procedure, PROC PHREG. In the presence of missing covariates, the EM algorithm is required. Martinussen (1999) proposes an efficient EM algorithm for computing the MLE and its standard error in the presence of discrete missing covariates. When xi,mis is continuous or mixed continuous and categorical, we need to develop a Monte Carlo EM (MCEM) algorithm, which is an extension of Martinussen’s algorithm for computing the MLE’s of β, h0, and α as well as their standard errors.
To implement the MCEM algorithm, let γ = (β, h0, α). Let γ(t) denote the parameter estimate of γ at the tth EM iteration. In the E-step, we take an MCMC sample of size , from
for i = 1, 2, …, n. Note that this conditional distribution is log-concave as long as f(xi,mis, xi,obs |α(t)) is log-concave in each component of xi,mis. We then compute
(5.1) |
where and H0(yj) = Σyl≤yj,δl=1 h0(yl). In the M-step, we compute
(5.2) |
(5.3) |
and
Following Booth and Hobert (1999), in the MCEM algorithm, we take m(t+1) = m(t) + Δm, where Δm > 0. With this dynamic MCMC sample size m(t), the MCEM algorithm requires much less computational time. Also a large m(t) is not needed in early iterations of the algorithm since γ(t) is still far from the MLE γ̂ and the algorithm is not near convergence. As t increases m(t) increases, and a more computationally accurate estimate of Q(γ|γ (t)) is obtained in the E-step.
When xi,mis is categorical, the E-step at the (t + 1)st iteration reduces to the EM by the Method of Weights (Ibrahim, 1990). With the EM by the Method of Weights, a similar M-step can be developed. We refer to Ibrahim (1990) and Martinussen (1999) for the detailed development of the EM algorithm in this case. It is easy to see from (5.2) that when there are no missing covariates, β(t+1) is the MPLE of β, which is consistent with Remark 4.2.
Let γ̂ denote the estimate of γ at EM convergence. Using Louis’s method (Louis, 1982), the estimated observed information matrix of γ based on the observed data is not difficult to compute. Note that the complete-data likelihood function can be written as
(5.4) |
Thus, the log-likelihood function for the ith observation is given by
(5.5) |
Write the gradient vector of Q(γ|γ(t)) as
and write the matrix of second derivatives of Q(γ|γ(t)) as
In addition, write the complete data score vector as
Then, the estimated observed information matrix of γ̂ is given by
(5.6) |
where is an MCMC sample of size , from f(xi,mis|xi,obs, γ̂), and . Thus, the estimate of the asymptotic covariance matrix of γ̂ is [ℐ(γ̂)]−1.
Finally, we note that when there are ties in the failure times, (5.7), (5.2), and (5.3) can be modified as
(5.7) |
where ,
(5.8) |
and
(5.9) |
The calculation of ℐ(γ̂) needs to be modified accordingly in the presence of ties. Again, the above formulation can be easily extended to the case where xi,mis is categorical.
6 Analysis of Small Cell Lung Cancer Data
For the LCCC 9719 data discussed in Section 2, we use the proposed methods to estimate the regression coeffients assuming the missing covariates are MAR. We consider a Cox regression model for [yi |xi, β, h0] allowing for right censoring. Thus, we have
where xi = (xi1,…, xi5)′ is a 5 × 1 vector of covariates, i = 1, 2, …, n, β = (β1,…,β5)′ is the vector of the corresponding regression coeffcients, h0(yi) and H0(yi) denote the baseline hazard function and the cumulative baseline hazard function, respectively. Since xi1, xi2, and xi3 are always observed, they do not need to be modeled. Thus, we only need to model two missing covariates (x4, x5) conditioning on the completely observed covariates throughout. We consider two models: [x4|x1, x2, x3][x5|x1, x2, x3, x4] and [x4|x1, x2, x3, x5][x5|x1, x2, x3] for (x4, x5). We use a logistic regression model for xi4 and a normal linear regression model for xi5. Specifically, for example, for [x4|x1, x2, x3][x5|x1, x2, x3, x4], we have
where α4 = (α40, α41, α42, α43)′, and
where α5 = (α50, α51, …, α55)′.
To illustrate how to apply the Theorems presented in Sections 2 and 3, we consider a subset of the LCCC 9719 data, which is given in Table 2. Since all of the covariates are observed in this subset, using (3.1) after excluding the rows corresponding to δi = 0 or xj = xi, X* is a 35× 5 matrix. The first 8 rows are given by for i = 1, 2, …, 8, and the last row is given by . Using Maple (Version 8) linsolve, with Maple code “linsolve(X*, v);”, after loading a linalg package, we obtain a closed form solution for X*′v = 0 and find that there indeed exists a positive vector v > 0 satisfying X*′v = 0. Also, |X*′X*| = 9.2344 × 1010 > 0. Thus, conditions (C1) and (C2) given in Theorem 3.1 are met for this subset. As discussed in Remark 3.4, when the conditions (C1) and (C2) are satisfied for a subset of the data, these two conditions hold for the entire set of completely observed cases. In addition, we can show that lim||α||→∞ L(α|Dobs) = 0, where L(α|Dobs) is defined by (4.3), using the results established in Chen and Shao (2001) and hence, details are omitted here for brevity. Thus, based on Theorem 4.1, the MLE does exist for the entire dataset.
Table 2.
Obs (i) | yi | δi | xi1 | xi2 | xi3 | xi4 | xi5 |
---|---|---|---|---|---|---|---|
1 | 0.394 | 1 | 1 | 0 | 68 | 0 | 54 |
2 | 1.083 | 1 | 0 | 0 | 81 | 0 | 79 |
3 | 1.116 | 1 | 1 | 1 | 82 | 0 | 64 |
4 | 1.149 | 1 | 0 | 1 | 58 | 1 | 86 |
5 | 1.313 | 1 | 1 | 1 | 52 | 1 | 54 |
6 | 3.973 | 1 | 1 | 0 | 69 | 1 | 92 |
7 | 6.665 | 1 | 0 | 0 | 54 | 1 | 83 |
8 | 9.521 | 0 | 1 | 0 | 62 | 0 | 67 |
9 | 14.380 | 0 | 0 | 1 | 81 | 0 | 80 |
Since the MPLE of β and the MLE of (β, h0, α) exist for this dataset, we can compute various estimates of β and α4 and α5. We standardized age and QOL score in order to make the numerical computations more stable. We used the SAS procedure PHREG to obtain the MPLE of β for the complete case (CC) analysis (i.e, an analysis deleting all of the missing values). The MCEM algorithm discussed in Section 5 was implemented using FORTRAN 77 with IMSL, the estimated observed information matrix ℐ(γ̂) given by (5.6) is of dimension (102+15)×(102+15), and its inverse was computed via the IMSL subroutine DLINDS. The Gibbs sampling algorithm was used to generate the Monte Carlo sample with 500 “burn-in” iterations at each MCEM iteration. In the MCEM, we took m(0) = 500 and Δm = 50. The convergence criterion for the MCEM algorithm for obtaining the MLE was that the squared distance between the tth and (t + 20)th iterations was less than 10−3. The MCEM algorithm for obtaining the MLE of (β, h0, α) required only 25 iterations using m(t) = 1750 at convergence.
The resulting MPLEs and MLEs are shown in Tables 3, 4, and 5 for the complete case (CC) analysis as well as analyses incorporating all of the cases with two different models for (x4, x5). In the tables, standard errors (SEs), Z-statistics, p-values, and 95% confidence intervals for β are also reported. We can see some differences between the estimates in Tables 3 and 4. In the CC analysis, the 95% confidence interval for β1 is (−0.024, 0.967) while the 95% confidence interval is (0.133, 0.820) in the analysis incorporating all of the cases, which indicates that the regression coefficient for treatment is not significant at the 0.05 level in the CC analysis, but significant in the analysis incorporating all of the cases. This indicates that continuous therapy followed by second line therapy may have a strong effect (i.e., more beneficial) compared to defined duration of therapy with respect to time to progression. Also, the SEs from the analysis incorporating all of the cases are consistently smaller than those from the CC analysis for all of the βj’s. This is expected since more information is used in the all case analysis.
Table 3.
Parameter | MPLE | SE | Z-statistic | p-value | 95% CI |
---|---|---|---|---|---|
β1 | 0.471 | 0.253 | 1.864 | 0.062 | (−0.024, 0.967) |
β2 | 0.068 | 0.243 | 0.280 | 0.780 | (−0.409, 0.545) |
β3 | −0.020 | 0.130 | −0.154 | 0.878 | (−0.275, 0.235) |
β4 | 0.878 | 0.411 | 2.140 | 0.032 | (0.074, 1.684) |
β5 | −0.138 | 0.119 | −1.158 | 0.247 | (−0.372, 0.096) |
Table 4.
Parameter | MLE | SE | Z-statistic | p-value | 95% CI |
---|---|---|---|---|---|
β1 | 0.477 | 0.175 | 2.723 | 0.006 | (0.133, 0.820) |
β2 | 0.174 | 0.180 | 0.966 | 0.334 | (−0.179, 0.528) |
β3 | −0.021 | 0.090 | −0.238 | 0.812 | (−0.198, 0.155) |
β4 | 0.914 | 0.381 | 2.400 | 0.016 | (0.168, 1.661) |
β5 | −0.052 | 0.105 | −0.490 | 0.624 | (−0.258, 0.155) |
Table 5.
Parameter | MLE | SE | Z-statistic | p-value | 95% CI |
---|---|---|---|---|---|
β1 | 0.477 | 0.175 | 2.722 | 0.006 | (0.133, 0.820) |
β2 | 0.173 | 0.180 | 0.959 | 0.338 | (−0.181, 0.527) |
β3 | −0.021 | 0.090 | −0.233 | 0.816 | (−0.197, 0.155) |
β4 | 0.914 | 0.388 | 2.356 | 0.018 | (0.154, 1.674) |
β5 | −0.053 | 0.106 | −0.501 | 0.616 | (−0.261, 0.155) |
The reason why we considered two models for (x4, x5) is that there are two possibilities in modeling the joint covariate distribution as a sequence of one dimensional conditional distributions. As Ibrahim, Lipsitz, and Chen (1999) point out, it is important to conduct a sensitivity analysis to examine whether inference about the parameters of primary interest, which are the βj’s in this case, is robust with respect to the order of conditioning in the covariate distributions. From Tables 4 and 5, both estimates and SEs for all the βj’s are very close for these two joint covariate distributions. Thus, inference about β is quite robust with respect to these two different orders of conditioning.
Finally, the estimated baseline hazard functions h0(y) are plotted in Figure 1 for the complete case analysis as well as the analysis incorporating all of the cases, labeled Complete Cases and All Cases, respectively. In the all case analysis, the model [x4|x1, x2, x3][x5|x1, x2, x3, x4] for (x4, x5) was used since an almost identical estimated baseline hazard function was obtained under the model [x4|x1, x2, x3, x5][x5|x1, x2, x3]. Strikingly, the CC analysis resulted in a much different (larger) estimate of the baseline hazard than the all case analysis, which further demonstrates the importance of incorporating all of the cases into the analysis.
Acknowledgments
The authors wish to thank the Editor, the Associate Editor, and two referees for helpful comments and suggestions which have improved the paper. Dr. Chen and Dr. Ibrahim’s research was partially supported by NIH grants #GM 70335 and #CA 74015. Dr. Shao’s research was partially supported by HKUST DAG05/06.SC27 and RGC 602206.
Appendix: Proofs of Theorems
We first establish a useful result, which is formally stated in the following lemma.
Lamma A.1
Let X* be an n* × p matrix (p < n*). Also let Rn* denote the n*-dimensional Euclidean space. If there is no positive vector v = (v1, v2, …, vn*)′ ∈ Rn* (denoted by v > 0, i.e., vi > 0 for i = 1, 2, …, n*) such that
(A.1) |
then there exists a non-zero vector b ∈Rp such that
(A.2) |
where is the ith row of X*.
Proof
Let = {X*′ v: v > 0, v ∈ Rn*}. Then is a convex cone in Rp (see Theorem 2.6 in Rockafellar (1970)). Since (A.1) does not hold, by Corollary 11.7.3 of Rockafellar (1970), there exists some non-zero vector b such that ∀ v > 0, b′X*′v ≤ 0 and hence ∀ v ≥ 0, b′X*′v ≤ 0. In particular, (A.2) holds.
Proof of Theorem 3.1
Observe that for δ = 0 or 1 and x > −1
(A.3) |
Without loss of generality (WLOG), assume y1 ≤ y2 ≤ … ≤ yn. Then
(A.4) |
where t = (t1, t2, …, tn)′, R+n = R+ ×…×R+ with R+ = (0, ∞), and F (u) = exp(−exp(−u)).
Sufficiency
WLOG, we assume that Lp(β|Dobs) ≢ 0. Then, there exists a β0 such that Lp(β0|Dobs) > 0. Let M > 1 such that
For β satisfying max1≤i≤n,j>i δi(xj − xi)′β > M, there exist i0 and j0 achieving the maximum such that δi0(xi0−xj0)′β < −M. Since F is a nondecreasing distribution function, we have
(A.5) |
When max1≤i≤n,j>i δi(xj−xi)′β ≤ M, following Lemma 4.1 in Chen and Shao (2001), conditions (C1) and (C2) imply that
(A.6) |
Combining (A.5) and (A.6) leads to supβ Lp(β|Dobs) = supβ: ||β||≤D Lp(β|Dobs). Since Lp(β|Dobs) is a continuous and bounded function, there exists a β̂ such that
and hence the MPLE exists.
Necessity
Assume that the MPLE of β exists. Then, there is a β* such that
Assume that condition (C2) does not hold. Then, by Lemma A.1, there exists a non-zero vector b such that δi(xj − xi)′b ≤ 0 for all 1 ≤ i ≤ n and j > i. Thus,
which is an increasing function of s when condition (C1) holds. This is a contradiction. This shows that condition (C2) is necessary for the existence of the MPLE for β if condition (C1) is satisfied.
Proof of Theorem 4.1
Write
(A.7) |
It is sufficient to prove that
(A.8) |
Observe that
is an increasing function in for δj = 1 and a decreasing function in for δj = 0. For 1 ≤ l ≤ p, let if βl ≥ 0 and if βl < 0. Write and . Let ℛi = ℛ(yi) − {i}. Then we have
(A.9) |
It directly follows from (A.7), (A.9) and (4.3) that
Following the proof of Theorem 3.1, if conditions (C2*) and (C3*) are satisfied. Consequently, we obtain (A.8) under condition (C1*).
Proof of Theorem 4.2
Let 1 = (1, 1, …, 1)′. Then we have
Therefore, the above inequality, condition (C1*) in Theorem 4.1, and conditions (C1) and (C2) stated in Theorem 3.1 directly yield the existence of the MLE of (β, h0, α).
Footnotes
AMS 2000 subject classifications. Primary 62N02, 62F15; secondary 62N99, 65C05.
References
- Booth JG, Hobert JP. Maximizing Generalized Linear Mixed Model likelihoods with an Automated Monte Carlo EM Algorithm. Journal of the Royal Statistical Society, Series B. 1999;61:265–285. [Google Scholar]
- Breslow NE. Covariance Analysis of Censored Survival Data. Biometrics. 1974;30:89–99. [PubMed] [Google Scholar]
- Chen HY. Double Nonparametric Likelihood Method for the Cox regression model with missing covariates. Journal of the American Statistical Association. 2002;97:565–576. [Google Scholar]
- Chen HY, Little RJA. Proportional Hazards Regression with Missing Covariates. Journal of the American Statistical Association. 1999;94:896–908. [Google Scholar]
- Chen MH, Ibrahim JG. Maximum Likelihood Methods for Cure Rate Models with Missing Covariates. Biometrics. 2001;57:43–52. doi: 10.1111/j.0006-341x.2001.00043.x. [DOI] [PubMed] [Google Scholar]
- Chen MH, Ibrahim JG, Shao QM. On Propriety of the Posterior Distribution and Existence of the Maximum Likelihood Estimator for Regression models with Covariates Missing at Random. Journal of the American Statistical Association. 2004;99:421–438. [Google Scholar]
- Chen MH, Ibrahim JG, Shao QM. Posterior Propriety and Computation for the Cox Regression Model with Applications to Missing Covariates. Biometrika. 2006;93:791–807. [Google Scholar]
- Chen MH, Shao QM. Propriety of Posterior Distribution for Dichotomous Quantal Response Models With General Link Functions. Proceedings of the American Mathematical Society. 2001;129:293–302. [Google Scholar]
- Cox DR. Regression Models and Life Tables (with Discussion) Journal of the Royal Statistical Society, Series B. 1972;34:187–220. [Google Scholar]
- Cox DR. Partial Likelihood. Biometrika. 1975;62:269–76. [Google Scholar]
- Herring AH, Ibrahim JG. Likelihood-based Methods for Missing Covariates in the Cox Proportional Hazards Model. Journal of the American Statistical Association. 2001;96:292–302. [Google Scholar]
- Herring AH, Ibrahim JG, Lipsitz SR. Frailty Models With Missing Covariates. Biometrics. 2002;58:98–109. doi: 10.1111/j.0006-341x.2002.00098.x. [DOI] [PubMed] [Google Scholar]
- Herring AH, Ibrahim JG, Lipsitz SR. Nonignorably Missing Covariate Data in Survival Analysis: A Case Study of an International Breast Cancer Study Group Trial. Applied Statistics. 2004;53:293–310. [Google Scholar]
- Huang L, Chen M-H, Ibrahim JG. Bayesian Analysis for Generalized Linear Models with Nonignorably Missing Covariates. Biometrics. 2005;61:729–737. doi: 10.1111/j.1541-0420.2005.00338.x. [DOI] [PubMed] [Google Scholar]
- Ibrahim JG. Incomplete Data in Generalized Linear Models. Journal of the American Statistical Association. 1990;85:765–769. [Google Scholar]
- Ibrahim JG, Lipsitz SR, Chen MH. Missing Covariates in Generalized Linear Models When the Missing Data Mechanism is Nonignorable. Journal of the Royal Statistical Society, Series B. 1999;61:173–190. [Google Scholar]
- Jacobsen M. Existence and Unicity of MLEs in Discrete Exponential Family Distributions. Scandinavian Journal of Statistics. 1989;16:335–349. [Google Scholar]
- Johansen S. An Extension of Cox’s Regression Model. International Statistical Review. 1983;51:258–262. [Google Scholar]
- Klein JP, Moeschberger ML. Survival Analysis. 2. New York: Springer-Verlag; 2003. [Google Scholar]
- Lin DY, Ying Z. Cox Regression With Incomplete Covariate Measurements. Journal of the American Statistical Association. 1993;88:1341–1349. [Google Scholar]
- Lipsitz SR, Ibrahim JG. Using the EM Algorithm for Survival Data with Incomplete Categorical Covariates. Lifetime Data Analysis. 1996;2:5–14. doi: 10.1007/BF00128467. [DOI] [PubMed] [Google Scholar]
- Lipsitz SR, Ibrahim JG. Estimating Equations with Incomplete Categorical Covariates in the Cox Model. Biometrics. 1998;54:1002–1013. [PubMed] [Google Scholar]
- Lipsitz SR, Ibrahim JG. Estimation with Correlated Censored Survival Data with Missing Covariates. Biostatistics. 2000;1:315–327. doi: 10.1093/biostatistics/1.3.315. [DOI] [PubMed] [Google Scholar]
- Louis T. Finding the Observed Information Matrix When Using the EM Algorithm. Journal of the Royal Statistical Society, Series B. 1982;44:226–233. [Google Scholar]
- Martinussen T. Cox Regression with Incomplete Covariate Measurements Using the EM-Algorithm. Scandinavian Journal of Statistics. 1999;26:479–491. [Google Scholar]
- Paik MC. Multiple Imputation for the Cox Proportional Hazards Model with Missing Covariates. Lifetime Data Analysis. 1997;3:289–298. doi: 10.1023/a:1009657116403. [DOI] [PubMed] [Google Scholar]
- Paik MC, Tsai WY. On Using the Cox Proportional Hazards Model with Missing Covariates. Biometrika. 1997;84:579–593. doi: 10.1023/a:1009657116403. [DOI] [PubMed] [Google Scholar]
- Pons O. Estimation in the Cox Model with Missing Covariate Data. Journal of Nonparametric Statistics. 2002;14:223–247. [Google Scholar]
- Rockafellar RT. Convex Analysis. Princeton, N.J: Princeton University Press; 1970. [Google Scholar]
- Roy V, Hobert JP. Convergence Rates and Asymptotic Standard Errors for Markov Chain Monte Carlo Algorithms for Bayesian Probit Regression. Journal of the Royal Statistical Society, Series B. 2007;69:607–623. [Google Scholar]
- Schluchter M, Jackson K. Log-linear Analysis of Censored Survival Data with Partially Observed Covariates. Journal of the American Statistical Association. 1989;84:42–52. [Google Scholar]
- Socinski MA, Schell MJ, Peterman A, Bakri K, Yates S, Gitten R, Unger P, Lee J, Lee Ji, Tynan M, Moore M, Kies M. Phase III Trial Comparing Defined Duration of Therapy Versus Continuous Therapy Followed by Second-Line Therapy in Advanced-Stage IIIB/IV Non-Small-Cell Lung Cancer. Journal of Clinical Oncology. 2002;20:1335–1343. doi: 10.1200/JCO.2002.20.5.1335. [DOI] [PubMed] [Google Scholar]
- Tsai HK, Chen M-H, McLeod DG, Carroll PR, Richie JP, D’Amico AV. Cancer-Specific Mortality Following Radical Prostatectomy or Radiation Therapy with Short-Course Hormonal Therapy in Men with Localized, Unfavorable-Risk Prostate Cancer. 2006 doi: 10.1002/cncr.22279. Submitted. [DOI] [PubMed] [Google Scholar]