Information criteria for Firth's penalized partial likelihood approach in Cox regression models

Kengo Nagashima; Yasunori Sato

doi:10.1002/sim.7368

. 2017 Jun 12;36(21):3422–3436. doi: 10.1002/sim.7368

Information criteria for Firth's penalized partial likelihood approach in Cox regression models

Kengo Nagashima ^1,^✉, Yasunori Sato ¹

PMCID: PMC6084330 PMID: 28608396

Abstract

In the estimation of Cox regression models, maximum partial likelihood estimates might be infinite in a monotone likelihood setting, where partial likelihood converges to a finite value and parameter estimates converge to infinite values. To address monotone likelihood, previous studies have applied Firth's bias correction method to Cox regression models. However, while the model selection criteria for Firth's penalized partial likelihood approach have not yet been studied, a heuristic AIC‐type information criterion can be used in a statistical package. Application of the heuristic information criterion to data obtained from a prospective observational study of patients with multiple brain metastases indicated that the heuristic information criterion selects models with many parameters and ignores the adequacy of the model. Moreover, we showed that the heuristic information criterion tends to select models with many regression parameters as the sample size increases. Thereby, in the present study, we propose an alternative AIC‐type information criterion based on the risk function. A Bayesian information criterion type was also evaluated. Further, the presented simulation results confirm that the proposed criteria performed well in a monotone likelihood setting. The proposed AIC‐type criterion was applied to prospective observational study data. © 2017 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd

Keywords: Akaike's information criterion, model selection, monotone likelihood, penalized partial likelihood, survival analysis

1. Introduction

The Cox regression model 1 is one of the most useful and widely used tools in survival analysis. In Cox regression model estimations, maximum partial likelihood estimates may be infinite in monotone likelihood situations 2. In such cases, partial likelihood converges to a finite value, and the parameter estimates and standard errors converge to infinite values; hence, these results are not interpretable. Such problems arise, for example, in the presence of unbalanced covariates, large parameter effects, and/or heavy censoring.

To address this monotone likelihood problem, Heinze and Schemper proposed Firth's penalized partial likelihood approach 3. They directly applied to the Cox regression model Firth's bias correction method 4, which aims to remove asymptotic bias from maximum likelihood estimates in exponential families with canonical link functions. Firth's penalized partial likelihood approach reduces asymptotic bias and addresses the monotone likelihood problem 3, 5. Firth's bias correction method was also applied to logistic regression models to address the separation problem 5, 6, 7, which is similar to the monotone likelihood problem. This approach reduces asymptotic bias and also overcomes the separation problem.

In this study, we discuss the model selection criteria for Firth's penalized partial likelihood approach based on Akaike's information criterion (AIC) 8 and Bayesian information criterion (BIC) 9. Although model selection is an important issue in data analysis, the model selection criteria for Firth's penalized partial likelihood approach have never been studied. To our best knowledge, only the SAS PHREG procedure can be used to obtain an AIC‐type heuristic information criterion, AIC^∗=−2log(maximum penalized partial likelihood)+2p, where p is the number of regression parameters, and other major statistical software (e.g., Stata and R) can only output the log penalized partial likelihood. However, AIC^∗ is not theoretically justified, and especially, we find that AIC^∗ tends to select a model that has a large number of regression parameters as n→∞, where n is the sample size; that is, AIC^∗ does not have the important property of avoiding over‐fitting. This result indicates that AIC^∗ is not a suitable model selection criterion. Similarily, the SAS PHREG procedure implements a BIC‐type heuristic information criterion, BIC^∗=−2log(maximum penalized partial likelihood)+logd, where d is the number of events. BIC^∗ is not also theoretically justified. Therefore, we consider alternative model selection criteria in this setting.

The remainder of this paper is organized as follows. In Section 2, we introduce motivating data and issues of AIC^∗. Section 3 briefly reviews Firth's bias correction method and penalized partial likelihood approach, discusses the fundamental problems of AIC^∗ and BIC^∗, and proposes appropriate information criteria. Section 4 presents the simulation results to demonstrate the performance of the criteria and to check the property of AIC^∗ holds. Section 5 applies the proposed method to real data, and Section 6 concludes the paper with a brief discussion.

2. Motivating example

Yamamoto et al. 10 collected time‐to‐event (e.g., death, local recurrence, and leptomeningeal dissemination) data for 1194 cancer patients with multiple brain metastases. Secondary end points of this study include time to leptomeningeal dissemination (the data on 928 patients were censored, while the MRI results of 121 patients (10%), that is, those who suffered an early death or who deteriorated markedly soon after stereotactic radiosurgery, were not available; 145 patients had an event). We analyzed the following covariates: age (<65, ⩾65), sex (female, male), kps(Karnofsky performance status; ⩾80, ⩽70), ntumor (number of tumors; 1, 2–4, 5–10), diameter (maximum diameter of largest tumor; <1.6cm, ⩾1.6cm), volume (cumulative tumor volume; <1.9 mL, ⩾1.9 mL), ptumor (primary tumor category; lung, breast, gastrointestinal, kidney, other), status(extracerebral disease status; not controlled, controlled), and neuro (neurological symptoms; no, yes). To analyze the competing risk end point, leptomeningeal dissemination, we used cause‐specific proportional hazard models, which are identical to usual Cox regression models 11.

The descriptive statistics for the study data are shown in Table 1. The data have heavy censoring, and the number of events differs considerably for the primary tumor categories. In particular, the kidney cancer group has no events. Further, as we illustrate below, monotone likelihood was observed in these data because of the primary tumor categories, while the parameter estimate of the kidney cancer group converges to −∞.

Table 1.

Number of events (leptomeningeal dissemination) and censored values for the study data (n=1073).

Covariate	Group	Event	Censored	% Censored
age	<65	69	393	85.1
	⩾65	76	535	87.6
sex	Female	62	370	85.6
	Male	83	558	87.1
kps	⩾80	132	810	86.0
	⩽70	13	118	90.1
ntumor	1	49	365	88.2
	2–4	61	412	87.1
	5–10	35	151	81.2
diameter	<1.6cm	76	455	85.7
	⩾1.6 cm	69	473	87.3
volume	<1.9 mL	75	459	86.0
	⩾1.9 mL	70	469	87.0
ptumor	Lung	116	705	85.9
	Breast	17	95	84.8
	Gastrointestinal	9	66	88.0
	Kidney	0	32	100.0
	Other	3	30	90.9
status	Not controlled	103	634	86.0
	Controlled	42	294	87.5
neuro	No	105	656	86.2
	Yes	40	272	87.2
Total		145	928	86.5

Open in a new tab

Note: ntumor, number of tumors; kps, Karnofsky performance status; diameter, maximum diameter of largest tumor; volume, cumulative tumor volume; ptumor, primary tumor category; status, extracerebral disease status; neuro, neurological symptoms.

Next, we consider the model selection based on AIC^∗, the results of which are shown in Table 2. The full model, which includes all the covariates, was selected by AIC^∗ as the best. In the best model, the hazard ratios were estimated by using Firth's penalized partial likelihood approach (Table 3). To illustrate the problem of monotone likelihood, the hazard ratios estimated by using a usual Cox regression model are also shown. For the usual Cox regression model, when monotone likelihood occurred, the hazard ratio of the kidney cancer group was 0.00 (= exp(−∞)), standard error was 543.30, p‐value of the Wald test was 0.98, and p‐value of the likelihood ratio test was <0.01. Although the number of events for lung cancer was considerably larger than that for kidney cancer (Table 1), a large p‐value was observed in the Wald test. On the contrary, the results derived by using Firth's penalized partial likelihood approach were plausible. The hazard ratio was 0.12, and standard error was 1.43 (Table 3); therefore, usual Cox regression models were unsatisfactory in the presence of monotone likelihood.

Table 2.

The top five models based on AIC^∗ for the study data.

Model									AIC^∗
age	sex	kps	ntumor	diameter	volume	ptumor	status	neuro	1731.50
age	sex		ntumor	diameter	volume	ptumor	status	neuro	1731.80
age	sex	kps	ntumor	diameter		ptumor	status	neuro	1731.85
age	sex	kps	ntumor		volume	ptumor	status	neuro	1731.85
age	sex		ntumor		volume	ptumor	status	neuro	1732.16

Open in a new tab

Table 3.

Parameter estimates of the best model based on AIC^∗ for the study data.

	Usual Cox						Firth's penalized partial
Covariate	regression model						likelihood approach
	HR	SE	95% CI		p (Wald)	p (LR)	HR	SE	95% CI		p (LR)
age (⩾65 vs. <65)	1.01	0.17	0.72	1.40	0.98	0.98	1.00	0.17	0.72	1.40	0.98
sex (male vs. female)	1.20	0.19	0.83	1.73	0.34	0.33	1.19	0.19	0.83	1.72	0.34
ntumor
1 vs. 2–4	0.72	0.20	0.49	1.06	0.09	0.09	0.72	0.20	0.49	1.06	0.09
5–10 vs. 2–4	1.58	0.22	1.03	2.41	0.04	0.04	1.59	0.22	1.04	2.43	0.04
kps (⩾80 vs. <70)	1.04	0.32	0.55	1.96	0.90	0.90	1.07	0.32	0.57	2.00	0.83
diameter (⩾1.6 vs. <1.6)	0.90	0.33	0.47	1.69	0.73	0.73	0.90	0.33	0.47	1.70	0.74
volume (⩾1.9 vs. <1.9)	1.10	0.32	0.59	2.08	0.76	0.76	1.11	0.32	0.59	2.08	0.76
ptumor
Breast vs. lung	1.03	0.29	0.58	1.83	0.91	0.91	1.05	0.29	0.60	1.87	0.85
GI vs. lung	1.55	0.37	0.75	3.21	0.24	0.26	1.61	0.37	0.79	3.31	0.21
Kidney vs. lung	0.00	543.30	0.00	–	0.98	<0.01	0.12	1.43	0.01	1.91	0.02
Others vs. lung	0.77	0.59	0.24	2.45	0.66	0.64	0.89	0.55	0.30	2.63	0.83
status (nc. vs. controlled)	1.02	0.19	0.71	1.47	0.92	0.92	1.03	0.19	0.71	1.48	0.89
neuro (yes vs. no)	1.50	0.23	0.96	2.36	0.07	0.08	1.51	0.23	0.97	2.37	0.07

Open in a new tab

Note: HR, hazard ratio; kps, Karnofsky performance status; ntumor, number of tumors; diameter, maximum diameter of largest tumor; volume, cumulative tumor volume; ptumor, primary tumor category; status, extracerebral disease status; neuro, neurological symptoms; GI, gastrointestinal; nc., not controlled; LR, likelihood ratio; SE, standard error.

Now, we return to the model selection result based on AIC^∗ when using Firth's penalized partial likelihood approach. As shown in Table 2, model selection based on AIC^∗ tends to select models that have many parameters, and to support this statement, we discuss the theoretical property of AIC^∗ in Section 3.3. In Table 3, the best model under AIC^∗ includes variables that have considerably small effects such as age(H R=1.00, p‐value=0.98) and status (H R=1.03, p‐value=0.89), whose p‐values were very large. Although these variables have little association with the time‐to‐event, such variables were included in the best model and subsequent models ranked in the top 5. Indeed, because model selection based on AIC^∗ performs badly, we propose an alternative approach herein to address this problem.

3. Infomation criteria for Firth's penalized partial likelihood approach

3.1. Cox regression model

We consider Cox regression models 1 with Andersen and Gill's 12 counting process formulation. A triplet $(Ω, F, P)$ is a probability space, and ${F_{t}, t \in [0, 1]}$ is an increasing right continuous family of sub σ‐algebras of $F$ that includes failure time and covariate histories to scaled time t and censoring histories to t ⁺. Let N=(N ₁,…,N _i,…,N _n)^T for i=1,…,n be an n‐component multivariate counting process, where N _i(t) counts the number of failures (0 or 1) for the ith individual in scaled time t∈[0,1]. The sample paths N ₁,…,N _i,…,N _n are step functions, with 0 at time 0 and no two components having simultaneous jumps. Now, suppose that N _i(t) has a random intensity process $h_{i} (t) = Y_{i} (t) h_{0} (t) \exp {β_{0}^{T} Z_{i} (t)}$ , where h ₀(t) is a baseline hazard function, Y _i(t) is a predictable process taking the value of 1 if the ith individual is at risk at time t and 0 otherwise, β ₀=(β ₀₁,…,β _0p)^T is a p‐dimensional vector of the true regression parameters, and the p‐dimensional vector Z _i=(Z _i1,…,Z _ip)^T is the predictable covariate process for the ith individual. Note that the superscript ‘T’ indicates the transpose of a matrix or a vector. We assume that (N _i,Y _i,Z _i) are independent and identically distributed. In this case, the processes $M_{i} (t) = N_{i} (t) - \int_{0}^{t} h_{i} (x) d x$ are independent local square integrable martingales on the scaled time interval [0,1]. N _i, Y _i, and Z _i are assumed to be adapted to ${F_{t}, t \in [0, 1]}$ .

Under these settings, the log‐partial likelihood function is defined as

l (N, Y, Z; β) = \sum_{i = 1}^{n} \int_{0}^{1} β^{T} Z_{i} (x) - l o g \{n S^{(0)} (β, x)\} d N_{i} (x),

the score function is defined as

U (β) = \frac{\partial l (N, Y, Z; β)}{\partial β} = \sum_{i = 1}^{n} \int_{0}^{1} \{Z_{i} (x) - \frac{S^{(1)} (β, x)}{S^{(0)} (β, x)}\} d N_{i} (x),

and the observed information matrix is defined as

I (β) = \frac{\partial^{2} l (N, Y, Z; β)}{\partial β \partial β^{T}} = \sum_{i = 1}^{n} \int_{0}^{1} [\frac{S^{(2)} (β, x)}{S^{(0)} (β, x)} - {\{\frac{S^{(1)} (β, x)}{S^{(0)} (β, x)}\}}^{\otimes 2}] d N_{i} (x),

where Y=(Y ₁,…,Y _n)^T, Z=(Z ₁,…,Z _n)^T, the scalar function S ⁽⁰⁾, the vector function S ⁽¹⁾, and the matrix function S ⁽²⁾ are defined as $S^{(k)} (β, t) = n^{- 1} \sum_{i = 1}^{n} Y_{i} (t) Z_{i} {(t)}^{\otimes k} \exp {β^{T} Z_{i} (t)}$ for k=0,1,2. Here, for a vector b, b ^⊗0=1, b ^⊗1=b, and b ^⊗2=b b ^T. By using this notation, the usual Cox regression estimator ${\hat{β}}_{C o x}$ is obtained by solving U(β)=0.

3.2. Firth's bias correction method for Cox regression models

Heinze and Schemper 3 directly applied Firth's bias correction method 4 to Cox regression models to overcome monotone likelihood. They proposed an estimation method based on the penalized log‐partial likelihood, l ^∗(N,Y,Z;β)=l(N,Y,Z;β)+0.5log|I(β)|, and the modified score function U ^∗(β)=U(β)+a(β), where |I(β)| is the determinant of the observed information matrix, a(β)={a ₁(β),…,a _p(β)}xx‐xx^T are modification terms, and a _j(β)=tr[{I(β)}xx‐xx⁻¹{∂ I(β)/∂ β _j}]/2. The penalized partial likelihood estimator $\hat{β}$ is obtained by solving U ^∗(β)=0, which is different from the usual Cox regression estimator, ${\hat{β}}_{C o x}$ . They only assessed the empirical performance of these methods by using simulation studies. These simulation results confirm the satisfactory performance of the penalized likelihood ratio test and profile penalized likelihood confidence interval under monotone likelihood.

The modification term a(β) can be derived by using an asymptotic expansion of E[U ^∗(β)]. It will be convenient to employ the notation of Cox and Snell 13 and Firth 4. Let U _j(β)=∂ l(N,Y,Z;β)/∂ β _j, U _jk(β)=∂ ² l(N,Y,Z;β)/∂ β _j ∂ β _k, U _jkl(β)=∂ ³ l(N,Y,Z;β)/∂ β _j ∂ β _k ∂ β _l, the null cumulants are κ _j,k=n ⁻¹E[U _j(β ₀)U _k(β ₀)], κ _jk=n ⁻¹E[U _jk(β ₀)], κ _j,kl=n ⁻¹E[U _j(β ₀)U _kl(β ₀)], κ _j,k,l=n ⁻¹E[U _j(β ₀)U _k(β ₀)U _l(β ₀)], and κ _jkl=n ⁻¹E[U _jkl(β ₀)]. Based on the asymptotic expansion, the bias of the estimator of the mth regression parameter is given by

E [n^{- 1 / 2} ({\hat{β}}_{m} - β_{0 m})] = n^{- 1} κ^{l, j} \{- \frac{1}{2} κ^{k, l} (κ_{j, k, l} + κ_{j, k l}) + α_{j} (β_{0})\} + O (n^{- 3 / 2}),

(1)

where κ ^k,l denotes the inverse of the Fisher information matrix, Einstein summation convention is applied, and α _j(β ₀)=E[a _j(β ₀)]. From Eq. 1, if α _j(β ₀)=κ ^k,l(κ _j,k,l+κ _j,kl)/2, then the first‐order bias term disappears. Moreover, if κ _j,k+κ _jk=0, κ _j,k,l+κ _j,kl+κ _k,jl+κ _l,jk+κ _jkl=0, and κ _j,kl=0, then

α_{j} (β) = \frac{1}{2} κ^{k, l} (κ_{j, k, l} + κ_{j, k l}) = \frac{1}{2} κ^{k, l} κ_{j, k, l} = - \frac{1}{2} κ^{k, l} κ_{j k l} .

From the aforementioned result and paying attention to the summation convention, the modification term can be written as

a_{j} (β) = \frac{1}{2} tr [{I (β)}^{- 1} {\partial I (β) / \partial β_{j}}] .

However, the relationships κ _j,k,l+κ _j,kl+κ _k,jl+κ _l,jk+κ _jkl=0 and κ _j,kl=0 are a nontrivial result in Cox regression models, and thus, they have never been evaluated. Nevertheless, we proved that these relationships are true in Cox regression models under independent and identically distributed (see Appendix A for more details).

3.3. Problem in heuristic information criteria

As noted earlier, although model selection is an important issue in data analysis, the model selection criteria for the penalized partial likelihood approach have never been studied. To our best knowledge, only the SAS PHREG procedure can be used to obtain an AIC‐type heuristic information criterion, ${AIC}^{*} = - 2 l^{*} (N, Y, Z; \hat{β}) + 2 p = - 2 l (N, Y, Z; \hat{β}) + 2 p - l o g | I (\hat{β}) |$ . Moreover, other major statistical software (e.g., Stata and R) can only output the penalized log‐partial likelihood. However, AIC^∗ is not theoretically justified.

Now, we discuss a property of AIC^∗. After some algebra,

{AIC}^{*} = - 2 l (N, Y, Z; \hat{β}) + (2 - l o g n) p - l o g {n^{- p} | I (\hat{β}) |} .

The last term on the right‐hand side $- l o g {n^{- p} | I (\hat{β}) |}$ converges to a constant because $n^{- 1} I (\hat{β}) \overset{P}{\to} Σ (β_{0})$ (see Appendix B for more details) and $| n^{- 1} I (\hat{β}) | = n^{- p} | I (\hat{β}) | \overset{P}{\to} | Σ (β_{0}) |$ as n→∞, where $Σ (β_{0}) = \int_{0}^{1} v (β_{0}, x) s^{(0)} (β_{0}, x) h_{0} (x) d x$ , v=(s ⁽²⁾/s ⁽⁰⁾)−{s ⁽¹⁾/s ⁽⁰⁾}xx‐xx^⊗2, s ⁽⁰⁾(β,t)=E[S ⁽⁰⁾(β,t)], s ⁽¹⁾(β,t)=E[S ⁽¹⁾(β,t)], and s ⁽²⁾(β,t)=E[S ⁽²⁾(β,t)]. If n⩾8, then 2−logn is negative. Because AIC^∗ includes the term (2−logn)p, this criterion tends to select models with large p as n→∞. Importantly, this result indicates that AIC^∗ does not avoid over‐fitting.

Similarly, the SAS PHREG procedure implements a BIC‐type heuristic information criterion, $BIC^{*} = - 2 l^{*} (N, Y, Z; \hat{β}) + p l o g d$ , where d is the number of events 14. Let c=1−d/n∈(0,1] be the proportion of censoring, $BIC^{*} = - 2 l (N, Y, Z; \hat{β}) + p l o g (1 - c) - l o g {n^{- p} | I (\hat{β}) |}$ . Because log(1−c)<0, BIC^∗ has a negative penalty term in proportion to the number of regression parameters.

3.4. Proposed criteria

As an alternative approach to address the issue discussed in Section 3.3, we propose a criterion termed herein AIC for Firth's penalized partial likelihood approach (AICF). AIC is a model selection criterion used to measure the goodness of fit of a model by using the risk function based on Kullback–Leibler (KL) information between the true model and the candidate model, which is a measure of discrepancy from the true model.

Xu et al. 15 provided a theoretical justification for the use of partial likelihood in AIC under usual Cox regression models, which was also extended to proportional hazards mixed models. These authors developed a profile AIC 16 for selecting a model with minimum KL information based on the profile likelihood under Cox regression models. It is well known that partial likelihood can be considered as profile likelihood 17, 18, 19. Suppose that f denotes the true distribution and g _β,λ=g(·;β,λ) denotes candidate models, where λ∈Λ is the nuisance parameter and Λ is the parameter space of λ. The KL information can be written as K L( f,g _β,λ)=E_(N,Y,Z)∼f[logf(N,Y,Z)−logg _β,λ(N,Y,Z)]. Here and subsequently, we write E_N=E_(N,Y,Z)∼f for convenience. Focusing on the regression parameters β alone and ignoring the constant term E_N[logf(N,Y,Z)] in KL, the minimum KL information is given at β ₀ such that $E_{N} [l o g g_{β_{0}} (N, Y, Z)] = \max_{β} E_{N} [l o g g_{β} (N, Y, Z)]$ , where E_N[logg _β(N,Y,Z)]= maxλE_N[logg _β,λ(N,Y,Z)]. If the model is correctly specified (i.e., f=g _β0), $E_{N} [l o g g_{β_{0}} (N, Y, Z)] = \int_{Ω} l o g g_{β_{0}} (N, Y, Z) d P$ . Under Cox regression models, the log profile likelihood can be written as $\max_{λ} \sum_{i = 1}^{n} g_{β} (N_{i}, Y_{i}, Z_{i}; β, λ) = l (N, Y, Z; β)$ . Xu et al. 15 showed that the risk function, $E_{N} E_{Ñ} [- 2 l (\tilde{N}, \tilde{Y}, \tilde{Z}; {\hat{β}}_{C o x})]$ based on the log profile likelihood, and profile AIC, $- 2 l (\tilde{N}, \tilde{Y}, \tilde{Z}; {\hat{β}}_{C o x}) + 2 p$ , as an approximately unbiased estimator of the risk function, where $(\tilde{N}, \tilde{Y}, \tilde{Z})$ is a future observation. Based on Akaike 8, the minimum risk function corresponds to the minimum KL information using a future observation.

Therefore, we consider a partial likelihood‐based risk function, $R I S K = E_{N} E_{Ñ} [- 2 l (\tilde{N}, \tilde{Y}, \tilde{Z}; \hat{β})]$ , and derive AICF as an approximately unbiased estimator of RISK. In the definition of RISK, the estimator $\hat{β}$ was not the usual Cox regression estimator ${\hat{β}}_{C o x}$ , but rather the Firth's penalized partial likelihood estimator. When we simply estimate RISK by $- 2 l (N, Y, Z; \hat{β})$ , we need to correct bias B. Here, B is defined as

\begin{matrix} B & = E_{N} E_{Ñ} [2 l (N, Y, Z; \hat{β}) - 2 l (\tilde{N}, \tilde{Y}, \tilde{Z}; \hat{β})] \\ = b_{1} + b_{2} + b_{3}, \end{matrix}

where

\begin{matrix} b_{1} & = E_{N} [E_{Ñ} [2 l (\tilde{N}, \tilde{Y}, \tilde{Z}; β_{0})] - E_{Ñ} [2 l (\tilde{N}, \tilde{Y}, \tilde{Z}; \hat{β})]], \\ b_{2} & = E_{N} [2 l (N, Y, Z; β_{0})] - E_{Ñ} [2 l (\tilde{N}, \tilde{Y}, \tilde{Z}; β_{0})], \end{matrix}

and

b_{3} = E_{N} [2 l (N, Y, Z; \hat{β}) - 2 l (N, Y, Z; β_{0})] .

According to this definition, B includes the true parameter vector β ₀. Therefore, we need approximate B. A second‐order Taylor expansion of $E_{Ñ} [l (\tilde{N}, \tilde{Y}, \tilde{Z}; \hat{β})]$ around $\hat{β} = β_{0}$ gives

E_{Ñ} [l (\tilde{N}, \tilde{Y}, \tilde{Z}; \hat{β})] \approx E_{Ñ} [l (\tilde{N}, \tilde{Y}, \tilde{Z}; β_{0})] - \frac{1}{2} {\sqrt{n} {(\hat{β} - β_{0})}^{T}} Σ (β_{0}) {\sqrt{n} (\hat{β} - β_{0})},

(2)

a first‐order Taylor expansion of $U^{*} (\hat{β}) = 0$ around $\hat{β} = β_{0}$ gives

\sqrt{n} (\hat{β} - β_{0}) \approx {n^{- 1} I^{*} (β_{0})}^{- 1} {n^{- 1 / 2} U^{*} (β_{0})},

(3)

and a second‐order Taylor expansion of l(N,Y,Z;β ₀) around $β_{0} = \hat{β}$ gives

l (N, Y, Z; β_{0}) \approx l (N, Y, Z; \hat{β}) - \frac{1}{2} {\sqrt{n} {(\hat{β} - β_{0})}^{T}} {n^{- 1} I (\hat{β})} {\sqrt{n} (\hat{β} - β_{0})} + {a (\hat{β})}^{T} (\hat{β} - β_{0}),

(4)

where I ^∗(β)=−∂ U ^∗(β)/∂ β ^T=I(β)−∂ a(β)/∂ β ^T. From the fact E[ f(D)]=E[ tr {f(D)}] for a scalar function f and a random vector D, and by substituting Eqs 2‐4 into b ₁, we can show that

b_{1} \approx E_{N} [tr {Σ (β_{0}) {n^{- 1} I^{*} (β_{0})}^{- 1} {n^{- 1} J^{*} (β_{0})} {n^{- 1} I^{*} (β_{0})}^{- 1}}],

where $J^{*} (β) = \sum_{i = 1}^{n} {L_{i}^{*} (β)}^{\otimes 2}$ , $L_{i}^{*} (β) = L_{i} (β) - a (β) / n$ , and $L_{i} (β_{0}) = \int_{0}^{1} {Z_{i} (x) - S^{(1)} (β_{0}, x) / S^{(0)} (β_{0}, x)} d M_{i} (x)$ . Similarly,

b_{3} \approx E_{N} [tr {{n^{- 1} I (\hat{β})} {n^{- 1} I^{*} (β_{0})}^{- 1} {n^{- 1} J^{*} (β_{0})} {n^{- 1} I^{*} (β_{0})}^{- 1} - 2 {a (\hat{β})}^{T} (\hat{β} - β_{0})}] .

Under the true model $n^{- 1} J^{*} (β_{0}) \overset{P}{\to} Σ (β_{0})$ , $n^{- 1} I^{*} (β_{0}) \overset{P}{\to} Σ (β_{0})$ , $n^{- 1} I (\hat{β}) \overset{P}{\to} Σ (β_{0})$ , and $a (\hat{β}) = O_{p} (1)$ (see Data S1 and Appendix B). Therefore, by applying the continuous mapping theorem, we obtain

\begin{matrix} tr {Σ (β_{0}) {n^{- 1} I^{*} (β_{0})}^{- 1} {n^{- 1} J^{*} (β_{0})} {n^{- 1} I^{*} (β_{0})}^{- 1}} \\ \overset{P}{\to} tr {Σ (β_{0}) {Σ (β_{0})}^{- 1} Σ (β_{0}) {Σ (β_{0})}^{- 1}} = p, \end{matrix}

and

\begin{matrix} tr {{n^{- 1} I (\hat{β})} {n^{- 1} I^{*} (β_{0})}^{- 1} {n^{- 1} J^{*} (β_{0})} {n^{- 1} I^{*} (β_{0})}^{- 1} - 2 {a (\hat{β})}^{T} (\hat{β} - β_{0})} \\ \overset{P}{\to} tr {Σ (β_{0}) {Σ (β_{0})}^{- 1} Σ (β_{0}) {Σ (β_{0})}^{- 1}} = p, \end{matrix}

as n→∞. Moreover, it is obvious that b ₂=0. Further details are presented in Data S1. Hence, b ₁≈p, b ₃≈p, and bias B=b ₁+b ₂+b ₃ can be approximated by 2p. From the aforementioned results, we define AICF as

AICF = - 2 l (N, Y, Z; \hat{β}) + 2 p .

The AICF does not include the penalty term of AIC^∗, $0.5 l o g | I (\hat{β}) |$ . Even in the penalized partial likelihood setting, non‐penalized likelihood should be used for risk estimation. Sometimes, penalty terms for parameter estimation have a strong effect on a model selection criterion.

Similarly, based on the results of a previous study 14, we propose BIC for Firth's penalized partial likelihood approach

BICF = - 2 l (N, Y, Z; \hat{β}) + p l o g d .

The detailed derivation is omitted, but is similar to that described previously 14.

Note that SAS and R programs for AICF and BICF are provided in Data S3.

4. Simulation

4.1. Simulation conditions

Simulation studies were conducted to investigate the performance of model selection critera (AICF, BICF, AIC^∗, and BIC^∗) in a monotone likelihood setting and to check that the property of AIC^∗ discussed in Section 3.3 holds. We set the simulation conditions by referring to 3 and the generated observations {N _i,Y _i,Z _i} from exponential distributions with hazard functions $h_{i} (t) = h_{0} (t) γ_{i} (t) = h_{0} (t) \exp {β_{0}^{T} Z_{i}}$ , where h ₀(t)=1, β ₀=(logθ,logθ,logθ,0,0)^T, Z _i=(Z _i1,Z _i2,Z _i3,Z _i4,Z _i5)^T, and Z _ij∼Bernoulli (q). We further set the proportion of covariates as q=0.5 or 0.8, the regression parameters as θ=1.3,2,4, or 16, the proportion of censoring as c=0,50, or 90 (%), and the total sample size as n=100,200, or 1000. We generated data under simple type I censoring; the observations of each individual were censored at a suitable time τ for each simulation. Time τ was determined to achieve an expected 50% and 90% censoring. We find monotone likelihood in situations of high censoring and high parameter values. For each data configuration, we generated R=20,000 simulations. For each simulation, we calculated AICF, AIC^∗, BICF, and BIC^∗ for the following 11 models:

1
Model 1: logγ _i(t)=β ₁ Z _i1
2
Model 2: logγ _i(t)=β ₄ Z _i4
3
Model 3: logγ _i(t)=β ₁ Z _i1+β ₂ Z _i2
4
Model 4: logγ _i(t)=β ₁ Z _i1+β ₄ Z _i4
5
Model 5: logγ _i(t)=β ₄ Z _i4+β ₅ Z _i5
6
Model 6: logγ _i(t)=β ₁ Z _i1+β ₂ Z _i2+β ₃ Z _i3(the true model)
7
Model 7: logγ _i(t)=β ₁ Z _i1+β ₂ Z _i2+β ₄ Z _i4
8
Model 8: logγ _i(t)=β ₁ Z _i1+β ₄ Z _i4+β ₅ Z _i5
9
Model 9: logγ _i(t)=β ₁ Z _i1+β ₂ Z _i2+β ₃ Z _i3+β ₄ Z _i4
10
Model 10: logγ _i(t)=β ₁ Z _i1+β ₂ Z _i2+β ₄ Z _i4+β ₅ Z _i5
11
Model 11: logγ _i(t)=β ₁ Z _i1+β ₂ Z _i2+β ₃ Z _i3+β ₄ Z _i4+β ₅ Z _i5(the full model)

Model 6 is the true model, and Model 11 is the full model, that is, the model with maximum p.

Because it is well known that AIC is designed for optimal prediction and BIC is designed to identify the true model, we assessed the predictive performance of AICF and AIC^∗. We evaluated the mean of the difference between the estimated mean risk (MR) and value of the information criterion in each model and its 5 and 95 percentiles, as well as the estimated MR for the selected model based on new data. The MR and its estimator, $\hat{MR}$ , are defined as

MR = \frac{1}{R} \sum_{r = 1}^{R} E_{Ñ} [- 2 l (\tilde{N}, \tilde{Y}, \tilde{Z}; {\hat{β}}_{r})]

and

\hat{MR} = - \frac{2}{R} \sum_{r = 1}^{R} \sum_{i = 1}^{n} \int_{0}^{1} [{\hat{β}}_{r}^{T} {\tilde{Z}}_{r i} (x) - l o g \{\sum_{j = 1}^{n} {\tilde{Y}}_{r j} (x) \exp {{\hat{β}}_{r}^{T} {\tilde{Z}}_{r j} (x)}\}] d Ñ_{r i} (x),

where ( $\tilde{N}$ , $\tilde{Y}$ , $\tilde{Z}$ ) is another dataset of the same size, ${\hat{β}}_{r}$ is the model estimate of each replication, and Z _ri is the covariate vector in each model and replication. The absolute value of the mean difference between the $\hat{MR}$ and the value of the information criterion should be small because AIC is an estimator of risk function; thus, the mean difference can be regarded as an empirical bias of an AIC‐type criterion. The estimated MR for the selected model, ${\hat{MR}}_{s e l}$ , is defined as

{\hat{MR}}_{s e l} = - \frac{2}{R} \sum_{r = 1}^{R} \sum_{i = 1}^{n} \int_{0}^{1} [{\hat{β}}_{r, s e l}^{T} {\tilde{Z}}_{r i} (x) - l o g \{\sum_{j = 1}^{n} {\tilde{Y}}_{r j} (x) \exp {{\hat{β}}_{r, s e l}^{T} {\tilde{Z}}_{r j} (x)}\}] d Ñ_{r i} (x),

where ${\hat{β}}_{r, s e l}$ is an estimate of the selected model. The ${\hat{MR}}_{s e l}$ should be small and is considered to be a performance indicator for prediction, as it measures the goodness of fit for the selected model to a future data as a mean deviance.

To assess the performance of the criteria, we evaluated model selection probability P _m, where m=1,2,…,11. We estimate the model selection probability by ${\hat{P}}_{m} = {# of the model m selected} / {# of replication (R=20,000)}$ , the relative frequency of the model obtained by minimizing the information criterion.

4.2. Simulation results

The mean of the difference between the estimated MR and value of information criterion and its 5 and 95 percentiles for Models 6 and 11 for q=0.5 is shown in Table 4. The mean differences for AICF were smaller than those for AIC^∗ under all conditions. The mean differences for AIC^∗ increased with the number of events. The 5 and 95 percentiles for AICF were approximately symmetric around 0, whereas the percentiles for AIC^∗ were not symmetric. For instance, in the case with c=0%, θ=1, n=100, and Model 6 (true model), the mean difference and its percentiles for AICF and AIC^∗ were −0.4(−9.9, 5.9) and −9.8(−19.2, −3.5); in the case with c=0%, θ=1, n=100, and Model 11 (full model), the mean difference and its percentiles for AICF and AIC^∗ were −1.1(−11.2, 6.0) and −16.7(−26.7, −9.7). Thus, AIC^∗ is clearly biased downward, as AIC^∗ includes unnecessary negative terms (Section 3.3). Larger bias was observed in models with large p‐values. Therefore, these models cannot estimate the risk function. The estimated MR for the selected model based on another new dataset, ${\hat{MR}}_{s e l}$ , for q=0.5 is shown in Table 4. The ${\hat{MR}}_{s e l}$ for AICF was smaller than that for AIC^∗, except in a case with c=50%, θ=4, and n=1000, and c=0%, θ=1, and n=200. Thus, the model selection based on AICF showed small deviance for future data. These results revealed that AICF shows better prediction performance.

Table 4.

The mean of the difference between the estimated MR and the value of the information criterion in each model and its 5 and 95 percentiles, and the estimated MR for the selected model (the proportion of covariates: q=0.5; the number of simulations: R=20,000).

c (%)

Mean difference (5 percentile, 95 percentile)

{\hat{MR}}_{s e l}

Model 6 (true model)

Model 11 (full model)

AICF

AIC^∗

AICF

AIC^∗

AICF

AIC^∗

100

−0.4 (−43.9, 45.9)

−2.7 (−44.2, 42.1)

−1.1 (−44.4, 45.5)

−4.8 (−44.7, 39.2)

94.38

95.05

200

0.1 (−71.4, 74.6)

−4.4 (−74.8, 68.9)

−0.1 (−71.6, 74.9)

−7.7 (−77.1, 65.5)

212.87

214.55

1000

1.6 (−206.1, 218.1)

−8.0 (−215.2, 208.0)

1.6 (−206.9, 218.2)

−14.4 (−222.1, 201.4)

1373.64

1375.61

100

−0.4 (−43.8, 45.8)

−2.7 (−44.1, 42.0)

−1.0 (−44.2, 45.6)

−4.7 (−44.6, 39.3)

94.40

95.04

200

−0.6 (−72.0, 74.4)

−5.2 (−75.4, 68.8)

−0.9 (−72.4, 74.6)

−8.5 (−77.8, 65.2)

212.82

214.48

1000

1.3 (−205.9, 217.1)

−8.3 (−215.1, 207.0)

1.3 (−206.6, 217.4)

−14.8 (−221.8, 200.6)

1375.21

1377.21

100

−0.7 (−44.2, 45.7)

−3.0 (−44.5, 41.9)

−1.4 (−44.6, 45.4)

−5.1 (−45.0, 39.1)

94.73

95.36

200

0.0 (−71.6, 74.2)

−4.6 (−74.9, 68.6)

−0.3 (−71.5, 74.3)

−7.9 (−77.1, 65.0)

212.88

214.50

1000

−1.8 (−216.9, 214.8)

−11.4 (−226.1, 204.7)

−1.9 (−215.8, 213.4)

−17.9 (−231.0, 196.7)

1376.07

1378.04

100

−0.7 (−43.4, 45.9)

−3.0 (−43.8, 42.1)

−1.4 (−44.0, 45.2)

−5.1 (−44.4, 38.9)

94.49

95.15

200

−0.6 (−71.3, 74.0)

−5.2 (−74.7, 68.3)

−0.9 (−71.5, 74.0)

−8.5 (−77.1, 64.6)

212.88

214.54

1000

0.1 (−207.5, 217.3)

−9.5 (−216.6, 207.3)

0.1 (−207.6, 217.4)

−15.9 (−222.8, 200.6)

1374.48

1376.42

100

0.1 (−64.8, 62.8)

−7.4 (−71.8, 54.9)

−0.4 (−65.8, 62.3)

−12.8 (−77.4, 49.1)

432.37

434.56

200

−1.1 (−110.0, 104.1)

−10.7 (−119.2, 94.2)

−1.3 (−109.7, 104.7)

−17.2 (−125.1, 88.2)

1000.80

1002.75

1000

3.0 (−322.5, 325.7)

−11.5 (−336.8, 311.1)

2.9 (−322.5, 325.9)

−21.2 (−346.3, 301.5)

6602.19

6603.65

100

−0.4 (−64.5, 63.2)

−7.8 (−71.4, 55.2)

−0.9 (−65.6, 62.8)

−13.3 (−77.1, 49.7)

432.28

434.21

200

0.9 (−108.1, 107.4)

−8.7 (−117.3, 97.4)

0.6 (−108.1, 106.9)

−15.3 (−123.5, 90.4)

999.82

1001.35

1000

−2.9 (−322.7, 319.4)

−17.4 (−337.0, 304.8)

−3.0 (−322.5, 318.6)

−27.1 (−346.4, 294.3)

6601.73

6601.85

100

−0.9 (−66.0, 61.9)

−8.4 (−73.0, 54.0)

−1.6 (−66.9, 61.3)

−14.0 (−78.4, 48.2)

432.62

434.21

200

0.1 (−107.2, 104.3)

−9.5 (−116.4, 94.4)

−0.3 (−107.3, 103.9)

−16.3 (−122.6, 87.4)

999.69

1000.66

1000

−3.0 (−323.1, 312.3)

−17.4 (−337.4, 297.7)

−3.2 (−323.1, 311.9)

−27.3 (−347.0, 287.6)

6595.55

6595.42

100

0.1 (−64.3, 63.4)

−7.3 (−71.2, 55.5)

−0.7 (−65.6, 62.4)

−13.1 (−77.0, 49.3)

431.82

433.14

200

−1.1 (−107.8, 104.4)

−10.6 (−117.0, 94.5)

−1.6 (−108.1, 104.4)

−17.6 (−123.4, 87.9)

999.22

999.75

1000

0.8 (−320.5, 319.0)

−13.7 (−334.8, 304.4)

0.5 (−321.0, 318.5)

−23.6 (−344.8, 294.1)

6589.13

6589.34

100

−0.4 (−9.9, 5.9)

−9.8 (−19.2,−3.5)

−1.1 (−11.2, 6.0)

−16.7 (−26.7,−9.7)

728.44

728.71

200

−0.1 (−12.1, 8.8)

−11.7 (−23.7,−2.8)

−0.5 (−13.2, 9.0)

−19.8 (−32.4,−10.3)

1722.43

1722.27

1000

0.2 (−23.8, 21.1)

−16.2 (−40.3, 4.6)

0.1 (−24.3, 21.3)

−27.4 (−51.8,−6.3)

11779.42

11780.44

100

−0.5 (−17.2, 14.1)

−9.6 (−26.2, 4.7)

−1.2 (−18.4, 13.8)

−16.6 (−33.6,−1.7)

705.35

706.37

200

−0.2 (−23.6, 20.8)

−11.6 (−34.8, 9.3)

−0.5 (−23.9, 20.5)

−19.5 (−42.9, 1.4)

1675.67

1676.76

1000

0.4 (−48.9, 48.0)

−15.9 (−65.2, 31.7)

0.3 (−49.3, 48.2)

−27.0 (−76.6, 20.8)

11548.20

11549.24

100

−0.5 (−24.1, 21.7)

−9.2 (−32.6, 12.8)

−1.1 (−24.9, 21.4)

−16.0 (−39.5, 6.2)

657.00

658.21

200

−0.5 (−33.3, 30.8)

−11.4 (−44.1, 19.8)

−0.9 (−34.0, 30.7)

−19.5 (−52.4, 12.0)

1578.08

1579.22

1000

−0.4 (−72.1, 69.9)

−16.2 (−87.8, 54.0)

−0.4 (−72.1, 69.9)

−27.3 (−98.8, 43.0)

11053.44

11054.48

100

−0.3 (−26.8, 26.5)

−7.9 (−34.0, 18.4)

−0.8 (−27.7, 26.2)

−14.6 (−41.0, 12.0)

575.12

576.30

200

−0.1 (−37.4, 37.8)

−10.0 (−47.0, 27.6)

−0.4 (−37.8, 37.6)

−17.9 (−55.1, 19.8)

1411.05

1412.15

1000

0.4 (−81.5, 84.4)

−14.4 (−96.3, 69.4)

0.4 (−82.0, 84.4)

−25.5 (−107.8, 58.5)

10203.57

10204.62

Open in a new tab

MR, mean risk; Mean difference (5 percentile, 95 percentile), mean of the difference between the estimated mean risk, $\hat{MR}$ , and the value of the information criterion in each model and its 5 and 95 percentiles; ${\hat{MR}}_{s e l}$ , estimated MR for the selected model based on new data; c, proportion of random censoring; θ, regression parameters; n, total sample size. The values that are superior to other are highlighted.

The selection probability of Models 6 (the true model) and 11 (the full model) for AICF and AIC^∗ when q=0.5 is shown in Table 5. For larger parameter values, larger sample sizes, and less censoring, the selection probability of the true model is larger for AICF than for AIC^∗. For smaller parameter values, smaller sample sizes, and more censoring, the selection probability of the true model is larger for AIC^∗ than for AICF. The selection probability of the full model for AICF is smaller than that for AIC^∗. On the contrary, the selection probability of the full model for AIC^∗ increases with the number of events. In particular, for n=1000, the selection probability of the full model for AIC^∗ is equal to one because of the term (2−logn)p in AIC^∗, as discussed in Section 3.3.

Table 5.

The selection probability (the proportion of covariates: q=0.5; the number of simulations: R=20,000).

c (%)	θ	n	Model 6 (true model)				Model 11 (full model)
			AICF	AIC^∗	BICF	BIC^∗	AICF	AIC^∗	BICF	BIC^∗
90	1.3	100	0.060	0.106	0.050	0.087	0.008	0.039	0.007	0.016
90	1.3	200	0.058	0.121	0.026	0.094	0.005	0.239	0.001	0.016
90	1.3	1000	0.059	0.000	0.005	0.094	0.005	1.000	0.000	0.019
90	2	100	0.063	0.110	0.052	0.092	0.008	0.041	0.007	0.019
90	2	200	0.060	0.116	0.024	0.093	0.005	0.238	0.001	0.018
90	2	1000	0.062	0.000	0.006	0.096	0.004	1.000	0.000	0.018
90	4	100	0.061	0.106	0.052	0.086	0.008	0.042	0.007	0.018
90	4	200	0.061	0.115	0.026	0.092	0.006	0.239	0.001	0.018
90	4	1000	0.055	0.000	0.005	0.090	0.005	1.000	0.000	0.019
90	16	100	0.062	0.110	0.052	0.091	0.008	0.043	0.006	0.018
90	16	200	0.058	0.119	0.023	0.093	0.006	0.234	0.001	0.017
90	16	1000	0.060	0.000	0.005	0.098	0.005	1.000	0.000	0.020
50	1.3	100	0.062	0.000	0.011	0.095	0.007	1.000	0.000	0.021
50	1.3	200	0.064	0.000	0.006	0.104	0.006	1.000	0.000	0.022
50	1.3	1000	0.091	0.000	0.003	0.135	0.007	1.000	0.000	0.027
50	2	100	0.082	0.000	0.018	0.122	0.010	1.000	0.000	0.027
50	2	200	0.099	0.000	0.015	0.140	0.010	1.000	0.000	0.031
50	2	1000	0.258	0.000	0.028	0.294	0.025	1.000	0.000	0.070
50	4	100	0.109	0.000	0.030	0.148	0.017	1.000	0.000	0.039
50	4	200	0.154	0.000	0.031	0.194	0.020	1.000	0.000	0.052
50	4	1000	0.469	0.000	0.142	0.462	0.053	1.000	0.000	0.116
50	16	100	0.133	0.000	0.044	0.172	0.022	1.000	0.002	0.050
50	16	200	0.203	0.000	0.054	0.242	0.029	1.000	0.001	0.066
50	16	1000	0.596	0.000	0.305	0.544	0.072	1.000	0.000	0.143
0	1.3	100	0.270	0.000	0.087	0.304	0.033	1.000	0.001	0.067
0	1.3	200	0.468	0.000	0.188	0.468	0.054	1.000	0.001	0.108
0	1.3	1000	0.778	0.000	0.919	0.658	0.085	1.000	0.001	0.155
0	2	100	0.745	0.000	0.820	0.662	0.091	1.000	0.007	0.145
0	2	200	0.778	0.000	0.968	0.670	0.086	1.000	0.003	0.150
0	2	1000	0.786	0.000	0.991	0.670	0.081	1.000	0.001	0.149
0	4	100	0.772	0.000	0.959	0.676	0.089	1.000	0.007	0.144
0	4	200	0.773	0.000	0.971	0.664	0.085	1.000	0.004	0.150
0	4	1000	0.786	0.000	0.990	0.670	0.081	1.000	0.001	0.149
0	16	100	0.774	0.000	0.958	0.690	0.088	1.000	0.007	0.137
0	16	200	0.779	0.000	0.974	0.679	0.085	1.000	0.003	0.144
0	16	1000	0.788	0.000	0.991	0.668	0.081	1.000	0.001	0.151

Open in a new tab

Note: c, proportion of random censoring; θ, regression parameters; n, total sample size. The values that are superior to other are highlighted.

The selection probability of Models 6 (true model) and 11 (full model) for BICF and BIC^∗ when q=0.5 is shown in Table 5. For larger parameter values, larger sample sizes, and less censoring, the selection probability of the true model was larger for BICF than for BIC^∗. For smaller parameter values, smaller sample sizes, and more censoring, the selection probability of the true model was larger for BIC^∗ than for BICF. The selection probability of the full model for BICF showed the smallest value. Although a BIC‐type criterion was designed to identify the true model as n→∞, the selection probability of the true model of BIC^∗ was smaller than that of BICF for a large number of samples. For instance, in the case with c=0%, θ=16, and n=1000, the selection probability of the true model for BICF and BIC^∗ were 0.991 and 0.668, while the selection probability of the full model for BICF and BIC^∗ were 0.001 and 0.151. This is because of the properties of BIC^∗ discussed in Section 3.3.

The results of the other models and conditions are presented in Data S2 (Tables S1–S8 for bias, Table S9 for prediction performance, and Tables S10–S17 for selection probability); these results reveal the same tendencies as discussed previously.

In summary, AICF can estimate the risk function and shows better prediction performance than AIC^∗. By contrast, because AIC^∗ selects models that have many parameters and ignores the adequacy of the model as the number of events increase, it is not as efficient and not recommended for model selection. BICF can select the true model as n→∞. In contrast, because the selection probability of the true model for BIC^∗ was smaller than that for BICF for a large number of samples, this method is not as efficient in selecting the true model.

5. Application of the prospective observational study data

Next, we apply the proposed method to the study introduced in Section 2. We select a model based on AICF as in that section, with the result shown in Table 6. As AIC^∗ tends to select models that have many parameters (Table 2), the model that includes ntumor (number of tumors), ptumor (primary tumor category), and neuro (neurological symptoms) was selected by AICF as the best. The best model under AICF does not include unnecessary covariates such as age and status whose hazard ratios were almost equal to one (Table 2).

Table 6.

The top five models based on AICF for the study data.

Model						AICF
	ntumor		ptumor		neuro	1753.84
	ntumor				neuro	1754.72
sex	ntumor		ptumor		neuro	1754.90
	ntumor	diameter	ptumor		neuro	1755.83
	ntumor		ptumor	status	neuro	1755.83

Open in a new tab

Note: ntumor, number of tumors; diameter, maximum diameter of largest tumor; ptumor, primary tumor category; status, extracerebral disease status; neuro, neurological symptoms; AICF, AIC for Firth's penalized partial likelihood approach.

The estimated hazard ratios for the best model are shown in Table 7. The best model under AICF included ntumor(1 vs. 2–4: H R=0.71, p‐value=0.08; 5–10 vs. 2–4: H R=1.57, p‐value=0.04), ptumor (breast vs. lung: H R=0.94, p‐value=0.82; gastrointestinal vs. lung: H R=1.60, p‐value=0.21; kidney vs. lung: H R=0.12, p‐value=0.02; others vs. lung: H R=0.87, p‐value=0.79), and neuro (H R=1.53, p‐value=0.04). The selected covariates under AICF are better than age (Table 3; H R=1.00, p‐value=0.98) and status(Table 3; H R=1.03, p‐value=0.85). Therefore, the best model under AICF seems to be plausible and may be more appropriate than the best model under AIC^∗. Additionally, the best model under AICF is clinically interpretable because the selected variables are prognostic factors for brain metastases 20, 21, 22.

Table 7.

Parameter estimates of the best model based on AICF for the study data.

Covariate	HR	SE	95% CI		p (LR)
ntumor
1 vs. 2–4	0.71	0.19	0.49	1.05	0.08
5–10 vs. 2–4	1.57	0.21	1.03	2.38	0.04
ptumor
Breast vs. lung	0.94	0.26	0.57	1.57	0.82
GI vs. lung	1.60	0.36	0.79	3.26	0.21
Renal cell vs. lung	0.12	1.43	0.01	1.99	0.02
Others vs. lung	0.87	0.55	0.30	2.56	0.79
neuro (yes vs. no)	1.53	0.20	1.04	2.24	0.04

Open in a new tab

Note: HR, hazard ratio; ntumor, number of tumors; ptumor, primary tumor category; neuro, neurological symptoms; GI, gastrointestinal; LR, likelihood ratio; SE, standard error.

6. Discussion and conclusion

One solution to the monotone likelihood problem, which is an important issue in Cox regression models, is Firth's penalized partial likelihood approach. However, the model selection criteria for this approach are yet to be studied, and heuristic criteria, AIC^∗ and BIC^∗, are used in the SAS PHREG procedure. Therefore, in this study, we proposed alternative criteria, AICF and BICF, which work for Firth's penalized partial likelihood approach. Moreover, we discussed the justification for adopting Firth's bias correction method in Cox regression models.

We showed that AICF, an estimator of the risk function based on KL information, does not include the penalty term of AIC^∗, $0.5 l o g | I (\hat{β}) |$ . Even in the penalized partial likelihood setting, non‐penalized likelihood should be used for risk estimation. In addition, AICF is more efficient than AIC^∗ in simulations and works well when addressing monotone likelihood. The simulation results revealed systematic bias in AIC^∗, and the model selection based on AICF showed superior predictive performance. In any case, AIC^∗ is not recommended for model selection. An application using real data concluded that AICF has better properties than AIC^∗ and that the latter leads to incorrect results. Moreover, we showed that BIC^∗ has a negative penalty term in proportion to the number of regression parameters. The selection probability of the true model for BIC^∗ was smaller than that of BICF for a large number of samples, indicating that it is not as efficient for selecting the true model. In summary, we showed that AICF or BICF is appropriate for model selection in monotone likelihood cases. AICF would be used for prediction, while BICF can be used for selecting the true model.

Because we obtained impressive results with alternative criteria, future studies should aim to examine other model evaluation criteria such as the C‐index 23, 24. Moreover, a similar problem occurs if one uses AIC and BIC based on the penalized log‐likelihood under separation in logistic regression models.

Appendix A. Justification for using Firth's bias correction method in Cox regression models

Firth's bias correction method is based on the relationships κ _j,k+κ _jk=0, κ _j,k,l+κ _j,kl+κ _k,jl+κ _l,jk+κ _j,k,l=0, and κ _j,kl=0, as discussed in Section 3.2. However, the relationships κ _j,k,l+κ _j,kl+κ _k,jl+κ _l,jk+κ _j,k,l=0 and κ _j,kl=0 have not yet been evaluated in Cox regression models. Therefore, we only prove that κ _j,kl=0; it is well known that κ _j,k+κ _jk=0, and we can easily show that κ _j,k,l+κ _jkl=0.

Lemma 1 (κj,kl=0.)

If we insert β ₀ in the functions U _j(β) and U _kl(β),

$U_{j} (β_{0}) = \sum_{i = 1}^{n} \int_{0}^{1} H_{i j} (x) d M_{i} (x),$

$H_{i j} (t) = Z_{i j} (t) - \frac{S_{j}^{(1)} (β_{0}, t)}{S^{(0)} (β_{0}, t)},$

and

$U_{k l} (β_{0}) = ⟨ U_{k l} (β_{0}) ⟩ (1) + \sum_{i = 1}^{n} \int_{0}^{1} G_{k l} (x) d M_{i} (x),$

$G_{k l} (t) = \frac{S_{k l}^{(2)} (β_{0}, t)}{S^{(0)} (β_{0}, t)} - \frac{S_{k}^{(1)} (β_{0}, t) S_{l}^{(1)} (β_{0}, t)}{{S^{(0)} (β_{0}, t)}^{2}},$

where $S_{j}^{(1)} (β, t) = n^{- 1} \sum_{i = 1}^{n} Y_{i} (t) Z_{i j} (t) \exp {β^{T} Z_{i} (t)}$ , $S_{k l}^{(2)} (β, t) = n^{- 1} \sum_{i = 1}^{n} Y_{i} (t) Z_{i k} (t) Z_{i l} (t) \exp {β^{T} Z_{i} (t)}$ , and $⟨ U_{k l} (β_{0}) ⟩ (t) = n \int_{0}^{t} G_{k l} (x) S^{(0)} (β_{0}, x) h_{0} (x) d x$ .

We note that H _ij and G _kl are predictable processes according to the assumption made in Section 3.1. Here,

$\begin{matrix} U_{j} (β_{0}) U_{k l} (β_{0}) & = ⟨ U_{k l} (β_{0}) ⟩ (1) \sum_{i = 1}^{n} \int_{0}^{1} H_{i j} (x) d M_{i} (x) + \\ \{\sum_{i = 1}^{n} \int_{0}^{1} H_{i j} (x) d M_{i} (x)\} \{\sum_{i = 1}^{n} \int_{0}^{1} G_{k l} (x) d M_{i} (x)\} . \end{matrix}$

From theorem 2.4.4 of 25, we have

$\begin{matrix} κ_{j, k l} & = n^{- 1} E [⟨ U_{k l} (β_{0}) ⟩ (1) \sum_{i = 1}^{n} \int_{0}^{1} H_{i j} (x) d M_{i} (x) + \\ \sum_{h = 1}^{n} \sum_{i = 1}^{n} \int_{0}^{1} H_{h j} (x) G_{k l} (x) d ⟨ M_{h}, M_{i} ⟩ (x)] . \end{matrix}$

Because M _i(t) is the counting process martingale,

$E [\sum_{i = 1}^{n} \int_{0}^{1} H_{i j} (x) d M_{i} (x)] = 0,$

and from the orthogonality of martingales

$⟨ M_{h}, M_{i} ⟩ (t) = \{\begin{matrix} \int_{0}^{t} Y_{h} (x) \exp {β_{0}^{T} Z_{h} (x)} h_{0} (x) d x, & h = i \\ 0, & h \neq i \end{matrix}$

It follows that

$\begin{matrix} κ_{j, k l} & = 0 + n^{- 1} E [\int_{0}^{1} [\sum_{i = 1}^{n} \{Z_{i j} (x) - \frac{S_{j}^{(1)} (β_{0}, x)}{S^{(0)} (β_{0}, x)} \} Y_{i} (x) \exp {β_{0}^{T} Z_{i} (x)}] G_{k l} (x) h_{0} (x) d x] \\ = 0 . \end{matrix}$

Therefore, the relationships κ _j,k+κ _jk=0, κ _j,k,l+κ _j,kl+κ _k,jl+κ _l,jk+κ _jkl=0, and κ _j,kl=0 are also true in Cox regression models.

□

Appendix B. Consistency of $\hat{β}$

In this section, we discuss the consistency of $\hat{β}$ . The following list of conditions will be assumed:

(A)

$\int_{0}^{1} h_{0} (x) d x < \infty$ .

(B)

There exists a neighborhood

B

of β ₀ and a scalar function, s ⁽⁰⁾, a vector function, s ⁽¹⁾, and a matrix function, s ⁽²⁾, defined on

B \times [0, 1]

such that for k=0,1,2,

\sup_{t \in [0, 1], β \in B} | | S^{(k)} (β, t) - s^{(k)} (β, t) | | \overset{P}{\to} 0 .

(C)

There exists δ>0 such that

n^{- 1 / 2} \sup_{t \in [0, 1], 1 ⩽ i ⩽ n} | Z_{i} (t) | Y_{i} (t) 1_{{β_{0}^{T} Z_{i} (t) > - δ | Z_{i} (t) |}} \overset{P}{\to} 0,

where 1_{} is an indicator function.

(D)

For all $β \in B$ , t∈[0,1]: s ⁽⁰⁾(β,t), s ⁽¹⁾(β,t)=∂ s ⁽⁰⁾(β,t)/∂ β, and s ⁽²⁾(β,t)=∂ s ⁽¹⁾(β,t)/∂ β ^T are continuous functions of $β \in B$ , uniformly in t∈[0,1], s ⁽⁰⁾, s ⁽¹⁾, and s ⁽²⁾ are bounded on $B \times [0, 1]$ ; s ⁽⁰⁾ is bounded away from zero on $B \times [0, 1]$ , and the matrix Σ(β ₀) is positive definite.

These conditions are identical to those given by Andersen and Gill 12.

Lemma 2 ( $\hat{β} \to β_{0} .$ )

Consider the process

$\begin{matrix} X_{n}^{*} (β, 1) & = n^{- 1} {l^{*} (N, Y, Z; β) - l^{*} (N, Y, Z; β_{0})} \\ = X_{n} (β, 1) + 0.5 n^{- 1} {l o g | I (β) | - l o g | I (β_{0}) |}, \end{matrix}$

where X _n(β,1)=n ⁻¹{l(N,Y,Z;β)−l(N,Y,Z;β ₀)}. To prove the consistency of the usual Cox regression estimator ${\hat{β}}_{C o x}$ , Andersen and Gill 12 showed that X _n(β,1) is a concave function and proved that X _n(β,1) converges in probability to

$A (β, 1) = \int_{0}^{1} [{(β - β_{0})}^{T} s^{(1)} (β_{0}, x) - l o g \{\frac{s^{(0)} (β, x)}{s^{(0)} (β_{0}, x)}\} s^{(0)} (β_{0}, x)] h_{0} (t) d x,$

for each $β \in B$ . The first derivative of A(β,1) is zero at β=β ₀, and the second derivative is minus a positive definite matrix. In other words, A(β,1) is a concave function of β with a unique maximum at β=β ₀. If X _n(β,1) is a concave function, then ${\hat{β}}_{C o x} \overset{P}{\to} β_{0}$ by applying theorem II.1 of 12.

In the same manner, if $X_{n}^{*} (β, 1)$ is concave and $X_{n}^{*} (β, 1)$ converges to a concave function of β with a unique maximum at β=β ₀, the consistency of $\hat{β}$ can be shown by applying theorem II.1 of 12.

In monotone likelihood settings, the partial log‐likelihood function converges to a finite value. Fixing β ₁,β ₂,…,β _j−1,β _j+1,…,β _p to be ${\hat{β}}_{1}, {\hat{β}}_{2}, \dots, {\hat{β}}_{j - 1}, {\hat{β}}_{j + 1}, \dots, {\hat{β}}_{p}$ , for the parameter β _j, a real number c, and a constant d,

$\exists c, \forall β_{j} ⩾ c, l (N, Y, Z; {\hat{β}}_{1}, {\hat{β}}_{2}, \dots, {\hat{β}}_{j - 1}, β_{j}, {\hat{β}}_{j + 1}, \dots, {\hat{β}}_{p}) = d,$

or

$\exists c, \forall β_{j} ⩽ c, l (N, Y, Z; {\hat{β}}_{1}, {\hat{β}}_{2}, \dots, {\hat{β}}_{j - 1}, β_{j}, {\hat{β}}_{j + 1}, \dots, {\hat{β}}_{p}) = d,$

in monotone likelihood settings. Therefore, X _n(β,1) is a concave function. Note that l(N,Y,Z;β ₀) and log|I(β ₀)| are obviously finite constants. Thus, it is sufficient to show that log|I(β)| is a concave function. Now, I(β) is a positive semidefinite matrix (see also Prentice 26) because l(N,Y,Z;β) is a concave function. Generally, the function log|C|, where |C| is the determinant of a positive semidefinite matrix C, is concave 27. Therefore, $X_{n}^{*} (β, 1)$ is a sum of concave functions.

We next show that $X_{n}^{*} (β, 1)$ converges in probability to A(β,1). According to the aforementioned result, X _n(β,1) converges in probability to A(β,1). Conditions (B) and (D) imply that, for each $β \in B$ , $n^{- 1} I (β) \overset{P}{\to} Σ (β)$ . Thus,

$\begin{matrix} l o g | I (β) | - l o g | I (β_{0}) | & = l o g | n \cdot n^{- 1} I (β) | - l o g | n \cdot n^{- 1} I (β_{0}) | \\ = l o g | n^{- 1} I (β) | - l o g | n^{- 1} I (β_{0}) | \\ = l o g {n^{- p} | I (β) |} - l o g {n^{- p} | I (β_{0}) |} \end{matrix}$

converges in probability to some finite quantity. Therefore,

$0.5 n^{- 1} \{l o g | I (β) | - l o g | I (β_{0}) |\} \overset{P}{\to} 0,$

and $X_{n}^{*} (β, 1)$ converges in probability to A(β,1), which is a concave function of β with a unique maximum at β=β ₀.

These facts establish that $\hat{β} \overset{P}{\to} β_{0}$ as n→∞.

Moreover, from this consistency, we can apply theorem 3.2 of 12; therefore, $n^{- 1} I (\hat{β}) \overset{P}{\to} Σ (β_{0})$ in monotone likelihood settings.

□

Supporting information

Supplementary Table S1: The mean of the difference between the estimated MR and AICF, and its 5 and 95 percentiles (q = 0.5, Model 1–6)

Supplementary Table S2: The mean of the difference between the estimated MR and AICF, and its 5 and 95 percentiles (q = 0.5, Model 7–11)

Supplementary Table S3: The mean of the difference between the estimated MR and AIC^*, and its 5 and 95 percentiles (q = 0.5, Model 1–6)

Supplementary Table S4: The mean of the difference between the estimated MR and AIC^*, and its 5 and 95 percentiles (q = 0.5, Model 7–11)

Supplementary Table S5: The mean of the difference between the estimated MR and AICF, and its 5 and 95 percentiles (q = 0.8, Model 1–6)

Supplementary Table S6: The mean of the difference between the estimated MR and AICF, and its 5 and 95 percentiles (q = 0.8, Model 7–11)

Supplementary Table S7: The mean of the difference between the estimated MR and AIC^*, and its 5 and 95 percentiles (q = 0.8, Model 1–6)

Supplementary Table S8: The mean of the difference between the estimated MR and AIC^*, and its 5 and 95 percentiles (q = 0.8, Model 7–11)

Supplementary Table S9: The estimated MR for the selected model $\hat{M R}$

Supplementary Table S10: The selection probability of AICF (q = 0.5)

Supplementary Table S11: The selection probability of AIC^* (q = 0.5)

Supplementary Table S12: The selection probability of BICF (q = 0.5)

Supplementary Table S13: The selection probability of BIC^* (q = 0.5)

Supplementary Table S14: The selection probability of AICF (q = 0.8)

Supplementary Table S15: The selection probability of AIC^* (q = 0.8)

Supplementary Table S16: The selection probability of BICF (q = 0.8)

Supplementary Table S17: The selection probability of BIC^* (q = 0.8)

Code 1: A SAS program for calculating AICF and BICF

Code 2: An R program for calculating AICF and BICF

Click here for additional data file.^{(109.6KB, pdf)}

Acknowledgements

The authors would like to thank the associate editor and two reviewers for very insightful and constructive comments that substantially improved the article. The authors would like to thank Professor Masaaki Yamamoto of Katsuta Hospital Mito Gamma House and Dr. Toru Serizawa of Tsukiji Neurological Clinic for providing the dataset of the JLGK0901 study. This work was supported by the Japan Society for the Promotion of Science KAKENHI grant numbers 26870099 and 16K16014 and the MHLW‐funded Core Clinical Research Centers.

Nagashima, K. , and Sato, Y. (2017) Information criteria for Firth's penalized partial likelihood approach in Cox regression models. Statist. Med., 36: 3422–3436. doi: 10.1002/sim.7368.

The copyright line for this article was changed on 20 November 2017 after original online publication.

References

1. Cox DR. Regression models and life tables (with discussion). Journal of the Royal Statistical Society, Series B 1972; 34(2):187‐220. [Google Scholar]
2. Bryson MC, Johnson ME. The incidence of monotone likelihood in the Cox model. Technometrics 1981; 23(4):381‐383. [Google Scholar]
3. Heinze G, Schemper M. A solution to the problem of monotone likelihood in Cox regression. Biometrics 2001; 57(1):114‐119. [DOI] [PubMed] [Google Scholar]
4. Firth D. Bias reduction of maximum likelihood estimates. Biometrika 1993; 80(1):27‐38. [Google Scholar]
5. Heinze G. The application of Firth's procedure to Cox and logistic regression, Technical Report 10/1999, update in January 2001, Section of Clinical Biometrics, Department of Medical Computer Sciences University of Vienna, 2001. [Google Scholar]
6. Heinze G. A comparative investigation of methods for logistic regression with separated or nearly separated data. Statistics in Medicine 2006; 25(24):4216‐4226. [DOI] [PubMed] [Google Scholar]
7. Heinze G, Schemper M. A solution to the problem of separation in logistic regression. Statistics in Medicine 2002; 21(16):2409‐2419. [DOI] [PubMed] [Google Scholar]
8. Akaike H. Information theory and an extension of the maximum likelihood principle In In Second International Symposium on Information Theory, Petrov BN, Csaki F, eds. Akademiai Kiado: Budapest, 1973; 267‐281. [Google Scholar]
9. Schwarz G. Estimating the dimension of a model. The Annals of Statistics 1978; 6(2):461‐464. [Google Scholar]
10. Yamamoto M, Serizawa T, Shuto T, Akabane A, Higuchi Y, Kawagishi J, Yamanaka K, Sato Y, Jokura H, Yomo S, Nagano O, Kenai H, Moriki A, Suzuki S, Kida Y, Iwai Y, Hayashi M, Onishi H, Gondo M, Sato M, Akimitsu T, Kubo K, Kikuchi Y, Shibasaki T, Goto T, Takanashi M, Mori Y, Takakura K, Saeki N, Kunieda E, Aoyama H, Momoshima S, Tsuchiya K. Stereotactic radiosurgery for patients with multiple brain metastases (JLGK0901): a multi‐institutional prospective observational study. The Lancet Oncology 2014; 15(4):387‐395. [DOI] [PubMed] [Google Scholar]
11. Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. 2nd ed John Wiley & Sons: New Jersey, 2002. [Google Scholar]
12. Andersen PK, Gill RD. Cox's regression model for counting processes: a large sample study. The Annals of Statistics 1982; 10(4):1100‐1120. [Google Scholar]
13. Cox DR, Snell EJ. A general definition of residuals. Journal of the Royal Statistical Society, Series B 1968; 30(2):248‐275. [Google Scholar]
14. Volinsky CT, Raftery AE. Bayesian information criterion for censored survival models. Biometrics 2000; 56(1):256‐262. [DOI] [PubMed] [Google Scholar]
15. Xu R, Vaida F, Harrington DP. Using profile likelihood for semiparametric model selection with application to proportional hazards mixed models. Statistica Sinica 2009; 19(2):819‐842. [PMC free article] [PubMed] [Google Scholar]
16. Claeskens G, Carroll RJ. An asymptotic theory for model selection inference in general semiparametric problems. Biometrika 2007; 94(2):249‐265. [Google Scholar]
17. Bailey KR. The asymptotic joint distribution of regression and survival parameter estimates in the Cox regression model. The Annals of Statistics 1983; 11(1):39‐48. [Google Scholar]
18. Johansen S. An extension of Cox's regression model. International Statistical Review 1983; 51(2):165‐174. [Google Scholar]
19. Murphy SA, van der Vaart AW. On profile likelihood. Journal of the American Statistical Association 2000; 95(450):449‐465. [Google Scholar]
20. Joseph J, Adler JR, Cox RS, Hancock SL. Linear accelerator‐based stereotaxic radiosurgery for brain metastases: the influence of number of lesions on survival. Journal of Clinical Oncology 1996; 14(4):1085‐1092. [DOI] [PubMed] [Google Scholar]
21. Balm M, Hammack J. Leptomeningeal carcinomatosis. Presenting features and prognostic factors. Archives of Neurology 1996; 53(7):626‐632. [DOI] [PubMed] [Google Scholar]
22. Sperduto PW, Kased N, Roberge D, Xu Z, Shanley R, Luo X, Sneed PK, Chao ST, Weil RJ, Suh J, Bhatt A, Jensen AW, Brown PD, Shih HA, Kirkpatrick J, Gaspar LE, Fiveash JB, Chiang V, Knisely JP, Sperduto CM, Lin N, Mehta M. Summary report on the graded prognostic assessment: an accurate and facile diagnosis‐specific tool to estimate survival for patients with brain metastases. Journal of Clinical Oncology 2012; 30(4):419‐425. [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Harrell FE, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine 1996; 15(4):361‐387. [DOI] [PubMed] [Google Scholar]
24. Uno H, Cai T, Pencina MJ, D'Agostino RB, Wei LJ. On the C‐statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Statistics in Medicine 2011; 30(10):1105‐1117. [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Fleming TR, Harrington DP. Counting Processes and Survival Analysis. John Wiley & Sons: New York, 1991. [Google Scholar]
26. Prentice RL, Self SG. Asymptotic distribution theory for Cox‐type regression models with general relative risk form. The Annals of Statistics 1983; 11(3):804‐813. [Google Scholar]
27. Cover TM, Thomas JA. Determinant inequalities via information theory. SIAM Journal on Matrix Analysis and Applications 1988; 9(3):384‐392. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials