Abstract
This paper considers generalized partially linear models. We propose empirical likelihood based statistics to construct confidence regions for the parametric and nonparametric componenets. The resulting statistics are shown to be asymptotically chi-squared distributed. Finite sample performance of the proposed statistics is assessed by simulation experiments. The proposed methods are applied to a dataset from an AIDS clinical trial.
Keywords: Confidence region, generalized additive models, least favorable curve, local linear regression, semiparametric estimation
1 Introduction
Generalized partially linear models (GPLM), a generalization of partially linear models to possibly non-Gaussian responses, assume that the conditional expectation of the response variables given the covariates can be represented as
(1) |
where μ = μ{X′β + θ(T)}, μ(·) is a known link function, V (·) is a known function, β is an unknown p × 1 vector, T ∈ Rq, θ is an unknown smooth function, and σ2 is an unknown scalar parameter. Assume that T takes values in 𝒯, a closed rectangle in Rq. We assume that (Yi, Xi, Ti), i = 1, 2, ⋯, n, are independent and identically distributed data from model (1). GPLMs allow easier interpretation of the effect of each variables and are preferable to general nonparametric models (Stone, 1980) since they provides a partial remedy to the “curse of dimensionality,” especially when q is small as is often the case. GPLMs are more flexible than the standard GLM because they combine both parametric and nonparametric components when it is believed that the E(Y|X, T) depends on variable X in a linear way but is nonlinearly related to other independent variables, T.
A special class of the GPLMs, partially linear models, have been intensively studied in literature. See for example, Engle et al. (1986), Speckman, (1988), Härdle, Liang & Gao (2000) and references therein. For GPLMs, Severini & Staniswalis (1994) applied the quasilikelihood principle proposed by Severini & Wong (1992), and Carroll et al. (1997) proposed two different estimation algorithms based on quasilikelihood and local kernel methods. Related topics have recently been studied by Lin & Carroll (2001) for longitudinal data, and Liang & Ren (2005) for measurement errors. It is worth pointing out that the quasilikelihood approach for the GPLM is different from the kernel-based smoothing method for partially linear models. The latter is simple and noniterative because the closed form of the estimators is available, while the former needs an iterative algorithm and an undersmoothing bandwidth.
Under mild regularity conditions, Severini & Staniswalis (1994) derived the asymptotics for the estimators of β and θ(t) that they proposed. In principle, these asymptotic results can be used to construct asymptotically correct confidence intervals of the parameters and pointwise confidence intervals for the nonparametric function. The finite-sample performance of the resulting confidence intervals may not be appealing because the complex structure of the covariance matrix, which needs to be estimated with estimates plugged-in for several parameters. In this paper, we propose an alternative for constructing regions for β and θ(t) using the empirical likelihood principle, which was originally studied by Hartley & Rao (1968) for sample surveys and by Thomas & Grunkemeier (1975) for survival analysis. Owen (2001) gave a comprehensive survey for empirical likelihood methods and related topics. The empirical likelihood method has many advantages over its competitors such as the normal-approximation-based method and the bootstrap method (see Hall & La Scala, 1990). These advantages include improvement of the confidence region, increase of accuracy of coverage because of using auxiliary information, easy implementation, avoiding estimating variances, and studentising automatically. Because of these features, the applications of empirical likelihood in parametric and nonparametric models have received a great amount of attention.
More recently, empirical likelihood based inference has been developed for semiparametric models, e.g., by Zhu & Xue (2006) who developed empirical likelihood confidence regions for the parameters of partially linear single-indexmodels. However, most research on empirical likelihood inference for semiparametric models has focused on the finite-dimensional parameter and has assumed a continuous response variables. In this paper, we study empirical likelihood inference for both the finite-dimensional parameters and the nonparametric functions in semiparametric models and we allow the response variable to be discrete. Our procedure is a generalization of empirical likelihood procedure to a combination of generalized linear models and nonparametric regression. This generalization is by no means straightforward. In Section 2, we will define the empirical likelihood ratio statistics for β and θ(t), derive the asymptotic distributions of the resulting empirical likelihood statistics, and explain how to establish the corresponding CI. In Section 3 we report the results of a simulation experiment to explore the finite sample performance of the proposed confidence intervals. The proposed methods will used to analyze a real dataset in Section 4. Section 5 gives a discussion. All technical derivations are given in the Appendix.
2 Empirical Likelihood Methods
Several authors have applied empirical likelihood to partially linear models, a special case of the GPLM. For example, Shi & Lau (1999) proposed an empirical likelihood based confidence interval for the parameters of a partially linear model. Qin & Jing (2001) and Wang & Li (2002) considered the case in which the response variables Yi are random censored. These authors proposed an empirical likelihood ratio for β and derived its asymptotic distribution, which is a sum of independent chi-squared distributions with unknown weights.
We first review briefly the quasi-likelihood estimators of β and θ(t) proposed by Severini & Staniswalis (1994). Denote the quasi-likelihood function by
Under some regularity conditions, ∑i Q(μ, Yi) behaves like a log-likelihood function for μ based on Y1, ⋯, Yn and Q(μ, y) behaves like the logarithm of a density function for Y. Let K denote a kernel on Rq, and h = hn denote a sequence of bandwidths. For each fixed t and β, let θ̂β(t) denote the solution in η of
(2) |
Let 𝒯0 denote a compact subset of int(T) and let Ii = 1 if Ti ∈ 𝒯0 and 0 otherwise. Given the estimator θ̂β(t), an estimator of β, β̂ is then obtained by solving
(3) |
The quasi-likelihood estimator of θ(t) is given by θ̂β̂(t). The trimming by Ii of data near the boundary is employed to reduce boundary bias, which, for kernel regression estimators, can be quite serious and converges to zero at a slower rate than in the interior. In the univariate case, when q = 1, either a boundary-corrected kernel estimator or locally linear kernel estimator may be used instead. Although either of these methods may be extended to the multivariate case, the resulting technical details for the development of the asymptotic theory become cumbersome. For ease of notation, we present our results for the case q = 1 in the remainder of this paper.
2.1 Confidence region for β
Let β0 denote the true value of β. Write
Based on the estimating equation (3) for β, we propose the empirical likelihood ratio statistic for β as follows.
where pi, i = 1, ⋯, n, are nonnegative numbers which satisfy
By the Lagrange multiplier method, it can be shown that
where λ1 is determined by
(4) |
The asymptotic distribution of the empirical likelihood ratio statistic ℓ1(β0) is established in Theorem 2.1. Its proof is given in the Appendix.
Theorem 2.1
Suppose that nh4 → 0 and the conditions (a)–(e) in the Appendix are satisfied. Then, as n → ∞,
where β0 is the true parameter value and is a chi-square distributed random variable with p degrees of freedom.
Therefore, CIβ = {β|ℓ1(β) ≤ cα} is a 1 − α confidence region for β0 where cα satisfies .
2.2 Pointwise confidence region for θ(t)
Let η = θ(t) for fixed t ∈ 𝒯0, and β̂ be a −consistent estimator of β0. Denote
where K(·) is a kernel function and h is a bandwidth. Based on the estimating equation (2) for η, we propose the empirical likelihood ratio statistic for η:
where pi, i = 1, ⋯, n, are nonnegative numbers which satisfy
A direct calculation implies that
where λ2 is determined by
(5) |
The asymptotic distribution of the empirical likelihood ratio statistic ℓ2(η) is given in Theorem 2.2. Its proof is given in the Appendix.
Theorem 2.2
Suppose that nh4 → 0 and that the conditions (a)–(e) in the Appendix are satisfied. Then, as n →∞,
From Theorem 2.2, the confidence region for η with coverage probability 1 − α(0 < α < 1) can be constructed by CIη = {η|ℓ2(η) ≤ cα}, where cα satisfies .
Remark 1
Theorems 2.1 and 2.2 indicate that undersmoothing is still needed as for the normal approximation theory (Carroll et al., 1997). To meet this requirement, we use existing bandwidth selection techniques to obtain the optimal bandwidth, ĥopt, An ad hoc bandwidth is generated by ĥopt × n−1/20 log−1/5 n, which ensures the bandwidth has correct order required in Theorems 2.1 and 2.2.
3 Simulations
To illustrate the numerical performance for the proposed method, we conducted a small simulation experiment in which n = 80, 100, 120. We generated data from a logistic model
where Xi is independent uniform (−0.5, 0.5) component and Ti is uniformly distributed on (0, 2). The parameter β is equal to 1, and the nonparametric function is θ(z) = sin{(z − a)/(b − a)π} with .
In our nonparametric estimation implementation, to save computational time, we tried the simple bandwidth h = an−1/4(log n)−1/5 for a = 0.75, 1, 1.25, 1.5, 2, which satisfy the condition in Theorems 1 and 2. We finally selected bandwidth via h = 1.5n−1/4(log n)−1/5. The numerical results are fairly stable against shifting values of the selected bandwidth. We used the quartic kernel, K(u) = 15/16(1 − u2)2I(|u|≤1). We generated 200 data sets in each configuration. The empirical likelihood-based and normal approximation based confidence intervals for β are reported in Table 1. The lower and upper values are the averages of 200 simulated lower and upper values. The columns “AL” give the average length of the confidence intervals, while the column “CP(%)” gives the corresponding coverage probabilities of the 200 simulated datasets. The pointwise confidence intervals for the nonparametric function θ(t) at the selected four points t = 0.3, 0.8, 1.5 and 1.9 are presented in Table 2. A referee has asked us how the confidence intervals proposed compare to bootstrap confidence intervals, for which we used the naive bootstrap, i.e., resampled (X, Y, T), for 500 times. We provided the results for β in Table 1 and for the nonparametric component θ(t) in Table 2. These results basically coincide with those the results based on the normal approximation, and slightly deviate from those based on the proposed method in this paper. But the bootstrap implementation took a significantly longer amount of time compared to the empirical likelihood method. From the tables we may conclude that the coverage probabilities based on empirical likelihood method are mostly closer to the nominal level than those based on the normal approximation method, while the lengths of the empirical likelihood based intervals are slightly shorter than those based on the normal approximation method. The length of the estimated confidence intervals of β decreases with the increase of sample size.
Table 1.
Normal | BC | EL | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
n | CP(%) | LE | RE | AL | CP(%) | LE | RE | AL | CP(%) | LE | RE | AL |
80 | 93.0 | −0.568 | 2.889 | 3.457 | 97.0 | −0.547 | 2.904 | 3.451 | 96.0 | −0.604 | 2.823 | 3.427 |
100 | 97.0 | −0.563 | 2.493 | 3.056 | 96.5 | −0.541 | 2.415 | 2.955 | 95.0 | −0.514 | 2.468 | 2.982 |
120 | 97.0 | −0.352 | 2.384 | 2.736 | 97.0 | −0.362 | 2.385 | 2.747 | 94.5 | −0.295 | 2.398 | 2.692 |
Table 2.
Normal | BC | EL | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
t | n | CP(%) | LE | RE | AL | CP(%) | LE | RE | AL | CP(%) | LE | RE | AL |
0.3 | 80 | 94.5 | −1.091 | 0.793 | 1.885 | 95.0 | −1.038 | 0.829 | 1.867 | 94.0 | −0.841 | 0.914 | 1.756 |
100 | 97.5 | −0.995 | 0.586 | 1.581 | 97.0 | −0.999 | 0.587 | 1.586 | 96.0 | −0.893 | 0.589 | 1.482 | |
120 | 98.5 | −0.844 | 0.418 | 1.262 | 96.5 | −0.840 | 0.428 | 1.267 | 96.5 | −0.823 | 0.442 | 1.265 | |
0.8 | 80 | 94.7 | −0.057 | 1.711 | 1.768 | 95.5 | −0.044 | 1.671 | 1.715 | 95.5 | −0.033 | 1.648 | 1.681 |
100 | 96.5 | 0.144 | 1.613 | 1.469 | 97.0 | 0.133 | 1.614 | 1.482 | 95.5 | 0.093 | 1.613 | 1.519 | |
120 | 97.0 | 0.103 | 1.448 | 1.345 | 96.5 | 0.100 | 1.427 | 1.327 | 94.5 | 0.110 | 1.336 | 1.225 | |
1.5 | 80 | 94.0 | −1.064 | 0.431 | 1.495 | 94.5 | −1.096 | 0.404 | 1.500 | 96.5 | −1.032 | 0.417 | 1.449 |
100 | 97.5 | −1.070 | 0.278 | 1.348 | 97.0 | −1.048 | 0.283 | 1.332 | 93.5 | −0.786 | 0.303 | 1.090 | |
120 | 96.5 | −1.037 | 0.236 | 1.273 | 97.0 | −1.002 | 0.265 | 1.267 | 96.0 | −0.965 | 0.259 | 1.223 | |
1.9 | 80 | 96.0 | −2.030 | 0.157 | 2.187 | 96.5 | −1.974 | 0.172 | 2.146 | 96.0 | −1.962 | 0.160 | 2.123 |
100 | 97.0 | −1.957 | 0.043 | 2.000 | 96.5 | −1.973 | 0.030 | 2.003 | 95.5 | −1.629 | 0.284 | 1.913 | |
120 | 97.5 | −1.963 | 0.024 | 1.987 | 97.0 | −1.953 | 0.032 | 1.985 | 95.5 | −1.630 | 0.102 | 1.732 |
4 Real Data Analysis
In recent years, one of the areas focused upon by AIDS researchers has been the relationship between viral load and CD4+ cell counts (Liang, Wu & Carroll, 2003; Liang, et al., 2004). This relationship is used to investigate the concordance and discordance between virologic and immunologic variables, which may help clinicians more deeply understand AIDS pathogenesis and improve therapy. Although antiretroviral therapy for HIV-1 infected patients has greatly improved in recent years, and administration of drug cocktails consisting of three or more drugs can reduce and maintain the viral load below the detection limit in many patients, it is unlikely that any combination of therapies can eradicate HIV in infected patients because of the existence of long-lived infected cells and sites within the body where drugs may not be effective. With the success of highly active antiretroviral therapy (HAART) against HIV infection, viral load (measured as viral RNA copies/mL) is suppressed and maintained at magnitudes that are below the limit of quantification, and the infection is considered chronic. Clinicians and patients are therefore nowadays more interested in achieving a viral load that is below the detection limit and in monitoring the immunologic system (measured by CD4+ cell counts).
In this section we analyze a dataset from the AIDS study PACTG 345 (Scott et al., 2001). Let Y be the indicator of a undetectable viral load level, let X be the CD4 cell count, and let T be the treatment time. In this study, 33 patients were enrolled as cohort II. Specimens were obtained on days 0, 1, 3, 7, 14, 28, 56, then irregularly through to the day 1155. A total of 559 HID-1 RNA measurements were obtained with 256 of these below the detection limit of 400 copies/mL. Thus, 45% of the viral loads were observed to be suppressed below the detection limit. Figure 1 presents the individual observations of plasma HID RNA concentration (viral load) after initial antiretroviral treatments. A main objective of the treatment is to suppress the viral load below the limit of detection.
We are interested in the relationship between the binary viral load measurement and CD4+ cell counts. A parsimonious model of this relationship is biologically and clinically important because these variables are good biomarkers for anti-HIV treatment and may be used to evaluate antiretroviral therapies. An obvious model is logistic regression, with X and T having linear effects on the logit scale, because it is easily implemented and interpreted. A concern, however, is whether this model can appropriately capture curvature in the effect of T due to drug resistance or noncompliance. To address this concern, we used the method of Härdle, Mammen & Müller (1998) to check if a logistic model is appropriate, and obtained a p-value less than 10−4, which reflects that the traditional logistic regression is not flexible enough to fit this data set well. We therefore used a partially logistic model, described in (6), to fit the dataset and use the proposed method to obtain the confidence intervals for parametric and nonparametric components.
(6) |
where θ(t) is a unknown smooth function. The estimate of β is 0.216, the positive value of which reflects the increased chance of RNA below the detection limit at higher levels of CD4+ counts. The 95%confidence intervals for β based on the normal approximation and the proposed empirical likelihood methods are (−0.202, 0.634) and (0.081, 0.514). These two confidence intervals convey different messages. The former interval indicates that the chance of RNA below the detection limit is not statistically significantly related to CD4 cell count, but the latter interval yields an inverse impression. We prefer to the conclusion based on the empirical likelihood method according to biological meanings and the simulation performance of this method. The pointwise estimates of θ(t) and associated confidence regions based on these two methods are shown in Figure 2, in which the solid line is the estimated pattern of θ(t), the dotted lines and broken lines are the confidence regions based on empirical likelihood and normal approximation methods. The former gives a narrower region than the latter.
5 Discussion
To simply inference for GPLMs, we proposed an empirical likelihood-based approach to constructing confidence regions for β and θ(t). The proposed approach is remarkably simpler than its counterpart based on the asymptotic normality of quasilikelihood estimators (Severini & Staniswalis, 1994) and easily executable. The finite-sample performance of the proposed statistics shows promise. In this article, we used local linear regression when we handled nonparametric function θ(t). There are many different alternatives to the local constant kernel regression in (2), including higher degree local polynomial kernel methods, smoothing splines, and regression splines. The details for these methods need further investigation in our setting. We chose the constant kernel regression because theoretical results can be derived (Severini & Staniswalis, 1994).
Model (1) may be extended to a generalized additive partially linear model in the form of
where Ti = (T1,i, …, TK,i)′ is a K-dimensional vector. The study of this model is interesting and requires additional efforts, but it is beyond the scope of this paper.
Acknowledgements
The research of Liang and Qin was supported by NIH/NIAID grants. Zhang’s research was partially supported by grants from the National Natural Science Foundation of China. Ruppert’s research was supported by NSF and NIH grants. The authors thank the Editor and two referees for their insightful comments that improved an earlier version of this paper.
Appendix
Conditions
The following assumptions are standard in studies of GPLMs, and we assume these hold throughout the article. Write ρ1(u) = {dμ(u)/du} V−1{μ(u)}, and q1(u, y) = {y−μ(u)}ρ1(u).
The density function f(t) of T is positive and continuous at the point t0 ∈ 𝒯.
The function μ(u) is twice differentiable in u.
The function θ(2)(t) is continuous at the point t0 ∈ 𝒯.
With , and are twice differentiable in t.
, for some δ > 2.
Proof of Theorem 2.1
Denote AA′ by A⊗2, for i = 1, …, n. Ξn = max1≤i≤n ‖ω1{β0, θ̂β0(Ti), Yi, Xi, Ti}‖. We first show that
(7) |
(8) |
and
(9) |
where Г is a positive definite matrix in form of
Recall that Q(μ, y) behaves like the logarithm of a density function for Y, and that θβ(t) is a least favorable curve and thus proposition 2 of Severini &Wong (1992) holds, which are shown in the proof of proposition 1 of Severini and Staniswalis (1994). Accordingly, applying (2) in Section 6 of Severini & Wong (1992) (here our corresponds to of Severini & Wong (1992)), we obtain
(10) |
Furthermore,
(11) |
(7) follows from (10), (11) and a central limit theorem. The proofs of (8) and (9) are trivial.
From (7), (8) and (9), and the arguments similar to the proof of (2.14) in Owen (1990), we can show that
(12) |
Recall (4). It is readily seen by a direct calculation and (12) that
Thus, using Taylor expansion, we have
Proof of Theorem 2.2
Denote by f0(·) the probability density function of T. Write
We first show that
(13) |
(14) |
where Г0 = ∫ K2(u)du · f0(t)E{H2(Y, X, T)|T = t}.
From Taylor expansion and the fact that
it can been shown that
(15) |
Moreover,
(16) |
Thus (13) follows from (15), (16) and a central limiting theorem. The proof of (14) is trivial.
Write
From (15), using a central limiting theorem, it can be shown that
Combining with (5), (13) and (14), we have
Furthermore, by Taylor expansion, we obtain
(17) |
Contributor Information
Hua Liang, University of Rochester.
Yongsong Qin, Guangxi Normal University.
Xinyu Zhang, Chinese Academy of Sciences.
David Ruppert, Cornell University.
References
- Carroll RJ, Fan J, Gijbels I, Wand MP. Generalized partially linear single-index models. J. Am. Statist. Assoc. 1997;92:477–489. [Google Scholar]
- Chen SX, Qin YS. Empirical likelihood confidence intervals for local linear smoothers. Biometrika. 2000;87:946–953. [Google Scholar]
- Engle RF, Granger CWJ, Rice J, Weiss A. Semiparametric estimates of the relation between weather and electricity sales. J. Am. Statist. Assoc. 1986;81:310–320. [Google Scholar]
- Hall P, La Scala B. Methodology and algorithms of empirical likelihood. Int. Statist. Rev. 1990;58:109–127. [Google Scholar]
- Hartley HO, Rao JNK. A new estimation theory for sample surveys. Biometrika. 1968;55:547–557. [Google Scholar]
- Härdle W, Mammen E, Müller M. Testing parametric versus semiparametric modeling in generalized linear models. J. Am. Statist. Assoc. 1998;93:1461–1474. [Google Scholar]
- Härdle W, Liang H, Gao J. Partially Linear Models. Heidelberg: Springer Physica-Verlag; 2000. [Google Scholar]
- Liang H, Ren HB. Generalized partially linear measurement error models. J. Comp. Graph. Statist. 2005;14:237–250. [Google Scholar]
- Liang H, Wu HL, Carroll RJ. The relationship between virologic and immunologic responses in AIDS clinical research using mixed-effect varying-coefficient semiparametric models with measurement error. Biostatistics. 2003;4:297–312. doi: 10.1093/biostatistics/4.2.297. [DOI] [PubMed] [Google Scholar]
- Liang H, Wang S, Robins JM, Carroll RJ. Estimation in partially linear models with missing covariates. J. Am. Statist. Assoc. 2004;99:357–367. [Google Scholar]
- Lin XH, Carroll RJ. Semiparametric regression for clustered data using generalized estimating equations. J. Am. Statist. Assoc. 2001;96:1045–1056. [Google Scholar]
- Owen AB. Empirical likelihood ratio confidence regions. Ann. Statist. 1990;18:90–120. [Google Scholar]
- Owen AB. Empirical likelihood. New York: Chapman and Hall; 2001. [Google Scholar]
- Qin GS, Jing BY. Censored partial linear models and empirical likelihood. J. Mult. Anal. 2001;78:37–61. [Google Scholar]
- Qin J. Empirical likelihood ratio based confidence intervals for mixture proportions. Ann. Statist. 1999;27:1368–1384. [Google Scholar]
- Qin J, Lawless J. Empirical likelihood and general estimating equations. Ann. Statist. 1994;22:300–325. [Google Scholar]
- Scott ZA, Chadwick EG, Gibson LL, et al. Infrequent detection of HIV-1-specific, but not cytomegalovirus-specific, CD8+T cell responses in young HIV-1-infected infants. J. Immunology. 2001;167:7134–7140. doi: 10.4049/jimmunol.167.12.7134. [DOI] [PubMed] [Google Scholar]
- Severini TA, Staniswalis JG. Quasilikelihood estimation in semiparametric models. J. Am. Statist. Assoc. 1994;89:501–511. [Google Scholar]
- Severini TA, Wong WH. Profile likelihood and conditionally parametric models. Ann. Statist. 1992;20:1768–1802. [Google Scholar]
- Shi J, Lau TS. Empirical likelihood for partially linear models. J. Mult. Anal. 1999;72:132–148. [Google Scholar]
- Speckman P. Kernel smoothing in partial linear models. J. R. Statist. Soc. B. 1988;50:413–436. [Google Scholar]
- Stone CJ. Optimal rates of convergence for nonparametric estimators. Ann. Statist. 1980;8:1348–1360. [Google Scholar]
- Thomas DR, Grunkemeier GL. Confidence interval estimation of survival probabilities for censored data. J. Am. Statist. Assoc. 1975;70:865–871. [Google Scholar]
- Wang QH, Li G. Empirical likelihood semiparametric regression analysis under random censorship. J. Mult. Anal. 2002;83:469–486. [Google Scholar]
- Zhu LX, Xue LG. Empirical likelihood confidence regions in a partially linear single-index model. J. R. Statist. Soc. B. 2006;68:549–570. [Google Scholar]