Penalized Quadratic Inference Function-Based Variable Selection for Generalized Partially Linear Varying Coefficient Models with Longitudinal Data

Jinghua Zhang; Liugen Xue

doi:10.1155/2020/3505306

. 2020 Oct 5;2020:3505306. doi: 10.1155/2020/3505306

Penalized Quadratic Inference Function-Based Variable Selection for Generalized Partially Linear Varying Coefficient Models with Longitudinal Data

Jinghua Zhang ^1,^2,^✉, Liugen Xue ²

PMCID: PMC7556090 PMID: 33082838

Abstract

Semiparametric generalized varying coefficient partially linear models with longitudinal data arise in contemporary biology, medicine, and life science. In this paper, we consider a variable selection procedure based on the combination of the basis function approximations and quadratic inference functions with SCAD penalty. The proposed procedure simultaneously selects significant variables in the parametric components and the nonparametric components. With appropriate selection of the tuning parameters, we establish the consistency, sparsity, and asymptotic normality of the resulting estimators. The finite sample performance of the proposed methods is evaluated through extensive simulation studies and a real data analysis.

1. Introduction

Identifying the significant variables is of great significance in all regression analysis. In practice, a number of variables are available for an initial analysis, but many of them may not be significant and should be excluded from the final model in order to increase the accuracy of prediction. Various procedures and criteria, such as stepwise selection and subset selection with Akaike information criterion (AIC), Mallows Cp, and Bayesian information criterion (BIC), have been developed. Nevertheless, these selection methods suffer from expensive computational costs. Many shrinkage methods have been developed for the purpose of computational efficiency, e.g., the nonnegative garrote [1], the LASSO [2], the bridge regression [3], the SCAD [4], and the one-step sparse estimator [5]. Among those, the SCAD possesses the virtues of continuity, unbiasedness, and sparsity. There are a number of works on the SCAD estimation methods in various regression models, e.g., [6–9]. Zhao and Xue [8] proposed a variable selection method to select significant variables in the parametric components and the nonparametric components simultaneously for the varying coefficient partially linear models (VCPLMs).

On the other hand, longitudinal data occurs frequently in biology, medicine, and life science, in which it is often necessary to make repeated measurements of subjects over time. The responses from different subjects are independent, but the responses from the same subject are very likely to be correlated. This feature is called “within-cluster correlation”. Qu et al. [10] proposed a method of quadratic inference functions (QIFs) to treat the longitudinal data. The QIF can efficiently take the within-cluster correlation into account and is more efficient than the generalized estimating equation (GEE) [11] approach when the working correlation is misspecified. The QIF approach has been applied to many models, including varying coefficient models (VCM) [12, 13], partially linear models (PLM) [14], varying coefficient partially linear models (VCPLMs) [15], and generalized partially linear models (GPLM) [16]. Wang et al. [13] proposed a group SCAD procedure for variable selection of VCM with longitudinal data. More recently, Tian et al. [15] proposed a QIF-based SCAD penalty for the variable selection for VCPLM with longitudinal data.

As introduced in Li and Liang [17], the generalized partially linear varying coefficient model (GPLVCM) possesses the great flexibility of a nonparametric regression model and provides the explanatory power of a generalized linear regression model, which arises naturally due to categorical covariates. Many models are the special case of GPLVCM, e.g., VCM, VCPLM, PLM, and GLM. Li and Liang [17] studied variable selection for GPLVCM, where the parametric components are identified via the SCAD but the nonparametric components are selected via a generalized likelihood ratio test instead of shrinkage. In this paper, we extend the QIF-based group SCAD variable selection procedure to GPLVCM with longitudinal data, and the B-spline methods are adopted to approximate the nonparametric component in the model. With suitable chosen tuning parameters, the proposed variable selection procedure is consistent, and the estimators of regression coefficients have oracle property, i.e., the estimators of the nonparametric components achieve the optimal convergence rate, and the estimators of the parametric components have the same asymptotic distribution as that based on the correct submodel.

The rest of this paper is organized as follows. In Section 2, we propose a variable selection procedure for the GPLVCM with longitudinal data. Asymptotic properties of the resulting estimators and an iteration algorithm are presented in Section 3. In Section 4, we carry out simulation studies to assess the finite sample performance of the method. A real data analysis is given in Section 5 to illustrate the proposed methodology. The details of proofs are provided in the appendix.

2. Methodology

2.1. GPLVCM with Longitudinal Data

In this article, we consider a longitudinal study with n subjects and m_i observations over time for the ith subject (i = 1, ⋯, n) for a total of N = ∑_i=1ⁿ m_i observations. Each observation consists of a response variable Y_ij and the predicator variables (X_ij, Z_ij, U(ij)), where X_ij ∈ R^p, Z_ij ∈ R^q and U_ij is a scalar. We assume that the observations from different subjects are independent, but those within the same subject are dependent. The generalized varying coefficient partially linear model (GPLVM) with longitudinal data takes the form

\begin{matrix} μ_{i j} = E (Y_{i j} ∣ X_{i j}, Z_{i j}, U_{i j}) = h (X_{i j}^{T} β + Z_{i j}^{T} α (U_{i j})), \end{matrix}

(1)

where μ_ij is the expectation of Y_ij when X_ij, Z_ij, and U_ij are given, β = (β₁,⋯,β_p)^T is an unknown p × 1 regression coefficient vector, h(·) is a known smooth link function, and α(u) = (α₁(u), α₂(u),⋯,α_q(u))^T is a q × 1 unknown monotonic smooth function vector. Without loss of generality, we assume U ~ U[0, 1].

We approximate α(·) by B-spline basis functions B(u) = (B₁(u),⋯,B_L(u))^T with the order of M, where L = K + M + 1 and K is the number of interior knots, i.e.,

\begin{matrix} α_{k} (u) \approx α_{k}^{*} (u) = B {(u)}^{T} γ_{k}, k = 1, \dots, q, \end{matrix}

(2)

where γ_k = (γ_k1,⋯,γ_kL)^T is a L × 1 vector of unknown regression coefficients. Accordingly, μ_ij is approximated by

\begin{matrix} μ_{i j} = E (Y_{i j} ∣ X_{i j}, Z_{i j}, U_{i j}) = h (X_{i j}^{T} β + Z_{i j}^{T} \cdot I_{q} \otimes B {(U_{i j})}^{T} γ), \end{matrix}

(3)

where γ = (γ₁^T, ⋯,γ_q^T)^T and “⊗” is the Kronecker product. We use the B-spline basis functions because they are numerically stable and have bounded support [18]. The spline approach also treats a nonparametric function as a linear function with the basis functions as pseudodesign variables, and thus, any computational algorithm for the generalized linear models can be used for the GPLVCMs.

To incorporate the within-cluster correlation, we apply the QIFs to estimate β and γ, respectively. Denote θ = (β^T, γ^T)^T, we define the extended score g_N(θ) as follows:

\begin{matrix} g_{N} (θ) = \frac{1}{n} \sum_{i = 1}^{n} g_{i} (θ) = \frac{1}{n} \sum_{i = 1}^{n} (\begin{matrix} {\dot{μ}}_{i}^{T} A_{i}^{- \frac{1}{2}} M_{1} A_{i}^{- \frac{1}{2}} (Y_{i} - μ_{i}) \\ ⋮ \\ {\dot{μ}}_{i}^{T} A_{i}^{- \frac{1}{2}} M_{s} A_{i}^{- \frac{1}{2}} (Y_{i} - μ_{i}) \end{matrix}), \end{matrix}

(4)

where ${\dot{μ}}_{i} = \partial μ_{i} / \partial θ$ , A_i = diag(Var(Y_i1), ⋯, Var(Y_im)) is the marginal variance matrix of subject Y_i, and M₁, ⋯, M_s are the base matrices to represent the inverse of the working correlation matrix R in GEE approach. Following Qu et al. [10], we define the quadratic inference functions to be

\begin{matrix} Q_{n} (θ) = n g_{N}^{T} (θ) Ω_{n} {(θ)}^{- 1} g_{N} (θ), \end{matrix}

(5)

where Ω_n(θ) = (1/n)∑_i=1ⁿ g_i(θ)g_i(θ)^T. Note that Ω_n depends on θ. The QIF estimate $\tilde{θ}$ is then given by

\begin{matrix} \tilde{θ} = argmi n_{θ} Q_{n} (θ) . \end{matrix}

(6)

2.2. Penalized QIF

In real data analysis, the true regression model is always unknown. An overfitted model lowers the efficiency of estimation while an underfitted one leads to a biased estimator. A popular approach to identify the relevant predictors while estimating the nonzero parameters and functions in model (1) simultaneously is to exert some kind of “penalty” on the original objective function. Here, we choose the smoothly clipped absolute deviation (SCAD) penalty because it has several advantages such as unbiasedness, sparsity, and continuity. The SCAD-penalized quadratic inference function (PQIF) is defined as follows:

\begin{matrix} Q_{n}^{p} (θ) = Q_{n} (θ) + n \sum_{l = 1}^{p} p_{λ_{1}} (|β_{l}|) + n \sum_{k = 1}^{q} p_{λ_{2}} ({‖γ_{k}‖}_{H}), \end{matrix}

(7)

where ‖γ_k‖_H = (γ_k^THγ_k)^1/2, H = (h_ij)_L×L, h_ij = ∫₀¹ B_i(u)B_j^T(u)du and p_λ is the SCAD penalty function, where the derivative is defined as

\begin{matrix} p'_{l} (ω) = λ \{I (ω ⩽ l) + \frac{{(a λ - ω)}_{+}}{(a - 1) λ} I \{ω > λ\}\}, \end{matrix}

(8)

where a > 2, ω > 0, p_λ(0) = 0; here, we choose a = 3.7 as in [4].

Note that

\begin{matrix} {‖γ_{k}‖}_{H} = {(\int_{0}^{1} γ_{k}^{T} B (u) B^{T} (U) γ_{k} d u)}^{1 / 2} = {(\int_{0}^{1} {[α^{*} (u)]}^{2} d u)}^{1 / 2} . \end{matrix}

(9)

This group-wised penalization ensures that the spline coefficient vector of the same nonparametric component is treated as an entire group in model selection.

Denote $\hat{θ}$ to be the penalized estimator obtained by minimizing the penalized objective function of (7). Then, $\hat{β} = {({\overset{\land}{θ}}_{1}, \dots, {\overset{\land}{θ}}_{p})}^{T}$ is the estimator of the parameter β and the estimator of the nonparametric function α(u) is calculated by $\hat{α} (u) = B {(u)}^{T} \hat{γ}$ , where $\hat{γ} = {({\overset{\land}{γ}}_{1}^{T}, \dots, {\overset{\land}{γ}}_{q}^{T})}^{T} = {({\overset{\land}{θ}}_{p + 1}, \dots, {\overset{\land}{θ}}_{p + L}, {\overset{\land}{θ}}_{p + L + 1}, \dots, {\overset{\land}{θ}}_{p + q L})}^{T}$ .

3. Asymptotic Properties

3.1. Oracle Property

We next establish the asymptotic properties of the resulting penalized QIF estimators. We first introduce some notations. Let β₀ and α₀(·) denote the true values of β(·) and α(·). In addition, γ₀ is the spline coefficient vector from the spline approximation to α₀(·). Without loss of generality, we assume that β_0l ≠ 0, l = 1, ⋯, p₁ and β_0l = 0, l = p₁ + 1, ⋯, p, i.e., only the first p₁ component of β₀ is nonzero. Similarly, we assume that α_0k(·) ≠ 0, k = 1, ⋯, q₁ and α_0k(·) = 0, k = q₁ + 1, ⋯, q, i.e., only the first q₁ component of α₀(·) is nonzero. For convenience and simplicity, let C denote a positive constant that may have different values at each appearance throughout this paper and ||A|| denote the modulus of the largest singular value of matrix or vector A. Before the proof of our main theorems, we list some regularity conditions used in this paper.

Assumption 1 (A1). —

The spline regression parameter γ is identifiable, that is, γ₀ is the spline coefficient vector from the spline approximation to α₀(·). In addition, there is a unique θ₀ = (β₀, γ₀) ∈ S satisfying E{g_N(θ₀)} = 0, where S is the parameter space.

Assumption 2 (A2). —

The weight matrix Ω_n = (1/n)∑_i=1ⁿ g_i(θ)g_i^T(θ) converges almost surely to a constant matrix Ω₀, where Ω₀ is invertible.

Assumption 3 (A3). —

The covariate matrices X_i and Z_i, i = 1, ⋯, n, satisfy sup_iE‖X_i‖⁴ < ∞ and sup_iE‖Z_i‖⁴ < ∞.

Assumption 4 (A4). —

The error ε_i = Y_i − μ_i satisfies E(ε_iε_i^T) = V_i, sup_i‖V_i‖ < ∞, and there exists a positive constant δ such that sup_iE‖ε_i‖^2+δ < ∞.

Assumption 5 (A5). —

All marginal variances A_i ≥ 0 and sup_i‖A_i‖ < ∞.

Assumption 6 (A6). —

{m_i} is a bounded sequence of positive integers.

Assumption 7 (A7). —

α _i(u), i = 1, 2, ⋯, q is rth continuous differentiable on (0, 1), where r ≥ 2.

Assumption 8 (A8). —

The inner knots {c_i, i = 1, ⋯, K} satisfy

$\begin{matrix} \max_{1 \leq i \leq K} |h_{i + 1} - h_{i}| = o (K^{- 1}), \\ \frac{\max h_{i}}{\min h_{i}} \leq C_{0}, \end{matrix}$ (10)

where h_i = c_i − c_i−1.

Assumption 9 (A9). —

The link function h(·) is 2th continuous differentiable and E{h^2+δ} < ∞ for some δ > 2.

Assumption 10 (A10). —

a _n = O(n^−1/2); b_n⟶0 as n⟶∞, where

$\begin{matrix} a_{n} = \max_{k, l} \{|p'_{λ_{1}} (|β_{0 l}|)|, |p'_{λ_{2}} ({‖γ_{0 k}‖}_{H})| : β_{o l} \neq 0, γ_{0 k} \neq 0\}, \\ b_{n} = \max_{k, l} \{|p''_{λ_{1}} (|β_{0 l}|)|, |p''_{λ_{2}} ({‖γ_{0 k}‖}_{H})| : β_{o l} \neq 0, γ_{0 k} \neq 0\} . \end{matrix}$ (11)

Theorem 1 indicates that the estimator of nonparametric components achieve the optimal convergence rate.

Theorem 1 . —

Assume that Assumptions (A.1)–(A.10) hold and the number of knots K = O(N^{1/(2r + 1)}), then

$\begin{matrix} ‖{\hat{α}}_{k} (\cdot) - α_{0 k} (\cdot)‖ = O_{p} (n^{- r / (2 r + 1)}), k = 1, \dots, q . \end{matrix}$ (12)

Furthermore, under suitable condition, Theorem 1 shows that the penalized QIF estimator has the sparsity property.

Theorem 2 . —

Assume that the conditions in Theorem 1 hold and $λ_{\max} ⟶ 0, \sqrt{n} λ_{\min} ⟶ \infty$ as n⟶∞, with probability approaching 1,

$\begin{matrix} {\hat{β}}_{l} = 0, l = p_{1} + 1, \dots, p, \\ {\hat{α}}_{k} (\cdot) \equiv 0, k = q_{1} + 1, \dots, q, \end{matrix}$ (13)

where λ_max = max{λ₁, λ₂}, λ_min = min{λ₁, λ₂}.

Theorems 1 and 2 indicate that with the tune parameter λ being suitably chosen, the proposed selection method possesses model selection consistency. Next, we establish the asymptotic property for the estimator of the nonzero parametric components. Let β^∗ = (β₁, ⋯,β_p₁)^T, α^∗(·) = (α₁^∗(·), ⋯,α_q₁^∗(·))^T and let β₀^∗ and α₀^∗(·) denote their true value, respectively. In addition, let γ^∗ = (γ₁^T, ⋯,γ_q₁^T)^T and γ₀^∗ = (γ₀₁^T, ⋯,γ_0q₁^T)^T denote the spline coefficient vector of α^∗(·) and α₀^∗(·), respectively, and let X_i^∗ and Z_i^∗, i = 1, ⋯, n denote their correspondent covariate. Let ${\tilde{X}}_{i} = H^{'} (η_{i}) X_{i}^{*}, \tilde{X} = ({\tilde{X}}_{1}^{T}, \dots, {\tilde{X}}_{n}^{T}), {\tilde{W}}_{i} = H^{'} (η_{i}) W_{i}^{*}, \tilde{W} = ({\tilde{W}}_{1}^{T}, \dots, {\tilde{W}}_{n}^{T})$ , and

$\begin{matrix} Γ = E \{{\tilde{X}}^{T} τ \tilde{X} - E [{\tilde{X}}^{T} τ \tilde{W} |u] E {[{\tilde{W}}^{T} τ \tilde{W} |u]}^{- 1} {\tilde{W}}^{T} τ \tilde{X}\}, \\ Δ = E {\{[τ - E [{\tilde{X}}^{T} τ \tilde{W} |u] E {[{\tilde{W}}^{T} τ \tilde{W} |u]}^{- 1} E [{\tilde{W}}^{T} τ |u]] ε\}}^{\otimes 2}, \end{matrix}$ (14)

where Δ^⊗2 = ΔΔ^T, τ = (τ_ij)_n×n is a n × n block matrix with its (i, j) block taking the form

$\begin{matrix} τ_{i j} = \sum_{k = 1}^{s} \sum_{l = 1}^{s} A_{i}^{- 1 / 2} M_{k} A_{i}^{- 1 / 2} H^{'} (η_{i}) P_{i}^{*} Ω_{l k}^{- 1} P_{j}^{* T} H^{'} (η_{j}) A_{j}^{- 1 / 2} M_{l} A_{j}^{- 1 / 2} . \end{matrix}$ (15)

Theorem 3 states that ${\overset{\land}{β}}^{*}$ is asymptotically normally distributed.

Theorem 3 . —

Suppose that Assumptions (A.1)–(A.9) hold and the number of knots K = O(N^{1/(2r + 1)}), then

$\begin{matrix} \sqrt{n} ({\overset{\land}{β}}^{*} - β_{0}^{*}) \overset{L}{⟶} N (0, Σ), \end{matrix}$ (16)

where Σ = (ΓΔ⁻¹Γ)⁻¹ and $\overset{L}{⟶}$ represents the convergence in distribution.

3.2. Selection of Tuning Parameters

Theorems 1–3 imply that the proposed variable selection procedure possessed the oracle property. However, this attractive feature relies on the choice of tuning parameters λ_i. The popular criteria to choose λ_i include cross-validation, generalized cross-validation, AIC, and BIC. Wang et al. [19] suggested using BIC for the SCAD estimator in linear models and partially linear models and proved its model selection consistency property, i.e., the optimal parameter chosen by BIC can identify the true model with probability tending to one. Tian proved that for partially linear models. Hence, we adopt BIC to choose the optimal {λ₁, λ₂}. Following [19–21], we simplify the tuning parameters as

\begin{matrix} λ_{1} = \frac{λ_{0}}{{‖{\tilde{γ}}_{k}^{(0)}‖}_{H}}, \\ λ_{2} = \frac{λ_{0}}{|{\tilde{β}}_{k}^{(0)}|}, \end{matrix}

(17)

where ${\tilde{β}}_{k}^{(0)}$ and ${\tilde{γ}}_{k}^{(0)}$ are the unpenalized QIF estimates. Consequently, the original two-dimensional problem becomes a univariate problem about λ₀, which can be selected according to the following BIC-type criterion:

\begin{matrix} BI C_{λ} = Q_{n}^{p} ({\hat{θ}}_{λ}) + d f_{λ} \times \log (n), \end{matrix}

(18)

where ${\hat{θ}}_{λ} = ({\hat{β}}_{λ}, {\hat{γ}}_{1 λ}^{T}, \dots, {\hat{γ}}_{q λ}^{T})$ is the regression coefficient estimated by minimizing the penalized QIF in (2.8) for a given λ and df_λ is the number of nonzero coefficients of ${\hat{β}}_{λ}$ and ${‖{\hat{γ}}_{1 λ}‖}_{H}, \dots, {‖{\hat{γ}}_{q λ}‖}_{H}$ . Thus, the tuning parameter λ is obtained by

\begin{matrix} \hat{λ} = \arg \min_{λ} BI C_{λ} . \end{matrix}

(19)

From Theorem 4 of Tian et al. [15], the BIC tuning parameter selector enables us to select the true model consistently.

3.3. An Algorithm Using Local Quadratic Approximation

Based on Fan and Li's local quadratic approximating approach [4], we propose an iterative algorithm to minimize the PQIF (7). Similar with Tian et al. [15], we choose the unpenalized QIF estimator $\tilde{θ}$ as the initial estimator. Let θ^k = (β₁^k, ⋯,β_p^k, γ₁^kT, ⋯,γ_q^kT)^T be the value of θ at the kth iteration. If β_l^k (or γ_l^k) is close to 0 (or 0), i.e., |β_l^k| ⩽ ϵ (or ‖γ_l^k‖_H ⩽ ϵ) with some small threshold value ϵ, then we set β_l^k = 0 (or γ_l^k = 0). We consider ϵ = 10⁻⁶ in our simulations.

Suppose β_l^k+1 = 0, for l = p_k + 1, ⋯, p, and γ_l^k+1 = 0, for l = q_k + 1, ⋯, q, and β^k+1 = (β₁^k+1, ⋯,β_{p_k}^k+1, β_{p_k+1}^k+1, ⋯,β_p^k+1)^T = ((β_N^k+1)^T, (β_Z^k+1)^T)^T, where β_N^k+1 = (β₁^k+1, ⋯,β_{p_k}^k+1)^T are the nonzero parametric components and β_Z^k+1 = (β_{p_k+1}^k+1, ⋯,β_p^k+1)^T = 0. Similarly, let γ^k+1 = ((γ₁^k+1)^T, ⋯,(γ_{q_k}^k+1)^T, (γ_{q_k+1}^k+1)^T, ⋯,(γ_q^k+1)^T)^T = ((γ_N^k+1)^T, (γ_Z^k+1)^T)^T, where γ_N^k+1 = ((γ₁^k+1)^T, ⋯,(γ_{q_k}^k+1)^T)^T and γ_Z^k+1 = ((γ_{q_k+1}^k+1)^T, ⋯,(γ_q^k+1)^T)^T correspond to q_k zero functions and q − q_K zero functions, respectively. Let θ = (β_N^T, β_Z^T, γ_N^T, γ_Z^T)^T denote a vector which has the same length and same partition with θ^k+1.

For the parametric term, if |β_l^k| > ϵ, the penalty function at β_l ≈ β_l^k is approximated by

\begin{matrix} p_{λ} (β_{l}) \approx p_{λ} (β_{l}^{k}) + \frac{1}{2} \frac{p_{λ}^{'} (|β_{l}^{k}|)}{|β_{l}^{k}|} (β_{l}^{2} - {(β_{l}^{k})}^{2}) . \end{matrix}

(20)

Similarly, to the nonparametric component, if ‖γ_l^k‖_H > ϵ, the penalty function at γ_l ≈ γ_l^k is approximated by

\begin{matrix} p_{λ} ({‖γ_{l}‖}_{H}) \approx p_{λ} ({‖γ_{l}^{k}‖}_{H}) + \frac{1}{2} \frac{p_{λ}^{'} ({‖γ_{l}^{k}‖}_{H})}{{‖γ_{l}^{k}‖}_{H}} ({‖γ_{l}‖}_{H}^{2} - {‖γ_{l}^{k}‖}_{H}^{2}) = p_{λ} ({‖γ_{l}^{k}‖}_{H}) + \frac{1}{2} \frac{p'_{λ} ({‖γ_{l}^{k}‖}_{H})}{{‖γ_{l}^{k}‖}_{H}} (β_{l}^{T} H β_{l} - β_{l}^{k T} H β_{l}^{k}), \end{matrix}

(21)

where p′_λ is the first-order derivative of the penalty function p_λ. This leads to the local approximation of the PQIF 𝒬_n^p(θ) by a quadratic function:

\begin{matrix} Q_{n} (θ^{k}) + {\dot{Q}}_{n} {(θ^{k})}^{T} (ω_{11} - ω_{11}^{k}) + \frac{1}{2} {(ω_{11} - ω_{11}^{k})}^{T} {\ddot{Q}}_{n} (θ^{k}) (ω_{11} - ω_{11}^{k}) + \frac{n}{2} ω_{11}^{t} Λ (θ^{k}) ω_{11}, \end{matrix}

(22)

where ${\dot{𝒬}}_{n} (θ^{k}) = \partial 𝒬_{n} (θ^{k}) / \partial ω_{11}$ , ${\ddot{𝒬}}_{n} (θ^{k}) = \partial^{2} 𝒬_{n} (θ^{k}) / \partial ω_{11} \partial ω_{11}^{T}$ , with ω₁₁ = (β_N^T, γ_Z^T)^T, and

\begin{matrix} Λ (θ^{k}) = diag \{\frac{p'_{λ_{2}} (|β_{1}^{k}|)}{|β_{1}^{k}|}, \dots, \frac{p'_{λ_{2}} (β_{p_{k}}^{k})}{|β_{p_{k}}^{k}|}, \frac{p'_{λ_{1}} ({‖γ_{1}^{k}‖}_{H})}{{‖γ_{1}^{k}‖}_{H}} H, \dots, \frac{p'_{λ_{1}} ({‖γ_{q_{k}}^{k}‖}_{H})}{{‖γ_{q_{k}}^{k}‖}_{H}} H\} . \end{matrix}

(23)

Minimizing the quadratic function (22), we obtain ω₁₁^k+1. The Newton-Raphson method then iterates the following process to convergence:

\begin{matrix} ω_{11}^{k + 1} = ω_{11}^{k} - {\{{\ddot{Q}}_{n} (ω_{11}^{k}) + n Λ (ω_{11}^{k})\}}^{- 1} \{{\dot{Q}}_{n} (ω_{11}^{k}) + n Λ (ω_{11}^{k}) ω_{11}^{k}\} . \end{matrix}

(24)

4. Simulation Studies

4.1. Assessing Rule

In this section, we conduct a simulation study to assess the finite sample performance of the proposed procedures. Following [17], the performance of estimator $\hat{β}$ will be assessed by the generalized mean square error (GMSE), which is defined as

\begin{matrix} GMSE = \frac{1}{n} \sum_{i = 1}^{n} (\hat{β} - β) X_{i}^{*} {X_{i}^{*}}^{T} (\hat{β} - β) . \end{matrix}

(25)

The performance of estimator $\hat{α} (\cdot)$ will be assessed by the square root of average square errors (RASE)

\begin{matrix} RASE = {\{\frac{1}{M} \sum_{v = 1}^{M} \sum_{k = 1}^{q} {[{\overset{\land}{α}}_{k} (u_{v}) - α_{k} (u_{v})]}^{2}\}}^{1 / 2}, \end{matrix}

(26)

where u_v, v = 1, ⋯, M are the grid points where the function $\hat{α} (u)$ is evaluated. In our simulation, M = 300 is used.

To assess the performance of the variable selection, we use “C” to denote the average number of zero regression coefficients that are correctly estimated as zero and use “IC” to denote the average number of nonzero regression coefficients that are erroneously set to zero. The more closer the value of “C” to the number of true zero coefficient in the model and the more closer the value of “IC” to zero, the better the performance of the variable selection procedure is.

In our simulations, we use the sample quantiles of U_ij as knots and take the number of internal knots to be 3, that is, O(N^1/5). This particular choice is consistent with the asymptotic theory in Section 3 and performs well in the simulations. For each simulated dataset, the proposed estimation procedures for finding out penalized QIF estimators with SCAD and LASSO penalty functions are considered. The tuning parameters λ₁, λ₂ for the penalty functions are chosen by BIC from 50 equispaced grid points in [−15, 5]. For each of these methods, the average of zero coefficients over the 500 simulated datasets is reported.

4.2. Study 1 (Partial Penalty)

Consider a Bernoulli response

\begin{matrix} logit \{Y_{i j}\} = X_{i j}^{T} β + α (U_{i j}), \end{matrix}

(27)

where β = (2,1.5,0.7, 0₁₇^T)^T, m = 6, X_ij ~ N(0, I₂₀), α(U_ij) = 0.4cos((π/2)U_ij), and U_ij are drawn independently from U[0, 1]. Response variable Y_ij with compound symmetry correlation structure (CS) is generated according to Oman [22]. In our simulation study, we consider ρ = 0.25 and 0.75, representing weak and strong correlations, respectively. In some situations, we prefer not to shrink some certain components in the variable selection procedure when some kind of prior information is available. Partial penalty arises naturally for such case. In this example, we only exert penalty on the parametric component, i.e., coefficient β. In this situation, the PQIF (7) becomes

\begin{matrix} Q_{n}^{p} (θ) = Q_{n} (θ) + n \sum_{l = 1}^{p} p_{λ_{1}} (β_{l}) . \end{matrix}

(28)

The variable selection result is reported in Tables 1 and 2.

Table 1.

Variable selection for the parametric components under different methods.

	Method	n = 150			n = 200			n = 300
	Method	GMSE	C	IC	GMSE	C	IC	GMSE	C	IC
ρ = 0.75	SCAD	0.0011	15.83	0	0.0006	16.246	0	0.0005	16.746	0
ρ = 0.75	LASSO	0.0006	14.81	0	0.0005	15.346	0	0.0004	15.574	0

ρ = 0.25	SCAD	0.0011	15.75	0	0.0006	16.70	0	0.0004	16.846	0
ρ = 0.25	LASSO	0.0007	14.82	0	0.0006	14.96	0	0.0005	15.35	0

Open in a new tab

Table 2.

RASE of $\hat{α} (u)$ under different methods.

	Method	n = 150	n = 200	n = 300
ρ = 0.75	SCAD	0.1920	0.2051	0.1054
ρ = 0.75	LASSO	0.0999	0.0840	0.1064

ρ = 0.25	SCAD	0.2449	0.2460	0.0694
ρ = 0.25	LASSO	0.1399	0.1205	0.1033

Open in a new tab

Tables 1 and 2 show that the performance of the proposed variable selection approach improves as n increases, e.g., the number of correctly recognized zero coefficient increases to the number of true zero coefficient in the model and the GMSE of $\hat{β}$ decreases as n increases. In addition, the RASE of $\hat{α} (u)$ also decreases as n increases, which means the estimated curve of $\hat{α} (u)$ fits better to the true line of α(u) when the sample size increases. Moreover, the SCAD penalty method outperforms the LASSO penalty ones in the sense of correct variable selection rate, which significantly reduces the model uncertainty and complexity.

4.3. Study 2 (Fixed-Dimensional Setup)

In this example, we generate data from the following model:

\begin{matrix} logit \{Y_{i j} = 1 ∣ X_{i j}, U_{i j}\} = X_{i j}^{T} β + Z_{i j}^{T} α (U_{i j}), \end{matrix}

(29)

where β = (2,1.5,0.7, 0₇^T) and α(u) = (α₁(u), α₂(u), 0₅^T)^T with α₁(u) = 0.8cos((π/2)u), α₂(u) = 1.5 + u², X_ij and Z_ij(j = 1, ⋯, 6) come from a multivariate normal distribution with mean zero, marginal variance 1 and correlation coefficient 0.5, and u ~ U(0, 1). Response variable Y_ij with compound symmetry correlation structure (CS) is generated by the same method as study 1 and we also consider ρ = 0.25 and 0.75, representing weak and strong correlations, respectively. We generated 500 datasets for each pair of (N, ρ). The results are also reported in Tables 3 and 4.

Table 3.

Variable selection for the parametric components under different methods.

	Method	n = 150			n = 200			n = 300
	Method	GMSE	C	IC	GMSE	C	IC	GMSE	C	IC
ρ = 0.75	SCAD	0.0048	6.76	0	0.0036	6.846	0	0.0030	6.864	0
ρ = 0.75	LASSO	0.0039	4.694	0	0.0033	4.766	0	0.0028	5.074	0

ρ = 0.25	SCAD	0.0047	6.76	0	0.0035	6.718	0	0.0028	6.846	0
ρ = 0.25	LASSO	0.0038	4.814	0	0.0035	4.98	0	0.0029	5.048	0

Open in a new tab

Table 4.

Variable selection for the nonparametric components under different methods.

	Method	n = 150			n = 200			n = 300
	Method	GMSE	C	IC	GMSE	C	IC	GMSE	C	IC
ρ = 0.75	SCAD	0.1696	4.35	0	0.1221	4.66	0	0.0812	4.83	0
ρ = 0.75	LASSO	0.1932	4.38	0	0.1540	4.36	0	0.1235	4.57	0

ρ = 0.25	SCAD	0.1636	4.42	0	0.1076	4.72	0	0.0344	4.85	0
ρ = 0.25	LASSO	0.1982	4.40	0	0.1160	4.68	0	0.0398	4.76	0

Open in a new tab

Table 3 reports the variable selection for the parametric components; it shows that the performances become better and better as n increases, e.g., the number of correctly recognized zero coefficients, which is denoted as values in the column labeled “C,” becomes more and more closer to the true number of zero regression coefficients in the model. At the same time, the GMSE decreases steadily as n increases. Table 4 shows that, for the nonparametric components, the performances of the proposed variable selection method are similar to those of the method for the parametric components. As n increases, the RASE of the estimated nonparametric function also becomes smaller and smaller. This reflects that the estimate curves fit better to the corresponding true line as the sample size increases. Moreover, the SCAD penalty method outperforms the LASSO penalty ones in the sense of correct variable selection rate, which significantly reduces the model uncertainty and complexity.

To study the influence of misspecified correlation structure to the proposed approach, we perform variable selection when the working correlation structure is specified to be CS and first-order autoregressive (AR-1), respectively. The result is listed in Table 5. It is known that the QIF estimator is insensitive to misspecification in correlation structure. Table 5 shows that the proposed variable selection procedure gives similar results even when the correlation structure is misspecified. This indicates that our method is robust.

Table 5.

Variable selection when the true R is CS when n = 300.

Working R	Method	β			α(·)
Working R	Method	GMSE	C	IC	RASE	C	IC
ρ = 0.75

CS	SCAD	0.0030	6.864	0	0.0812	4.83	0
CS	LASSO	0.0028	5.074	0	0.1235	4.57	0

AR-1	SCAD	0.0033	6.856	0	0.0935	4.82	0
AR-1	LASSO	0.0034	4.924	0	0.1230	4.57	0

ρ = 0.25

CS	SCAD	0.0028	6.846	0	0.0344	4.85	0
CS	LASSO	0.0029	5.048	0	0.0398	4.76	0

AR-1	SCAD	0.0030	6.846	0	0.0354	4.86	0
AR-1	LASSO	0.0031	5.048	0	0.0411	4.75	0

Open in a new tab

4.4. Study 3 (High-Dimensional Setup)

In this example, we discuss how the proposed variable selection procedure can be applied to the “large n, diverging p/q” setup for longitudinal models. We consider the high-dimensional setup of study 2. In this simulation, we take n = 300, m = 6, p = 20 = O(N^1/4), q = 10 = O(N^1/4). The true coefficient vector is β = (2,1.5,0.7, 0₁₇^T)^T, α(u) = (α₁(u), α₂(u), 0₁₀^T)^T, where α₁(u) and α₂(u) are defined in study 2. The other settings are the same with study 2. The results are reported in Table 6. It is easy to see that the proposed variable selection procedure is able to correctly identify the true model and works well in the “large n, diverging p/q” setup.

Table 6.

Variable selection under high-dimensional setup.

	Method	β			α(·)
	Method	GMSE	C	IC	RASE	C	IC
ρ = 0.75	SCAD	0.0036	16.664	0	0.1148	9.656	0
ρ = 0.75	LASSO	0.0033	15.574	0	0.1239	9.546	0

ρ = 0.25	SCAD	0.0034	16.846	0	0.1047	9.875	0
ρ = 0.25	LASSO	0.0039	15.35	0	0.1138	9.802	0

Open in a new tab

5. Application to Infectious Disease Data

We apply the proposed method to analyze an infectious disease data (indon.data), which has been well analyzed by many authors, such as [16, 23–27]. In this study, a total of 275 preschool children were examined every three months for 18 months. The response is the presence of respiratory infection (1 = yes, 0 = no). The primary interest is in studying the relationship between the risk of respiratory infection and vitamin A deficiency (1 = yes, 0 = no).

In our study, we consider the following GPLVCM model

\begin{matrix} logit \{μ_{i j} ∣ X_{i j}, t_{i j}\} = \sum_{k = 1}^{6} β_{i} x_{i j} + α_{0} (t_{i j}) + z_{i j} α_{1} (t_{i j}), \end{matrix}

(30)

where t is age, X₁ is vitamin A deficiency, X₂, X₃ are the seasonal cosine and seasonal sine variables, respectively, which indicate the season when those examinations took place, X₄ is gender (1 = female, 0 = male), X₅ is height, X₆ is stunting status (1 = yes, 0 = no), and Z₁ = X₅² is the square of height. The with-cluster correlation structure is assumed to be exchangeable, i.e., compound symmetric. This structure is also used in [16, 26, 27].

We apply the proposed QIF-based group SCAD variable selection procedure to the above model and recognize five nonzero coefficients and one nonzero function α₀(t), where β₁ = 0.842, β₂ = −0.685, β₃ = −0.309, β₄ = −0.554, and β₆ = 0.966. The results are generally consistent with those previous studies, but our results show that the height has no significant impact on the infectious rate and can be removed from the model. Figure 1 reports the curve of baseline age function α₀(t) estimated by QIF-based group SCAD that is estimated by QIF and that is estimated by QIF-based SCAD partial penalty to β in [16], where the GPLM without the varying coefficient term is used. Figure 1 implies that the probability of having respiratory infection increases at the very early stage, then decreases steadily, and declines dramatically when the age is over 5.5 years old. This also coincides with previous results [16, 26, 27].

The estimated function on age for the infectious disease data.

6. Conclusion and Discussion

We proposed a QIF-based group SCAD variable selection procedure for the generalized partially linear varying coefficient models with longitudinal data. This procedure can select significant variables in the parametric components and nonparametric components simultaneously. Under mild conditions, the estimators of regression coefficients have oracle property. Simulation studies indicate that the proposed procedure is very effective in selecting significant variables and estimating the regression coefficients.

In this paper, we assume that the dimensions of the covariates X and Z are fixed. Study 3 in simulations shows that the proposed approach still have desired results when the dimensions p and q go to infinity as n⟶∞. However, when in ultrahigh-dimensional case, the proposed variable selection procedure may not work well anymore. As a future research topic, it is interesting to consider the variable selection for the generalized partially linear varying coefficient models with ultrahigh-dimensional covariates.

Acknowledgments

The research is funded by the National Natural Science Foundation of China (11571025) and the Beijing Natural Science Foundation (1182008). This support is greatly appreciated.

Appendix

A. Proofs of the Main Results

For convenience and simplicity, let C denote a positive constant that may have different values at each appearance throughout this paper and ‖A‖ denote the modulus of the largest singular value of matrix or vector A.

Let η_ij = X_ij^Tβ + Z_ij^T · I_q ⊗ B(U_ij)^Tγ, then μ_ij = h(η_ij). Let η_i = (η_i1, ⋯,η_im)^T, μ_i = (μ_i1, ⋯,μ_im)^T, and θ = (β^T, γ^T)^T, Y_i = (Y_i1, ⋯,Y_im)^T, X_i = (X_i1, ⋯,X_im)^T.

Similarly, let W_ij = B(U_ij) ⊗ I_q · Z_ij, P_ij = (X_ij^T, W_ij)^T, and W_i = (W_i1, ⋯,W_im))^T, P_i = (P_i1, ⋯,P_im)^T = (X_i, W(U_i)); then, η_ij = P_ij^Tθ, η_i = P_iθ, and ∂η_ij/∂θ = P_ij, ∂η_i/∂θ = P_i^T.

Let h′(t) = dh(t)/dt, then ∂μ_ij/∂θ = h′(η_ij)P_ij. Let

\begin{matrix} H^{'} (η_{i}) ≜ (\begin{matrix} h^{'} (η_{i 1}) \\ ⋱ \\ h^{'} (η_{i m}) \end{matrix}), H^{''} (η_{i}) ≜ (\begin{matrix} h^{''} (η_{i 1}) \\ ⋱ \\ h^{''} (η_{i,}) \end{matrix}) . \end{matrix}

(A.1)

Then,

\begin{matrix} {\dot{μ}}_{i} = (\begin{matrix} \frac{\partial μ_{i 1}}{\partial β_{1}} & \dots & \frac{\partial μ_{i 1}}{\partial γ_{q L}} \\ ⋮ & \dots & ⋮ \\ \frac{\partial μ_{i m}}{\partial β_{1}} & \dots & \frac{\partial μ_{i m}}{\partial γ_{q L}} \end{matrix}) = (\begin{matrix} {(\frac{\partial μ_{i 1}}{\partial θ})}^{T} \\ ⋮ \\ {(\frac{\partial μ_{i m}}{\partial θ})}^{T} \end{matrix}) = (\begin{matrix} P_{i 1}^{T} h^{'} (η_{i 1}) \\ ⋮ \\ P_{i m}^{T} h^{'} (η_{i m}) \end{matrix}) = H^{'} (η_{i}) P_{i} . \end{matrix}

(A.2)

Proof of Theorem 1. —

Let δ = n^−1/2, β = β₀ + δD₁, γ = γ₀ + δD₂, and D = (D₁^T, D₂^T)^T. We first show that for any given ε > 0, there exists a large constant C such that

$\begin{matrix} P \{\inf_{‖D‖ = C} Q_{n}^{P} (β, γ) > Q_{n}^{P} (β_{0}, γ_{0})\} \geq 1 - ε . \end{matrix}$ (A.3)

Note that β_0l = 0, for all l = P₁ + 1, ⋯, p, and γ_0k = 0, for all k = q₁, ⋯, q, together with Assumption (A1) and p_λ(0) = 0, we have

$\begin{matrix} Q_{n}^{p} (θ) - Q_{n}^{p} (θ_{0}) ⩾ [Q_{n} (θ) - Q_{n} (θ_{0})] + n \sum_{l = 1}^{p_{1}} [p_{λ_{2}} (|β_{l}|) - p_{λ_{2}} (|β_{0 l}|)] + n \sum_{k = 1}^{q_{1}} [p_{λ_{1}} ({‖γ_{k}‖}_{H}) - p_{λ_{1}} ({‖γ_{0 k}‖}_{H})] ≜ I_{1} + I_{2} + I_{3} . \end{matrix}$ (A.4)

By Taylor expansion and Assumption (A4), we have

$\begin{matrix} I_{2} = n \sum_{l = 1}^{p_{1}} [δ p'_{λ_{2}} (|β_{0 l}|) sgn (β_{0 l}) |D_{1 l}| + δ p''_{λ_{2}} (|β_{0 l}|) sgn (β_{0 l}) {|D_{1 l}|}^{2} \{1 + o (1)\}] ⩽ \sqrt{p_{1}} a_{n} ‖D‖ O (n^{1 / 2}) + b_{n} {‖D‖}^{2} O (1) = \sqrt{p_{1}} ‖D‖ O (n^{- 1 / 2}) + {‖D‖}^{2} o (1) . \end{matrix}$ (A.5)

Invoking the proof of Theorem 2 in Zhang and Xue [16],

$\begin{matrix} I_{1} = Q_{n} (θ) - Q_{n} (θ_{0}) = D^{T} {\dot{g}}_{N}^{T} (θ_{0}) Ω_{n}^{- 1} (θ_{0}) {\dot{g}}_{N} (θ_{0}) D + {‖D‖}^{2} o_{p} (1) + ‖D‖ O_{p} (1) . \end{matrix}$ (A.6)

By choosing a sufficient large C, I₁ dominates I₂. Similarly, I₁ dominates I₃ for a sufficient large C. Thus (A.3) holds, i.e., with probability at least 1 − ε, there exists a local minimizer $\hat{θ}$ that satisfies $‖\hat{θ} - θ_{0}‖ = O_{p} (δ)$ . Therefore, $‖\hat{γ} - γ_{0}‖ = O_{p} (n^{- 1 / 2})$ and $‖\hat{β} - β_{0}‖ = O_{p} (n^{- 1 / 2})$ . Let R_k(u) = α_k(u) − B(U)^Tγ_k and γ_ok denote the spline coefficient vector from the spline approximation to α_k(·). From Assumptions (A7) and (A8) and Theorem 12.7 in [18], we get that ‖R_k(u)‖ = O(K^−r). Therefore,

$\begin{matrix} {‖{\overset{\land}{α}}_{k} (u) - α_{0 k} (u)‖}^{2} = \int_{0}^{1} {\{{\overset{\land}{α}}_{k} (u) - α_{0 k} (u)\}}^{2} d u = \int_{0}^{1} {\{B {(u)}^{T} {\overset{\land}{γ}}_{k} - B {(u)}^{T} γ_{0 k} + R_{k} (u)\}}^{2} d u \leq 2 \int_{0}^{1} {\{B {(u)}^{T} {\overset{\land}{γ}}_{k} - B {(u)}^{T} γ_{0 k}\}}^{2} d u + 2 \int_{0}^{1} R_{k} {(u)}^{2} d u = 2 {({\overset{\land}{γ}}_{k} - γ_{0 k})}^{T} \int_{0}^{1} B (u) B^{T} (u) d u ({\hat{γ}}_{k} - γ_{0 k}) + 2 \int_{0}^{1} R_{k} {(u)}^{2} d u = O_{p} (n^{- 2 r / (2 r + 1)}) . \end{matrix}$ (A.7)

Thus, we complete the proof of Theorem 1.

Proof of Theorem 2. —

According to Theorem 2, in order to prove the first part of Theorem 2, we need only to prove that, for any γ satisfying ‖γ − γ₀‖ = O_p(n^−1/2) and for any β_l satisfying ‖β_l − β_0l‖ = O_p(n^−1/2), l = 1, ⋯, p₁, there exists a certain ϵ = Cn^−1/2 that satisfies, as n⟶∞, with probability tending to 1:

$\begin{matrix} \frac{\partial Q_{n}^{p} (β, γ)}{\partial β_{l}} > 0, for 0 < β_{l} < ϵ, l = p_{1} + 1, \dots, p, \end{matrix}$ (A.8)

$\begin{matrix} \frac{\partial Q_{n}^{p} (β, γ)}{\partial β_{l}} < 0, for - ϵ < β_{l} < 0, l = p_{1} + 1, \dots, p . \end{matrix}$ (A.9)

These imply that the PQIF 𝒬_n^p(β, γ) reaches its minimum at β_l = 0, l = p₁ + 1, ⋯, p.

Following Lemmas 3 and 4 of [16], we have

$\begin{matrix} \frac{\partial Q_{n}^{p} (β, γ)}{\partial β_{l}} = \frac{\partial g_{n}^{T} (β, γ)}{\partial β_{l}} Ω_{n}^{- 1} (β, γ) g_{n} (β, γ) + O_{p} (1) + n p'_{λ_{2} (∣ β_{l} ∣)} sgn (β_{l}) = - 2 \sum_{i = 1}^{n} {(\begin{matrix} {\dot{μ}}_{i}^{T} A_{i}^{- 1 / 2} M_{1} A_{i}^{- 1 / 2} \frac{\partial μ_{i}}{\partial β_{l}} \\ ⋮ \\ {\dot{μ}}_{i}^{T} A_{i}^{- 1 / 2} M_{s} A_{i}^{- 1 / 2} \frac{\partial μ_{i}}{\partial β_{l}} \end{matrix})}^{T} Ω_{n}^{- 1} (β, γ) g_{n} (β, γ) + n p'_{λ_{2} (∣ β_{l} ∣)} sgn (β_{l}) + O_{p} (1) = n^{1 / 2} [n^{1 / 2} λ_{2} \{λ_{2}^{- 1} p'_{λ_{2}} (∣ β_{l} ∣) sgn (β_{l})\} + O_{p} (1)] . \end{matrix}$ (A.10)

According to (8), the expression of the derivative of SCAD-penalized function, it is easy to see that lim_n→∞liminf_{β_l→0}λ₂⁻¹p′_λ₂(∣β_l∣) = 1. Together with Assumption (A10), λ₂n^1/2 > λ_minn^1/2⟶∞, it is clear that the sign of (A.10) is decided by that of β_l. This implies (A.8) and (A.9) hold. Thus, we complete the proof of the first part.

Similarly, we can prove that with probability tending to 1, ${\hat{γ}}_{k} = 0, k = q_{1} + 1, \dots, q$ . Note that ‖B(u)‖ = O(1) and ${\hat{α}}_{k} (u) = B^{T} (u) {\hat{γ}}_{k}$ ; the second part of Theorem 2 is proved. Thus, we complete the proof of Theorem 2.

Proof of Theorem 3. —

Let θ^∗ = (β^∗T, γ^∗T)^T and let P_i^∗ = (X_i^∗T, W_i^∗T)^T, i = 1, ⋯, n denote the covariates corresponding to θ^∗. Denote ${\dot{𝒬}}_{1 n} (β, γ)$ and ${\dot{𝒬}}_{2 n} (β, γ)$ to be the first derivatives of the PQIF 𝒬_n^p with respect to β and γ, respectively, i.e.,

$\begin{matrix} {\dot{Q}}_{1 n} (β, γ) = \frac{\partial Q_{n}^{p} (β, γ)}{\partial β}, \\ {\dot{Q}}_{2 n} (β, γ) = \frac{\partial Q_{n}^{p} (β, γ)}{\partial γ} . \end{matrix}$ (A.11)

By Theorems 1 and 2, ${({\overset{\land}{β}}^{* T}, 0^{T})}^{T}$ and ${({\overset{\land}{γ}}^{* T}, 0^{T})}^{T}$ satisfies that

$\begin{matrix} {\dot{Q}}_{1 n} ({({\overset{\land}{β}}^{* T}, 0^{T})}^{T}, {({\overset{\land}{γ}}^{* T}, 0^{T})}^{T}) = 0^{T}, \\ {\dot{Q}}_{2 n} ({({\overset{\land}{β}}^{* T}, 0^{T})}^{T}, {({\overset{\land}{γ}}^{* T}, 0^{T})}^{T}) = 0^{T} . \end{matrix}$ (A.12)

By the Taylor expansion, we have

$\begin{matrix} Q_{1 n} {|_{({({\overset{\land}{β}}^{* T}, 0^{T})}^{T}, {({\overset{\land}{γ}}^{* T}, 0^{T})}^{T})} = Q_{1 n}|}_{((β_{0}^{* T}, 0^{T}) {(β_{0}^{* T}, 0^{T})}^{T}, {(γ_{0}^{* T}, 0^{T})}^{T})} + \frac{\partial Q_{1 n}}{\partial β} {|_{θ = \tilde{θ}} \{{({\overset{\land}{β}}^{* T}, 0^{T})}^{T} - {(β_{0}^{* T}, 0^{T})}^{T}\} + \frac{\partial Q_{1 n}}{\partial γ}|}_{θ = \tilde{θ}} \{{({\overset{\land}{γ}}^{* T}, 0^{T})}^{T} - {(γ_{0}^{* T}, 0^{T})}^{T}\} + \sum_{i = 1}^{p_{1}} n p'_{λ_{2}} ({\hat{β}}_{l}) sgn ({\hat{β}}_{l}), \end{matrix}$ (A.13)

where $\tilde{θ}$ is between ((β₀^∗T, 0^T)^T, (γ₀^∗T, 0^T)^T) and $({({\overset{\land}{β}}^{* T}, 0^{T})}^{T}, {({\overset{\land}{γ}}^{* T}, 0^{T})}^{T})$ . Apply the Taylor expansion to $p'_{λ_{2}} (∣ {\hat{β}}_{l} ∣)$ , we obtain

$\begin{matrix} p'_{λ_{2}} (|{\hat{β}}_{l}|) = p'_{λ_{2}} (|β_{0 l}|) + \{p''_{λ_{2 l}} (|β_{0 l}| + o_{p} (1)\} ({\hat{β}}_{l} - β_{0 l}) . \end{matrix}$ (A.14)

By Assumption (A10), p′′_λ₂(∣β_0l∣) = o_p(1). Note that p′_{λ_2l}(∣β_0l∣) = 0 as λ_max⟶0; therefore, by Lemma 4 of [16] and through some calculation, we have

$\begin{matrix} {\frac{1}{n} Q_{1 n}|}_{({(β_{0}^{* T}, 0^{T})}^{T}, {(γ_{0}^{* T}, 0^{T})}^{T})} = - \frac{2}{n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} \sum_{k = 1}^{s} \sum_{l = 1}^{s} \{X_{i}^{* T} H^{'} (η_{i}) A_{i}^{- 1 / 2} M_{k} A_{i}^{- 1 / 2} H^{'} (η_{i}) P_{i}^{*} \cdot Ω_{k l}^{- 1} P_{j}^{* T} H^{'} (η_{j}) A_{j}^{- 1 / 2} M_{l} A_{j}^{- 1 / 2} (Y_{j} - μ_{0 j})\} + o_{p} (n^{- 1 / 2}) = - \frac{2}{n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} X_{i}^{* T} H^{'} (η_{i}) τ_{i j} (Y_{j} - μ_{0 j}) + o_{p} (n^{- 1 / 2}) = - \frac{2}{n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} {\tilde{X}}_{i}^{T} τ_{i j} (\tilde{R} (U_{j}) + ε_{j}) + o_{p} (n^{- 1 / 2}), \end{matrix}$ (A.15)

where ${\tilde{X}}_{i} = H^{'} (η_{i}) X_{i}^{*}, \tilde{R} (U_{i}) = H^{'} (η_{i}) R (U_{i}), Ω_{k l}^{- 1}$ is the (l, k) block of Ω⁻¹ and

$\begin{matrix} τ_{i j} = \sum_{k = 1}^{s} \sum_{l = 1}^{s} A_{i}^{- 1 / 2} M_{k} A_{i}^{- 1 / 2} H^{'} (η_{i}) P_{i}^{*} Ω_{k l}^{- 1} P_{j}^{* T} H^{'} (η_{j}) A_{j}^{- 1 / 2} M_{l} A_{j}^{- 1 / 2} . \end{matrix}$ (A.16)

Similarly, we have

$\begin{matrix} \begin{matrix} {\frac{1}{n} \frac{\partial Q_{1 n}}{\partial β}|}_{θ = \tilde{θ}} & = - \frac{2}{n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} {\tilde{X}}_{i}^{T} τ_{i j} {\tilde{X}}_{j} + o_{p} (n^{- 1 / 2}), \end{matrix} \\ {\frac{1}{n} \frac{\partial Q_{1 n}}{\partial γ}|}_{θ = \tilde{θ}} = - \frac{2}{n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} {\tilde{X}}_{i}^{T} τ_{i j} \tilde{W} (U_{j}) + o_{p} (n^{- 1 / 2}), \end{matrix}$ (A.17)

where $\tilde{W} (U_{j}) = H^{'} (η_{j}) W^{*} (U_{j}), W^{*} (U_{j}) = {(W_{j 1}^{*}, \dots, W_{j m}^{*})}^{T}, W_{i j}^{*} = B (U_{i j}) \otimes I_{q} \cdot Z_{i j}^{*}$ . Hence,

$\begin{matrix} {\frac{1}{n} Q_{1 n}|}_{({(β_{0}^{* T}, 0^{T} 0)}^{T}, {(γ_{0}^{* T}, 0^{T} 0)}^{T})} = - \frac{2}{n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} {\tilde{X}}_{i}^{T} τ_{i j} \{{\tilde{X}}_{j} (β_{0}^{*} - {\overset{\land}{β}}^{*}) + \tilde{W} (U_{j}) \cdot (γ_{0}^{*} - {\overset{\land}{γ}}^{*}) + \tilde{R} (U_{j}) + ε_{j}\} + o_{p} ({\overset{\land}{β}}^{*} - β_{0}^{*}), \\ {\frac{1}{n} Q_{2 n}|}_{({(β_{0}^{* T}, 0^{T} 0)}^{T}, {(γ_{0}^{* T}, 0^{T} 0)}^{T})} = - \frac{2}{n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} \tilde{W} {(U_{i})}^{T} τ_{i j} \{{\tilde{X}}_{j} (β_{0}^{*} - {\overset{\land}{β}}^{*}) + \tilde{W} (U_{j}) \cdot (γ_{0}^{*} - {\overset{\land}{γ}}^{*}) + \tilde{R} (U_{j}) + ε_{j}\} + o_{p} ({\overset{\land}{γ}}^{*} - γ_{0}^{*}) . \end{matrix}$ (A.18)

Following the proof of Theorem 2 in [16], we prove (16). Thus, we complete the proof of Theorem 3.

Data Availability

The data can be downloaded from https://content.sph.harvard.edu/xlin/dat/indon.dat.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Supplementary Materials

The R code presented in Word format for the real data analysis is included in the supplementary file.

Click here for additional data file.^{(38KB, docx)}

References

1.Breiman L. Better subset regression using the nonnegative garrote. Techonometrics. 1995;37(4):373–384. doi: 10.1080/00401706.1995.10484371. [DOI] [Google Scholar]
2.Tibshirani R. Regression shrinkage and selection via the LASSO. Journal of Royal Statistical Society, Series B. 1996;58:267–288. doi: 10.1111/j.2517-6161.1996.tb02080.x. [DOI] [Google Scholar]
3.Fu W. J. Penalized Regressions: the bridge versus the LASSO. Journal of Computational and Graphical Statistics. 1998;7(3):397–416. doi: 10.1080/10618600.1998.10474784. [DOI] [Google Scholar]
4.Fan J., Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96(456):1348–1360. doi: 10.1198/016214501753382273. [DOI] [Google Scholar]
5.Zhou H., Li R. One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics. 2007;36:1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.FAN J., Li R. New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. Journal of the American Statistical Association. 2004;99(467):710–723. doi: 10.1198/016214504000001060. [DOI] [Google Scholar]
7.FAN J., Zhang W. Statistical methods with varying coefficient models. Statistics and Its Interface. 2008;1(1):179–195. doi: 10.4310/SII.2008.v1.n1.a15. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Zhao P. X., Xue L. G. Variable selection for semi-parametric varying coefficient partially linear models. Statistics & Probability Letters. 2009;79(20):2148–2157. doi: 10.1016/j.spl.2009.07.004. [DOI] [Google Scholar]
9.Xue L., Qu A., Zhou J. Consistent model selection for marginal generalized additive model for correlated data. Journal of the American Statistical Association. 2010;105(492):1518–1530. doi: 10.1198/jasa.2010.tm10128. [DOI] [Google Scholar]
10.Qu A., Lindsay B. G., Li B. Improving generalised estimating equations using quadratic inference functions. Biometrika. 2000;87(4):823–836. doi: 10.1093/biomet/87.4.823. [DOI] [Google Scholar]
11.Liang K. L., Zeger S. L. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13–22. doi: 10.1093/biomet/73.1.13. [DOI] [Google Scholar]
12.Qu A., Li R. Quadratic inference functions for varying coefficient models with longitudinal data. Biometrics. 2006;62(2):379–391. doi: 10.1111/j.1541-0420.2005.00490.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Wang L., Li H., Huang J. Z. Variable selection in nonparametric varying coefficient models for analysis of repeated measurements. Journal of American Statistical Association. 2008;103:1556–1569. doi: 10.1198/016214508000000788. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Bai Y., Zhu Z. Y., Fung W. K. Partial linear models for longitudinal data based on quadratic inference functions. Scandinavian Journal of Statistics. 2008;35(1):104–118. doi: 10.1111/j.1467-9469.2007.00578.x. [DOI] [Google Scholar]
15.Tian R. Q., Xue L. G., Liu C. L. Penalized quadratic inference functions for semiparametric varying coefficient partially linear models with longitudinal data. Journal of Multivariate Analysis. 2014;132:94–110. doi: 10.1016/j.jmva.2014.07.015. [DOI] [Google Scholar]
16.Zhang J. H., Xue L. G. Quadratic inference functions for generalized partially models with longitudinal data. Chinese Journal of Applied Probability and Statistics. 2017;33:417–432. [Google Scholar]
17.Li R., Liang H. Variable selection in semiparametric regression modeling. The Annals of Statistics. 2008;36(1):261–286. doi: 10.1214/009053607000000604. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Schumaker G. Spline Function. New York, NY, USA: Wiley; 1981. [Google Scholar]
19.Wang H., Li R., Tsai C. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94(3):553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Zhou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101(476):1418–1429. doi: 10.1198/016214506000000735. [DOI] [Google Scholar]
21.Wang H. S., Xia Y. C. Shrinkage estimation of the varying coefficient model. Journal of the American Statistical Association. 2009;104(486):747–757. doi: 10.1198/jasa.2009.0138. [DOI] [Google Scholar]
22.Oman S. D. Easily simulated multivariate binary distributions with given positive and negative correlations. Computational Statistics & Data Analysis. 2009;53(4):999–1005. doi: 10.1016/j.csda.2008.11.017. [DOI] [Google Scholar]
23.Zeger S. L., Karim M. R. Generalized linear models with random effects: a Gibbs sampling approach. Journal of the American Statistical Association. 1991;86:79–86. doi: 10.1080/01621459.1991.10475006. [DOI] [Google Scholar]
24.Diggle P. J., Liang K. Y., Zeger S. L. Analysis of Longitudinal Data. Oxford, England: Oxford University Press; 1994. [Google Scholar]
25.Lin X. H., Carroll R. J. Nonparametric function estimation for clustered data when the predictor is measured without/with error. Journal of the American Statistical Association. 2000;95:520–534. doi: 10.1080/01621459.2000.10474229. [DOI] [Google Scholar]
26.Lin X. H., Carroll R. J. Semiparametric regression for clustered data using generalized estimating equations. Journal of the American Statistical Association. 2001;96(455):1045–1056. doi: 10.1198/016214501753208708. [DOI] [Google Scholar]
27.He X., Fung W., Zhu Z. Robust estimation in generalized Partial linear Models for Clustered data. Journal of the American Statistical Association. 2005;100(472):1176–1184. doi: 10.1198/016214505000000277. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

The R code presented in Word format for the real data analysis is included in the supplementary file.

Click here for additional data file.^{(38KB, docx)}

Data Availability Statement

The data can be downloaded from https://content.sph.harvard.edu/xlin/dat/indon.dat.

[B1] 1.Breiman L. Better subset regression using the nonnegative garrote. Techonometrics. 1995;37(4):373–384. doi: 10.1080/00401706.1995.10484371. [DOI] [Google Scholar]

[B2] 2.Tibshirani R. Regression shrinkage and selection via the LASSO. Journal of Royal Statistical Society, Series B. 1996;58:267–288. doi: 10.1111/j.2517-6161.1996.tb02080.x. [DOI] [Google Scholar]

[B3] 3.Fu W. J. Penalized Regressions: the bridge versus the LASSO. Journal of Computational and Graphical Statistics. 1998;7(3):397–416. doi: 10.1080/10618600.1998.10474784. [DOI] [Google Scholar]

[B4] 4.Fan J., Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96(456):1348–1360. doi: 10.1198/016214501753382273. [DOI] [Google Scholar]

[B5] 5.Zhou H., Li R. One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics. 2007;36:1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] 6.FAN J., Li R. New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. Journal of the American Statistical Association. 2004;99(467):710–723. doi: 10.1198/016214504000001060. [DOI] [Google Scholar]

[B7] 7.FAN J., Zhang W. Statistical methods with varying coefficient models. Statistics and Its Interface. 2008;1(1):179–195. doi: 10.4310/SII.2008.v1.n1.a15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] 8.Zhao P. X., Xue L. G. Variable selection for semi-parametric varying coefficient partially linear models. Statistics & Probability Letters. 2009;79(20):2148–2157. doi: 10.1016/j.spl.2009.07.004. [DOI] [Google Scholar]

[B9] 9.Xue L., Qu A., Zhou J. Consistent model selection for marginal generalized additive model for correlated data. Journal of the American Statistical Association. 2010;105(492):1518–1530. doi: 10.1198/jasa.2010.tm10128. [DOI] [Google Scholar]

[B10] 10.Qu A., Lindsay B. G., Li B. Improving generalised estimating equations using quadratic inference functions. Biometrika. 2000;87(4):823–836. doi: 10.1093/biomet/87.4.823. [DOI] [Google Scholar]

[B11] 11.Liang K. L., Zeger S. L. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13–22. doi: 10.1093/biomet/73.1.13. [DOI] [Google Scholar]

[B12] 12.Qu A., Li R. Quadratic inference functions for varying coefficient models with longitudinal data. Biometrics. 2006;62(2):379–391. doi: 10.1111/j.1541-0420.2005.00490.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] 13.Wang L., Li H., Huang J. Z. Variable selection in nonparametric varying coefficient models for analysis of repeated measurements. Journal of American Statistical Association. 2008;103:1556–1569. doi: 10.1198/016214508000000788. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14] 14.Bai Y., Zhu Z. Y., Fung W. K. Partial linear models for longitudinal data based on quadratic inference functions. Scandinavian Journal of Statistics. 2008;35(1):104–118. doi: 10.1111/j.1467-9469.2007.00578.x. [DOI] [Google Scholar]

[B15] 15.Tian R. Q., Xue L. G., Liu C. L. Penalized quadratic inference functions for semiparametric varying coefficient partially linear models with longitudinal data. Journal of Multivariate Analysis. 2014;132:94–110. doi: 10.1016/j.jmva.2014.07.015. [DOI] [Google Scholar]

[B16] 16.Zhang J. H., Xue L. G. Quadratic inference functions for generalized partially models with longitudinal data. Chinese Journal of Applied Probability and Statistics. 2017;33:417–432. [Google Scholar]

[B17] 17.Li R., Liang H. Variable selection in semiparametric regression modeling. The Annals of Statistics. 2008;36(1):261–286. doi: 10.1214/009053607000000604. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B18] 18.Schumaker G. Spline Function. New York, NY, USA: Wiley; 1981. [Google Scholar]

[B19] 19.Wang H., Li R., Tsai C. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94(3):553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] 20.Zhou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101(476):1418–1429. doi: 10.1198/016214506000000735. [DOI] [Google Scholar]

[B21] 21.Wang H. S., Xia Y. C. Shrinkage estimation of the varying coefficient model. Journal of the American Statistical Association. 2009;104(486):747–757. doi: 10.1198/jasa.2009.0138. [DOI] [Google Scholar]

[B22] 22.Oman S. D. Easily simulated multivariate binary distributions with given positive and negative correlations. Computational Statistics & Data Analysis. 2009;53(4):999–1005. doi: 10.1016/j.csda.2008.11.017. [DOI] [Google Scholar]

[B23] 23.Zeger S. L., Karim M. R. Generalized linear models with random effects: a Gibbs sampling approach. Journal of the American Statistical Association. 1991;86:79–86. doi: 10.1080/01621459.1991.10475006. [DOI] [Google Scholar]

[B24] 24.Diggle P. J., Liang K. Y., Zeger S. L. Analysis of Longitudinal Data. Oxford, England: Oxford University Press; 1994. [Google Scholar]

[B25] 25.Lin X. H., Carroll R. J. Nonparametric function estimation for clustered data when the predictor is measured without/with error. Journal of the American Statistical Association. 2000;95:520–534. doi: 10.1080/01621459.2000.10474229. [DOI] [Google Scholar]

[B26] 26.Lin X. H., Carroll R. J. Semiparametric regression for clustered data using generalized estimating equations. Journal of the American Statistical Association. 2001;96(455):1045–1056. doi: 10.1198/016214501753208708. [DOI] [Google Scholar]

[B27] 27.He X., Fung W., Zhu Z. Robust estimation in generalized Partial linear Models for Clustered data. Journal of the American Statistical Association. 2005;100(472):1176–1184. doi: 10.1198/016214505000000277. [DOI] [Google Scholar]

PERMALINK

Penalized Quadratic Inference Function-Based Variable Selection for Generalized Partially Linear Varying Coefficient Models with Longitudinal Data

Jinghua Zhang

Liugen Xue

Abstract

1. Introduction

2. Methodology

2.1. GPLVCM with Longitudinal Data

2.2. Penalized QIF

3. Asymptotic Properties

3.1. Oracle Property

Assumption 1 (A1). —

Assumption 2 (A2). —

Assumption 3 (A3). —

Assumption 4 (A4). —

Assumption 5 (A5). —

Assumption 6 (A6). —

Assumption 7 (A7). —

Assumption 8 (A8). —

Assumption 9 (A9). —

Assumption 10 (A10). —

Theorem 1 . —

Theorem 2 . —

Theorem 3 . —

3.2. Selection of Tuning Parameters

3.3. An Algorithm Using Local Quadratic Approximation

4. Simulation Studies

4.1. Assessing Rule

4.2. Study 1 (Partial Penalty)

Table 1.

Table 2.

4.3. Study 2 (Fixed-Dimensional Setup)

Table 3.

Table 4.

Table 5.

4.4. Study 3 (High-Dimensional Setup)

Table 6.

5. Application to Infectious Disease Data

Figure 1.

6. Conclusion and Discussion

Acknowledgments

Appendix

A. Proofs of the Main Results

Proof of Theorem 1. —

Proof of Theorem 2. —

Proof of Theorem 3. —

Data Availability

Conflicts of Interest

Supplementary Materials

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases