Sparse regression for low-dimensional time-dynamic varying coefficient models with application to air quality data

Jinwen Liang; Maozai Tian

doi:10.1080/02664763.2022.2028131

. 2022 Feb 3;50(6):1378–1399. doi: 10.1080/02664763.2022.2028131

Sparse regression for low-dimensional time-dynamic varying coefficient models with application to air quality data

Jinwen Liang ^a, Maozai Tian ^a,^b,^CONTACT

PMCID: PMC10071989 PMID: 37025280

Abstract

Time dynamic varying coefficient models play an important role in applications of biology, medicine, environment, finance, etc. Traditional methods, such as kernel smoothing and spline smoothing, are popular. But explicit expressions are unavailable using these methods, and the convergence rate of coefficient function estimators is slow. To address these problems, we expand the varying component with appropriate basis functions. And then we solve a sparse regression problem via a sequential thresholded least-squares estimator. The “parameterization” leads to explicit expressions and fast computation speed. Convergence of the sequential thresholded least squares algorithm is guaranteed. The asymptotic distribution of the coefficient function estimator is derived under certain assumptions. Our simulation shows the proposed method has higher precision and computing speed. Finally, our proposed method is applied to the study of PM $_{2.5}$ concentration in Beijing. We analyze the relationship between PM $_{2.5}$ and other impact factors.

Keywords: Sparse regression, varying coefficient models, basis expansion

1. Introduction

There has been much research on varying coefficient models since they were proposed by [9]. Varying coefficient model is an extension of the classical linear model. Their objective is to examine how regression coefficients change over time or other covariates. Park et al. [16] gave an overview of methodological and theoretical developments of earlier studies. To construct a varying coefficient model, one strategy is to let all regression coefficients depend on a mutual covariate. Zhang and Fan [25] proposed a two-step local polynomial least squares estimation procedure. Chiang et al. [5] proposed a componentwise smoothing spline method. The other is to let different regression coefficients depend on different covariates. Marginal integration technique [13] and smooth backfitting technique [14] are used frequently.

The problem, how a covariate $X_{j}$ affects response Y over time in the dynamical longitudinal analysis, is of interest. Varying coefficient models are useful and play an important role to solve the problem. The simplest functional regression model is

Y (t) = β_{1} (t) X_{1} (t) + \dots + β_{p} (t) X_{p} (t) + ε (t),

(1)

where $β_{j} s (j = 1, \dots, p)$ are unknown coefficient functions and ε is a mean zero stochastic process. Here the coefficient functions $β_{j}$ depend only on time t. Lee et al. [11] studied a varying coefficient model with the response variable dependent on the covariates as follows

Y = β_{1} (X_{1}) Z_{1} + \dots + β_{p} (X_{p}) Z_{p} + σ (X, Z) ε,

(2)

where $X = (X_{1}, \dots, X_{p})$ , $Z = (Z_{1}, \dots, Z_{p})$ , $E (ε | X, Z) = 0$ , and $Var (ε | X, Z) = 1.$ Zhang et al. [26] proposed the following model

Y (t) = β_{1} (t, Z_{1} (t)) + \dots + β_{p} (t, Z_{d} (t)) + ε (t),

(3)

and a new backfitting algorithm was developed to estimate the unknown regression functions with the convergence rate $n^{1 / 3}$ . Lee et al. [10] studied a new varying coefficient model

Y (t) = β_{1} (t, Z_{1} (t)) X_{1} (t) + \dots + β_{p} (t, Z_{p} (t)) X_{p} (t) + ε (t),

(4)

where ε is a stochastic process such that $E (ε | X_{1}, Z_{1}, \dots, X_{p}, Z_{p}) = 0$ . Different from model (1), the coefficient functions change with time and a covariate. Kernel smoothing technique was developed to estimate this model. And the convergence rate of the coefficient function estimators was also $n^{1 / 3}$ . Although similar to model (3), model (4) takes nonlinear interaction effects into account.

The estimation procedure of kernel smoothing depends on the selection of bandwidths. Spline smoothing methods depend on knots and smoothing parameters. And the backfitting algorithms are computationally cumbersome. Furthermore, varying coefficient models are semiparametric and have bad interpretability. We are likely to estimate such models by deep learning methods [1] due to their ability to mine nonlinear structures hidden in the data. However, we will not develop this point here for lack of interpretability.

Brunton et al. [3] combined sparsity-promoting and machine learning techniques to discover governing equations from noisy measurement data. Their proposed data-driven modeling framework, called sparse identification of nonlinear dynamics (short for SINDy), could identify nonlinear low-order models. Motivated by this, we propose our data-driven estimator for time-dynamic varying coefficient models. We expand functional regression coefficients globally by constructing a library of candidate basis functions. This is a basis expansion method. The basis functions are chosen more flexibly based on domain knowledge. Then, the time-dynamic varying coefficient model becomes a sparse linear regression model. One advantage is that an explicit functional expression may be found with strong interpretability. Even though the constructed functions are not in the correct functional space in some cases, the proposed procedure can still obtain an approximate estimator.

We can use regularization techniques, such as LASSO [21], to fit a sparse linear regression model. There is ample literature on high dimensional linear regression, see [17] for an overview. Besides, [3] proposed a sequentially least squares algorithm to compute the sparse regression problem. Zhang and Schaeffer [22] provided some theoretical results on the behavior and convergence of SINDy. In further study, [18] found that LASSO performed poorly with highly correlated data. And they approximated the problem to a ridge regression problem with hard thresholding, called sequential threshold ridge regression (STRidge). We use sequential thresholded least squares estimator to estimate the varying functional regression coefficients in this research.

This paper is organized as follows. Section 2 introduces time dynamic varying coefficient models and our proposed estimation procedure. $\sqrt{n}$ consistency and some other theoretical properties are shown in Section 3. Simulation studies are conducted to verify the efficiency of this approach in Section 4. We analyze the air quality data set using our method in Section 5. Finally, we draw some conclusions and directions for future research in Section 6.

2. Model

In this section, we take the model (4) for an example to introduce our estimation procedure, then extend it to model (1) and (3). The observed response and covariates in the longitudinal setting satisfy the model

Y (T) = β_{1} (T, Z_{1} (T)) X_{1} (T) + \dots + β_{p} (T, Z_{p} (T)) X_{p} (T) + ε (T),

(5)

where T denotes the random variable that represents the independent time points allowed to be different for different subjects, the processes $X = (X_{1}, \dots, X_{p})^{⊤}$ , $Z = (Z_{1}, \dots, Z_{p})^{⊤}$ , and Y are observed, the coefficients $β_{j} s (j = 1, \dots, p)$ are bivariate functions and vary with time T. Some terms, such as polynomials, often appear in $β_{j}$ in practice. If we can construct possible basis functions, the bivariate function $β_{j}$ will expand in a linear combination of the basis functions.

2.1. Sparse identification of nonlinear dynamics

Sparse identification of nonlinear dynamics (SINDy) is developed for discovering ordinary differential equations and partial differential equations in engineering. Let's have a brief review of sparse identification of nonlinear dynamics. Consider the dynamical system of some time series data $u (t)$ as follows

\dot{u} = f (u) .

SINDy combines sparsity regression and nonlinear function libraries to identify a nonlinear and dynamical system from time-series data. The main idea is to convert the nonlinear model identification problem to a linear system

A ω = b,

(6)

where the column full rank matrix $A \in R^{r \times q} (r \geq q)$ is a data-driven dictionary whose columns are candidate functions of the given data u, the unknown vector $ω \in R^{q}$ represents the coefficients of the selected terms in the governing equation f, and the vector $b \in R^{r}$ is an approximation to the first-order time derivative $\dot{u}$ . A sparse vector ω which approximately solves (6) is generated by the following iterative scheme:

\begin{aligned} S^{k} & = {1 \leq j \leq q : | ω_{j}^{k} | \geq λ}, \end{aligned}

(7)

\begin{aligned} ω^{k + 1} & = \arg min_{ω \in R^{q} : s u p p (x) \subset S^{k}} ‖ A ω - b ‖_{2}^{2}; \end{aligned}

(8)

where $λ > 0$ is a thresholding parameter, and $s u p p (ω)$ is the support set of ω. The calculating process is illustrated in Algorithm 1 [3].

2.1.

2.2. Data-Driven estimator for dynamic varying coefficient models

We rewrite the model (5) as follows

y^{i j} (t^{i j}) = β_{1} (t^{i j}, z_{1}^{i j} (t^{i j})) x_{1}^{i j} (t^{i j}) + \dots + β_{p} (t^{i j}, z_{p}^{i j} (t^{i j})) x_{p}^{i j} (t^{i j}) + ε (t^{i j}),

(9)

where $i = 1, \dots, M$ , $j = 1, \dots, M_{i}$ . Let y be the matrix

(\begin{array}{ccc} y^{11} & \dots & y^{1 M_{1}} \\ ⋮ & ⋱ & ⋮ \\ y^{M 1} & \dots & y^{M M_{M}} \end{array}) .

We vectorize the response y to be $(y^{11}, \dots, y^{1 M_{1}}, \dots, y^{M 1}, \dots, y^{M M_{M}})^{⊤}$ , let $M_{l} = (Z_{l}, T), l = 1, \dots, p$ , a library of candidate nonlinear functions $Θ (M_{l})$ can be constructed as follows

Θ (M_{l}) = [\begin{array}{ccc} f_{1} (M_{l}), & \dots & , f_{m} (M_{l}) \end{array}],

(10)

where $f_{1}, \dots, f_{m}$ are univariate functions or bivariate functions from $R^{2}$ to $R$ . These functions can be chosen with prior knowledge. Here, prior information can be viewed as domain knowledge. We can cooperate with domain experts or search the literature to know prior knowledge. For example, the right-hand side of ordinary differential equations often appears polynomials, then we can choose polynomials as candidate functions. If we want to discover the dynamic transmission model of COVID-19 [12], SIR (susceptible, infective, and recovered), SEIR (susceptible, exposed, infective, and recovered), SEIJR (susceptible, exposed, infective, diagnosed, and recovered), and SIRD (susceptible, infective, recovered and dead) models, which are designed for infectious diseases, can be regarded as our prior knowledge. We select polynomials in this case. Even though the constructing functions in some cases are not in the correct functional space, the proposed procedure can still gain an approximate estimator.

Remark 2.1

Any continuous function can be well approximated by either polynomial base or trigonometric base. It is also available to include both in the library. We aim to get a global functional expression of the model of interest. Basis functions, such as B-spline functions, cannot be used in our research. Because B-spline methods recover data interval by interval, leading to a local solution.

We take Example 4.1 in Section 4 [10] for an example, three varying coefficients are specified as follows

\begin{aligned} β_{1} (t, z_{1}) & = t (- {(z_{1} - 0.5)}^{2} + 2 z_{1} - 0.5 t), \\ β_{2} (t, z_{2}) & = 2 (z_{2} - 0.5) (t - z_{2} + 3), \\ β_{3} (t, z_{3}) & = 3.5 (- z_{3} + (t - 0.5)^{2} + 0.4) . \end{aligned}

(11)

An equivalent formulation of (11) is

\begin{aligned} β_{1} (t, z_{1}) & = - z_{1}^{2} t + 3 z_{1} t - 0.5 t^{2} - 0.25 t, \\ β_{2} (t, z_{2}) & = 2 (- z_{2}^{2} + 3.5 z_{2} + z_{2} t - 0.5 t - 1.5), \\ β_{3} (t, z_{3}) & = 3.5 (- z_{3} + t^{2} - t + 0.65), \end{aligned}

it is interesting to find each $β_{j}$ is polynomial. Then

\begin{aligned} y (t) & = (- z_{1}^{2} t + 3 z_{1} t - 0.5 t^{2} - 0.25 t) x_{1} (t) + 2 (- z_{2}^{2} + 3.5 z_{2} + z_{2} t - 0.5 t - 1.5) x_{2} (t) \\ + (3.5 (- z_{3} + t^{2} - t + 0.65)) x_{3} (t) + ε (t) . \end{aligned}

If each varying coefficient is expanded in three-order polynomial functions, we get

Θ (M_{i}) = [\begin{array}{cccccccccc} 1 & T & Z_{i} & T^{2} & T Z_{1} & Z_{i}^{2} & T^{3} & T^{2} Z_{i} & T Z_{i}^{2} & Z_{i}^{3} \end{array}] .

Thus the original varying coefficient model now becomes a 30-dimensional linear sparse regression model. The true coefficients are illustrated in Table 1.

Table 1.

Basis functions and corresponding coefficients in Example 4.1.

$Θ (M)$	$β_{1}$	$β_{2}$	$β_{3}$
1	0	$-$ 3	2.275
t	$-$ 0.25	$-$ 1	$-$ 3.5
z	0	7	$-$ 3.5
tt	$-$ 0.5	0	3.5
tz	3	2	0
zz	0	$-$ 2	0
ttt	0	0	0
ttz	0	0	0
tzz	$-$ 1	0	0
zzz	0	0	0

Open in a new tab

Remark 2.2

We assume the constructed library of candidate functions contains true function $β_{j} (T, Z_{j})$ . It will be feasible as long as the set of basis functions can have more terms than $β_{j} (T, Z_{j})$ . We tend to pick a moderately large set. The computation cost will be expensive if the set is too large. Basis functions should be three-order or greater than three-order in (11). The higher the order is, the harder it is for computation.

We assume true functions are a subset of the functions in (10), then model (5) is

y^{i j} (t^{i j}) = Θ (M_{1}^{i j}) \circ x_{1}^{i j} (t^{i j}) ψ_{1}^{*} + \dots + Θ (M_{p}^{i j}) \circ x_{p}^{i j} (t^{i j}) ψ_{p}^{*} + ε (t^{i j}),

where $ψ_{1}^{*}, \dots, ψ_{p}^{*} \in R^{m}$ , and ° denotes a new product

Θ (M_{l}^{i j}) \circ x_{l}^{i j} (t^{i j}) = [\begin{array}{ccc} f_{1} (M_{l}^{i j}) x_{l}^{i j} (t^{i j}), & \dots & , f_{m} (M_{l}^{i j}) x_{l}^{i j} (t^{i j}) \end{array}] .

(12)

Let $θ_{l} = (Θ (M_{l}^{11}) \circ x_{l}^{11} (t^{11}), \dots, Θ (M_{l}^{1 M_{1}}) \circ x_{l}^{1 M_{1}} (t^{1 M_{1}}), \dots, Θ (M_{l}^{M 1}) \circ x_{l}^{M 1} (t^{M 1}), \dots, Θ (M_{l}^{M M_{M}}) \circ x_{l}^{M M_{M}} (t^{M M_{M}}))^{⊤}, l = 1, \dots, p$ , $Θ = (θ_{1}, \dots, θ_{p})$ , $Ψ^{*} = ({ψ_{1}^{*}}^{⊤}, \dots, {ψ_{p}^{*}}^{⊤})^{⊤}$ , and $ε = (ε_{11}, \dots, ε_{1 M_{1}}, \dots, ε_{M 1}, \dots, ε_{M M_{M}})^{⊤}$ , we rewrite (9) as

y = Θ Ψ^{*} + ε,

(13)

where y is a vector of $(M_{1} + \dots + M_{M})$ observed outcomes, Θ is an $(M_{1} + \dots + M_{M}) \times m p$ matrix of predictors, $Ψ^{*} = (ψ_{1}^{*}, \dots, ψ_{m p}^{*})^{⊤}$ is the unknown parameter of interest, and ε is a vector of $i . i . d .$ additive noise with mean 0 and variance $σ^{2}$ .

As estimates for $Ψ^{*}$ , we use sequential thresholded least squares estimator

\begin{aligned} \hat{Ψ} = \underset{Ψ \in R^{m p \times 1}}{\arg min} {‖ y - Θ Ψ ‖}_{2}^{2}, \\ s . t . | Ψ_{l} | > λ, l = 1, \dots, m p . \end{aligned}

(14)

where λ is a tuning parameter, and the calculating process is illustrated in Algorithm 2. Refer to [3] for more details.

2.2.

For the tuning parameter λ, we can use k-fold cross-validation or AIC criteria [2] to determine it. Sequential thresholded least squares estimator in (14) is equivalent to the $ℓ_{0}$ penalized least squares as follows

\underset{Ψ \in R^{m p}}{\arg min} ‖ Θ Ψ - y ‖_{2}^{2} + λ^{2} ‖ Ψ ‖_{0},

where $λ^{2}$ is much smaller when λ is a small value, which has little effect on the results. We fix λ to a small value, such as 0.05, in our simulation. The results do not change much when λ is a small value.

Remark 2.3

We use sequential thresholded least squares estimator for the reason it is equivalent to $ℓ_{0}$ penalized regression. $ℓ_{0}$ penalized least squares estimator penalizes the number of variables in a regression model. It has an advantage over other penalties for sparse regressions in nature. More sparse solutions can be obtained using $ℓ_{0}$ penalty. There is an enormous amount of literature studying sparse regression in linear models. STRidge in [18] is one of them developed for sparse linear regression models. The calculation procedure is similar between sequential thresholded least squares and STRidge. But STRidge needs to specify one more hyperparameter.

2.3. Estimator of non-structured varying coefficient models

The estimation procedure of (4) is not difficult when we expand (4) in an appropriate basis. But it is hard to decide which covariate interacts with both time and another covariate arises. This problem won't occur in (1) and (3). To deal with it, we consider the following model

Y (T) = m (T, Z (T), X (T)) + ε (T),

(15)

where $m (\cdot)$ is an unknown function and ε is a stochastic process such that $E (ε | X_{1}, Z_{1}, \dots, X_{p}, Z_{p}) = 0$ . We expand the model (15) with appropriate basis functions. An explicit expression can still be obtained though there is severe sparsity of the basis functions. Furthermore, it would be helpful for us to decide which covariate interacts with time and another covariate if we just use three-order polynomials as basis functions. This is a pre-selection process. Then we update the estimation in model (4) using Algorithm 2.

3. Theoretical properties

In the work of [22], they showed the SINDy algorithm could find local minimizers of (16). Thus, it has theoretical guarantees summarized below in the setting of our dynamic varying coefficient models.

Lemma 3.1

Let $n = M_{1} + \dots + M_{M},$ $Θ \in R^{n \times m p}$ with $‖ Θ ‖_{2} = 1, Ψ \in R^{m p},$ and $λ \geq 0$ . Assume that $n \geq m p,$ ${\hat{Ψ}}^{(k)}$ is the sequence generated by Algorithm 2. Define the objective function F by

$F (Ψ) := ‖ Θ Ψ - y ‖_{2}^{2} + λ^{2} ‖ Ψ ‖_{0}, Ψ \in R^{m p} .$ (16)

We have

${\hat{Ψ}}^{(k)}$ converges to a fixed point of the iterative scheme defined by (16) at most mp steps;

a fixed point of the scheme is a local minimizer of F;

a global minimizer of F is a fixed point of the scheme;

the inequality ${\hat{Ψ}}^{(k + 1)} < {\hat{Ψ}}^{(k)}$ holds unless the iterations are stationary.

Proof.

For the proof, we refer the reader to [22].

In (13), let $S_{0} = {j : Ψ_{j}^{*} \neq 0, j = 1, \dots, m p}$ be the true model with size $d_{0}$ , $| S_{0} | = d_{0}$ , where $| S_{0} |$ denotes the cardinality of $S_{0}$ . For $S \subset {1, \dots, m p}$ , let $Ψ_{S} = (Ψ_{i})_{j \in S} \in R^{| S |}$ be the $| S |$ dimensional sub-vector of Ψ indexed by S, and let $Θ_{S}$ be the $n \times | S |$ matrix obtained from Θ by extracting columns corresponding to S. If the estimate $\hat{S} = {j : {\hat{Ψ}}_{j} \neq 0, j = 1, \dots, m p}$ of $S_{0}$ satisfies $P (\hat{S} \neq S_{0}) \to 0$ as $n, m p \to \infty$ , selection consistency is obtained. Then we can say SINDy consistently selects the correct model $S_{0}$ .

By Lemma 3.1, the solution of SINDy algorithm equals to $ℓ_{0}$ penalized least squares solution. The $ℓ_{0}$ penalty directly penalizes the number of nonzero components in (16), which is appealing in sparse regression. Though $ℓ_{0}$ penalized least squares has a good property, it is discontinuous and the computation complexity is NP-hard. To overcome such difficulty, many researchers [4,7,8] suggested to relax the $ℓ_{0}$ regularization and considered the $ℓ_{1}$ regularization instead, for example, the LASSO [21]. Dicker et al. [6] proposed the seamless- $ℓ_{0}$ (SELO) penalty, a smooth function on $[0, 1)$ , very close to the $ℓ_{0}$ penalty. They also showed the model selection consistency and asymptotical normality. Applying to our time dynamic varying coefficient models, Lemma 3.2 holds.

Lemma 3.2

Assume that true basis functions of each coefficient are a subset of the functions in (10) and the size of library m is fixed, then under

condition 1: $n \to \infty$ and $m p σ^{2} / n \to 0$ ;

condition 2: $ρ \sqrt{n / (m p σ^{2})} \to \infty,$ where $ρ = min_{l \in S} ‖ Ψ_{l} ‖$ ;

condition 3: There exist positive constants $r_{0}, R_{0} > 0$ such that $r_{0} \leq λ_{min} (n^{- 1} Θ^{⊤} Θ) < λ_{max} (n^{- 1} Θ^{⊤} Θ) < R_{0},$ where $λ_{min} (n^{- 1} Θ^{⊤} Θ)$ and $λ_{max} (n^{- 1} Θ^{⊤} Θ)$ are the smallest and largest eigenvalues of $n^{- 1} Θ^{⊤} Θ,$ respectively;

condition 4: $λ = O (1),$ $λ \sqrt{n / (m p σ^{2})} \to \infty;$

we have: $lim_{n \to \infty} P ({l; {\hat{Ψ}}_{l} \neq 0} = S_{0}) = 1$ .

Proof.

For the proof we refer the reader to [6].

According to Lemma 3.2, the SINDy algorithm in (16) approximately selects the true model with probability 1. Shen et al. [19] gave an error bound for a global minimizer of (16) using $ℓ_{0}$ penalty. Lemma 3.3 can be obtained for our time dynamic varying coefficient models.

Lemma 3.3

Let ${\hat{Ψ}}^{ℓ_{0}} = ({\hat{Ψ}}_{\hat{S}}^{ℓ_{0}}, 0)^{⊤}$ and ${\hat{Ψ}}^{o l} = ({\hat{Ψ}}_{S_{0}}^{o l}, 0_{S_{0}^{c}})^{⊤}$ under (16) with $α > 1,$ then

$\begin{aligned} P ({\hat{Ψ}}^{ℓ_{0}} \neq {\hat{Ψ}}^{o l}) & \leq 4 \exp (- (\frac{n C_{min}^{*}}{18 σ^{2}} - (α + 1) \log (m p + 1) - \frac{λ^{2}}{2 σ^{2}})) \\ + 4 \exp (- (\frac{(α - 1) λ^{2}}{3 α σ^{2}} - (\frac{1}{α} + 1) \log (m p + 1) - \frac{2}{3})), \end{aligned}$

where

$C_{min}^{*} \equiv min_{{Ψ_{S} : S \neq S_{0}, | S | \leq α d_{0}}} \frac{1}{n max (| S_{0} ∖ S |, 1)} ‖ Θ_{S_{0}} Ψ_{S_{0}}^{*} - Θ_{S} Ψ_{S} ‖ .$

Moreover, if $sup_{Ψ^{*} \in B_{0} (u, l)} \frac{1}{n} ‖ Θ Ψ^{*} ‖^{2} \leq c_{1} \exp (c_{2} d_{0})$ for some constant $c_{j}; j = 1, 2,$ then

$P ({\hat{Ψ}}^{ℓ_{0}} \neq {\hat{Ψ}}^{o l}) \leq \frac{e + 1}{e - 1} \exp (\frac{n}{18 σ^{2}} (C_{min}^{*} - 36 \frac{\log m p}{n} σ^{2})),$

where $B_{0} (u, l) = {Ψ : d_{0} = \sum_{j = 1}^{m p} I (Ψ_{j} \neq 0) \leq u, C_{min}^{*} (Ψ, Θ) \geq l}$ with $l = d_{1} σ^{2} \frac{\log m p}{n}$ and constant $d_{1} > 0,$ which is called an $ℓ_{0}$ -band with upper and lower radii u and l $(u > l > 0)$ .

Proof.

For the proof we refer the reader to [19].

According to Lemma 3.3, model selection consistency of (16) is clear. Now, we analyze the asymptotic normality of (16) in Theorem 3.4.

Theorem 3.4

If the model selection consistency of (16) mentioned in Lemmas 3.2 and 3.3 is guaranteed, then

$\sqrt{n} ({\hat{Ψ}}_{S_{0}} - Ψ_{S_{0}}^{*}) \to N (0, G) .$

in distribution, where $G = (Θ_{S_{0}}^{⊤} Θ_{S_{0}})^{- 1} σ^{2}$ .

Proof.

Let $L = \frac{1}{2 n} ‖ Y - Θ_{S_{0}} Ψ_{S_{0}} ‖ + λ^{2} d_{0}$ , the derivative of L to $Ψ_{S_{0}}$ is

$\frac{\partial L}{\partial Ψ_{S_{0}}} = (Ψ_{S_{0}})^{⊤} (Y - Θ_{S_{0}} Ψ_{S_{0}}),$

and solve $\frac{\partial L}{\partial Ψ_{S_{0}}} = 0$ , it is the ordinary least squares solution

${\hat{Ψ}}_{S_{0}} = (Θ_{S_{0}}^{⊤} Θ_{S_{0}})^{- 1} Θ_{S_{0}}^{⊤} Y .$

By basic calculus,

$\begin{aligned} {\hat{Ψ}}_{S_{0}} & = (Θ_{S_{0}}^{⊤} Θ_{S_{0}})^{- 1} Θ_{S_{0}}^{⊤} Y \\ = (Θ_{S_{0}}^{⊤} Θ_{S_{0}})^{- 1} Θ_{S_{0}}^{⊤} (Θ_{S_{0}} Ψ_{S_{0}}^{*} + ε) \\ = Ψ_{S_{0}}^{*} + (Θ_{S_{0}}^{⊤} Θ_{S_{0}})^{- 1} Θ_{S_{0}}^{⊤} ε . \end{aligned}$

Thus $E ({\hat{Ψ}}_{S_{0}} - Ψ_{S_{0}}^{*}) = 0$ , $V a r ({\hat{Ψ}}_{S_{0}} - Ψ_{S_{0}}^{*}) = (Θ_{S_{0}}^{⊤} Θ_{S_{0}})^{- 1} σ^{2}$ ,

$\sqrt{n} ({\hat{Ψ}}_{S_{0}} - Ψ_{S}) \to N (0, G) .$

in distribution, where $G = (Θ_{S_{0}}^{⊤} Θ_{S_{0}})^{- 1} σ^{2} .$

Denote ${\hat{S}}_{0 (i)} = {j : {\hat{Ψ}}_{j}^{*} \neq 0, j = m (i - 1) + 1, \dots, m i}$ , $i = 1, \dots, p$ , size of ${\hat{S}}_{0 (i)}$ is $d_{0 i}$ , $d_{00} = 0$ . Let $G_{i}^{*}$ be a submatrix of G,

G_{i}^{*} = G_{i j}, i = (\sum_{k = 0}^{i - 1} d_{0 k}) + 1, \dots, \sum_{k = 1}^{i} d_{0 k}, j = (\sum_{k = 0}^{i - 1} d_{0 k}) + 1, \dots, \sum_{k = 1}^{i} d_{0 k} .

Note $M_{l} = (Z_{l}, T), l = 1, \dots, p$ , and $Θ (M_{l})$ is a map from $R^{2}$ to $R^{m}$ . Let

\begin{aligned} U_{l} & = {(M_{l})}_{h}, h = k - m (l - 1), k \in {\hat{S}}_{0 (i)}, \\ ν_{l}^{*} & = {(ψ_{l}^{*})}_{k}, k \in {\hat{S}}_{0 (i)} . \end{aligned}

By Theorem 3.4 and the transformation $β_{l} (T, Z_{l}) = M_{l}^{⊤} ψ_{l}^{*} = U_{l}^{⊤} ν_{l}^{*}$ , it is evident that ${\hat{β}}_{l} (T, Z_{l}) - β_{l} (T, Z_{l}) (l = 1, \dots, p) \to N (0, U_{l}^{⊤} G_{l}^{*} U_{l})$ in distribution. So

\sqrt{n} (\begin{matrix} {\hat{β}}_{l} (T, Z_{l}) - β_{0} (T, Z_{l}) \\ ⋮ \\ {\hat{β}}_{l} (T, Z_{l}) - β_{0} (T, Z_{l}) \end{matrix}) \to N (0, d i a g (U_{l}^{⊤} G_{l}^{*} U_{l})),

where $0$ is a $p \times 1$ vector, and $d i a g (U_{l}^{⊤} G_{l}^{*} U_{l})$ is a $p \times p$ diagonal matrix with elements $U_{l}^{⊤} G_{l}^{*} U_{l}$ , $l = 1, \dots, p$ .

Thus the convergence rate of our proposed method is $n^{\frac{1}{2}}$ , while the convergence rate of backfitting algorithms is $n^{\frac{1}{3}}$ . Our proposed method has a faster convergence rate. According to asymptotic distributions, backfitting algorithms are biased, while our proposed method is unbiased.

4. Simulation

Simulation studies are conducted in this section to check the performance of the proposed estimator. We use matlab 2019a on a server with Intel(R) core(TM) i5-10210U CPU @1.60 GHz. Our proposed method aims at time dynamic varying coefficient regression models. So we compare with methods developed for time dynamic varying coefficient regression models. We focus on two types of varying coefficient models with Case 1 in [10] and Case 2 in [26].

In Examples 4.1 and 4.2, we generate 50 pseudo samples of sizes n = 200 and 400 from the model (4) with d = 3. The numbers of time points $M_{i}$ , $1 \leq i \leq n$ , are $i . i . d .$ having a discrete uniform distribution supported on the set ${2, \dots, 9}$ . The time points $T_{i l}$ , $1 \leq i \leq n$ , $1 \leq l \leq M_{i}$ , are sampled from $U (0, 1)$ , the uniform distribution on the unit interval $[0, 1]$ , independently across i and l. The values of the linear covariates $X_{j}$ at $T_{i l}$ are obtained according to

\begin{array}{l} X_{1}^{i l} = T_{i l} + V_{1 i} + V_{2 i} (T_{i l} - 0.5) + ζ_{1}^{i l}, \\ X_{2}^{i l} = {(T_{i l} - 0.5)}^{3} + V_{3 i} + V_{4 i} T_{i l} + ζ_{2}^{i l}, \\ X_{3}^{i l} \equiv 1, \end{array}

where V's and ζ's are random variables to be specified below. The values of the nonlinear covariates $Z_{j}$ at $T_{i l}$ are

\begin{array}{l} Z_{1}^{i l} = \exp (T_{i l} - 5) - 0.2 W_{1 i} + 0.1 W_{2 i} (T_{i l} - 0.5) + η_{1}^{i l}, \\ Z_{2}^{i l} = 0.5 {(T_{i l} - 0.5)}^{3} + 0.2 W_{3 i} + 0.1 W_{4 i} (T_{i l} - 0.5) + η_{2}^{i l}, \\ Z_{3}^{i l} = η_{3}^{i l}, \end{array}

where W's and η's are random variables to be specified below. The random vectors $(V_{1 i}, V_{2 i}, V_{3 i}, V_{4 i})$ are $i . i . d .$ $N (0, Σ_{X})$ with

Σ_{X} = (\begin{matrix} 0.07 & - 0.02 & 0.04 & - 0.02 \\ - 0.02 & 0.03 & - 0.02 & 0.01 \\ 0.04 & - 0.02 & 0.07 & - 0.02 \\ - 0.02 & 0.01 & - 0.02 & 0.03 \end{matrix}),

and $(W_{1 i}, W_{2 i}, W_{3 i}, W_{4 i})$ are $i . i . d .$ $N (0, Σ_{Z})$ with

Σ_{Z} = (\begin{matrix} 0.04 & - 0.02 & 0.01 & - 0.02 \\ - 0.02 & 0.04 & - 0.02 & 0.01 \\ 0.01 & - 0.02 & 0.04 & - 0.02 \\ - 0.02 & 0.01 & - 0.02 & 0.04 \end{matrix}) .

The random variables $ζ_{j}^{i l}$ are $i . i . d .$ $N (0, {0.3}^{2})$ , and $η_{j}^{i l}$ are $i . i . d .$ $U (0, 1)$ independent of $ζ_{j}^{i l}$ . They are sampled also independent of V's and W's. The errors $ϵ^{i l}$ in $ε^{i l} = ϵ^{i l} + δ^{i l}$ are generated by $ϵ^{i l} = R_{i} (\cos (2 π T^{i l}) + \sin (2 π T^{i l}))$ with $R_{i}$ being $i . i . d .$ $N (0, {0.5}^{2})$ , and $δ^{i l}$ are $i . i . d .$ $N (0, {0.1}^{2})$ independent of $R_{i}$ .

We consider two regression functions specified below.

Example 4.1

$\begin{aligned} β_{1} (t, z_{1}) & = t (- {(z_{1} - 0.5)}^{2} + 2 z_{1} - 0.5 t), \\ β_{2} (t, z_{2}) & = 2 (z_{2} - 0.5) (t - z_{2} + 3), \\ β_{3} (t, z_{3}) & = 3.5 (- z_{3} + (t - 0.5)^{2} + 0.4) . \end{aligned}$

Example 4.2

$\begin{aligned} β_{1} (t, z_{1}) & = 2 (- \sin (π t) + \cos (0.5 π z_{1})) \\ = - 2 \sin (π t) + 2 \cos (0.5 π z_{1})), \\ β_{2} (t, z_{2}) & = 2 (\sin (π t) \cdot \cos (π z_{2})), \\ β_{3} (t, z_{3}) & = 3.5 (- z_{3} + (t - 0.5)^{2} + 0.4) \\ = 3.5 (- z_{3} + t^{2} - t + 0.65) . \end{aligned}$

We use basis functions as illustrated in Table 1 in Example 4.1. And we add sine, cosine basis functions for each $β_{i} (i = 1, 2, 3)$ in Example 4.2 as follows $(\sin (0.5 π T), \sin (π T), \sin (1.5 π T), \sin (0.5 π Z_{i}), \sin (π Z_{i}), \sin (1.5 π Z_{i}), \cos (0.5 π T), \cos (π T), \cos (1.5 π T), \cos (0.5 π Z_{i}), \cos (π Z_{i}), \cos (1.5 π Z_{i})) .$ It can be seen the set of constructing candidate basis functions consists of the set of true functions in Example 4.1. But in Example 4.2, it is not the situation, so we can only give approximate results.

Figures 1 and 2 depict the true and the estimated coefficient functions for sample size n = 400 averaged over the 50 pseudo samples using our proposed method. The estimated surfaces in the figures suggest that our method estimates the true coefficients well. We compare our proposed method with that of [10], denoted by backfitting algorithm 1. We compute the mean integrated squared errors (MISE), the integrated squared biases (ISB), the integrated variances (IV), the number of iterations, and the computation time of the estimators. The formulas of $M I S E_{j}$ , $I S B_{j}$ , and $I V_{j}$ , j = 1, 2, 3, are given by

\begin{aligned} {MISE}_{j} & = E \int {({\hat{β}}_{j} (t, z_{j}) - β_{j} (t, z_{j}))}^{2} d z_{j} d t \\ = \int {(E {\hat{β}}_{j} (t, z_{j}) - β_{j} (t, z_{j}))}^{2} d z_{j} d t + \int E {({\hat{β}}_{j} (t, z_{j}) - E {\hat{β}}_{j} (t, z_{j}))}^{2} d z_{j} d t \\ \equiv I S B_{j} + I V_{j}, j = 1, 2, 3, \end{aligned}

and $MISE = {MISE}_{1} + {MISE}_{2} + {MISE}_{3}$ . The BFA $^{1}$ depends on the selected bandwidths. We select the optimal bandwidths $h_{k}$ , $0 \leq k \leq 3$ , that minimize the asymptotic mean integrated squared error (AMISE), where $A M I S E = \sum_{k = 1}^{3} A M I S E_{k}$ . For a given $h_{0}$ , we choose $h_{k} (h_{0})$ , dependent on $h_{0}$ , which minimizes $A M I S E_{k}$ . Denote such minima by $A M I S E_{k} (h_{0})$ . Then, we choose the optimal $h_{0}$ which minimizes $\sum_{k = 1}^{3} A M I S E_{k}$ . The optimal $h_{k}, 1 \leq k \leq 3$ , are correspondingly obtained. We use Epanechnikov kernel, the convergence criteria for $1 \leq k \leq d$ is

\int \int (β_{k}^{(r)} (t, z_{k}) - β_{k}^{(r - 1)} (t, z_{k}))^{2} d t d z_{k} < ϵ,

where $β_{k}^{(r)} (t, z_{k})$ is the tth iteration result of true $β_{k} (t, z_{k})$ , the threshold value ϵ is 0.00001. We summarize values of bandwidths in Figure 3.

Figure 1. — The lower panels are the true coefficient functions, and the upper panels are their estimates for n = 400 averaged over 50 pseudo samples in Example 4.1.

Figure 2. — The lower panels are the true coefficient functions, and the upper panels are their estimates for n = 400 averaged over 50 pseudo samples in Example 4.2.

Figure 3. — Values of bandwidth in Examples 4.1 and 4.2 in simulations.

Remark 4.1

The convergence threshold is a number close to zero in the numerical calculation, such as $10^{- 5}$ and $10^{- 7}$ . It depends on researcher's tolerence. Sometimes the convergence threshold is also called the tolerence. Researchers give threshold values without any justification in most cases. For example, the convergence threshold in [10] is $10^{- 5}$ ; the convergence threshold in [26] is $10^{- 4}$ . We set the threshold to $10^{- 5}$ considering the calculation time and accuracy. Higher accuracy requires more computation time. As we mentioned before, the convergence rate of backfitting algorithms is $n^{1 / 3}$ . Even though we set the threshold to $10^{- 20}$ , it is not going to make a qualitative difference.

Tables 2 and 3 summarize the results. The values of MISE, ISB, and IV decrease as the sample size n increases. Our method is higher in accuracy and more rapid in speed.

Table 2.

Comparison of the mean integrated squared errors (MISE), the integrated squared biases (ISB), the integrated variances (IV), the number of iterations, and the computing time in Example 4.1.

	n = 200		n = 400
	Our proposed method	Backfitting algorithm 1	Our proposed method	Backfitting algorithm 1
MISE $_{1}$	0.0105	0.0331	0.0033	0.0259
MISE $_{2}$	0.0209	0.0668	0.0096	0.0186
MISE $_{3}$	0.0032	0.0270	0.0024	0.0123
MISE	0.0346	0.1269	0.0153	0.0568
ISB $_{1}$	0.0019	0.0217	0.0010	0.0176
ISB $_{2}$	0.0074	0.0460	0.0038	0.0091
ISB $_{3}$	0.0013	0.0234	0.0010	0.0083
IV $_{1}$	0.0086	0.0114	0.0023	0.0083
IV $_{2}$	0.0135	0.0208	0.0058	0.0095
IV $_{3}$	0.0019	0.0036	0.0014	0.0041
Time	0.0327s	110.6247s	0.0422s	257.85s
Iterations	−	7.5200	−	7.7800

Open in a new tab

Table 3.

Comparison of the mean integrated squared errors (MISE), the integrated squared biases (ISB), the integrated variances (IV), the number of iterations, and the computing time in Example 4.2.

	n = 200		n = 400
	Our proposed method	Backfitting algorithm 1	Our proposed method	Backfitting algorithm 1
MISE $_{1}$	0.0281	0.0655	0.0102	0.0357
MISE $_{2}$	0.0781	0.1059	0.0234	0.0358
MISE $_{3}$	0.0041	0.0183	0.0009	0.0113
MISE	0.1103	0.1897	0.0345	0.0828
ISB $_{1}$	0.0106	0.0215	0.0016	0.0173
ISB $_{2}$	0.0502	0.0598	0.0105	0.0151
ISB $_{3}$	0.0007	0.0085	0.0007	0.0064
IV $_{1}$	0.0175	0.0439	0.0086	0.0184
IV $_{2}$	0.0279	0.0461	0.0129	0.0207
IV $_{3}$	0.0035	0.0098	0.0008	0.0049
Time	0.0766s	140.665s	0.1335s	330.8357s
Iterations	−	8.3800	−	8.5600

Open in a new tab

In Examples 4.3 and 4.4, two longitudinal covariates are generated by

\begin{aligned} Z_{i j 1} & = 2 T_{i j} - 0.5 + a_{i 1} + b_{i 1} T_{i j} + c_{i j 1}, \\ Z_{i j 2} & = 0.5 + 4 (T_{i j} - 0.5)^{3} + a_{i 2} + b_{i 2} T_{i j} + c_{i j 2}, \end{aligned}

where $(a_{i 1}, b_{i 1}, a_{i 2}, b_{i 2})^{⊤}$ are $i . i . d .$ having a multivariate normal distribution with $0$ mean and covariance

[\begin{matrix} 0.09 & - 0.03 & 0.045 & - 0.03 \\ - 0.03 & 0.04 & - 0.03 & 0.02 \\ 0.045 & - 0.03 & 0.09 & - 0.03 \\ - 0.03 & 0.02 & - 0.03 & 0.04 \end{matrix}],

and $c_{i j 1}$ and $c_{i j 2}$ are $i . i . d .$ $N (0, {0.4}^{2})$ and independent of $(α_{i 1}, β_{i 1}, α_{i 2}, β_{i 2})$ . We take d = 2 and consider the following two regression functions.

Example 4.3

$m (t, z) = 1 + (t - 0.5)^{2} + t z_{1}^{2} + (1 - t) z_{2}^{2}$ .

Example 4.4

$m (t, z) = 1 + (t - 0.5)^{2} + \sin (2 π t) z_{1}^{2} + (1 - \sin (2 π t)) z_{2}^{2} .$

The stochastic components $w_{i j}$ are generated by $w_{i j} = w_{i} (T_{i j}) = \sum_{q = 1}^{4} A_{i q} ϕ_{q} (T_{i j})$ , where $A_{i q}$ are independent random variables having $N (0, λ_{q})$ distribution with $λ_{q} = 1 / (q + 1)^{2}$ , $1 \leq q \leq 4$ , and

\begin{aligned} ϕ_{1} (t) & = \sqrt{2} \cos (2 π t), ϕ_{2} (t) = \sqrt{2} \sin (2 π t), \\ ϕ_{3} (t) & = \sqrt{2} \cos (4 π t), ϕ_{2} (t) = \sqrt{2} \sin (4 π t) . \end{aligned}

Then, the response data $Y_{i j} = m (T i j, Z_{i j}) + w_{i j} + e_{i j}$ are obtained, where $e_{i j}$ are i.i.d random noises having $N (0, {0.05}^{2})$ distribution.

If we use polynomials as basis functions, true coefficients in Example 4.3 are shown in Table 4. In Example 4.4, time T is expanded in third-order polynomials, sine and cosine functions. Sine and cosine functions include $(\sin (2 π T), \sin (4 π T), \sin (6 π T), \cos (2 π T), \cos (4 π T), \cos (6 π T))$ , $Z_{1}$ and $Z_{2}$ are expanded in third-order polynomials. So

Θ = (M (T), M (Z_{1}), M (Z_{2}), M (T) \circ M (Z_{1}), M (T) \circ M (Z_{2})),

where $M (T)$ consists of third-order polynomials, sine and cosine functions, $M (Z_{1})$ and $M (Z_{2})$ are third-order polynomials.

Table 4.

Basis functions and corresponding coefficients in Example 4.3.

$Θ (M)$	$μ_{0}$	$μ_{1}$	$μ_{2}$
1	1.25	0	0
t	$-$ 1	0	0
$z_{1}$	0	0	0
$z_{2}$	0	0	0
$t^{2}$	1	0	0
$t z_{1}$	0	0	0
$t z_{2}$	0	0	0
$z_{1}^{2}$	0	0	0
$z_{1} z_{2}$	0	0	0
$z_{2}^{2}$	0	0	1
$t^{3}$	0	0	0
$t^{2} z_{1}$	0	0	0
$t^{2} z_{2}$	0	0	0
$t z_{1}^{2}$	0	1	0
$t z_{1} z_{2}$	0	0	0
$t z_{2}^{2}$	0	0	$- 1$
$z_{1}^{3}$	0	0	0
$z_{1}^{2} z_{2}$	0	0	0
$z_{1} z_{2}^{2}$	0	0	0
$z_{2}^{3}$	0	0	0

Open in a new tab

Figures 4 and 5 depict the true and the estimated coefficient functions for sample size n = 400 averaged over the 50 pseudo samples using our proposed method. The estimated surfaces in the figures suggest that our method estimates the true coefficients well. We compare our proposed method with that of [26], denoted by backfitting algorithm 2. The bandwidths selection method and convergence criteria are the same as Examples 4.3 and 4.4. Values of bandwidths are illustrated in Figure 6.

Figure 4. — The lower panels are the true coefficient functions, and the upper panels are their estimates for n = 400 averaged over 50 pseudo samples in Example 4.3.

Figure 5. — The lower panels are the true coefficient functions, and the upper panels are their estimates for n = 400 averaged over 50 pseudo samples in Example 4.4.

Figure 6. — Values of bandwidth in Examples 4.3 and 4.4 in simulations.

Tables 5 and 6 summarize the results. The values of MISE, ISB, and IV decrease as the sample size n increases. Our method is higher in accuracy and more rapid in speed.

Table 5.

Comparison of the mean integrated squared errors (MISE), the integrated squared biases (ISB), the integrated variances (IV), the number of iterations, and the computing time in Example 4.3.

	n = 200		n = 400
	Our proposed method	Backfitting algorithm 2	Our proposed method	Backfitting algorithm 2
MISE $_{0}$	0.0296	−	0.0043	−
MISE $_{1}$	0.0088	0.0498	0.0029	0.0347
MISE $_{2}$	0.0076	0.0340	0.0008	0.0272
MISE	0.0164	0.0838	0.0037	0.0619
ISB $_{0}$	0.0026	−	0.0007	−
ISB $_{1}$	0.0032	0.0174	0.0005	0.0111
ISB $_{2}$	0.0009	0.0250	0.0005	0.0213
IV $_{0}$	0.0270	−	0.0036	−
IV $_{1}$	0.0056	0.0324	0.0024	0.0235
IV $_{2}$	0.0075	0.0090	0.0003	0.0059
Time	0.0048s	25.7134s	0.0062s	43.9962s
Iterations	−	4.5600	−	4.8800

Open in a new tab

Table 6.

Comparison of the mean integrated squared errors (MISE), the integrated squared biases (ISB), the integrated variances (IV), the number of iterations, and the computing time in Example 4.4.

	n = 200		n = 400
	Our proposed method	Backfitting algorithm 2	Our proposed method	Backfitting algorithm 2
MISE $_{0}$	0.0621	−	0.0077	−
MISE $_{1}$	0.0391	0.0895	0.0066	0.0692
MISE $_{2}$	0.0039	0.0222	0.0012	0.0160
MISE	0.0430	0.1117	0.0078	0.0852
ISB $_{0}$	0.0561	−	0.0023	−
ISB $_{1}$	0.0299	0.0518	0.0015	0.0496
ISB $_{2}$	0.0006	0.0162	0.0004	0.0148
IV $_{0}$	0.0059	−	0.0054	−
IV $_{1}$	0.0092	0.0377	0.0051	0.0196
IV $_{2}$	0.0033	0.0060	0.0008	0.0012
Time	0.0589s	48.2093s	0.0924s	127.2423s
Iterations	−	6.5600	−	7.2600

Open in a new tab

5. Real data analysis

The data set comes from the UCI machine learning repository, called Beijing Multi-Site Air-Quality Data Data Set [24]. It includes hourly air pollutants data from 12 nationally controlled air-quality monitoring sites. The air-quality data are collected by the Beijing Municipal Environmental Monitoring Center. Variables consist of $P M_{2.5}$ concentration $(μ g / m^{3})$ , $P M_{10}$ concentration $(μ g / m^{3})$ , ${SO}_{2}$ concentration $(μ g / m^{3})$ , ${NO}_{2}$ concentration $(μ g / m^{3})$ , CO concentration $(μ g / m^{3})$ and $O_{3}$ concentration $(μ g / m^{3})$ . The meteorological data in each air-quality site are matched with the nearest weather station from the China Meteorological Administration, including temperature (degree Celsius), pressure (hPa), dew point temperature (degree Celsius), precipitation, wind direction, and wind speed (m/s). The period is from March 1st, 2013 to February 28th, 2017. Missing data are denoted as NA. One of the interesting purposes is to determine the relationship between $P M_{2.5}$ and other impact factors.

According to domain knowledge and previous research, meteorology factors and gaseous pollutants are the main contributors to the formation process of $P M_{2.5}$ . For gaseous pollutants, ${SO}_{2}$ , ${NO}_{2}$ , and CO are selected in our models. The reason we do not consider $O_{3}$ is that [15] concluded $O_{3}$ had little effect on $P M_{2.5}$ . Besides, the correlation coefficient between $P M_{2.5}$ and $O_{3}$ is $- 0.18$ , weakly correlated. For meteorology factors, temperature and wind speed are chosen. Dew point temperature and pressure are also weakly correlated to the concentration of $P M_{2.5}$ , with correlation coefficients 0.02 and 0.01, respectively.

The original data set is from 12 nationally-controlled air-quality monitoring sites, we only analyze data from Wanshouxigong station in this study. There are 30666 hourly samples without missing values. Viewed as longitudinal data, there are n = 1461 daily observations with each day's time point $1 < M_{i} \leq 24$ during the period. We scale the interval of time point from $[0, 23]$ to $[0, 1]$ , and the concentration of CO is divided by 100. We split the data into a training set with 25000 randomly sampled data and a test set with the remaining data. We consider different types of varying coefficient models to analyze the relationship between the concentration of $P M_{2.5}$ and the selected covariates. The measure of assessment is the relative mean squared error (RMSE) given by

RMSE = \frac{\sum_{i, j} ({P M_{2.5}}_{i j} - {\hat{P M_{2.5}}}_{i j})^{2}}{\sum_{i, j} ({P M_{2.5}}_{i j} - {\bar{P M_{2.5}}}_{i j})^{2}},

It can been seen the smaller the RMSE is, the better the model fits. We do 100 simulations and calculate the mean RMSE as the final RMSE in test data.

Model 1: additive model, refer to [20].

P M_{2.5} (t) = f_{0} (t) + f_{1} ({SO}_{2}) + f_{2} ({NO}_{2}) + f_{3} (CO) + f_{4} (T) + f_{5} (W S) + n o i s e,

we use ‘gam’ in R package ‘mgcv’ to fit Model 1, the RMSE is 0.2998.

Model 2: varying coefficient model in (1)

\begin{aligned} P M_{2.5} (t) & = β_{0} (t) + β_{1} (t) ({SO}_{2}) + β_{2} (t) ({NO}_{2}) + β_{3} (t) (CO) \\ + β_{4} (t) (T) + β_{5} (t) (W S) + n o i s e, \end{aligned}

the RMSE is 0.2930 when basis functions of $β_{j}$ are three-order polynomials and sin functions.

Model 3: varying coefficient model in (3)

\begin{aligned} P M_{2.5} (t, z) & = μ_{0} (t) + μ_{1} (t, {SO}_{2}) + μ_{2} (t, {NO}_{2}) + μ_{3} (t, CO) \\ + μ_{4} (t, T) + μ_{5} (t, W S) + n o i s e, \end{aligned}

the RMSE is 0.2836 when basis functions of $μ_{j}$ are three-order polynomials and sin functions. Model 2 is a special case of Model 3. By comparing RMSE, Model 3 is better than Model 2. Figure 7 depicts the estimated surfaces of different components. The concentration of $P M_{2.5}$ increases as ${NO}_{2}$ , CO, and temperature increase. The result is consistent with the fact ${NO}_{2}$ and CO are pollutants that compose $P M_{2.5}$ . The temperature has a higher impact on the concentration of $P M_{2.5}$ at noon than in the morning and evening when the temperature is high. In contrast, it has a higher impact in the morning and evening than at noon when the temperature is low.

Figure 7. — Estimates of coefficient functions in Model 3.

Model 4: varying coefficient model in (4)

\begin{aligned} P M_{2.5} (t) & = β_{0} (t, W S (t)) + β_{1} (t, W S (t)) {SO}_{2} (t) + β_{2} (t, W S (t)) {NO}_{2} (t) \\ + β_{3} (t, W S (t)) CO (t) + β_{4} (t, W S (t)) (T (t)) + n o i s e, \end{aligned}

different from the above-mentioned models, the problems which covariate should be the linear component and which covariate should be the varying coefficient component, are unclear for Model 4. We use cubic polynomials as the library of functions for full data, and find out the coefficients of terms $w i n d s p e e d \times t \times CO$ and $w i n d s p e e d \times t \times {SO}_{2}$ are not zero. According to the result in reference [24], air quality will be better once the wind blows away $P M_{2.5}$ . Cooling is often accompanied by strong winds. This result is consistent with our life experience in Beijing. So Model 4 makes sense according to our data-driven pre-estimation result and research in [24].

If we only use cubic polynomials, the RMSE is 0.2935; if we use cubic polynomials and sine functions simultaneously, the RMSE is 0.2916; if we use second-order polynomials, the RMSE is 0.3020; if we use both second-order polynomial and sine functions, the RMSE is 0.3020. By comparing RMSE, we choose cubic polynomials and sine functions in this case. Figure 8 depicts the estimated surface of each component.

Figure 8. — Estimates of coefficient functions in Model 4.

By comparing RMSE of four models, Model 3 performs best.

6. Discussion

Designing efficient and accurate algorithms to estimate varying coefficient models is of significance as more and more data is available. We focus on different types of dynamic varying coefficient models and propose a new procedure to estimate the varying component efficiently in our research. The main idea is to expand the varying component with appropriate basis functions and then solve a sparse regression problem. Finally, we can get an explicit expression of the model under certain conditions. Furthermore, the proposed procedure can also find out an expression without model assumptions as shown in the real data analysis.

Constructing a library of functions, especially when the number of covariates is large is a challenge to us. To deal with this, we can incorporate domain knowledge and other related methods. Still taking the concentration of $P M_{2.5}$ as an example, the existing research results show meteorology factors and gaseous pollutants are the main contributors to the formation process of $P M_{2.5}$ , so the feature extraction process will depend on these two aspects. Curse of dimensionality will occur when the number of covariates is large. To handle this, before implementing the above-mentioned procedure, we can delete some unimportant variables as in the variable selection and feature screening procedure. This is based on the sparsity assumption. Sensitivity to noise is also troublesome. One technique is to use absolute loss function or check function in quantile regression to get robust estimation. And the other is to take advantage of Bayesian inference [23]. We will study robust estimation under the frame of SINDy in our future research.

Acknowledgements

We would like to thank three anonymous referees for their constructive comments and suggestions.

Funding Statement

The work was partially supported by the National Natural Science Foundation of China [grant number 11861042], and the China Statistical Research Project [grant number 2020LZ25].

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

1.Bauer B. and Kohler M., On deep learning as a remedy for the curse of dimensionality in nonparametric regression, Ann. Statist. 47 (2019), pp. 2261–2285. [Google Scholar]
2.Bozdogan H., Model selection and Akaike's information criterion (AIC): The general theory and its analytical extensions, Psychometrika 52 (1987), pp. 345–370. [Google Scholar]
3.Brunton S.L., Proctor J.L., and Kutz J.N., Discovering governing equations from data by sparse identification of nonlinear dynamical systems, Proc. Natl. Acad. Sci. U. S. A 113 (2016), pp. 3932–3937. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Cands E.J., Romberg J.K., and Tao T., Stable signal recovery from incomplete and inaccurate measurements, Commun. Pure Appl. Math 59 (2006), pp. 1207–1223. [Google Scholar]
5.Chiang C.T., Rice J.A., and Wu C.O., Smoothing spline estimation for varying coefficient models with repeatedly measured dependent variables, J. Am. Stat. Assoc 96 (2001), pp. 605–619. [Google Scholar]
6.Dicker L., Huang B., and Lin X., Variable selection and estimation with the seamless $ℓ_{0}$ penalty, Stat. Sin. 23 (2013), pp. 929–962. [Google Scholar]
7.Donoho D.L., Compressed sensing, IEEE Trans. Inf. Theor 52 (2006), pp. 1289–1306. [Google Scholar]
8.Donoho D.L., High-dimensional centrally symmetric polytopes with neighborliness proportional to dimension, Discret. Comput. Geom 35 (2005), pp. 617–652. [Google Scholar]
9.Hastie T. and Tibshirani R., Varying-coefficient models, J. R. Stat. Soc. Ser. B-Stat. Methodol 55 (1996), pp. 757–796. [Google Scholar]
10.Lee K., Lee Y.K., Park B.U., and Yang S.J., Time-dynamic varying coefficient models for longitudinal data, Comput. Stat. Data Anal. 123 (2018), pp. 50–65. [Google Scholar]
11.Lee Y.K., Mammen E., and Park B.U., Projection-type estimation for varying coefficient regression models, Bernoulli 18 (2012), pp. 177–205. [Google Scholar]
12.Liang J., Zhang X., Wang K., Tang M., and Tian M., Discovering dynamic models of covid-19 transmission, Transbound. Emerg. Dis. (2021). 10.1111/tbed.14263. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Linton O. and Nielsen J.P., A kernel method of estimating structured nonparametric regression based on marginal integration, Biometrika 82 (1995), pp. 93–100. [Google Scholar]
14.Mammen E., Linton O., and Nielsen J., The existence and asymptotic properties of a backfitting projection algorithm under weak conditions, Ann. Stat. 27 (2000), pp. 1443–1490. [Google Scholar]
15.Mei B. and Tian M., Analysis of influencing factors on pm2.5 in beijing based on spatio-temporal model, J. Appl. Statist. Manage. 37 (2018), pp. 571–586. [Google Scholar]
16.Park B.U., Mammen E., Lee Y.K., and Lee E.R., Varying coefficient regression models: A review and new developments, Int. Stat. Rev. 83 (2015), pp. 36–64. [Google Scholar]
17.Peter Bühlmann S.v.d.G., Statistics for High-Dimensional Data: Methods, Theory and Applications, Berlin, Heidelberg, Springer, 2011. [Google Scholar]
18.Rudy S.H., Brunton S.L., Proctor J.L., and Kutz J.N., Data-driven discovery of partial differential equations, Sci. Adv. 3 (2017), pp. e1602614. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Shen X., Pan W., Zhu Y., and Zhou H., On constrained and regularized high-dimensional regression, Ann. Inst. Stat. Math 65 (2013), pp. 807–832. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Su J.G., Hopke P.K., Tian Y., Baldwin N., and Rich D.Q., Modeling particulate matter concentrations measured through mobile monitoring in a deletion/substitution/addition approach, Atmos. Environ. 122 (2015), pp. 477–483. [Google Scholar]
21.Tibshirani R., Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B-Stat. Methodol. 58 (1996), pp. 267–288. [Google Scholar]
22.Zhang L. and Schaeffer H., On the convergence of the Sindy algorithm, Multiscale Model. Simul. 17 (2019), pp. 948–972. [Google Scholar]
23.Zhang S. and Lin G., Robust data-driven discovery of governing physical laws with error bars, Proc. R. Soc. A-Math. Phys. Eng. Sci. 474 (2018), pp. 20180305. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Zhang S., Guo B., Dong A., He J., Xu Z., and Chen S.X., Cautionary tales on air-quality improvement in Beijing, Proc. R Soc. Math. Phys. Eng. Sci. 473 (2017), pp. 20170457. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Zhang W. and Fan J., Statistical estimation in varying coefficient models, Ann. Stat. 27 (1999), pp. 1491–1518. [Google Scholar]
26.Zhang X., Park B.U., and Wang J.L., Time-varying additive models for longitudinal data, J. Am. Stat. Assoc. 108 (2013), pp. 983–998. [Google Scholar]

[CIT0001] 1.Bauer B. and Kohler M., On deep learning as a remedy for the curse of dimensionality in nonparametric regression, Ann. Statist. 47 (2019), pp. 2261–2285. [Google Scholar]

[CIT0002] 2.Bozdogan H., Model selection and Akaike's information criterion (AIC): The general theory and its analytical extensions, Psychometrika 52 (1987), pp. 345–370. [Google Scholar]

[CIT0003] 3.Brunton S.L., Proctor J.L., and Kutz J.N., Discovering governing equations from data by sparse identification of nonlinear dynamical systems, Proc. Natl. Acad. Sci. U. S. A 113 (2016), pp. 3932–3937. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0004] 4.Cands E.J., Romberg J.K., and Tao T., Stable signal recovery from incomplete and inaccurate measurements, Commun. Pure Appl. Math 59 (2006), pp. 1207–1223. [Google Scholar]

[CIT0005] 5.Chiang C.T., Rice J.A., and Wu C.O., Smoothing spline estimation for varying coefficient models with repeatedly measured dependent variables, J. Am. Stat. Assoc 96 (2001), pp. 605–619. [Google Scholar]

[CIT0006] 6.Dicker L., Huang B., and Lin X., Variable selection and estimation with the seamless $ℓ_{0}$ penalty, Stat. Sin. 23 (2013), pp. 929–962. [Google Scholar]

[CIT0007] 7.Donoho D.L., Compressed sensing, IEEE Trans. Inf. Theor 52 (2006), pp. 1289–1306. [Google Scholar]

[CIT0008] 8.Donoho D.L., High-dimensional centrally symmetric polytopes with neighborliness proportional to dimension, Discret. Comput. Geom 35 (2005), pp. 617–652. [Google Scholar]

[CIT0009] 9.Hastie T. and Tibshirani R., Varying-coefficient models, J. R. Stat. Soc. Ser. B-Stat. Methodol 55 (1996), pp. 757–796. [Google Scholar]

[CIT0010] 10.Lee K., Lee Y.K., Park B.U., and Yang S.J., Time-dynamic varying coefficient models for longitudinal data, Comput. Stat. Data Anal. 123 (2018), pp. 50–65. [Google Scholar]

[CIT0011] 11.Lee Y.K., Mammen E., and Park B.U., Projection-type estimation for varying coefficient regression models, Bernoulli 18 (2012), pp. 177–205. [Google Scholar]

[CIT0012] 12.Liang J., Zhang X., Wang K., Tang M., and Tian M., Discovering dynamic models of covid-19 transmission, Transbound. Emerg. Dis. (2021). 10.1111/tbed.14263. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0013] 13.Linton O. and Nielsen J.P., A kernel method of estimating structured nonparametric regression based on marginal integration, Biometrika 82 (1995), pp. 93–100. [Google Scholar]

[CIT0014] 14.Mammen E., Linton O., and Nielsen J., The existence and asymptotic properties of a backfitting projection algorithm under weak conditions, Ann. Stat. 27 (2000), pp. 1443–1490. [Google Scholar]

[CIT0015] 15.Mei B. and Tian M., Analysis of influencing factors on pm2.5 in beijing based on spatio-temporal model, J. Appl. Statist. Manage. 37 (2018), pp. 571–586. [Google Scholar]

[CIT0016] 16.Park B.U., Mammen E., Lee Y.K., and Lee E.R., Varying coefficient regression models: A review and new developments, Int. Stat. Rev. 83 (2015), pp. 36–64. [Google Scholar]

[CIT0017] 17.Peter Bühlmann S.v.d.G., Statistics for High-Dimensional Data: Methods, Theory and Applications, Berlin, Heidelberg, Springer, 2011. [Google Scholar]

[CIT0018] 18.Rudy S.H., Brunton S.L., Proctor J.L., and Kutz J.N., Data-driven discovery of partial differential equations, Sci. Adv. 3 (2017), pp. e1602614. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0019] 19.Shen X., Pan W., Zhu Y., and Zhou H., On constrained and regularized high-dimensional regression, Ann. Inst. Stat. Math 65 (2013), pp. 807–832. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0020] 20.Su J.G., Hopke P.K., Tian Y., Baldwin N., and Rich D.Q., Modeling particulate matter concentrations measured through mobile monitoring in a deletion/substitution/addition approach, Atmos. Environ. 122 (2015), pp. 477–483. [Google Scholar]

[CIT0021] 21.Tibshirani R., Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B-Stat. Methodol. 58 (1996), pp. 267–288. [Google Scholar]

[CIT0022] 22.Zhang L. and Schaeffer H., On the convergence of the Sindy algorithm, Multiscale Model. Simul. 17 (2019), pp. 948–972. [Google Scholar]

[CIT0023] 23.Zhang S. and Lin G., Robust data-driven discovery of governing physical laws with error bars, Proc. R. Soc. A-Math. Phys. Eng. Sci. 474 (2018), pp. 20180305. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0024] 24.Zhang S., Guo B., Dong A., He J., Xu Z., and Chen S.X., Cautionary tales on air-quality improvement in Beijing, Proc. R Soc. Math. Phys. Eng. Sci. 473 (2017), pp. 20170457. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0025] 25.Zhang W. and Fan J., Statistical estimation in varying coefficient models, Ann. Stat. 27 (1999), pp. 1491–1518. [Google Scholar]

[CIT0026] 26.Zhang X., Park B.U., and Wang J.L., Time-varying additive models for longitudinal data, J. Am. Stat. Assoc. 108 (2013), pp. 983–998. [Google Scholar]

PERMALINK

Sparse regression for low-dimensional time-dynamic varying coefficient models with application to air quality data

Jinwen Liang

Maozai Tian

Abstract

1. Introduction

2. Model

2.1. Sparse identification of nonlinear dynamics

2.2. Data-Driven estimator for dynamic varying coefficient models

Remark 2.1

Table 1.

Remark 2.2

Remark 2.3

2.3. Estimator of non-structured varying coefficient models

3. Theoretical properties

Lemma 3.1

Proof.

Lemma 3.2

Proof.

Lemma 3.3

Proof.

Theorem 3.4

Proof.

4. Simulation

Example 4.1

Example 4.2

Figure 1.

Figure 2.

Figure 3.

Remark 4.1

Table 2.

Table 3.

Example 4.3

Example 4.4

Table 4.

Figure 4.

Figure 5.

Figure 6.

Table 5.

Table 6.

5. Real data analysis

Figure 7.

Figure 8.

6. Discussion

Acknowledgements

Funding Statement

Disclosure statement

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases