Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Sep 30.
Published in final edited form as: Stat Med. 2023 Jun 27;42(22):3903–3918. doi: 10.1002/sim.9839

Varying-coefficients for regional quantile via KNN-based LASSO with applications to health outcome study

Seyoung Park 1, Eun Ryung Lee 1, Hyokyoung G Hong 2
PMCID: PMC11370892  NIHMSID: NIHMS2013625  PMID: 37365909

Abstract

Health outcomes, such as body mass index and cholesterol levels, are known to be dependent on age and exhibit varying effects with their associated risk factors. In this paper, we propose a novel framework for dynamic modeling of the associations between health outcomes and risk factors using varying-coefficients (VC) regional quantile regression via K-nearest neighbors (KNN) fused Lasso, which captures the time-varying effects of age. The proposed method has strong theoretical properties, including a tight estimation error bound and the ability to detect exact clustered patterns under certain regularity conditions. To efficiently solve the resulting optimization problem, we develop an alternating direction method of multipliers (ADMM) algorithm. Our empirical results demonstrate the efficacy of the proposed method in capturing the complex age-dependent associations between health outcomes and their risk factors.

Keywords: health outcome study, K-nearest neighbors, Lasso, regional quantile regression, varying-coefficients

1 |. INTRODUCTION

In recent years, health outcome research has played a crucial role in identifying disparities among different racial and ethnic groups, enabling policymakers and clinicians to make informed decisions for individuals from diverse socioeconomic backgrounds. Numerous studies have evaluated racial disparities in various health outcomes, such as body mass index (BMI), sleep duration, and cholesterol levels.13 BMI, in particular, has been widely used as a health risk indicator in clinical and public health research.4,5 To better understand the factors that drive BMI, researchers from various fields have explored statistical techniques to identify important predictors. For example, Huang et al6 proposed a group bridge approach for selecting risk factors of BMI, while Rehkopf et al7 used a random forest technique to rank risk factors according to their relative importance score. Gao et al8 proposed a variable selection method to identify relevant BMI risk factors, assuming that the impact of these determinants is an unknown function of categorical demographic variables.

Existing literature on BMI studies has primarily focused on modeling the relationships between risk factors and the mean BMI level from average individuals, while ignoring age or time dependency. This approach has limited insights regarding the BMI distribution in the population. In this paper, we propose a tailored statistical approach for detecting dynamic and heterogeneous associations between health outcomes and risk factors. Specifically, we adopt a varying-coefficient quantile model framework to explain associations between health outcomes and risk factors in the presence of varying effects of age.

Varying-coefficient (VC) models have gained popularity in both theoretical and practical aspects since their inception.914 To deal with high-dimensional variables, VC approaches have incorporated variable selection procedures.8,13,1517 For instance, Gao et al8 considered the variable selection problem for the categorical varying-coefficient model, based on a penalized approach using group Lasso.18 Nonparametric approaches, based on basis functions, have also been widely used for estimation and variable selection in VC models.16,1922

We consider a VC regional quantile model in this paper, which is defined as follows:

YT,X=XβT,τ+ϵτforτΔ,

where P(ϵ(τ)0X,T)=τ for τΔ,Δ(0,1) is a quantile interval of interest, T is an age index, X are covariates, and Y is an outcome variable. This model framework naturally arises in many real-life applications.23 Analyzing the behavior of VC over a range of quantiles is important in the field of regression analysis. When various quantile levels are of interest, a typical approach is to individually fit a quantile regression and obtain inference at each quantile level, which may result in a loss of estimation efficiency because regressions at adjacent quantile levels are expected to share similar features. In such cases, a regional quantile regression approach23,24 can be a useful alternative and may lead to a more efficient estimation procedure.

Despite the importance of both theoretical and practical aspects, there is a lack of literature on the selection and estimation of clustered patterns in the coefficient function under VC regional quantile settings. Our main objective is to detect regional clustered patterns in the regional quantile regression coefficients, β(T,τ), using K-nearest neighbors (KNN) fused Lasso. The proposed method can identify de-noised clustered patterns between the risk factors and health outcomes, select important determinants of the regional health outcome quantiles (such as the upper level of the BMI distribution), and simultaneously estimate varying coefficients across both age and quantiles of BMI.

Our work is related to Padilla et al,25 who combined fused Lasso with the KNN procedure in a mean-based regression model, and more recently, Ye and Padilla,26 who proposed a nonparametric quantile regression using KNN-based fused Lasso penalty. However, it is important to note that Ye and Padilla26 modeled the conditional quantile of the response variable as a nonparametric function of covariates at a fixed quantile level. Li and Sange27 considered the spatially clustered varying-coefficient model using the fused Lasso method, but it was based on a linear model and only allowed a simple tree structure in the graph. Yang et al28 adopted the parametric quantile regression method of Frumento and Bottai29 to analyze longitudinal BMI data. These approaches differ from our proposed method, which considers a structured nonparametric model for regional quantiles. The proposed model shares common structural information across adjacent quantile levels and may better detect patterns over both quantiles, τ, and index, T. Our work is also more general since we do not require parametric specifications for quantile coefficient functions, which is not a trivial assumption.

We summarize the key properties of the proposed method as follows: (1) The proposed approach based on quantile regression yields robust estimates despite the violation of normality assumption; (2) The conditional quantile framework allows us to explore the heteroscedastic relationship between different sublevels of the dependent variable and covariates, which cannot be captured by the standard regression approach; (3) By adopting a nonparametric approach to modeling the covariate effects under the regional quantile VC framework, our method can handle a nonlinear relationship between the dependent variable and covariates; (4) Compared to its local counterpart, the regional quantile approach provides more stable and interpretable results; (5) By leveraging the insight from the quantile KNN fused Lasso,26 our algorithm can detect underlying clustered patterns in the VC functions; and (6) The proposed optimization via the efficient alternating direction method of multipliers (ADMM) algorithm is computationally scalable since each updating step has a closed-form solution and utilizes parallel computing.

The remainder of this paper is organized as follows. In Section 2, we present the proposed VC quantile model via KNN fused Lasso method, its theoretical properties, and the ADMM algorithm. In Section 3, we evaluate the finite sample performance of the proposed methods using simulation studies. In Section 4, we apply the proposed method to two health outcome studies. Finally, Section 5 concludes the paper and discusses potential future research questions. Technical proofs, additional simulation results, and figures are provided in the Supplementary material.

2 |. VARYING-COEFFICIENT QUANTILE MODEL VIA KNN FUSED LASSO

In this section, we propose VC regional quantile regression via KNN fused Lasso and study theoretical properties and ADMM algorithm.

In the VC regional quantile model, for units i=1,,n with a covariate vector xi=xi1,,xipRp, a response, yi, can be modeled as

yi=j=1pxijβjoti,τ+ϵiτforτΔ,

where ti[0,1] is a time index variable, for example, age in our applications, βjo(t,τ) is the underlying coefficient function for the jth covariate, Δ(0,1) is the quantile interval of interest, and the conditional τ-th quantile of a random error ϵi(τ) given ti,xi is zero.

We first construct the KNN graph G based on ti,τi,1in in the domain of coefficient functions βj’s, where the quantile level τi’s are randomly chosen from Δ, that is, τi~uniform(Δ), which is to discretize Δ. Specifically, each ti,τi, for i=1,,n, corresponds to a node in the graph G and its edge set E contains the pair (i,k) for ik, if and only if ti,τi is among the K-nearest neighbors of tk,τk. We propose the regional quantile KNN fused Lasso (RQF) method in varying-coefficients models to estimate the coefficient function βjo(t,τ) as follows:

minβ1,,βp1ni=1nρτi(yij=1pxijβj(ti,τi))+λj=1pHβj1, (1)

where βj=βj1,,βjnRn with βji:=βjti,τi for i=1,,n.

Here, H is an |E|×n oriented incidence matrix of the KNN graph G, and thus each row of H corresponds to an edge eE.25 Specifically, if the m-th edge in G connects the im-th and km-th nodes, then

Hm,l=1ifl=im1ifl=km0otherwise,

and Hβjm=βj,imβj,km,Hβj1=mβj,imβj,km. In (1), we considered a single quantile level τi for each sample i to reduce computational cost. If we consider fixed multiple quantile levels τ1,,τKΔ, as in the composite quantile regression, then the computation is nearly infeasible. In addition, the large sample size as in our real data examples would make it worse.

In (1), the fused Lasso penalty enforces sparsity of the difference in two edge-connected coefficients. This allows the estimation of coefficients with clustered patterns if edge sets are selected appropriately. Using the obtained βˆj, we can estimate the value of coefficient corresponding to a new (t,τ) by the averaged estimated values of the KNN as follows:

βˆj(t,τ)=1Ki=1nβˆj(ti,τi)1{(ti,τi)NK(t,τ)}, (2)

where NK(t,τ) is the set of KNN of (t,τ) in a training data ti,τi:i=1,,n. Thus, it leads to smooth and locally adaptive VC estimates.

Note that the tuning parameter λ in (1) controls the number of clusters in regression coefficients. When λ=0, it reduces to the ordinary regional quantile regression; when λ, RQF yields a nearly constant regression coefficient in that βˆjti,τi=βˆjtk,τk for (i,k)E. With an appropriate choice of λ, RQF produces clustered regression coefficients. In practice, we propose the following BIC to choose λ:

BIC=logi=1nρτiyij=1pxijβjti,τi+lognnj=1pHβj0,

where Hβj0 represents the number of nonzero values in Hβj.

2.1 |. Notations

Throughout the paper, Aop represents an operator norm of a matrix A, that is, the maximum singular value of A, and λmax(B) is a maximum eigenvalue of a symmetric matrix B. We write ab if aC1b for some positive constant C1,ab if ab and ba, and ab and ab to denote max(a,b) and min(a,b), respectively. For a vector x, let support(x)=j:xj0 be the index set of non-zero entries of the vector x and x0=|support(x)| is the cardinality of support(x). For a vector x and the index set S, let xS be the subvector of x with components in S. For a matrix M=Mij, let MF=ijMij21/2,M1=ijMij,Mmax=maxi,jMij,|M|1=maxjiMij, and |M|=maxjiMij. For a random sample t1,,tn, let Enti:=n1i=1nti.

2.2 |. Theoretical properties

For easier presentation, we introduce a new parameter θ, which is a reparametrization of βjti,τi for i=1,,n and j=1,,p. Suppose that in the KNN graph G, there exists L1 connected components, say G1,,GL, where the subgraph Gl has a node set Vl and an edge set El with Vl=nl,l=1LVl={1,,n}, and VlVl˜= for ll˜. Note that given K and ti,τi,i=1,,n, the number of connected components L and the graph G are determined. For example, if n is a square number and ti,τi,i=1,,n, are n×n rectangular grid points in [0,1]×Δ, then the KNN graph G with K4 is itself connected, that is, L=1. By rearranging sample indices, let Vl’s be increasingly ordered sets, that is, V1=1,,n1,V2=n1+1,,n1+n2,,VL=l=1LnlnL+1,,l=1Lnl. We can write H=BlockH1,,HLR|E|×n as a block diagonal matrix consisting of Hl’s, where rows of Hl corresponds to edges of the l-th connected group Gl such that HlREl×nl. Define H˜l=1nl/nl,HlREl+1×nl, where 1nl is nl-dimensional vector with all 1’s. We can see that H˜lH˜l=HlHl+1nl1nl1nl, where HlHl represents the Laplacian matrix, or called the graph Laplacian of the component Gl. Thus, H˜lH˜lRnl×nl is invertible.

Without loss of generality, we write

H˜l=H˜l(1)H˜l(2),whereH˜l(1)Rnl×nlisinvertibleandH˜l2REl+1nl×nl.

This can be achieved by rearranging the rows of H˜l such that the first nl rows corresponds to the edges of the minimum spanning tree of Gl and the vector 1nl/nl. We can write the parameter βj with βj=βjG1,,βjGL, where βGlj=βjti,τi:iVlRnl is a subvector of βj corresponding to the node (index) in the graph Gl. Then, βGlj can be rewritten using a new parameter θGljRnl as βGlj=H˜l(1)1θGlj. Let θj=θjG1,,θjGLRn for j=1,,p, and θ=θ1,,θpRnp.

Then, the problem (1) can be rewritten as

minθnp1ni=1nρτi(yix˜iθ)+λj=1pθj,B11+λj=1pH¨θj1, (3)

where H¨ is defined in Section A of the Supplementary material, θj,B1=[(θG1,B1j),,(θGL,B1j)]RnL, and θGl,B1j’s are defined in the Supplementary material. From the estimate θˆ of (3), we can obtain the estimate βˆj=[(βˆG1j),,(βˆGLj)], where βˆGlj=[H˜l(1)]1θˆGlj.

Let θo=[θ1o,,θpo]Rnp,θjoRn, and θj,B1oRnL be the underlying vectors similarly defined as for θ,θj, and θj,B1, respectively, which is the function of βjoti,τi’s. Let S1j:=supportθj,B1o={l:θj,B1ol0},S2j:=supportH¨θjo=l:H¨θjol0,s1j=S1j,s2j=S2j,s1:=js1j, and s2:=js2j. The underlying clustered pattern for each j is explained by the support set Sj:=S1jS2j, indicating which edge-connected differences in βjo are nonzero. For example, if Sj=, then βjoti,τi=βjotk,τk for (i,k)E. Let S1U=jS1j,S2U=jS2j,S:=S1US2U, and A:=S1UB2, where B2 is the index set corresponding to the vector 1nl/nl. Let s(j)=Sj and s=j=1ps(j)=s1+s2. We estimate Sj by Sˆj=supportHβˆj and S by Sˆ=Sˆj. To facilitate theoretical analyses, we impose the following conditions.

Assumption 1.

For any τΔ and 1in, the conditional density function of the random error ϵi(τ) at τth quantile level, that is, fi(τ)(x) has a continuous derivative (fi(τ))(x), and satisfies supisupxfi(τ)(x)f and supi|(fi(τ))(x)|f for |x|c1 for some positive constants c1,f, and f. Moreover, infifi(τ)(0)f¯ for some positive constant f¯.

Assumption 2.

Define the following restricted set:

CA,S2U:=δ=δ1,,δpRnp:δjRn,δAc1+j=1pH¨δjS2jc13δA1+j=1pH¨δjS2j1.

It holds that the design matrix X˜ satisfies

κ=infδCA,S2U,δ0Enx˜iδ2δ2>0,q:=3f¯3/28finfδCA,S2U,δ0Enx˜iδ23/2Enx˜iδ3>0.

Assumption 3.

The minimum nonzero signal difference in βjo is greater than the order of Op(slog(np)/n), that is, minlSHβjolslog(np)/n.

Assumption 4.

Let Q(θ)=E1ni=1nρτiyix˜iθ and M=012Q(θ+t(θˆθ))dtRnp×np. We assume |M¯(S1U)c,S1UM¯S1U,S1U1|11 and |H2|1|T¨|1, where H2 and T¨ are defined in the proof of Theorem 2 in Section C of the Supplementary material.

Assumption 1 is a common assumption used in quantile regression literature,24,30,31 which imposes only mild assumptions on the conditional density of the response variable given covariates, not imposing any normality or homoscedasticity assumptions. The first part in Assumption 2 is the restricted eigenvalue (RE) condition, which is analogous to the assumptions in the existing literature.30,32,33 The second part in Assumption 2 is the restricted nonlinear impact (RNI) condition,24,30 which controls the quality of minoration of the quantile regression objective function by a quadratic function over the restricted set. Assumption 3 is a beta-min type condition, which imposes a lower bound of the nonzero signal differences. Assumption 4 is a irrepresentable type condition,24,34,35 which restricts correlations among covariates.

The mean squared error (MSE) of θˆ is defined as θˆθon2:=n1θˆθo2. The following theorem presents upper bound of the MSE of θˆ. All the technical proofs are deferred to Section C of the Supplementary material.

Theorem 1.

Suppose that Assumptions 1 and 2 hold. If λlog(pn)/n, then θˆsatisfiesθˆθon2slog(pn)/n2.

Theorem 1 implies that the MSE of θˆ decreases asymptotically to zero as n assuming that s grows with a rate as on2/log(pn). Note that such a growth rate for s is satisfied when j{1,,p}:s(j)0=o(n/log(pn)), for example, many βjo are constant functions. Because βˆjβj2=(θˆjθj)HH(θˆjθj), where H is defined in Section A of the Supplementary material, Theorem 1 also gives the MSE of βˆj’s as follows:

j=1pβˆjβjn2:=j=1pβˆjβj2/nλmaxHHj=1pθˆjθj2/nλmaxHHslog(pn)/n2.

If λmaxHH=Op(n), the MSE rate is bounded by slog(pn)/n, which is within logarithmic factors of the oracle rate that can be obtained with known S.

The following theorem shows that RQF detects the underlying true set S.

Theorem 2.

Suppose that Assumptions 1, 2, 3, and 4 holds. If λlog(pn)/nands1nlog(pn), then

P(Sˆ=S)1.

Theorem 2 implies that RQF can detect the underlying subclusters in the graph G constructed from the points ti,τi,i=1,,n as long as any nodes in the same underlying subcluster are connected in the graph G.

2.3 |. ADMM algorithm

The optimization of (1) can be computed using the ADMM algorithm. Let βji=βjti,τi,βj=βj1,,βjnRn, and β=β1,,βpRnp. Then, we can rewrite the optimization as follows by introducing supplementary variables zj=zj1,,zjnRn and z=z1,,zpRnp,

minβ,zRnpi=1nρτiyij=1pxijβji+λj=1pHzj1,wherezj=βjforj=1,,p.

By the augmented Lagrangian method, we consider the following

i=1nρτi(yij=1pxijβji)+λj=1pHzj1+η2jβjzj+uj2, (4)

where uj=uj1,,ujnRn are the dual variables and η>0 is a step-size parameter. In (4), we need to update β,z, and uj’s. Updating β and z requires some mathematical derivations, but dual variables uj will be simply updated according to the updated β and z. Let F(β,z,u) be the objective function of (4). We iteratively solve (4) as follows:

β(t)=argminβFβ,z(t1),u(t1) (5)
z(t)=argminzFβ(t),z,u(t1)uj(t)=uj(t1)+ηβj(t)zj(t)forj=1,,p. (6)

For each step, we omit the superscript notations (t) and (t1) whenever it does not cause any confusion.

Update β

For each i=1,,n, define βi:=β1i,,βpi,zi:=z1i,,zpi and ui:=u1i,,upi as p-dimensional vectors. For each i=1,,n, solving (5) is equivalent to solving the following:

minβiρτiyixiβi+η2βizi+ui2.

By the Karush-Kuhn-Tucker (KKT) conditions, minimizer βˆi must satisfy

vixi+η(βˆizi+ui)=0orβˆi=ziui+vixi/η,

where

vi=τi1ifyi<xiβˆiτiifyi>xiβˆi
viτi1,τiifyi=xiβˆi.

Suppose that yi=xiβˆi. Then, it must hold that

τixixi/ηxiziuiyi=vixixi/η1τixixi/η,

which implies that if xiziuiyiτixixi/η,1τixixi/η, then βˆi=ziui+vˆixi/η, where vˆi=ηxiziuiyi/xixi. Similarly, we can consider the remaining cases and obtain the following updates:

βˆi=ziui+τi1xi/ηifxiziuiyi>1τixixi/ηziui+τixi/ηifxiziuiyi<τiiixi/ηziui+vˆixi/ηelse.

This can be efficiently computed via a parallel implementation.

Update z

For each j=1,,p, recall that zj=z1j,,znj. Then, for each j=1,,p, solving (6) is equivalent to solving the following:

minzjη2zjβjuj2+λHzj1,

which corresponds to the KNN-fused Lasso25 and can be solved by the parametric max-flow algorithm.25,26,36 This also allows parallel implementation.

3 |. SIMULATION STUDIES

In this section, we consider simulation studies to illustrate the performance of RQF. We set K=5 for sufficient information and efficient computation, as suggested in Padilla et al25 and Ye and Padilla.26 To examine the performance of RQF in capturing patterns under various scenarios, we design the case in which the underlying VC coefficients have different clustered or varying patterns. For comparisons, we consider the sieve estimation method using B-spline (Bspline), which is a common nonparametric approach. Specifically, Bspline approximates βj(t,τ) by using bivariate B-spline functions Π(τ,t)=[π1(τ,t),,πmn+22(τ,t)], where πk(τ,t) is the product of two normalized B-spline basis functions of order 2 with mn quasi-uniform knots over the region Δ and T, that is, βj(τ,t)BjΠ(τ,t), where Bj is estimated via the composite quantile regression framework23 and mn is chosen using the Bayesian information criterion (BIC).23We consider the following varying random coefficient model:

Yi=17Xi2+10+2sin2πtiXi3+3+2UiXi4+5×1Ui>0.5Xi5+5×1Ui>0.5Xi6+3Ui+tiXi7,

where d is the number of covariates, ti,Xi3,Xi6,Xi7~uniform(0,1), Xi2,Xi8,,Xid~N(0,1), Xi4,Xi5 are from the Bernoulli distribution with probability 1/2, and Ui~uniform(0,1) is introduced to consider a random coefficient model. Accordingly, the underlying quantile coefficient functions, given the index t and the quantile level τ, are

β1t,τ=1,β2t,τ=7,β3t,τ=10+2sin2πt,β4t,τ=3+2τ,
β5(t,τ)=5×1{τ>0.5},β6(t,τ)=5×1{t>0.5},β7(t,τ)=3t+3τ,β8(t,τ)==βd(t,τ)=0.

We use n=5,000 and d{9,25} in the implementation, and the quantile levels τi are i.i.d. generated from [0.05, 0.95].

Figure 1 depicts the underlying coefficient functions βj(t,τ) for j=1,,9. We observe that β3(t,τ), β4(t,τ), and β7(t,τ) have smoothly varying patterns; β5(t,τ) and β6(t,τ) are clustered with respect to quantile τ and time t;β1(t,τ) and β2(t,τ) are positive and negative, constant value, respectively; and β8(t,τ) and β9(t,τ) are zeros. Figures 2 and 3 are the estimated coefficient function βˆj(t,τ) derived by RQF produced from a particular simulation when d=9 and d=25, respectively. Figures 4 and 5 are the estimated coefficient function derived by Bspline from a particular simulation when d=9 and d=25, respectively. We can observe that the overall patterns of RQF are highly consistent with the true regression coefficients, as shown in Figure 1. It successfully captures the underlying clustered patterns and smoothly varying patterns in the regression coefficients and also detects the abrupt changes across the boundaries of adjacent clusters. However, the estimates from Bspline are quite noisy, with artificially abrupt changes in coefficient values in some parts of the domain.

FIGURE 1.

FIGURE 1

Underlying coefficient function βj(t,τ) for j=1,,9. The x-axis and y-axis represent the index T(0,1) and the quantile level τ(0.05,0.95), respectively.

FIGURE 2.

FIGURE 2

Estimated coefficient function βˆj(t,τ) for j=1,,9, produced from a particular simulation when d=9.

FIGURE 3.

FIGURE 3

Estimated coefficient function βˆj(t,τ) for j=1,,9, produced from a particular simulation with d=25.

FIGURE 4.

FIGURE 4

An example of the estimated coefficient function of the B-spline methods with d=9.

FIGURE 5.

FIGURE 5

An example of the estimated coefficient function of the B-spline methods with d=25.

We further examine the performances of RQF in terms of parameter estimation using d{9,25}. As a performance measure, we consider the mean-squared error of estimation (MSE), defined as

MSE=1ni=1nj=1pβˆjti,τiβjoti,τi2.

For an index j{1,2,5,6,8,9}, each of the underlying coefficient functions βjo’s has clustered structures. Specifically, β1o,β2o,β8o,β9o have single values, i.e., has only one cluster, and β5o,β6o have two subclusters. On the other hand, an index j{3,4,7} has smoothly varying structures. For an index j{1,2,5,6,8,9} involving clustered structures, we measure the structural consistency performance when d{9,25}. Let Sj=i,i:βjoti,τiβjoti,τi and Sˆj=i,i:βˆjti,τiβˆjti,τi. If Sj0, i.e., j{5,6}, we consider the Precision and Recall for each j, defined as

Precision=SjSˆj/SˆjandRecall=SjSˆj/Sj.

On the other hand, if Sj=0, i.e., j{1,2,8,9}, we consider true negative rate (TNR) and negative predictive value (NPV), defined as

TNR=|SjcSˆjc|/|SjcandNPV=|SjcSˆjc|/|Sˆjc|

As shown in Table 1, all the Precision, Recall, TNR, and NPV values for RQF are close to 1, which implies that RQF has a high structural consistency for each covariate j. Even with the large d, RQF presents robust results. See Section B of the Supplementary material for detailed results. On the other hand, Bspline demonstrates poor performance in TNR and Precision because Bspline tends to yield more false positives. This implies that Bspline does not capture the underlying structures of the model well. Regarding the inferior performance of the Bspline method compared to our proposed KNN-based Lasso method, we note that the differences in performance mainly stem from the fused Lasso term in our proposed method. This term shrinks differences between neighboring points ti,τi and tj,τj if they are close. This shrinkage effect cannot be easily implemented in the Bspline method and, to our knowledge, is not considered in the existing literature.

TABLE 1.

Mean precision, recall, TNR, and NPV for RQF and Bspline over 100 simulations, under d{9,25}.

d = 9 d = 25


Method Coefficient Precision Recall Precision Recall
RQF β 5 0.946 0.952 0.923 0.918
β 6 0.952 0.953 0.921 0.913
Bspline β 5 0.504 0.991 0.521 0.971
β 6 0.510 0.994 0.531 0.981
TNR NPV TNR NPV
RQF β 1 0.996 0.995 0.945 0.952
β 2 0.991 0.994 0.939 0.941
β 8 0.985 0.983 0.941 0.938
β 9 0.989 0.990 0.950 0.941
Bspline β 1 0.009 1.000 0.011 1.000
β 2 0.012 1.000 0.015 1.000
β 8 0.018 1.000 0.023 1.000
β 9 0.012 1.000 0.024 1.000

We also investigate the sensitivity of the proposed method with respect to the choice of quantile levels τi’s. Using the same ti’s as used in βˆ, we consider 500 different choices of τi’s generated from uniform(0.05, 0.95) and obtain estimates β˜j(b)’s for b=1,,500 to compare with βˆj when d=9 and d=25, respectively. Then, we record the mean squared difference (MSD) between βˆj and β˜j(b) at the 9,000 fixed points tk,τl, where tk=0.01k for k=1,,100 and τl=0.05+0.01l for l=1,,90, defined as

MSDb=19000pk=1100l=190j=1pβˆjtk,τlβ˜jbtk,τl2,

where βˆjtk,τl’s and β˜jtk,τl’s are computed by (2). Figure B2 in the Supplementary material shows the boxplots of MSD obtained from 500 different quantile choices. We observe that most MSD values are close to 0, which implies that the obtained coefficients are not highly sensitive to the choices of τi’s.

Let βˆj(K) be the proposed estimate using K nearest neighbors in the estimation. To perform the sensitivity analyses of the proposed method for the choice of K, we compute the MSD between βˆj(K) and βˆj(K˜) at the fixed points tk,τl, where tk=0.01k(k=1,,100) and τl=0.01l+0.05(l=1,,90). That is,

MSD(K,K˜)=19000pk=1100l=190j=1pβˆj(K)tk,τlβˆj(K˜)tk,τl2.

Figure B3 in the Supplementary material presents the heatmap of MSD(K,K˜) with K,K˜{2,,12} when d=9 and d=25. The proposed method does not seem to be heavily impacted by the choice of K.

Another alternative is to consider multiple predetermined quantile levels τ for each ti instead of a single, randomly chosen quantile level. Specifically, to demonstrate it in our simulation, we used a setting with d=9 and selected a predetermined grid of quantile levels τ for each ti. This resulted in ti,τ1,,ti,τ19 for each i, where τk=0.05k for k=1,,19. As a result, a total of n×19=5,000×19=95,000 points were used to generate the 2D plot depicted in Figure B4 of the Supplementary material. It is worth noting that using multiple predetermined quantile levels for each ti necessitates the estimation of more parameters, thereby increasing computational time compared to the proposed method that uses randomly selected quantile levels. This latter approach employs only a single, randomly chosen quantile level, denoted as τi, for each ti, but still manages to yield similar patterns in the 2D plot. However, the advantage of predetermined quantile levels is their ability to focus on specific quantile regions. If there is particular interest in these regions, more quantile levels can be preselected specifically for those areas.

4 |. EMPIRICAL ILLUSTRATIONS

4.1 |. Time-varying and heteroscedastic effects of risk factors on BMI

Body mass index (BMI), which is a measure of body fat based on height and weight, has been shown to be associated with many health status indicators.3739 While there has been much interest in developing statistical methods for measuring time-varying effects of risk factors on BMI, there have been few studies from the quantile perspective. Moreover, classical local quantile approaches are not well-suited for visualizing the relationship between BMI and covariates as age-quantile functions, unlike regional quantile approaches.

In this section, we utilized the RQF to investigate the relationship between risk factors and different sublevels of the BMI distribution among women, and to examine the potential variation of this relationship across age. Furthermore, we explored the health disparity between non-Hispanic Whites (NHW) and non-Hispanic Blacks (NHB) by examining the interaction between risk factors and racial groups. The data for our analysis were obtained from the 2011–2018 National Health and Nutrition Examination Survey (NHANES) dataset, and we considered 12 covariates identified in the literature as potential risk factors for BMI. These covariates include physical condition factors such as dietary fiber, sodium level, vitamin A, vitamin C, vitamin D, and zinc; socioeconomic factors such as education (1 if college, 0 otherwise), occupation (1 if yes and 0 if no), insurance (1 if yes and 0 if no), and poverty-income ratio (PIR) which ranges between 0 (lowest income level) and 1 (highest income level); and demographic information such as marital status (1 if married, 0 otherwise) and race (1 if NHB and 0 if NHW). After removing subjects with missing variables, a total of 4,119 women were available for our analysis, with 2,664 NHW and 1,455 NHB.

The following varying-coefficient (VC) model was used to fit the data:

yi=β0(T,τ)+j=112xj,iβj(T,τ)+j=212x1,ixj,iηj(T,τ)+ϵi(τ), (7)

where yi represents the BMI, T represents age, and x1,i represents the race variable, while ϵi(τ)’s are independent errors satisfying Pϵi(τ)0xi,Ti=τ. Here, βj(T,τ) represents a quantile coefficient function for the j-th explanatory variable, and ηj(T,τ) represents a quantile coefficient function for the interaction between the j-th explanatory variable and the race variable.

To re-scale the age variable, we set 0T1, where 0 corresponds to 20 years old and 1 corresponds to 80 years old or more. We normalized each continuous variable such that its mean and standard deviation were 0 and 1, respectively. To choose λ, we used BIC as described in Section 2.

Figure 6 displays the estimated functional coefficients, where the x-axis and y-axis represent the age index T(0,1) and the BMI quantile level τ(0.1,0.9), respectively. The two-dimensional graphs describe how the coefficients vary with age and the quantile level. Overall, most variables, except sodium level, zinc, and race, showed a negative association with BMI. The effect of education on BMI was more pronounced in the younger age group compared to the older group. In contrast, having insurance had a more constant effect across different ages, while a heteroscedastic association was observed over different BMI quantiles. For example, having insurance tended to be associated with attenuated BMI for individuals with medium or lower BMI across all ages. A higher income (ie, higher PIR) was associated with a lower BMI, and as an individual gets older, the effect seems to be stronger at the upper level of the BMI distribution. NHB women appeared to have a higher BMI than NHW women across all ages.

FIGURE 6.

FIGURE 6

The estimated quantile coefficient functions using the BMI data. The x-axis and y-axis represent the age index T(0,1) and the quantile level τ(0.1,0.9), respectively. On top of the figures, ‘xr’ represents an interaction between a risk factor ‘x’ and the ‘race.’

The disparity in BMI between NHW and NHB women can be inferred by observing the interaction between race and risk factors. The positive effect sizes observed in the interaction plots suggest that the impact of risk factors on BMI levels is greater in NHB women than in NHW women. Although most interactions did not show a heteroscedastic effect, the interaction between sodium and race appeared to vary with age.

4.2 |. Time-varying and heteroscedastic effects of risk factors on LDL cholesterol

Low-density lipoprotein (LDL) cholesterol is associated with various health problems such as cardiovascular diseases and stroke in adults.4042 In this study, we use the RQF framework to model the association between risk factors and LDL cholesterol levels (measured in mg/dL), where LDL is the health outcome of interest. Previous research has shown that LDL cholesterol levels tend to be associated with age. For instance, Ferrara et al43 observed that LDL tends to increase in younger and middle-aged adults and decrease with people who are older than 65 years.

Covariates included in our model were race (NHW and NHB), PIR (0–5), sex (1 if female and 0 if male), age (18–80+), alcohol (average number of alcoholic drinks/day during the past 12 months), smoking status (1 if smoked at least 100 cigarettes in life, 0 otherwise), physical activity (continuous), BMI (continuous), energy intake (tkcal), total saturated fatty acids (tsfat, gm), alcohol intake (talco, gm), sodium intake (tsodi, mg), total monounsaturated fatty acids (tmfat, gm), and total polyunsaturated fatty acids (tpfat, gm). In addition, we included the interaction term between race and each of these covariates. The LDL data were also obtained from the 2011–2018 NHANES, and a total of 3,386 subjects were included in the analysis after excluding those with missing variables.

Similar to the BMI study, we considered the following VC model for LDL cholesterol:

yi=β0(T,τ)+j=114xj,iβj(T,τ)+j=214x1,ixj,iηj(T,τ)+ϵi(τ), (8)

where yi is an LDL cholesterol level, T is age, x1,i represents race, and ϵi(τ)’s are independent errors satisfying Pϵi(τ)0xi,Ti=τ. Here, βj(T,τ) represents a quantile varying-coefficient function for the j-th explanatory variable, whereas ηj(T,τ) represents a quantile coefficient function for the interaction between the j-th covariate and a race variable. We re-scaled the age variable to be 0T1, where 0 corresponds to 18 years old and 1 to 80+ years old.

The results are presented in Figure 7. Most covariates, including race, smoking, insurance, talco, and tsodi, showed homogeneous associations with LDL levels. NHB had higher LDL cholesterol levels than NHW across all quantiles and ages. On the other hand, heteroscedastic effects were observed in BMI, sex, PA, and tsfat. For BMI, its impact on LDL varied with age, having a greater impact on younger individuals than older ones. The effect size of sex on LDL increased with higher LDL levels. Figure 7 also suggests that interactions between race and other risk factors, such as BMI, sex, PIR, smoking, alcohol, insurance, physical activity, energy intake (tkcal), alcohol, tsfat, and tsodi, are prevalent. For example, higher BMI was associated with larger increases in LDL levels for NHB compared to NHW. Increased physical activity reduced LDL levels more for NHB than NHW, suggesting that NHB may benefit more from increased physical activity in lowering LDL cholesterol levels than NHW.

FIGURE 7.

FIGURE 7

The estimated quantile coefficient functions using the LDL data. The x-axis and y-axis represent the age index T(0,1) and the quantile level τ(0.1,0.9), respectively. On top of the figures, ‘xr’ represents an interaction between the risk factor ‘x’ and the ‘race.’

Our RQF model allows us to understand how risk factors influence health outcomes by incorporating their varying effects over age and quantiles. The findings on the impact of risk factors and their interaction with race on health outcomes of interest, namely BMI and LDL cholesterol levels, provide important insights for the field of public health and health disparity research.

We note that NHANES uses a complex, multistage probability sampling design, and therefore, to avoid biased estimation, the sampling weight should be used. However, our model was not specifically designed for population-based survey data, and thus, we did not incorporate the sampling weights in our analysis. As a result, caution is needed in interpreting our results.

5 |. CONCLUSION

We have shown that the proposed regional quantile regression approach in varying-coefficient models, known as RQF, can provide valuable insights into health outcome studies. The RQF method is demonstrated to yield a consistent and locally adaptive estimate in various settings. It provides consistent estimation when the number of nonconstant coefficient functions βjo is bounded by the rate of n/log(pn) and the number of different edge-connected coefficient values in the minimum spanning trees is bounded by the rate of nlog(pn), that is, s1nlog(pn). Additionally, our approach can capture underlying smoothly varying patterns as well as cluster structures in varying-coefficients of health risks. This is because RQF employs the fused Lasso penalty function to encourage similarity in coefficients between adjacent locations when both quantile levels and the index variable are used as distance metrics in the KNN graph.

By adapting the ADMM framework to the proposed RQF approach as shown in (1), RQF is easy to implement and each step of the algorithm can be parallelized, leading to a scalable computation model. Our simulation results demonstrate that RQF can detect underlying cluster structures and smoothly varying patterns better than standard nonparametric methods that use B-splines. Analysis of the BMI and cholesterol studies reveals heteroscedastic associations between the covariates and different quantile regions of the health outcome distribution, which cannot be captured by standard VC regression approaches. Based on our theoretical investigation, we conjecture that the RQF estimation obtained in our data analyses is consistent, and the detected cluster structures are close to the truth under some regularity conditions.

There is room for exploring further variants of the RQF approach. For instance, if variable selection is a primary concern, one could consider adding an extra Lasso penalty for dimension reduction in RQF. However, our empirical studies suggest that RQF performs variable selection naturally to some extent, as it detects both underlying cluster structures and dynamic patterns in each coefficient. Another potential future direction for RQF is allowing cluster structures across different varying-coefficients simultaneously. For example, if prior knowledge suggests that some varying-coefficients share similar varying patterns or cluster structures, this information could be utilized in the estimation procedure by adding extra penalty functions that penalize the differences of those varying-coefficients. Furthermore, considering that KNN is a nonparametric estimation method, future studies could benefit from exploring the combination of other nonparametric methods, like B-splines, with a fused lasso penalty term. We plan to report on this work elsewhere.

Supplementary Material

Supplementary Material

Footnotes

SUPPORTING INFORMATION

Additional supporting information can be found online in the Supporting Information section at the end of this article.

DATA AVAILABILITY STATEMENT

The data utilized in this study were obtained from the National Health and Nutrition Examination Survey (NHANES), which is conducted by the National Center for Health Statistics (NCHS) of the Centers for Disease Control and Prevention (CDC). The NHANES data are publicly available and can be accessed through the CDC website (https://www.cdc.gov/nchs/nhanes/index.htm). The code used to generate the results in this study is available in the following Github site: https://github.com/younghhk/software/tree/master/KNN.

REFERENCES

  • 1.Beydoun MA, Wang Y. Gender-ethnic disparity in BMI and waist circumference distribution shifts in US adults. Obesity. 2009;17(1):169–176. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Grandner MA, Williams NJ, Knutson KL, Roberts D, Jean-Louis G. Sleep disparity, race/ethnicity, and socioeconomic position. Sleep Med. 2016;18:7–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Stewart SH, Silverstein MD. Racial and ethnic disparity in blood pressure and cholesterol measurement. J Gen Intern Med. 2002;17(6):405–411. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Rossner S Obesity: the disease of the twenty-first century. Int J Obes (Lond). 2002;26:S2–S4. [DOI] [PubMed] [Google Scholar]
  • 5.Nuttall FQ. Body mass index. Nutr Today. 2015;50(3):117–128. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Huang J, Ma S, Xie H, Zhang CH. A group bridge approach for variable selection. Biometrika. 2009;96(2):339–355. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Rehkopf DH, Laraia BA, Segal M, Braithwaite D, Epel I. The relative importance of predictors of body mass index change, overweight and obesity in adolescent girls. Int J Pediatr Obes. 2011;6(3):233–242. [DOI] [PubMed] [Google Scholar]
  • 8.Gao J, Peng B, Ren Z, Zhang X. Variable selection for a categorical varying-coefficient model with identifications for determinants of body mass index. Ann Appl Stat. 2017;11(2):1117–1145. [Google Scholar]
  • 9.Fan J, Zhang W. Statistical estimation in varying coefficient models. Ann Stat. 1999;27(5):1491–1518. [Google Scholar]
  • 10.Hastie TJ, Tibishirani RJ. Varying-coefficient models. J R Stat Soc Ser B Methodol. 1993;55:757–796. [Google Scholar]
  • 11.Li Q, Ouyang D, Racine JS. Categorical semiparametric varying-coefficient models. J Appl Economet. 2013;28(4):551–579. [Google Scholar]
  • 12.Li Q, Racine JS. Smooth varying-coefficient estimation and inference for qualitative and quantitative data. Econ Theory. 2010;26(6):1607–1637. [Google Scholar]
  • 13.Wang H, Xia Y. Shrinkage estimation of the varying coefficient model. J Am Stat Assoc. 2009;104(486):747–757. [Google Scholar]
  • 14.Cleveland WS, Grosse E, Shyu WM. Local Regression Models, Statistical Models in S. Routledge, New York; 1992. [Google Scholar]
  • 15.Ma S, Song PXK. Varying index coefficient models. J Am Stat Assoc. 2015;110(509):341–356. [Google Scholar]
  • 16.Wang L, Li H, Huang JZ. Variable selection in nonparametric varying-coefficient models for analysis of repeated measurements. J Am Stat Assoc. 2008;103(484):1556–1569. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Kim MO. Quantile regression with varying coefficients. Ann Stat. 2007;35(1):92–108. [Google Scholar]
  • 18.Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B Methodol. 2006;68:49–67. [Google Scholar]
  • 19.Wei F, Huang J, Li H. Variable selection and estimation in high-dimensional varying-coefficient models. Stat Sin. 2011;21:1515–1540. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Xue L, Qu A. Variable selection in high-dimensional varying-coefficient models with global optimality. J Mach Learn Res. 2012;13:1973–1998. [Google Scholar]
  • 21.Klopp O, Pensky M. Sparse high-dimensional varying coefficient model: non-asymptotic minimax study. Ann Stat. 2015;43:1273–1299. [Google Scholar]
  • 22.Honda T, Ing CK, Wu WY. Adaptively weighted group lasso for semiparametric quantile regression models. Ther Ber. 2019;25(4B):3311–3338. [Google Scholar]
  • 23.Park S, Lee E. Hypothesis testing of varying coefficients for regional quantiles. Comput Stat Data Anal. 2021;159:107204. [Google Scholar]
  • 24.Zheng Q, Peng L, He X. Globally adaptive quantile regression with ultra-high dimensional data. Ann Stat. 2015;43:2225–2258. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Padilla OHM, Sharpnack J, Chen Y, Witten DM. Adaptive nonparametric regression with the k-nearest neighbour fused lasso. Biometrika. 2020;107(2):293–310. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Ye SS, Padilla OHM. Non-parametric quantile regression via the K-NN fused lasso. J Mach Learn Res. 2021;22(111):1–38. [Google Scholar]
  • 27.Li F, Sang H. Spatial homogeneity pursuit of regression coefficients for large datasets. J Am Stat Assoc. 2019;114(527):1050–1062. [Google Scholar]
  • 28.Yang CC, Chen YH, Chang HY. Composite marginal quantile regression analysis for longitudinal adolescent body mass index data. Stat Med. 2017;36(21):3380–3397. [DOI] [PubMed] [Google Scholar]
  • 29.Frumento P, Bottai M. Parametric modeling of quantile regression coefficient functions. Biometrics. 2016;72(1):74–84. [DOI] [PubMed] [Google Scholar]
  • 30.Belloni A, Chernozhukov V. 1-penalized quantile regression in high-dimensional sparse models. Ann Stat. 2011;39(1):82–130. [Google Scholar]
  • 31.Park S, He X. Hypothesis testing for regional quantiles. J Stat Plan Inference. 2017;191:13–24. [Google Scholar]
  • 32.Bickel PJ, Ritov Y, Tsybakov AB. Simultaneous analysis of lasso and Dantzig selector. Ann Stat. 2009;37:1705–1732. [Google Scholar]
  • 33.Candes E, Tao T. The Dantzig selector: statistical estimation when p is much larger than n. Ann Stat. 2007;35:2313–2351. [Google Scholar]
  • 34.Meinshausen N, Bühlmann P. High dimensional graphs and variable selection with the lasso. Ann Stat. 2006;34(3):1436–1462. [Google Scholar]
  • 35.Zhao P, Yu B. On model selection consistency of lasso. J Mach Learn Res. 2006;7:2541–2567. [Google Scholar]
  • 36.Chambolle A, Darbon J. On total variation minimization and surface evolution using parametric maximum flows. Int J Comput Vis. 2009;84(3):288–307. [Google Scholar]
  • 37.Weisell RC. Body mass index as an indicator of obesity. Asia Pac J Clin Nutr. 2002;11:S681–S684. [Google Scholar]
  • 38.Dobbelsteyn C, Joffres M, MacLean DR, Flowerdew G. A comparative evaluation of waist circumference, waist-to-hip ratio and body mass index as indicators of cardiovascular risk factors. The Canadian heart health surveys. Int J Obes (Lond). 2001;25(5):652–661. [DOI] [PubMed] [Google Scholar]
  • 39.Vargas PA, Perry TT, Robles E, et al. Relationship of body mass index with asthma indicators in head start children. Ann Allergy Asthma Immunol. 2007;99(1):22–28. [DOI] [PubMed] [Google Scholar]
  • 40.Amarenco P, Kim JS, Labreuche J, et al. A comparison of two LDL cholesterol targets after ischemic stroke. N Engl J Med. 2020;382(1):9–19. [DOI] [PubMed] [Google Scholar]
  • 41.Zeljkovic A, Vekic J, Spasojevic-Kalimanovska V, et al. LDL and HDL subclasses in acute ischemic stroke: prediction of risk and short-term mortality. Atherosclerosis. 2010;210(2):548–554. [DOI] [PubMed] [Google Scholar]
  • 42.Valdes-Marquez E, Parish S, Clarke R, et al. Relative effects of LDL-C on ischemic stroke and coronary disease: a Mendelian randomization study. Neurology. 2019;92(11):e1176–e1187. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Ferrara A, Barrett-Connor E, Shan J. Total, LDL, and HDL cholesterol decrease with age in older men and women: the rancho Bernardo study 1984–1994. Circulation. 1997;96(1):37–43. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

Data Availability Statement

The data utilized in this study were obtained from the National Health and Nutrition Examination Survey (NHANES), which is conducted by the National Center for Health Statistics (NCHS) of the Centers for Disease Control and Prevention (CDC). The NHANES data are publicly available and can be accessed through the CDC website (https://www.cdc.gov/nchs/nhanes/index.htm). The code used to generate the results in this study is available in the following Github site: https://github.com/younghhk/software/tree/master/KNN.

RESOURCES