Abstract
Health outcomes, such as body mass index and cholesterol levels, are known to be dependent on age and exhibit varying effects with their associated risk factors. In this paper, we propose a novel framework for dynamic modeling of the associations between health outcomes and risk factors using varying-coefficients (VC) regional quantile regression via K-nearest neighbors (KNN) fused Lasso, which captures the time-varying effects of age. The proposed method has strong theoretical properties, including a tight estimation error bound and the ability to detect exact clustered patterns under certain regularity conditions. To efficiently solve the resulting optimization problem, we develop an alternating direction method of multipliers (ADMM) algorithm. Our empirical results demonstrate the efficacy of the proposed method in capturing the complex age-dependent associations between health outcomes and their risk factors.
Keywords: health outcome study, K-nearest neighbors, Lasso, regional quantile regression, varying-coefficients
1 |. INTRODUCTION
In recent years, health outcome research has played a crucial role in identifying disparities among different racial and ethnic groups, enabling policymakers and clinicians to make informed decisions for individuals from diverse socioeconomic backgrounds. Numerous studies have evaluated racial disparities in various health outcomes, such as body mass index (BMI), sleep duration, and cholesterol levels.1–3 BMI, in particular, has been widely used as a health risk indicator in clinical and public health research.4,5 To better understand the factors that drive BMI, researchers from various fields have explored statistical techniques to identify important predictors. For example, Huang et al6 proposed a group bridge approach for selecting risk factors of BMI, while Rehkopf et al7 used a random forest technique to rank risk factors according to their relative importance score. Gao et al8 proposed a variable selection method to identify relevant BMI risk factors, assuming that the impact of these determinants is an unknown function of categorical demographic variables.
Existing literature on BMI studies has primarily focused on modeling the relationships between risk factors and the mean BMI level from average individuals, while ignoring age or time dependency. This approach has limited insights regarding the BMI distribution in the population. In this paper, we propose a tailored statistical approach for detecting dynamic and heterogeneous associations between health outcomes and risk factors. Specifically, we adopt a varying-coefficient quantile model framework to explain associations between health outcomes and risk factors in the presence of varying effects of age.
Varying-coefficient (VC) models have gained popularity in both theoretical and practical aspects since their inception.9–14 To deal with high-dimensional variables, VC approaches have incorporated variable selection procedures.8,13,15–17 For instance, Gao et al8 considered the variable selection problem for the categorical varying-coefficient model, based on a penalized approach using group Lasso.18 Nonparametric approaches, based on basis functions, have also been widely used for estimation and variable selection in VC models.16,19–22
We consider a VC regional quantile model in this paper, which is defined as follows:
where for is a quantile interval of interest, is an age index, are covariates, and is an outcome variable. This model framework naturally arises in many real-life applications.23 Analyzing the behavior of VC over a range of quantiles is important in the field of regression analysis. When various quantile levels are of interest, a typical approach is to individually fit a quantile regression and obtain inference at each quantile level, which may result in a loss of estimation efficiency because regressions at adjacent quantile levels are expected to share similar features. In such cases, a regional quantile regression approach23,24 can be a useful alternative and may lead to a more efficient estimation procedure.
Despite the importance of both theoretical and practical aspects, there is a lack of literature on the selection and estimation of clustered patterns in the coefficient function under VC regional quantile settings. Our main objective is to detect regional clustered patterns in the regional quantile regression coefficients, , using K-nearest neighbors (KNN) fused Lasso. The proposed method can identify de-noised clustered patterns between the risk factors and health outcomes, select important determinants of the regional health outcome quantiles (such as the upper level of the BMI distribution), and simultaneously estimate varying coefficients across both age and quantiles of BMI.
Our work is related to Padilla et al,25 who combined fused Lasso with the KNN procedure in a mean-based regression model, and more recently, Ye and Padilla,26 who proposed a nonparametric quantile regression using KNN-based fused Lasso penalty. However, it is important to note that Ye and Padilla26 modeled the conditional quantile of the response variable as a nonparametric function of covariates at a fixed quantile level. Li and Sange27 considered the spatially clustered varying-coefficient model using the fused Lasso method, but it was based on a linear model and only allowed a simple tree structure in the graph. Yang et al28 adopted the parametric quantile regression method of Frumento and Bottai29 to analyze longitudinal BMI data. These approaches differ from our proposed method, which considers a structured nonparametric model for regional quantiles. The proposed model shares common structural information across adjacent quantile levels and may better detect patterns over both quantiles, , and index, . Our work is also more general since we do not require parametric specifications for quantile coefficient functions, which is not a trivial assumption.
We summarize the key properties of the proposed method as follows: (1) The proposed approach based on quantile regression yields robust estimates despite the violation of normality assumption; (2) The conditional quantile framework allows us to explore the heteroscedastic relationship between different sublevels of the dependent variable and covariates, which cannot be captured by the standard regression approach; (3) By adopting a nonparametric approach to modeling the covariate effects under the regional quantile VC framework, our method can handle a nonlinear relationship between the dependent variable and covariates; (4) Compared to its local counterpart, the regional quantile approach provides more stable and interpretable results; (5) By leveraging the insight from the quantile KNN fused Lasso,26 our algorithm can detect underlying clustered patterns in the VC functions; and (6) The proposed optimization via the efficient alternating direction method of multipliers (ADMM) algorithm is computationally scalable since each updating step has a closed-form solution and utilizes parallel computing.
The remainder of this paper is organized as follows. In Section 2, we present the proposed VC quantile model via KNN fused Lasso method, its theoretical properties, and the ADMM algorithm. In Section 3, we evaluate the finite sample performance of the proposed methods using simulation studies. In Section 4, we apply the proposed method to two health outcome studies. Finally, Section 5 concludes the paper and discusses potential future research questions. Technical proofs, additional simulation results, and figures are provided in the Supplementary material.
2 |. VARYING-COEFFICIENT QUANTILE MODEL VIA KNN FUSED LASSO
In this section, we propose VC regional quantile regression via KNN fused Lasso and study theoretical properties and ADMM algorithm.
In the VC regional quantile model, for units with a covariate vector , a response, , can be modeled as
where is a time index variable, for example, age in our applications, is the underlying coefficient function for the th covariate, is the quantile interval of interest, and the conditional -th quantile of a random error given is zero.
We first construct the KNN graph based on in the domain of coefficient functions ’s, where the quantile level ’s are randomly chosen from , that is, , which is to discretize . Specifically, each , for , corresponds to a node in the graph and its edge set contains the pair for , if and only if is among the K-nearest neighbors of . We propose the regional quantile KNN fused Lasso (RQF) method in varying-coefficients models to estimate the coefficient function as follows:
(1) |
where with for .
Here, is an oriented incidence matrix of the KNN graph , and thus each row of corresponds to an edge .25 Specifically, if the -th edge in connects the -th and -th nodes, then
and . In (1), we considered a single quantile level for each sample to reduce computational cost. If we consider fixed multiple quantile levels , as in the composite quantile regression, then the computation is nearly infeasible. In addition, the large sample size as in our real data examples would make it worse.
In (1), the fused Lasso penalty enforces sparsity of the difference in two edge-connected coefficients. This allows the estimation of coefficients with clustered patterns if edge sets are selected appropriately. Using the obtained , we can estimate the value of coefficient corresponding to a new by the averaged estimated values of the KNN as follows:
(2) |
where is the set of KNN of in a training data . Thus, it leads to smooth and locally adaptive VC estimates.
Note that the tuning parameter in (1) controls the number of clusters in regression coefficients. When , it reduces to the ordinary regional quantile regression; when , RQF yields a nearly constant regression coefficient in that for . With an appropriate choice of , RQF produces clustered regression coefficients. In practice, we propose the following BIC to choose :
where represents the number of nonzero values in .
2.1 |. Notations
Throughout the paper, represents an operator norm of a matrix , that is, the maximum singular value of , and is a maximum eigenvalue of a symmetric matrix . We write if for some positive constant if and , and and to denote and , respectively. For a vector , let be the index set of non-zero entries of the vector and is the cardinality of . For a vector and the index set , let be the subvector of with components in . For a matrix , let , and . For a random sample , let .
2.2 |. Theoretical properties
For easier presentation, we introduce a new parameter , which is a reparametrization of for and . Suppose that in the KNN graph , there exists connected components, say , where the subgraph has a node set and an edge set with , and for . Note that given and , the number of connected components and the graph are determined. For example, if is a square number and , are rectangular grid points in , then the KNN graph with is itself connected, that is, . By rearranging sample indices, let ’s be increasingly ordered sets, that is, . We can write as a block diagonal matrix consisting of ’s, where rows of corresponds to edges of the -th connected group such that . Define , where is -dimensional vector with all 1’s. We can see that , where represents the Laplacian matrix, or called the graph Laplacian of the component . Thus, is invertible.
Without loss of generality, we write
This can be achieved by rearranging the rows of such that the first rows corresponds to the edges of the minimum spanning tree of and the vector . We can write the parameter with , where is a subvector of corresponding to the node (index) in the graph . Then, can be rewritten using a new parameter as . Let for , and .
Then, the problem (1) can be rewritten as
(3) |
where is defined in Section A of the Supplementary material, , and ’s are defined in the Supplementary material. From the estimate of (3), we can obtain the estimate , where .
Let , and be the underlying vectors similarly defined as for , and , respectively, which is the function of ’s. Let , and . The underlying clustered pattern for each is explained by the support set , indicating which edge-connected differences in are nonzero. For example, if , then for . Let , and , where is the index set corresponding to the vector . Let and . We estimate by and by . To facilitate theoretical analyses, we impose the following conditions.
Assumption 1.
For any and , the conditional density function of the random error at th quantile level, that is, has a continuous derivative , and satisfies and for for some positive constants , and . Moreover, for some positive constant .
Assumption 2.
Define the following restricted set:
It holds that the design matrix satisfies
Assumption 3.
The minimum nonzero signal difference in is greater than the order of , that is, .
Assumption 4.
Let and . We assume and , where and are defined in the proof of Theorem 2 in Section C of the Supplementary material.
Assumption 1 is a common assumption used in quantile regression literature,24,30,31 which imposes only mild assumptions on the conditional density of the response variable given covariates, not imposing any normality or homoscedasticity assumptions. The first part in Assumption 2 is the restricted eigenvalue (RE) condition, which is analogous to the assumptions in the existing literature.30,32,33 The second part in Assumption 2 is the restricted nonlinear impact (RNI) condition,24,30 which controls the quality of minoration of the quantile regression objective function by a quadratic function over the restricted set. Assumption 3 is a beta-min type condition, which imposes a lower bound of the nonzero signal differences. Assumption 4 is a irrepresentable type condition,24,34,35 which restricts correlations among covariates.
The mean squared error (MSE) of is defined as . The following theorem presents upper bound of the MSE of . All the technical proofs are deferred to Section C of the Supplementary material.
Theorem 1.
Suppose that Assumptions 1 and 2 hold. If , then .
Theorem 1 implies that the MSE of decreases asymptotically to zero as assuming that grows with a rate as . Note that such a growth rate for is satisfied when , for example, many are constant functions. Because , where is defined in Section A of the Supplementary material, Theorem 1 also gives the MSE of ’s as follows:
If , the MSE rate is bounded by , which is within logarithmic factors of the oracle rate that can be obtained with known .
The following theorem shows that RQF detects the underlying true set .
Theorem 2.
Suppose that Assumptions 1, 2, 3, and 4 holds. If , then
Theorem 2 implies that RQF can detect the underlying subclusters in the graph constructed from the points as long as any nodes in the same underlying subcluster are connected in the graph .
2.3 |. ADMM algorithm
The optimization of (1) can be computed using the ADMM algorithm. Let , and . Then, we can rewrite the optimization as follows by introducing supplementary variables and ,
By the augmented Lagrangian method, we consider the following
(4) |
where are the dual variables and is a step-size parameter. In (4), we need to update , and ’s. Updating and requires some mathematical derivations, but dual variables will be simply updated according to the updated and . Let be the objective function of (4). We iteratively solve (4) as follows:
(5) |
(6) |
For each step, we omit the superscript notations and whenever it does not cause any confusion.
Update
For each , define and as -dimensional vectors. For each , solving (5) is equivalent to solving the following:
By the Karush-Kuhn-Tucker (KKT) conditions, minimizer must satisfy
where
Suppose that . Then, it must hold that
which implies that if , then , where . Similarly, we can consider the remaining cases and obtain the following updates:
This can be efficiently computed via a parallel implementation.
Update
For each , recall that . Then, for each , solving (6) is equivalent to solving the following:
which corresponds to the KNN-fused Lasso25 and can be solved by the parametric max-flow algorithm.25,26,36 This also allows parallel implementation.
3 |. SIMULATION STUDIES
In this section, we consider simulation studies to illustrate the performance of RQF. We set for sufficient information and efficient computation, as suggested in Padilla et al25 and Ye and Padilla.26 To examine the performance of RQF in capturing patterns under various scenarios, we design the case in which the underlying VC coefficients have different clustered or varying patterns. For comparisons, we consider the sieve estimation method using B-spline (Bspline), which is a common nonparametric approach. Specifically, Bspline approximates by using bivariate B-spline functions , where is the product of two normalized B-spline basis functions of order 2 with quasi-uniform knots over the region and , that is, , where is estimated via the composite quantile regression framework23 and is chosen using the Bayesian information criterion (BIC).23We consider the following varying random coefficient model:
where is the number of covariates, , , are from the Bernoulli distribution with probability 1/2, and is introduced to consider a random coefficient model. Accordingly, the underlying quantile coefficient functions, given the index and the quantile level , are
We use and in the implementation, and the quantile levels are i.i.d. generated from [0.05, 0.95].
Figure 1 depicts the underlying coefficient functions for . We observe that , , and have smoothly varying patterns; and are clustered with respect to quantile and time and are positive and negative, constant value, respectively; and and are zeros. Figures 2 and 3 are the estimated coefficient function derived by RQF produced from a particular simulation when and , respectively. Figures 4 and 5 are the estimated coefficient function derived by Bspline from a particular simulation when and , respectively. We can observe that the overall patterns of RQF are highly consistent with the true regression coefficients, as shown in Figure 1. It successfully captures the underlying clustered patterns and smoothly varying patterns in the regression coefficients and also detects the abrupt changes across the boundaries of adjacent clusters. However, the estimates from Bspline are quite noisy, with artificially abrupt changes in coefficient values in some parts of the domain.
FIGURE 1.
Underlying coefficient function for . The x-axis and y-axis represent the index and the quantile level , respectively.
FIGURE 2.
Estimated coefficient function for , produced from a particular simulation when .
FIGURE 3.
Estimated coefficient function for , produced from a particular simulation with .
FIGURE 4.
An example of the estimated coefficient function of the B-spline methods with .
FIGURE 5.
An example of the estimated coefficient function of the B-spline methods with .
We further examine the performances of RQF in terms of parameter estimation using . As a performance measure, we consider the mean-squared error of estimation (MSE), defined as
For an index , each of the underlying coefficient functions ’s has clustered structures. Specifically, have single values, i.e., has only one cluster, and have two subclusters. On the other hand, an index has smoothly varying structures. For an index involving clustered structures, we measure the structural consistency performance when . Let and . If , i.e., , we consider the Precision and Recall for each , defined as
On the other hand, if , i.e., , we consider true negative rate (TNR) and negative predictive value (NPV), defined as
As shown in Table 1, all the Precision, Recall, TNR, and NPV values for RQF are close to 1, which implies that RQF has a high structural consistency for each covariate . Even with the large , RQF presents robust results. See Section B of the Supplementary material for detailed results. On the other hand, Bspline demonstrates poor performance in TNR and Precision because Bspline tends to yield more false positives. This implies that Bspline does not capture the underlying structures of the model well. Regarding the inferior performance of the Bspline method compared to our proposed KNN-based Lasso method, we note that the differences in performance mainly stem from the fused Lasso term in our proposed method. This term shrinks differences between neighboring points and if they are close. This shrinkage effect cannot be easily implemented in the Bspline method and, to our knowledge, is not considered in the existing literature.
TABLE 1.
Mean precision, recall, TNR, and NPV for RQF and Bspline over 100 simulations, under .
d = 9 | d = 25 | ||||
---|---|---|---|---|---|
|
|
||||
Method | Coefficient | Precision | Recall | Precision | Recall |
RQF | β 5 | 0.946 | 0.952 | 0.923 | 0.918 |
β 6 | 0.952 | 0.953 | 0.921 | 0.913 | |
Bspline | β 5 | 0.504 | 0.991 | 0.521 | 0.971 |
β 6 | 0.510 | 0.994 | 0.531 | 0.981 | |
TNR | NPV | TNR | NPV | ||
RQF | β 1 | 0.996 | 0.995 | 0.945 | 0.952 |
β 2 | 0.991 | 0.994 | 0.939 | 0.941 | |
β 8 | 0.985 | 0.983 | 0.941 | 0.938 | |
β 9 | 0.989 | 0.990 | 0.950 | 0.941 | |
Bspline | β 1 | 0.009 | 1.000 | 0.011 | 1.000 |
β 2 | 0.012 | 1.000 | 0.015 | 1.000 | |
β 8 | 0.018 | 1.000 | 0.023 | 1.000 | |
β 9 | 0.012 | 1.000 | 0.024 | 1.000 |
We also investigate the sensitivity of the proposed method with respect to the choice of quantile levels ’s. Using the same ’s as used in , we consider 500 different choices of ’s generated from uniform(0.05, 0.95) and obtain estimates ’s for to compare with when and , respectively. Then, we record the mean squared difference (MSD) between and at the 9,000 fixed points , where for and for , defined as
where ’s and ’s are computed by (2). Figure B2 in the Supplementary material shows the boxplots of MSD obtained from 500 different quantile choices. We observe that most MSD values are close to 0, which implies that the obtained coefficients are not highly sensitive to the choices of ’s.
Let be the proposed estimate using nearest neighbors in the estimation. To perform the sensitivity analyses of the proposed method for the choice of , we compute the MSD between and at the fixed points , where and . That is,
Figure B3 in the Supplementary material presents the heatmap of with when and . The proposed method does not seem to be heavily impacted by the choice of .
Another alternative is to consider multiple predetermined quantile levels for each instead of a single, randomly chosen quantile level. Specifically, to demonstrate it in our simulation, we used a setting with and selected a predetermined grid of quantile levels for each . This resulted in for each , where for . As a result, a total of points were used to generate the 2D plot depicted in Figure B4 of the Supplementary material. It is worth noting that using multiple predetermined quantile levels for each necessitates the estimation of more parameters, thereby increasing computational time compared to the proposed method that uses randomly selected quantile levels. This latter approach employs only a single, randomly chosen quantile level, denoted as , for each , but still manages to yield similar patterns in the 2D plot. However, the advantage of predetermined quantile levels is their ability to focus on specific quantile regions. If there is particular interest in these regions, more quantile levels can be preselected specifically for those areas.
4 |. EMPIRICAL ILLUSTRATIONS
4.1 |. Time-varying and heteroscedastic effects of risk factors on BMI
Body mass index (BMI), which is a measure of body fat based on height and weight, has been shown to be associated with many health status indicators.37–39 While there has been much interest in developing statistical methods for measuring time-varying effects of risk factors on BMI, there have been few studies from the quantile perspective. Moreover, classical local quantile approaches are not well-suited for visualizing the relationship between BMI and covariates as age-quantile functions, unlike regional quantile approaches.
In this section, we utilized the RQF to investigate the relationship between risk factors and different sublevels of the BMI distribution among women, and to examine the potential variation of this relationship across age. Furthermore, we explored the health disparity between non-Hispanic Whites (NHW) and non-Hispanic Blacks (NHB) by examining the interaction between risk factors and racial groups. The data for our analysis were obtained from the 2011–2018 National Health and Nutrition Examination Survey (NHANES) dataset, and we considered 12 covariates identified in the literature as potential risk factors for BMI. These covariates include physical condition factors such as dietary fiber, sodium level, vitamin A, vitamin C, vitamin D, and zinc; socioeconomic factors such as education (1 if college, 0 otherwise), occupation (1 if yes and 0 if no), insurance (1 if yes and 0 if no), and poverty-income ratio (PIR) which ranges between 0 (lowest income level) and 1 (highest income level); and demographic information such as marital status (1 if married, 0 otherwise) and race (1 if NHB and 0 if NHW). After removing subjects with missing variables, a total of 4,119 women were available for our analysis, with 2,664 NHW and 1,455 NHB.
The following varying-coefficient (VC) model was used to fit the data:
(7) |
where represents the BMI, represents age, and represents the race variable, while ’s are independent errors satisfying . Here, represents a quantile coefficient function for the -th explanatory variable, and represents a quantile coefficient function for the interaction between the -th explanatory variable and the race variable.
To re-scale the age variable, we set , where 0 corresponds to 20 years old and 1 corresponds to 80 years old or more. We normalized each continuous variable such that its mean and standard deviation were 0 and 1, respectively. To choose , we used BIC as described in Section 2.
Figure 6 displays the estimated functional coefficients, where the x-axis and y-axis represent the age index and the BMI quantile level , respectively. The two-dimensional graphs describe how the coefficients vary with age and the quantile level. Overall, most variables, except sodium level, zinc, and race, showed a negative association with BMI. The effect of education on BMI was more pronounced in the younger age group compared to the older group. In contrast, having insurance had a more constant effect across different ages, while a heteroscedastic association was observed over different BMI quantiles. For example, having insurance tended to be associated with attenuated BMI for individuals with medium or lower BMI across all ages. A higher income (ie, higher PIR) was associated with a lower BMI, and as an individual gets older, the effect seems to be stronger at the upper level of the BMI distribution. NHB women appeared to have a higher BMI than NHW women across all ages.
FIGURE 6.
The estimated quantile coefficient functions using the BMI data. The x-axis and y-axis represent the age index and the quantile level , respectively. On top of the figures, ‘’ represents an interaction between a risk factor ‘’ and the ‘race.’
The disparity in BMI between NHW and NHB women can be inferred by observing the interaction between race and risk factors. The positive effect sizes observed in the interaction plots suggest that the impact of risk factors on BMI levels is greater in NHB women than in NHW women. Although most interactions did not show a heteroscedastic effect, the interaction between sodium and race appeared to vary with age.
4.2 |. Time-varying and heteroscedastic effects of risk factors on LDL cholesterol
Low-density lipoprotein (LDL) cholesterol is associated with various health problems such as cardiovascular diseases and stroke in adults.40–42 In this study, we use the RQF framework to model the association between risk factors and LDL cholesterol levels (measured in mg/dL), where LDL is the health outcome of interest. Previous research has shown that LDL cholesterol levels tend to be associated with age. For instance, Ferrara et al43 observed that LDL tends to increase in younger and middle-aged adults and decrease with people who are older than 65 years.
Covariates included in our model were race (NHW and NHB), PIR (0–5), sex (1 if female and 0 if male), age (18–80+), alcohol (average number of alcoholic drinks/day during the past 12 months), smoking status (1 if smoked at least 100 cigarettes in life, 0 otherwise), physical activity (continuous), BMI (continuous), energy intake (tkcal), total saturated fatty acids (tsfat, gm), alcohol intake (talco, gm), sodium intake (tsodi, mg), total monounsaturated fatty acids (tmfat, gm), and total polyunsaturated fatty acids (tpfat, gm). In addition, we included the interaction term between race and each of these covariates. The LDL data were also obtained from the 2011–2018 NHANES, and a total of 3,386 subjects were included in the analysis after excluding those with missing variables.
Similar to the BMI study, we considered the following VC model for LDL cholesterol:
(8) |
where is an LDL cholesterol level, is age, represents race, and ’s are independent errors satisfying . Here, represents a quantile varying-coefficient function for the -th explanatory variable, whereas represents a quantile coefficient function for the interaction between the -th covariate and a race variable. We re-scaled the age variable to be , where 0 corresponds to 18 years old and 1 to 80+ years old.
The results are presented in Figure 7. Most covariates, including race, smoking, insurance, talco, and tsodi, showed homogeneous associations with LDL levels. NHB had higher LDL cholesterol levels than NHW across all quantiles and ages. On the other hand, heteroscedastic effects were observed in BMI, sex, PA, and tsfat. For BMI, its impact on LDL varied with age, having a greater impact on younger individuals than older ones. The effect size of sex on LDL increased with higher LDL levels. Figure 7 also suggests that interactions between race and other risk factors, such as BMI, sex, PIR, smoking, alcohol, insurance, physical activity, energy intake (tkcal), alcohol, tsfat, and tsodi, are prevalent. For example, higher BMI was associated with larger increases in LDL levels for NHB compared to NHW. Increased physical activity reduced LDL levels more for NHB than NHW, suggesting that NHB may benefit more from increased physical activity in lowering LDL cholesterol levels than NHW.
FIGURE 7.
The estimated quantile coefficient functions using the LDL data. The x-axis and y-axis represent the age index and the quantile level , respectively. On top of the figures, ‘’ represents an interaction between the risk factor ‘’ and the ‘race.’
Our RQF model allows us to understand how risk factors influence health outcomes by incorporating their varying effects over age and quantiles. The findings on the impact of risk factors and their interaction with race on health outcomes of interest, namely BMI and LDL cholesterol levels, provide important insights for the field of public health and health disparity research.
We note that NHANES uses a complex, multistage probability sampling design, and therefore, to avoid biased estimation, the sampling weight should be used. However, our model was not specifically designed for population-based survey data, and thus, we did not incorporate the sampling weights in our analysis. As a result, caution is needed in interpreting our results.
5 |. CONCLUSION
We have shown that the proposed regional quantile regression approach in varying-coefficient models, known as RQF, can provide valuable insights into health outcome studies. The RQF method is demonstrated to yield a consistent and locally adaptive estimate in various settings. It provides consistent estimation when the number of nonconstant coefficient functions is bounded by the rate of and the number of different edge-connected coefficient values in the minimum spanning trees is bounded by the rate of , that is, . Additionally, our approach can capture underlying smoothly varying patterns as well as cluster structures in varying-coefficients of health risks. This is because RQF employs the fused Lasso penalty function to encourage similarity in coefficients between adjacent locations when both quantile levels and the index variable are used as distance metrics in the KNN graph.
By adapting the ADMM framework to the proposed RQF approach as shown in (1), RQF is easy to implement and each step of the algorithm can be parallelized, leading to a scalable computation model. Our simulation results demonstrate that RQF can detect underlying cluster structures and smoothly varying patterns better than standard nonparametric methods that use B-splines. Analysis of the BMI and cholesterol studies reveals heteroscedastic associations between the covariates and different quantile regions of the health outcome distribution, which cannot be captured by standard VC regression approaches. Based on our theoretical investigation, we conjecture that the RQF estimation obtained in our data analyses is consistent, and the detected cluster structures are close to the truth under some regularity conditions.
There is room for exploring further variants of the RQF approach. For instance, if variable selection is a primary concern, one could consider adding an extra Lasso penalty for dimension reduction in RQF. However, our empirical studies suggest that RQF performs variable selection naturally to some extent, as it detects both underlying cluster structures and dynamic patterns in each coefficient. Another potential future direction for RQF is allowing cluster structures across different varying-coefficients simultaneously. For example, if prior knowledge suggests that some varying-coefficients share similar varying patterns or cluster structures, this information could be utilized in the estimation procedure by adding extra penalty functions that penalize the differences of those varying-coefficients. Furthermore, considering that KNN is a nonparametric estimation method, future studies could benefit from exploring the combination of other nonparametric methods, like B-splines, with a fused lasso penalty term. We plan to report on this work elsewhere.
Supplementary Material
Footnotes
SUPPORTING INFORMATION
Additional supporting information can be found online in the Supporting Information section at the end of this article.
DATA AVAILABILITY STATEMENT
The data utilized in this study were obtained from the National Health and Nutrition Examination Survey (NHANES), which is conducted by the National Center for Health Statistics (NCHS) of the Centers for Disease Control and Prevention (CDC). The NHANES data are publicly available and can be accessed through the CDC website (https://www.cdc.gov/nchs/nhanes/index.htm). The code used to generate the results in this study is available in the following Github site: https://github.com/younghhk/software/tree/master/KNN.
REFERENCES
- 1.Beydoun MA, Wang Y. Gender-ethnic disparity in BMI and waist circumference distribution shifts in US adults. Obesity. 2009;17(1):169–176. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Grandner MA, Williams NJ, Knutson KL, Roberts D, Jean-Louis G. Sleep disparity, race/ethnicity, and socioeconomic position. Sleep Med. 2016;18:7–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Stewart SH, Silverstein MD. Racial and ethnic disparity in blood pressure and cholesterol measurement. J Gen Intern Med. 2002;17(6):405–411. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Rossner S Obesity: the disease of the twenty-first century. Int J Obes (Lond). 2002;26:S2–S4. [DOI] [PubMed] [Google Scholar]
- 5.Nuttall FQ. Body mass index. Nutr Today. 2015;50(3):117–128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Huang J, Ma S, Xie H, Zhang CH. A group bridge approach for variable selection. Biometrika. 2009;96(2):339–355. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Rehkopf DH, Laraia BA, Segal M, Braithwaite D, Epel I. The relative importance of predictors of body mass index change, overweight and obesity in adolescent girls. Int J Pediatr Obes. 2011;6(3):233–242. [DOI] [PubMed] [Google Scholar]
- 8.Gao J, Peng B, Ren Z, Zhang X. Variable selection for a categorical varying-coefficient model with identifications for determinants of body mass index. Ann Appl Stat. 2017;11(2):1117–1145. [Google Scholar]
- 9.Fan J, Zhang W. Statistical estimation in varying coefficient models. Ann Stat. 1999;27(5):1491–1518. [Google Scholar]
- 10.Hastie TJ, Tibishirani RJ. Varying-coefficient models. J R Stat Soc Ser B Methodol. 1993;55:757–796. [Google Scholar]
- 11.Li Q, Ouyang D, Racine JS. Categorical semiparametric varying-coefficient models. J Appl Economet. 2013;28(4):551–579. [Google Scholar]
- 12.Li Q, Racine JS. Smooth varying-coefficient estimation and inference for qualitative and quantitative data. Econ Theory. 2010;26(6):1607–1637. [Google Scholar]
- 13.Wang H, Xia Y. Shrinkage estimation of the varying coefficient model. J Am Stat Assoc. 2009;104(486):747–757. [Google Scholar]
- 14.Cleveland WS, Grosse E, Shyu WM. Local Regression Models, Statistical Models in S. Routledge, New York; 1992. [Google Scholar]
- 15.Ma S, Song PXK. Varying index coefficient models. J Am Stat Assoc. 2015;110(509):341–356. [Google Scholar]
- 16.Wang L, Li H, Huang JZ. Variable selection in nonparametric varying-coefficient models for analysis of repeated measurements. J Am Stat Assoc. 2008;103(484):1556–1569. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Kim MO. Quantile regression with varying coefficients. Ann Stat. 2007;35(1):92–108. [Google Scholar]
- 18.Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B Methodol. 2006;68:49–67. [Google Scholar]
- 19.Wei F, Huang J, Li H. Variable selection and estimation in high-dimensional varying-coefficient models. Stat Sin. 2011;21:1515–1540. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Xue L, Qu A. Variable selection in high-dimensional varying-coefficient models with global optimality. J Mach Learn Res. 2012;13:1973–1998. [Google Scholar]
- 21.Klopp O, Pensky M. Sparse high-dimensional varying coefficient model: non-asymptotic minimax study. Ann Stat. 2015;43:1273–1299. [Google Scholar]
- 22.Honda T, Ing CK, Wu WY. Adaptively weighted group lasso for semiparametric quantile regression models. Ther Ber. 2019;25(4B):3311–3338. [Google Scholar]
- 23.Park S, Lee E. Hypothesis testing of varying coefficients for regional quantiles. Comput Stat Data Anal. 2021;159:107204. [Google Scholar]
- 24.Zheng Q, Peng L, He X. Globally adaptive quantile regression with ultra-high dimensional data. Ann Stat. 2015;43:2225–2258. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Padilla OHM, Sharpnack J, Chen Y, Witten DM. Adaptive nonparametric regression with the k-nearest neighbour fused lasso. Biometrika. 2020;107(2):293–310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Ye SS, Padilla OHM. Non-parametric quantile regression via the K-NN fused lasso. J Mach Learn Res. 2021;22(111):1–38. [Google Scholar]
- 27.Li F, Sang H. Spatial homogeneity pursuit of regression coefficients for large datasets. J Am Stat Assoc. 2019;114(527):1050–1062. [Google Scholar]
- 28.Yang CC, Chen YH, Chang HY. Composite marginal quantile regression analysis for longitudinal adolescent body mass index data. Stat Med. 2017;36(21):3380–3397. [DOI] [PubMed] [Google Scholar]
- 29.Frumento P, Bottai M. Parametric modeling of quantile regression coefficient functions. Biometrics. 2016;72(1):74–84. [DOI] [PubMed] [Google Scholar]
- 30.Belloni A, Chernozhukov V. -penalized quantile regression in high-dimensional sparse models. Ann Stat. 2011;39(1):82–130. [Google Scholar]
- 31.Park S, He X. Hypothesis testing for regional quantiles. J Stat Plan Inference. 2017;191:13–24. [Google Scholar]
- 32.Bickel PJ, Ritov Y, Tsybakov AB. Simultaneous analysis of lasso and Dantzig selector. Ann Stat. 2009;37:1705–1732. [Google Scholar]
- 33.Candes E, Tao T. The Dantzig selector: statistical estimation when p is much larger than n. Ann Stat. 2007;35:2313–2351. [Google Scholar]
- 34.Meinshausen N, Bühlmann P. High dimensional graphs and variable selection with the lasso. Ann Stat. 2006;34(3):1436–1462. [Google Scholar]
- 35.Zhao P, Yu B. On model selection consistency of lasso. J Mach Learn Res. 2006;7:2541–2567. [Google Scholar]
- 36.Chambolle A, Darbon J. On total variation minimization and surface evolution using parametric maximum flows. Int J Comput Vis. 2009;84(3):288–307. [Google Scholar]
- 37.Weisell RC. Body mass index as an indicator of obesity. Asia Pac J Clin Nutr. 2002;11:S681–S684. [Google Scholar]
- 38.Dobbelsteyn C, Joffres M, MacLean DR, Flowerdew G. A comparative evaluation of waist circumference, waist-to-hip ratio and body mass index as indicators of cardiovascular risk factors. The Canadian heart health surveys. Int J Obes (Lond). 2001;25(5):652–661. [DOI] [PubMed] [Google Scholar]
- 39.Vargas PA, Perry TT, Robles E, et al. Relationship of body mass index with asthma indicators in head start children. Ann Allergy Asthma Immunol. 2007;99(1):22–28. [DOI] [PubMed] [Google Scholar]
- 40.Amarenco P, Kim JS, Labreuche J, et al. A comparison of two LDL cholesterol targets after ischemic stroke. N Engl J Med. 2020;382(1):9–19. [DOI] [PubMed] [Google Scholar]
- 41.Zeljkovic A, Vekic J, Spasojevic-Kalimanovska V, et al. LDL and HDL subclasses in acute ischemic stroke: prediction of risk and short-term mortality. Atherosclerosis. 2010;210(2):548–554. [DOI] [PubMed] [Google Scholar]
- 42.Valdes-Marquez E, Parish S, Clarke R, et al. Relative effects of LDL-C on ischemic stroke and coronary disease: a Mendelian randomization study. Neurology. 2019;92(11):e1176–e1187. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Ferrara A, Barrett-Connor E, Shan J. Total, LDL, and HDL cholesterol decrease with age in older men and women: the rancho Bernardo study 1984–1994. Circulation. 1997;96(1):37–43. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data utilized in this study were obtained from the National Health and Nutrition Examination Survey (NHANES), which is conducted by the National Center for Health Statistics (NCHS) of the Centers for Disease Control and Prevention (CDC). The NHANES data are publicly available and can be accessed through the CDC website (https://www.cdc.gov/nchs/nhanes/index.htm). The code used to generate the results in this study is available in the following Github site: https://github.com/younghhk/software/tree/master/KNN.