Abstract
Recently, there has been a growing interest in variable selection for causal inference within the context of high-dimensional data. However, when the outcome exhibits a skewed distribution, ensuring the accuracy of variable selection and causal effect estimation might be challenging. Here, we introduce the generalized median adaptive lasso (GMAL) for covariate selection to achieve an accurate estimation of causal effect even when the outcome follows skewed distributions. A distinctive feature of our proposed method is that we utilize a linear median regression model for constructing penalty weights, thereby maintaining the accuracy of variable selection and causal effect estimation even when the outcome presents extremely skewed distributions. Simulation results showed that our proposed method performs comparably to existing methods in variable selection when the outcome follows a symmetric distribution. Besides, the proposed method exhibited obvious superiority over the existing methods when the outcome follows a skewed distribution. Meanwhile, our proposed method consistently outperformed the existing methods in causal estimation, as indicated by smaller root-mean-square error. We also utilized the GMAL method on a deoxyribonucleic acid methylation dataset from the Alzheimer’s disease (AD) neuroimaging initiative database to investigate the association between cerebrospinal fluid tau protein levels and the severity of AD.
Keywords: causal inference, propensity score, observational studies, variable selection
BACKGROUND
Causal inference from observational studies plays an important role in the field of biomedical research. A crucial challenge of causal inference in observational studies is confounding bias due to lack of randomization. In bioinformatics, propensity score (PS) methods are commonly employed for reducing confounding bias and understanding causality [1]. Estimating treatment effects within the framework of PS models is highly sensitive to the covariates for adjustment. Insufficient adjustment for confounders in the PS model leads to biased causal effect estimates [2–4]. Incorporating all confounders is important for unbiased treatment effect estimates. Inclusion of instrumental covariates in addition to confounders may result in a decrease in efficiency, while incorporating prognostic covariates can improve estimation efficiency [2–5]. Striking a balance between bias and efficiency through variable selection is of paramount concern in research [3, 6–8].
With the rapidly expanding data sources, including omics data, it is often the case that there is limited prior knowledge regarding the precise set of confounders. As a result, there has been a growing interest in variable selection for causal inference within the context of high-dimensional data [3, 6–12]. The group LASSO and doubly robust estimation (GLiDeR) method modified the regularization penalty to select confounders and prognostic covariates [3]. The high-dimensional covariate balancing PS (hdCBPS) method estimated the initial PS by maximizing a penalized generalized quasi-likelihood and calibrated the initial PS by balancing covariates selected from the outcome model [10]. Shortreed et al. presented the outcome-adaptive lasso (OAL) method [6]. This approach involves estimating the PS model using adaptive lasso under binary treatment [6]. The tuning parameters were selected depending on the covariate balance between different treatment groups. The OAL method has the capability to select confounders and prognostic covariates, while effectively excluding the instrumental and spurious covariates. Most of these methods are applied to causal inference in the context of binary treatment in high-dimensional settings. In real-world scenarios, the treatments under investigation frequently involve continuous variables [13, 14]. Gao et al. introduced the generalized outcome-adaptive lasso (GOAL) method [15], which emphasized variable selection for causal effect estimation in omics data. The GOAL method applied the non-parametric covariate balancing generalized propensity score (npCBGPS) method to estimate the balance weight, making it suitable for handling continuous treatment [16].
Both the OAL and GOAL methods constructed penalty weights in objective function using a full linear outcome regression model, which requires that the outcome is normally distributed. In practical research, skewed distribution of data is frequently encountered [17, 18]. A skewed distribution, also known as an asymmetric distribution [19], refers to a probability distribution of a dataset where the data are not evenly or symmetrically distributed around the mean. In our real data application based on the Alzheimer’s disease (AD) neuroimaging initiative (ADNI) database, we aim to explore the association between cerebrospinal fluid tau protein levels (CSF-tau) and the severity of AD. The severity of AD exhibits a positively skewed distribution. In cases where the outcome demonstrates a skewed distribution or contains contaminated data, constructing penalty weights using a linear outcome regression may lead to inaccurate variable selection and may result in imprecise estimation of the causal effect. Median regression is to some extent robust to outliers and extreme values [20]. Therefore, employing a median regression model to construct penalty weights may better ensure the accuracy of variable selection and causal effect estimation, particularly when the outcome follows skewed distributions.
In this paper, to reduce the impact of skewed distributions of the outcome, we proposed the generalized median adaptive lasso (GMAL) method, which is motivated by OAL and GOAL for accurate variable selection and causal effect estimation. Our proposed method is applicable to both continuous and binary treatments. A distinctive feature of our proposed method is that we utilize a linear median regression model for constructing penalty weights, thereby enabling an accurate variable selection and causal estimation even when the outcome exhibits skewed distributions. Simulation results showed that our proposed method performs comparably to the existing methods in variable selection when the outcome follows a symmetric distribution. Besides, the proposed method exhibited obvious superiority over the existing methods when the outcome variable follows a skewed distribution. Meanwhile, the proposed method consistently exhibited superior causal effect estimation performance compared to the existing methods when considering the root-mean-square error (RMSE).
This article is structured as follows. First, we begin by introducing the notations, and assumptions and providing an overview of the inverse probability weighting (IPW) method with continuous treatment. Second, we present the GMAL method with continuous treatment in detail. The notations, assumptions and the proposed method with binary treatment are showed in the Supplementary Appendix. Third, we carry out simulations to assess the performance of the proposed method and existing methods. Fourth, we utilize our method to a deoxyribonucleic acid (DNA) methylation (DNAm) dataset from the ADNI database to explore the effects of CSF-tau on the severity of AD. Finally, we present a brief discussion of the results.
NOTATIONS, ASSUMPTIONS AND IPW ESTIMATOR
For causal inference with continuous treatment variables, we focus on the dose–response function (DRF). We formulize the DRF under the potential outcome framework [21, 22]. Suppose we have drawn
units from the population of interest. The observed data
are independent and identically distributed copies of
, where
denote the treatment level (
denotes continuous domain), the outcome and a
-dimensional vector of pre-treatment covariates, respectively. Let
denote the potential outcomes under treatment level
. Then, the DRF is defined as
.
For the observational data, we first assume that there is no interference between units, and then we make the following causal inference assumptions [23]:
(A1) Consistency: 
(A2) Unconfoundeness:
,
denotes statistical independence.
(A3) Positivity:
where
denotes the generalized propensity score (GPS).
To obtain the IPW estimator, we first employ the npCBGPS method to calculate weights [16]. The npCBGPS method estimates weights utilizing an empirical likelihood approach while simultaneously satisfying balance conditions [16]. The npCBGPS method does not require the specification of a GPS model when estimating balance weights, thus demonstrating a degree of robustness against potential misspecification of the GPS model. The weight of
th observation is defined as follows:
![]() |
(1) |
where
is the marginal density of the treatment
,
is the conditional density of treatment given
, also called GPS.
We then estimate the DRF through a marginal structural model (MSM) method [24]. This method involves constructing an MSM to compute the potential outcomes, with our assumption being a linear model:
![]() |
(2) |
Under the above causal inference assumptions (A1–A3), we could obtain consistent estimates for the parameters in Equation (2) using the IPW method based on the observed data
.
THE PROPOSED METHOD
Variable selection based on GMAL
We now discuss the four categories of pre-treatment covariates: confounders
that impact both the treatment and the outcome; prognostic covariates
that impact the outcome but not the treatment; instrumental covariates
that impact treatment, but not outcome, and spurious covariates
that unrelated to treatment and outcome. We shall include all confounders (
) and exclude spurious covariates (
) in the adjustment set to avoid bias. Besides, we shall include all prognostic covariates (
) and exclude instrumental covariates (
) to increase the statistical efficiency. As a result,
is our target adjustment set. Specifically, we propose the GMAL objective function for the case of continuous treatment variable as follows:
![]() |
(3) |
where
is a nonnegative regularization parameter;
is the penalty weight and
,
represents the computation of absolute value and
represents the computation of the squared sum;
represents the unpenalized estimation coefficient of the
th covariate conditioned on the treatment from the median regression model.
Selecting
Selecting the appropriate tuning parameter
holds significance in practical applications. We assembled a set of
values, following the approach used by Shortreed et al. [6], where each
corresponds to a specific covariate set. Here, we present how to select the optimal tuning parameter
.
To begin, with
and a given
, we select a group of covariates
from the set
, as detailed in the ‘Variable selection based on GMAL’ subsection. Subsequently, we employ the adjustment set
to calculate the balance weight using the npCBGPS approach [16].
Next, the optimal
is selected by minimizing a dual-weight correlation (DWC) as proposed by Gao et al. [15].
![]() |
where
represents the weighted correlation coefficient between
th covariate and treatment;
represents the unpenalized estimation coefficient for the
th covariate conditioned on treatment.
SIMULATIONS
In this section, we conducted a series of simulation studies to assess the performance of the proposed method in settings with continuous or binary treatment variables. Furthermore, we generated outcome variables with standard normal distribution, Pareto distribution and Student’s
-distribution. Here, we assumed linear outcome regression models. The vector of potential covariates
were obtained from a multivariate Gaussian distribution
. The covariance was set to
or
. We examined the impact of covariates with different correlation structures by considering
(independent covariates) and
(correlated covariates). The simulations were repeated 100 times for each data-generating process. The results are evaluated using relative bias, RMSE and the proportion of covariates selected.
For
, we considered the same eight possible values used by Shortreed et al. [6]
for each dataset. In the settings with continuous treatment, the optimal
is selected by minimizing a DWC. In the settings with binary treatment, the optimal
is determined by minimizing a wAMD. We set
such that
for each
[6, 8, 25].
Continuous treatment
In continuous treatment setting, the continuous treatment
is generated from a standard normal distribution with a specified linear GPS model,
![]() |
where
In Scenarios 1–3, we considered the sample size (
) and covariates dimension (
) as
. We also considered
in the Supplementary Materials.
Scenario 1
In Scenario 1, we aim to evaluate the performance of GMAL when the outcome variable follows symmetric distributions. Specifically, in Scenario 1, we focused on the simulations with the random error term
arising from the standard normal distribution,
![]() |
Scenario 2
In Scenario 2, we tend to evaluate the performance of our proposed method when the outcome variable follows a skewed distribution. The skewed distribution can take many forms, including the exponential, Weibull and Pareto distributions [19]. We focused on the simulations with the random error term
arising from the Pareto distribution, which is a typical type of positively skewed distribution [26]. We focused on the Pareto distribution with varying shapes (
) and fixed location (
). The variation of
was employed to assess the influence of various degrees of heavy-tailed distributions on variable selection and causal estimation. The smaller
is, the heavier is the tail of the Pareto distribution:
![]() |
Scenario 3
In Scenario 3, we focused on the simulations with
arising from the Student’s
-distribution. We focused on the case where the degree of freedom is equal to 5,
![]() |
(4) |
The true outcome variable generation process is defined as follows:
Under the linear data generative model,
and
are confounders;
and
are prognostic covariates;
and
are the instrumental covariates. Other covariates, except for
, are considered as spurious covariates. For simplicity, we are interested in the estimation of the parameter
rather than the linear DRF.
We compare the GMAL with the GOAL method introduced by Gao et al. [15]. Furthermore, we also compared the GMAL method with four weighting methods:
(i) GOAL: proposed by Gao et al. [15], which employed a ‘full’ linear outcome regression model to construct penalty weights.
(ii) Targ: the reference method, target adjustment set
. Ideally, the variable selection of the proposed method is consistent with this method.(iii) Conf: covariate adjustment set includes only
.(iv) PreT: covariate adjustment set includes
and
.(v) PotConf: covariate adjustment set includes
,
and
.
The npCBGPS method was also employed to estimate the balance weights for the Targ, Conf, PreT and PotConf methods, considering inclusion of different adjustment covariates.
Binary treatment
In a binary treatment setting, we focused on the estimation of average treatment effect (ATE). The details of the proposed method with binary treatment are showed in the Supplementary Appendix. The binary treatment
is generated from a Bernoulli distribution with a specified logistic regression,
![]() |
where
such that
and
. In Scenarios 4–6, we considered the sample size (
) and covariates dimension (
) as
.
Scenario 4
In Scenario 4, we focused on the simulations with the random error term
arising from the standard normal distribution,
![]() |
Scenario 5
In Scenario 5, we focused on the simulations with the random error term
arising from the Pareto distribution. We focused on the Pareto distribution with varying shapes (
) and fixed location (
). The smaller
, the heavier is the tail of the Pareto distribution:
![]() |
Scenario 6
In Scenario 6, we focused on the simulations with
arising from the Student’s
-distribution. We focused on the case where the degree of freedom is equal to 5,
![]() |
(5) |
The true outcome variable generation process is defined as follows:
Under the linear data generative model,
is the confounder;
is the prognostic covariate;
and
are instrumental covariates. Other covariates, except for
, are considered as spurious covariates. We are interested in the estimation of the parameter
also called the ATE.
We compare our proposed GMAL method with the following methods for estimating ATE:
(i) OAL: in this study, the OAL was implemented with default settings from code provided in Shortreed and Ertefaie [6].
(ii) hdCBPS [10]: the hdCBPS method was implemented with default settings using R package CBPS.
(iii) GLiDeR: we use the R code provided in original paper to estimate the ATE [3].
(iv) Doubly robust semiparametric (DRS) method: we also use the R code provided in original paper to estimate the ATE [9].
SIMULATION RESULTS
Continuous treatment
The variable selection performance of our proposed method, as well as other methods, is shown in Figure 1. We present the proportions of the first 30 covariates selected in 100 simulation runs, as the selection proportion for each spurious covariate was similar. In the settings of Scenarios 1 and 3, both the GMAL method and the GOAL method excelled in the selection of confounders and prognostic covariates, with their selection proportions close to 1. Additionally, the rates for selecting instrumental and spurious covariates consistently remained <30%. In Scenario 2, when the parameter
of Pareto distribution decreased from 3 to 1.3, the performance of variable selection of the GMAL method was almost unchanged with the rates for selecting confounders and prognostic covariates consistently remaining close to 1. However, the GOAL method resulted in a significant reduction in the proportion of confounders and prognostic covariates, which was accompanied by an increase in the proportion of instrumental and spurious covariates.
Figure 1.
Proportion of the top 30 covariates being selected under Scenarios 1–3 with
.
The bias distribution of parameter estimates and summary statistics are shown in Figure 2 and Table 1. In Scenarios 1 and 3, the performance of the proposed method GMAL and GOAL was similar. Targ exhibited the smallest variability when compared to the Conf, PreT and PotConf methods, whereas the PreT method consistently displayed the highest variability. In Scenario 2 with data generated from Pareto distribution, we found that our proposed method always outperformed the GOAL method and performed similarly to the reference method, Targ. When
of the Pareto distribution decreased from 3 to 1.3, our proposed method showed a slight increase in bias and RMSE, while the GOAL method increased significantly. Notably, when
and
, the GOAL method resulted in a largely biased estimation, with a relative bias >20%. The simulation results with correlated covariates, different sample size and covariate combination are provided in supplementary Table S1, Table S2, Figure S1, and Figure S2.
Figure 2.
Proportion of the top 30 covariates being selected under Scenarios 4–6 with
.
Table 1.
Simulation results of each weighting methods under Scenarios 1–3 with
with 
| Distributions | Methods | Estimate | Bias (%) | RMSE | |
|---|---|---|---|---|---|
| Normal | N (0, 1) | Proposed | 2.025 | 1.227 | 0.097 |
| GOAL | 2.018 | 0.888 | 0.098 | ||
| Targ | 1.988 | −0.580 | 0.071 | ||
| Conf | 1.987 | −0.669 | 0.139 | ||
| PreT | 2.059 | 2.926 | 0.302 | ||
| PotConf | 2.065 | 3.228 | 0.157 | ||
| Pareto | Pareto (1, 3) | Proposed | 2.003 | 0.167 | 0.047 |
| GOAL | 2.018 | 0.914 | 0.082 | ||
| Targ | 2.003 | 0.165 | 0.043 | ||
| Conf | 1.999 | −0.055 | 0.115 | ||
| PreT | 2.093 | 4.672 | 0.330 | ||
| PotConf | 2.097 | 4.833 | 0.188 | ||
| Pareto (1, 2) | Proposed | 2.033 | 1.639 | 0.159 | |
| GOAL | 2.073 | 3.667 | 0.277 | ||
| Targ | 2.013 | 0.681 | 0.115 | ||
| Conf | 2.004 | 0.221 | 0.141 | ||
| PreT | 2.158 | 7.898 | 0.580 | ||
| PotConf | 2.175 | 8.757 | 0.620 | ||
| Pareto (1, 1.5) | Proposed | 2.123 | 6.127 | 0.773 | |
| GOAL | 2.411 | 20.570 | 2.340 | ||
| Targ | 2.044 | 2.217 | 0.325 | ||
| Conf | 2.022 | 1.086 | 0.312 | ||
| PreT | 2.346 | 17.277 | 1.394 | ||
| PotConf | 2.463 | 23.17 | 2.516 | ||
| Pareto (1, 1.3) | Proposed | 2.142 | 7.078 | 1.357 | |
| GOAL | 3.745 | 87.232 | 12.560 | ||
| Targ | 2.090 | 4.505 | 0.652 | ||
| Conf | 2.048 | 2.380 | 0.620 | ||
| PreT | 2.618 | 30.894 | 2.584 | ||
| PotConf | 2.967 | 48.327 | 6.115 | ||
| t | df = 5 | Proposed | 2.030 | 1.498 | 0.126 |
| GOAL | 2.016 | 0.786 | 0.127 | ||
| Targ | 2.007 | 0.373 | 0.087 | ||
| Conf | 2.007 | 0.357 | 0.148 | ||
| PreT | 2.050 | 2.479 | 0.317 | ||
| PotConf | 2.085 | 4.274 | 0.256 | ||
Notations: RMSE is calculated as
; Normal, standard normal distribution; Pareto, Pareto distribution; t, Student`s
-distribution; df, degree of freedom.
Binary treatment
The variable selection performances of our proposed method, OAL method and GLiDeR method are shown in Figure 2. We also present the proportions of the first 30 covariates selected in 100 simulation runs, as the selection proportion for each spurious covariate was similar. In the settings of Scenarios 4 and 6, both the proposed method and the OAL method outperformed the GLiDeR method in the selection of instrumental covariates. In Scenario 5, when the parameter
of Pareto distribution decreased from 3 to 1.3, the performance of variable selection of the proposed method was almost unchanged, with confounders and prognostic covariates selected close to 100% and instrumental variables selected <20%. However, the OAL and GLiDeR methods resulted in a significant reduction in the proportion of confounders and prognostic covariates, accompanied by an increase in the proportion of instrumental covariates.
The bias distribution of ATE estimates and summary statistics are shown in Figure 4 and Table 2. In Scenarios 4 and 6, the performance of the proposed method and OAL method was similar in terms of relative bias and RMSE. The proposed method and OAL method outperformed the hdCBPS, GLiDeR and DRS methods. In Scenario 5 with outcome data generated from Pareto distribution, we found that our proposed method always outperformed the OAL, hdCBPS, GLiDeR and DRS methods in terms of relative bias and RMSE.
Figure 4.
Box plot of the bias of parameter estimates of ATE under Scenarios 4–6 with
. The bias was calculated by subtracting 2 from the estimates.
Table 2.
Simulation results of each weighting methods under Scenario 4 to 6 with
with 
| Distributions | Methods | Estimate | Bias (%) | RMSE | |
|---|---|---|---|---|---|
| Normal | N (0,1) | Proposed | 1.991 | −0.475 | 0.064 |
| OAL | 1.992 | −0.417 | 0.065 | ||
| hdCBPS | 1.993 | −0.367 | 0.121 | ||
| GLiDeR | 1.994 | −0.324 | 0.077 | ||
| DRS | 2.005 | 0.237 | 0.118 | ||
| Pareto | Pareto (1, 3) | Proposed | 2.015 | 0.731 | 0.056 |
| OAL | 2.014 | 0.718 | 0.057 | ||
| hdCBPS | 2.030 | 1.515 | 0.101 | ||
| GLiDeR | 2.016 | 0.803 | 0.073 | ||
| DRS | 2.059 | 2.945 | 0.122 | ||
| Pareto (1, 2) | Proposed | 2.054 | 2.706 | 0.205 | |
| OAL | 2.063 | 3.150 | 0.335 | ||
| hdCBPS | 2.170 | 8.491 | 0.346 | ||
| GLiDeR | 2.069 | 3.469 | 0.322 | ||
| DRS | 2.176 | 8.788 | 0.404 | ||
| Pareto (1, 1.5) | Proposed | 2.235 | 11.748 | 0.970 | |
| OAL | 2.349 | 17.468 | 1.480 | ||
| hdCBPS | 2.464 | 23.195 | 1.366 | ||
| GLiDeR | 2.292 | 14.613 | 1.661 | ||
| DRS | 2.460 | 23.015 | 1.850 | ||
| Pareto (1, 1.3) | Proposed | 2.646 | 32.280 | 2.853 | |
| OAL | 2.864 | 43.194 | 3.953 | ||
| hdCBPS | 2.952 | 47.588 | 3.861 | ||
| GLiDeR | 2.722 | 36.106 | 5.077 | ||
| DRS | 3.017 | 50.867 | 5.511 | ||
| t | df = 5 | Proposed | 1.997 | −0.140 | 0.091 |
| OAL | 1.994 | −0.286 | 0.093 | ||
| hdCBPS | 1.981 | −0.943 | 0.158 | ||
| GLiDeR | 2.007 | 0.370 | 0.113 | ||
| DRS | 2.018 | 0.922 | 0.152 | ||
Notations: RMSE is calculated as
; Normal, standard normal distribution; Pareto, Pareto distribution; t, Student`s
-distribution; df, degree of freedom.
Figure 3.
Box plot of the bias of parameter estimates of
under Scenarios 1–3 with
. The bias was calculated by subtracting 2 from the estimates.
REAL DATA APPLICATION
We utilized the GMAL method to a real-world data from the ADNI Study. We examined the baseline characteristics and genetic covariates. Specifically, we focused on the CSF-tau. It is acknowledged that tau plays a crucial role in microtubule polymerization and stabilization [27]. Previous studies have indicated that abnormalities in tau protein can trigger the AD cascade, leading to dementia [28]. Pathogenic CSF-tau can activate the antiviral pathway in microglial cells and inhibit neuronal self-repair mechanisms, thereby promoting cognitive impairment [29], which further supported the potential therapeutic value of drugs targeting CSF-tau or the antiviral pathway for treating AD [29]. In this study, our goal is to explore the potential influence of CSF-tau on AD severity.
The CSF-tau measured at baseline is the exposure of interest. The outcome we are concerned with is the severity of AD measured at Month 24. We employed the widely accepted 11-item version of Alzheimer’s Disease Assessment Scale (ADAS-11) cognitive score to evaluate the extent of AD severity. ADAS-11 score ranges from 0 to 70, where a higher score signifies a greater severity.
The dataset had 364 participants with complete information. In the GMAL analysis, age, gender and education level were considered as known risk factors for AD [30]. Zhang et al. discovered significant DNAm patterns in CSF biomarkers among individuals with AD and those who were cognitively normal [31]. Their research elucidated that DNAm in blood is indicative of the biological processes linked to early brain impairment associated with AD [31]. Moreover, they found associations between blood DNAm at multiple CpG sites and tau pathology as well as DNAm in the brain [31]. Therefore, we also included the whole-genome CpG sites as candidate covariates. The whole-genome CpG sites may include confounders and prognostic covariates or surrogates for these two types of covariates [32]. We initially pre-processed the DNAm profiles by: (i) excluding probes with
-value
0.05; (ii) filtering out gender-related probes; (iii) removing probes with SNPs at CpG sites; (iv) eliminating cross-reactive probes and (v) averaging DNAm levels for samples measured multiple times [33]. After pre-processing, 865 859 CpG sites were retained in the analysis as candidate covariates. Due to the inapplicability of GMAL and GOAL for
, we initially performed an epigenome-wide association study (EWAS) analysis. We selected the top 100 CpG sites according to the Bonferroni-adjusted
values of each CpG site. After accounting for age, sex and education level, as well as the covariates chosen via the GMAL and GOAL methods, we estimated the parameter of interest and 95% confidence intervals (95% CIs). The CIs were determined through bootstrapping with 200 replications.
The characteristics of the known risk factors according to different levels of ADAS-11 are summarized in Supplementary Table S4. A total of 64 CpG sites were utilized for the GMAL analysis, while the GOAL analysis involved 54 CpG sites. We listed the 64 CpG sites selected by GMAL in Supplementary Table S3. A subset of these selected variables, such as cg04874795, cg02674693 and cg01681367, have previously shown strong associations with AD [34, 35]. Assessing covariate balance is a critical aspect of ensuring accurate causal inference, as imbalances can introduce bias into estimates. We examined the balance of covariates by calculating the absolute Pearson’s correlation coefficient of CSF-tau with each covariate [16]. Figure 5 illustrates the performance of covariate balance in both the unweighted and the weighted samples. The weights generated by the GMAL method effectively achieved balance among covariates across various CSF-tau levels.
Figure 5.

Covariate balance achieved by the unweighted and GMAL methods. The Y-axis represents the absolute Pearson’s correlation between each selected covariate and the CSF-tau.
Results from Table 3 indicate that participants with high level of CSF-tau exhibited a higher severity of AD. The proposed method estimated a narrower CI (estimate, 0.069; 95% CI, 0.042–0.096) compared to the GOAL method (0.058; 0.031–0.101).
Table 3.
Causal estimator and corresponding 95% CIs using the proposed method and GOAL method
| Methods | Estimate | 95% CIs |
|---|---|---|
| Proposed | 0.069 | 0.042–0.096 |
| GOAL | 0.058 | 0.031–0.101 |
Notations: The 95% CIs were calculated using the bootstrap method with 200 replications.
DISCUSSION
In this article, we introduced the GMAL method, which is specifically designed for causal estimation from high-dimensional observational data. Our method improves upon the previous approaches, which used penalty weights derived from a linear ‘full’ outcome regression model. By utilizing the median regression model, we achieved a robust variable selection against heavy-tailed distributions. Moreover, we demonstrate the practical utility of GMAL through data analysis. Specifically, we illustrate its effectiveness in variable selection for causal inference from omics data.
Simulation studies showed that the GMAL method always outperformed the existing methods when the data were generated from Pareto distributions. Furthermore, as the shape parameter of the Pareto distribution decreases, indicating a higher degree of heavy tail, the variable selection performance of the OAL, GLiDeR and GOAL methods deteriorates significantly compared to the GMAL. One possible explanation for this could be that the penalty weights in the OAL, GLiDeR and GOAL methods are derived from the linear regression model that relates covariates to the outcome. The proposed method used a median regression model to construct weights that might be more robust for variable selection for an outcome with skewed distributions.
For biological data with more samples than features or variables, the GMAL method can be directly applied in such cases. However, when the dimension of covariates is greater than participants (
), the GMAL method is not directly applicable. Possible ways to preprocess the biological data as an
problem for processing with GMAL would be useful. In this article, we performed an EWAS analysis as dimensionality reduction preprocessing. The method of dimensionality reduction needs further research.
In our simulations, we did not take into account other types of outcome distributions, such as mixed distributions. In future research, it may be worthwhile to investigate the variable selection performance for outcome variables with mixed distribution.
Key Points
There has been a growing interest in variable selection for causal inference within the context of high-dimensional data. However, when the outcome exhibits a skewed distribution, ensuring the accuracy of variable selection might be challenging.
We proposed GMAL to select covariates that can achieve an accurate estimation of causal effects when the outcome follows skewed distributions.
The GMAL performs comparably to the existing methods in variable selection when the outcome follows a symmetric distribution. Besides, the GMAL exhibits an obvious superiority over the existing methods when the outcome variable follows a skewed distribution.
ETHICS APPROVAL AND CONSENT TO PARTICIPATE
Since the simulated datasets did not involve any human data, ethics approval was not applicable.
CONSENT FOR PUBLICATION
Not applicable.
AUTHORS’ CONTRIBUTIONS
T.W., G.Q. and Y.Y. conceived the study. Y.L., Q.G. and K.W. performed the analysis and prepared the manuscript, including figures and tables. All authors have provided critical comments on the draft and read and approved the final manuscript. Y.L., Q.G. and K.W. contributed equally to this work.
Supplementary Material
ACKNOWLEDGEMENTS
Not applicable.
Author Biographies
Yahang Liu is a PhD student at Fudan University. Her research focuses on variable selection for causal inference in high-dimensional settings and its application.
Qian Gao is an associate professor at Shanxi Medical University. Her research focuses on statistical methods for variable selection in high-dimensional settings and its application in omics data.
Kecheng Wei is a PhD student at Fudan University. His research focuses on statistical methods for casual inference.
Chen Huang is a PhD student at Fudan University. Her research focuses on causal inference research with multi-site survival data and distributed analysis.
Ce Wang is a PhD at Fudan University. His research focuses on statistical methods for causal inference in survival data.
Yongfu Yu is a Professor at Fudan University. His research focuses on causal inference and life course epidemiology (cardiometabolic disease). He also has interests in research design and statistical methods related to birth cohorts.
Guoyou Qin is a Professor at Fudan University. His research focuses on statistical methods for causal inference and complex data. He also concentrates on the application of statistical methods in medicine and public health, with a primary focus on the fields of oncology and chronic diseases.
Tong Wang is a Professor at Shanxi Medical University. His research focuses on developing statistical methods for complex data and causal inference. He also has research interests in determining risk and etiological factors of non-communicable diseases.
Contributor Information
Yahang Liu, Department of Biostatistics, School of Public Health, Fudan University, Shanghai, China.
Qian Gao, Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China; Key Laboratory of Coal Environmental Pathogenicity and Prevention (Shanxi Medical University), Ministry of Education, China.
Kecheng Wei, Department of Biostatistics, School of Public Health, Fudan University, Shanghai, China.
Chen Huang, Department of Biostatistics, School of Public Health, Fudan University, Shanghai, China.
Ce Wang, Department of Biostatistics, School of Public Health, Fudan University, Shanghai, China.
Yongfu Yu, Department of Biostatistics, School of Public Health, Fudan University, Shanghai, China; Shanghai Institute of Infectious Disease and Biosecurity, Shanghai, China; Key Laboratory of Public Health Safety of Ministry of Education, Key Laboratory for Health Technology Assessment, National Commission of Health, Fudan University, Shanghai, China.
Guoyou Qin, Department of Biostatistics, School of Public Health, Fudan University, Shanghai, China; Shanghai Institute of Infectious Disease and Biosecurity, Shanghai, China; Key Laboratory of Public Health Safety of Ministry of Education, Key Laboratory for Health Technology Assessment, National Commission of Health, Fudan University, Shanghai, China.
Tong Wang, Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China; Key Laboratory of Coal Environmental Pathogenicity and Prevention (Shanxi Medical University), Ministry of Education, China.
FUNDING
National Natural Science Foundation of China (Nos. 82173612 to G.Q., 82273730 to Y.Y., 82073674 to T.W. and 82204163 to Q.G.); Shanghai Rising-Star Program (21QA1401300 to Y.Y.); Shanghai Municipal Natural Science Foundation (22ZR1414900 to Y.Y.); Shanghai Municipal Science and Technology Major Project (ZD2021CY001 to G.Q.).
DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. These data can be found at: adni.loni.usc.edu.
References
- 1. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983;70(1):41–55. [Google Scholar]
- 2. Ertefaie A, Asgharian M, Stephens DA. Variable selection in causal inference using a simultaneous penalization method. J Causal Inference 2018;6(1). [Google Scholar]
- 3. Koch B, Vock DM, Wolfson J. Covariate selection with group lasso and doubly robust estimation of causal effects. Biometrics 2018;74(1):8–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Wilson A, Reich BJ. Confounder selection via penalized credible regions. Biometrics 2014;70(4):852–61. [DOI] [PubMed] [Google Scholar]
- 5. Brookhart MA, Schneeweiss S, Rothman KJ, et al. Variable selection for propensity score models. Am J Epidemiol 2006;163(12):1149–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Shortreed SM, Ertefaie A. Outcome-adaptive lasso: variable selection for causal inference. Biometrics 2017;73(4):1111–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Antonelli J, Parmigiani G, Dominici F. High-dimensional confounding adjustment using continuous spike and slab priors. Bayesian Anal 2019;14(3):805–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Ye Z, Zhu Y, Coffman DL. Variable selection for causal mediation analysis using LASSO-based methods. Stat Methods Med Res 2021;30(6):1413–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Ghosh S, Tan Z. Doubly robust semiparametric inference using regularized calibrated estimation with high-dimensional data. Ther Ber 2022;28(3):1675–703. [Google Scholar]
- 10. Ning Y, Sida P, Imai K. Robust estimation of causal effects via a high-dimensional covariate balancing propensity score. Biometrika 2020;107(3):533–54. [Google Scholar]
- 11. Sun B, Tan Z. High-dimensional model-assisted inference for local average treatment effects with instrumental variables. J Bus Econ Stat 2022;40(4):1732–44. [Google Scholar]
- 12. Li Y, Li L. Propensity score analysis with local balance. Stat Med 2023;42:2637–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Mak K-K, Kim D-H, Leigh JP. Sociodemographic differences in the association between obesity and stress: a propensity score-matched analysis from the Korean National Health and Nutrition Examination Survey (KNHANES). Nutr Cancer 2015;67(5):804–10. [DOI] [PubMed] [Google Scholar]
- 14. VanderWeele TJ, Hawkley LC, Thisted RA, Cacioppo JT. A marginal structural model analysis for loneliness: implications for intervention trials and clinical practice. J Consult Clin Psychol 2011;79(2):225–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Gao Q, Zhang Y, Liang J, et al. High-dimensional generalized propensity score with application to omics data. Brief Bioinform 2021;22(6). 10.1093/bib/bbab331. [DOI] [PubMed] [Google Scholar]
- 16. Fong C, Hazlett C, Imai K. Covariate balancing propensity score for a continuous treatment: application to the efficacy of political advertisements. Ann Appl Stat 2018;12(1):156–77. [Google Scholar]
- 17. Zhang Z, Chen Z, Troendle JF, Zhang J. Causal inference on quantiles with an obstetric application. Biometrics 2012;68(3):697–706. [DOI] [PubMed] [Google Scholar]
- 18. Zhang J, Troendle J, Reddy UM, et al. Contemporary cesarean delivery practice in the United States. Am J Obstet Gynecol 2010;203(4):326. e1-326, e10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Chen L. Testing the mean of skewed distributions. J Am Stat Assoc 1995;90(430):767–72. [Google Scholar]
- 20. Yuan Y, MacKinnon DP. Robust mediation analysis based on median regression. Psychol Methods 2014;19(1):1–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Hirano K, Imbens GW. The propensity score with continuous treatments. Applied Bayesian modeling and causal inference from incomplete-data perspectives. 2004;226164:73–84. [Google Scholar]
- 22. Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol 1974;66(5):688–701. [Google Scholar]
- 23. Tang D, Kong D, Pan W, Wang L. Ultra-high dimensional variable selection for doubly robust causal inference. Biometrics 2022;79:903–14. [DOI] [PubMed] [Google Scholar]
- 24. Robins JM. Association, causation, and marginal structural models. Synthese 1999;121(1/2):151–79. [Google Scholar]
- 25. Ju C, Benkeser D, Laan MJ. Robust inference on the average treatment effect using the outcome highly adaptive lasso. Biometrics 2020;76(1):109–18. [DOI] [PubMed] [Google Scholar]
- 26. Sun S, Moodie EE, Nešlehová JG. Causal inference for quantile treatment effects. Environ 2021;32(4):e2668. [Google Scholar]
- 27. Kametani F, Hasegawa M. Reconsideration of amyloid hypothesis and tau hypothesis in Alzheimer's disease. Front Neurosci 2018;12:25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Iqbal K, Liu F, Gong CX, Grundke-Iqbal I. Tau in Alzheimer disease and related tauopathies. Curr Alzheimer Res 2010;7(8):656–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Mummery CJ, Börjesson-Hanson A, Blackburn DJ, et al. Tau-targeting antisense oligonucleotide MAPT(Rx) in mild Alzheimer's disease: a phase 1b, randomized, placebo-controlled trial. Nat Med 2023;29(6):1437–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Viña J, Lloret A. Why women have more Alzheimer's disease than men: gender and mitochondrial toxicity of amyloid-beta peptide. J Alzheimers Dis 2010;20(Suppl 2):S527–33. [DOI] [PubMed] [Google Scholar]
- 31. Zhang W, Young JI, Gomez L, et al. Distinct CSF biomarker-associated DNA methylation in Alzheimer's disease and cognitively normal subjects. Res Sq 2023;15(1):78. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Higgins-Chen AT, Boks MP, Vinkers CH, et al. Schizophrenia and epigenetic aging biomarkers: increased mortality, reduced cancer risk, and unique clozapine effects. Biol Psychiatry 2020;88(3):224–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Shireby GL, Davies JP, Francis PT, et al. Recalibrating the epigenetic clock: implications for assessing biological age in the human cortex. Brain 2020;143(12):3763–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Li QS, Sun Y, Wang T. Epigenome-wide association study of Alzheimer's disease replicates 22 differentially methylated positions and 30 differentially methylated regions. Clin Epigenetics 2020;12(1):149. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Smith RG, Pishva E, Shireby G, et al. A meta-analysis of epigenome-wide association studies in Alzheimer's disease highlights novel differentially methylated loci across cortex. Nat Commun 2021;12(1):3517. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Publicly available datasets were analyzed in this study. These data can be found at: adni.loni.usc.edu.
















