Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2024 Mar 2;25(2):bbae059. doi: 10.1093/bib/bbae059

High-dimensional generalized median adaptive lasso with application to omics data

Yahang Liu 1,✉,2, Qian Gao 2,3,✉,2, Kecheng Wei 4,✉,2, Chen Huang 5, Ce Wang 6, Yongfu Yu 7,8,9,✉,, Guoyou Qin 10,11,12,✉,, Tong Wang 13,14,✉,
PMCID: PMC10939310  PMID: 38436558

Abstract

Recently, there has been a growing interest in variable selection for causal inference within the context of high-dimensional data. However, when the outcome exhibits a skewed distribution, ensuring the accuracy of variable selection and causal effect estimation might be challenging. Here, we introduce the generalized median adaptive lasso (GMAL) for covariate selection to achieve an accurate estimation of causal effect even when the outcome follows skewed distributions. A distinctive feature of our proposed method is that we utilize a linear median regression model for constructing penalty weights, thereby maintaining the accuracy of variable selection and causal effect estimation even when the outcome presents extremely skewed distributions. Simulation results showed that our proposed method performs comparably to existing methods in variable selection when the outcome follows a symmetric distribution. Besides, the proposed method exhibited obvious superiority over the existing methods when the outcome follows a skewed distribution. Meanwhile, our proposed method consistently outperformed the existing methods in causal estimation, as indicated by smaller root-mean-square error. We also utilized the GMAL method on a deoxyribonucleic acid methylation dataset from the Alzheimer’s disease (AD) neuroimaging initiative database to investigate the association between cerebrospinal fluid tau protein levels and the severity of AD.

Keywords: causal inference, propensity score, observational studies, variable selection

BACKGROUND

Causal inference from observational studies plays an important role in the field of biomedical research. A crucial challenge of causal inference in observational studies is confounding bias due to lack of randomization. In bioinformatics, propensity score (PS) methods are commonly employed for reducing confounding bias and understanding causality [1]. Estimating treatment effects within the framework of PS models is highly sensitive to the covariates for adjustment. Insufficient adjustment for confounders in the PS model leads to biased causal effect estimates [2–4]. Incorporating all confounders is important for unbiased treatment effect estimates. Inclusion of instrumental covariates in addition to confounders may result in a decrease in efficiency, while incorporating prognostic covariates can improve estimation efficiency [2–5]. Striking a balance between bias and efficiency through variable selection is of paramount concern in research [3, 6–8].

With the rapidly expanding data sources, including omics data, it is often the case that there is limited prior knowledge regarding the precise set of confounders. As a result, there has been a growing interest in variable selection for causal inference within the context of high-dimensional data [3, 6–12]. The group LASSO and doubly robust estimation (GLiDeR) method modified the regularization penalty to select confounders and prognostic covariates [3]. The high-dimensional covariate balancing PS (hdCBPS) method estimated the initial PS by maximizing a penalized generalized quasi-likelihood and calibrated the initial PS by balancing covariates selected from the outcome model [10]. Shortreed et al. presented the outcome-adaptive lasso (OAL) method [6]. This approach involves estimating the PS model using adaptive lasso under binary treatment [6]. The tuning parameters were selected depending on the covariate balance between different treatment groups. The OAL method has the capability to select confounders and prognostic covariates, while effectively excluding the instrumental and spurious covariates. Most of these methods are applied to causal inference in the context of binary treatment in high-dimensional settings. In real-world scenarios, the treatments under investigation frequently involve continuous variables [13, 14]. Gao et al. introduced the generalized outcome-adaptive lasso (GOAL) method [15], which emphasized variable selection for causal effect estimation in omics data. The GOAL method applied the non-parametric covariate balancing generalized propensity score (npCBGPS) method to estimate the balance weight, making it suitable for handling continuous treatment [16].

Both the OAL and GOAL methods constructed penalty weights in objective function using a full linear outcome regression model, which requires that the outcome is normally distributed. In practical research, skewed distribution of data is frequently encountered [17, 18]. A skewed distribution, also known as an asymmetric distribution [19], refers to a probability distribution of a dataset where the data are not evenly or symmetrically distributed around the mean. In our real data application based on the Alzheimer’s disease (AD) neuroimaging initiative (ADNI) database, we aim to explore the association between cerebrospinal fluid tau protein levels (CSF-tau) and the severity of AD. The severity of AD exhibits a positively skewed distribution. In cases where the outcome demonstrates a skewed distribution or contains contaminated data, constructing penalty weights using a linear outcome regression may lead to inaccurate variable selection and may result in imprecise estimation of the causal effect. Median regression is to some extent robust to outliers and extreme values [20]. Therefore, employing a median regression model to construct penalty weights may better ensure the accuracy of variable selection and causal effect estimation, particularly when the outcome follows skewed distributions.

In this paper, to reduce the impact of skewed distributions of the outcome, we proposed the generalized median adaptive lasso (GMAL) method, which is motivated by OAL and GOAL for accurate variable selection and causal effect estimation. Our proposed method is applicable to both continuous and binary treatments. A distinctive feature of our proposed method is that we utilize a linear median regression model for constructing penalty weights, thereby enabling an accurate variable selection and causal estimation even when the outcome exhibits skewed distributions. Simulation results showed that our proposed method performs comparably to the existing methods in variable selection when the outcome follows a symmetric distribution. Besides, the proposed method exhibited obvious superiority over the existing methods when the outcome variable follows a skewed distribution. Meanwhile, the proposed method consistently exhibited superior causal effect estimation performance compared to the existing methods when considering the root-mean-square error (RMSE).

This article is structured as follows. First, we begin by introducing the notations, and assumptions and providing an overview of the inverse probability weighting (IPW) method with continuous treatment. Second, we present the GMAL method with continuous treatment in detail. The notations, assumptions and the proposed method with binary treatment are showed in the Supplementary Appendix. Third, we carry out simulations to assess the performance of the proposed method and existing methods. Fourth, we utilize our method to a deoxyribonucleic acid (DNA) methylation (DNAm) dataset from the ADNI database to explore the effects of CSF-tau on the severity of AD. Finally, we present a brief discussion of the results.

NOTATIONS, ASSUMPTIONS AND IPW ESTIMATOR

For causal inference with continuous treatment variables, we focus on the dose–response function (DRF). We formulize the DRF under the potential outcome framework [21, 22]. Suppose we have drawn Inline graphic units from the population of interest. The observed data Inline graphic are independent and identically distributed copies of Inline graphic, where Inline graphic denote the treatment level (Inline graphic denotes continuous domain), the outcome and a Inline graphic-dimensional vector of pre-treatment covariates, respectively. Let Inline graphic denote the potential outcomes under treatment level Inline graphic. Then, the DRF is defined as Inline graphic.

For the observational data, we first assume that there is no interference between units, and then we make the following causal inference assumptions [23]:

(A1) Consistency: Inline graphic

(A2) Unconfoundeness: Inline graphic, Inline graphic denotes statistical independence.

(A3) Positivity: Inline graphic where Inline graphic denotes the generalized propensity score (GPS).

To obtain the IPW estimator, we first employ the npCBGPS method to calculate weights [16]. The npCBGPS method estimates weights utilizing an empirical likelihood approach while simultaneously satisfying balance conditions [16]. The npCBGPS method does not require the specification of a GPS model when estimating balance weights, thus demonstrating a degree of robustness against potential misspecification of the GPS model. The weight of Inline graphicth observation is defined as follows:

graphic file with name DmEquation2.gif (1)

where Inline graphic is the marginal density of the treatment Inline graphic, Inline graphic is the conditional density of treatment given Inline graphic, also called GPS.

We then estimate the DRF through a marginal structural model (MSM) method [24]. This method involves constructing an MSM to compute the potential outcomes, with our assumption being a linear model:

graphic file with name DmEquation3.gif (2)

Under the above causal inference assumptions (A1–A3), we could obtain consistent estimates for the parameters in Equation (2) using the IPW method based on the observed data Inline graphic.

THE PROPOSED METHOD

Variable selection based on GMAL

We now discuss the four categories of pre-treatment covariates: confounders Inline graphic that impact both the treatment and the outcome; prognostic covariates Inline graphic that impact the outcome but not the treatment; instrumental covariates Inline graphic that impact treatment, but not outcome, and spurious covariates Inline graphic that unrelated to treatment and outcome. We shall include all confounders (Inline graphic) and exclude spurious covariates (Inline graphic) in the adjustment set to avoid bias. Besides, we shall include all prognostic covariates (Inline graphic) and exclude instrumental covariates (Inline graphic) to increase the statistical efficiency. As a result, Inline graphic is our target adjustment set. Specifically, we propose the GMAL objective function for the case of continuous treatment variable as follows:

graphic file with name DmEquation4.gif (3)

where Inline graphic is a nonnegative regularization parameter; Inline graphic is the penalty weight and Inline graphic, Inline graphic represents the computation of absolute value and Inline graphic represents the computation of the squared sum; Inline graphic represents the unpenalized estimation coefficient of the Inline graphicth covariate conditioned on the treatment from the median regression model.

Selecting Inline graphic

Selecting the appropriate tuning parameter Inline graphic holds significance in practical applications. We assembled a set of Inline graphic values, following the approach used by Shortreed et al. [6], where each Inline graphic corresponds to a specific covariate set. Here, we present how to select the optimal tuning parameter Inline graphic.

To begin, with Inline graphic and a given Inline graphic, we select a group of covariates Inline graphic from the setInline graphic, as detailed in the ‘Variable selection based on GMAL’ subsection. Subsequently, we employ the adjustment set Inline graphic to calculate the balance weight using the npCBGPS approach [16].

Next, the optimal Inline graphic is selected by minimizing a dual-weight correlation (DWC) as proposed by Gao et al. [15].

graphic file with name DmEquation5.gif

where Inline graphic represents the weighted correlation coefficient between Inline graphicth covariate and treatment; Inline graphic represents the unpenalized estimation coefficient for the Inline graphicth covariate conditioned on treatment.

SIMULATIONS

In this section, we conducted a series of simulation studies to assess the performance of the proposed method in settings with continuous or binary treatment variables. Furthermore, we generated outcome variables with standard normal distribution, Pareto distribution and Student’s Inline graphic-distribution. Here, we assumed linear outcome regression models. The vector of potential covariates Inline graphic were obtained from a multivariate Gaussian distribution Inline graphic. The covariance was set to Inline graphic or Inline graphic. We examined the impact of covariates with different correlation structures by considering Inline graphic (independent covariates) and Inline graphic (correlated covariates). The simulations were repeated 100 times for each data-generating process. The results are evaluated using relative bias, RMSE and the proportion of covariates selected.

For Inline graphic, we considered the same eight possible values used by Shortreed et al. [6] Inline graphic for each dataset. In the settings with continuous treatment, the optimal Inline graphic is selected by minimizing a DWC. In the settings with binary treatment, the optimal Inline graphic is determined by minimizing a wAMD. We set Inline graphic such that Inline graphic for each Inline graphic [6, 8, 25].

Continuous treatment

In continuous treatment setting, the continuous treatment Inline graphic is generated from a standard normal distribution with a specified linear GPS model,

graphic file with name DmEquation6.gif

where Inline graphic In Scenarios 1–3, we considered the sample size (Inline graphic) and covariates dimension (Inline graphic) as Inline graphic. We also considered Inline graphic in the Supplementary Materials.

Scenario 1

In Scenario 1, we aim to evaluate the performance of GMAL when the outcome variable follows symmetric distributions. Specifically, in Scenario 1, we focused on the simulations with the random error term Inline graphic arising from the standard normal distribution,

graphic file with name DmEquation7.gif

Scenario 2

In Scenario 2, we tend to evaluate the performance of our proposed method when the outcome variable follows a skewed distribution. The skewed distribution can take many forms, including the exponential, Weibull and Pareto distributions [19]. We focused on the simulations with the random error term Inline graphic arising from the Pareto distribution, which is a typical type of positively skewed distribution [26]. We focused on the Pareto distribution with varying shapes (Inline graphic) and fixed location (Inline graphic). The variation of Inline graphic was employed to assess the influence of various degrees of heavy-tailed distributions on variable selection and causal estimation. The smaller Inline graphic is, the heavier is the tail of the Pareto distribution:

graphic file with name DmEquation8.gif

Scenario 3

In Scenario 3, we focused on the simulations with Inline graphic arising from the Student’s Inline graphic-distribution. We focused on the case where the degree of freedom is equal to 5,

graphic file with name DmEquation9.gif (4)

The true outcome variable generation process is defined as follows:Inline graphic Under the linear data generative model, Inline graphicand Inline graphic are confounders; Inline graphicand Inline graphicare prognostic covariates; Inline graphicand Inline graphic are the instrumental covariates. Other covariates, except for Inline graphic, are considered as spurious covariates. For simplicity, we are interested in the estimation of the parameter Inline graphic rather than the linear DRF.

We compare the GMAL with the GOAL method introduced by Gao et al. [15]. Furthermore, we also compared the GMAL method with four weighting methods:

  • (i) GOAL: proposed by Gao et al. [15], which employed a ‘full’ linear outcome regression model to construct penalty weights.

  • (ii) Targ: the reference method, target adjustment set Inline graphic. Ideally, the variable selection of the proposed method is consistent with this method.

  • (iii) Conf: covariate adjustment set includes only Inline graphic.

  • (iv) PreT: covariate adjustment set includes Inline graphic and Inline graphic.

  • (v) PotConf: covariate adjustment set includes Inline graphic, Inline graphic and Inline graphic.

The npCBGPS method was also employed to estimate the balance weights for the Targ, Conf, PreT and PotConf methods, considering inclusion of different adjustment covariates.

Binary treatment

In a binary treatment setting, we focused on the estimation of average treatment effect (ATE). The details of the proposed method with binary treatment are showed in the Supplementary Appendix. The binary treatment Inline graphic is generated from a Bernoulli distribution with a specified logistic regression,

graphic file with name DmEquation10.gif

where Inline graphicsuch that Inline graphic and Inline graphic. In Scenarios 4–6, we considered the sample size (Inline graphic) and covariates dimension (Inline graphic) as Inline graphic.

Scenario 4

In Scenario 4, we focused on the simulations with the random error term Inline graphic arising from the standard normal distribution,

graphic file with name DmEquation11.gif

Scenario 5

In Scenario 5, we focused on the simulations with the random error term Inline graphic arising from the Pareto distribution. We focused on the Pareto distribution with varying shapes (Inline graphic) and fixed location (Inline graphic). The smaller Inline graphic, the heavier is the tail of the Pareto distribution:

graphic file with name DmEquation12.gif

Scenario 6

In Scenario 6, we focused on the simulations with Inline graphic arising from the Student’s Inline graphic-distribution. We focused on the case where the degree of freedom is equal to 5,

graphic file with name DmEquation13.gif (5)

The true outcome variable generation process is defined as follows: Inline graphic Under the linear data generative model, Inline graphicis the confounder; Inline graphic is the prognostic covariate; Inline graphicand Inline graphic are instrumental covariates. Other covariates, except for Inline graphic, are considered as spurious covariates. We are interested in the estimation of the parameter Inline graphic also called the ATE.

We compare our proposed GMAL method with the following methods for estimating ATE:

  • (i) OAL: in this study, the OAL was implemented with default settings from code provided in Shortreed and Ertefaie [6].

  • (ii) hdCBPS [10]: the hdCBPS method was implemented with default settings using R package CBPS.

  • (iii) GLiDeR: we use the R code provided in original paper to estimate the ATE [3].

  • (iv) Doubly robust semiparametric (DRS) method: we also use the R code provided in original paper to estimate the ATE [9].

SIMULATION RESULTS

Continuous treatment

The variable selection performance of our proposed method, as well as other methods, is shown in Figure 1. We present the proportions of the first 30 covariates selected in 100 simulation runs, as the selection proportion for each spurious covariate was similar. In the settings of Scenarios 1 and 3, both the GMAL method and the GOAL method excelled in the selection of confounders and prognostic covariates, with their selection proportions close to 1. Additionally, the rates for selecting instrumental and spurious covariates consistently remained <30%. In Scenario 2, when the parameter Inline graphic of Pareto distribution decreased from 3 to 1.3, the performance of variable selection of the GMAL method was almost unchanged with the rates for selecting confounders and prognostic covariates consistently remaining close to 1. However, the GOAL method resulted in a significant reduction in the proportion of confounders and prognostic covariates, which was accompanied by an increase in the proportion of instrumental and spurious covariates.

Figure 1.

Figure 1

Proportion of the top 30 covariates being selected under Scenarios 1–3 with Inline graphic.

The bias distribution of parameter estimates and summary statistics are shown in Figure 2 and Table 1. In Scenarios 1 and 3, the performance of the proposed method GMAL and GOAL was similar. Targ exhibited the smallest variability when compared to the Conf, PreT and PotConf methods, whereas the PreT method consistently displayed the highest variability. In Scenario 2 with data generated from Pareto distribution, we found that our proposed method always outperformed the GOAL method and performed similarly to the reference method, Targ. When Inline graphic of the Pareto distribution decreased from 3 to 1.3, our proposed method showed a slight increase in bias and RMSE, while the GOAL method increased significantly. Notably, when Inline graphic and Inline graphic, the GOAL method resulted in a largely biased estimation, with a relative bias >20%. The simulation results with correlated covariates, different sample size and covariate combination are provided in supplementary Table S1, Table S2, Figure S1, and Figure S2.

Figure 2.

Figure 2

Proportion of the top 30 covariates being selected under Scenarios 4–6 with Inline graphic.

Table 1.

Simulation results of each weighting methods under Scenarios 1–3 with Inline graphic with Inline graphic

Distributions Methods Estimate Bias (%) RMSE
Normal N (0, 1) Proposed 2.025 1.227 0.097
GOAL 2.018 0.888 0.098
Targ 1.988 −0.580 0.071
Conf 1.987 −0.669 0.139
PreT 2.059 2.926 0.302
PotConf 2.065 3.228 0.157
Pareto Pareto (1, 3) Proposed 2.003 0.167 0.047
GOAL 2.018 0.914 0.082
Targ 2.003 0.165 0.043
Conf 1.999 −0.055 0.115
PreT 2.093 4.672 0.330
PotConf 2.097 4.833 0.188
Pareto (1, 2) Proposed 2.033 1.639 0.159
GOAL 2.073 3.667 0.277
Targ 2.013 0.681 0.115
Conf 2.004 0.221 0.141
PreT 2.158 7.898 0.580
PotConf 2.175 8.757 0.620
Pareto (1, 1.5) Proposed 2.123 6.127 0.773
GOAL 2.411 20.570 2.340
Targ 2.044 2.217 0.325
Conf 2.022 1.086 0.312
PreT 2.346 17.277 1.394
PotConf 2.463 23.17 2.516
Pareto (1, 1.3) Proposed 2.142 7.078 1.357
GOAL 3.745 87.232 12.560
Targ 2.090 4.505 0.652
Conf 2.048 2.380 0.620
PreT 2.618 30.894 2.584
PotConf 2.967 48.327 6.115
t df = 5 Proposed 2.030 1.498 0.126
GOAL 2.016 0.786 0.127
Targ 2.007 0.373 0.087
Conf 2.007 0.357 0.148
PreT 2.050 2.479 0.317
PotConf 2.085 4.274 0.256

Notations: RMSE is calculated as Inline graphic; Normal, standard normal distribution; Pareto, Pareto distribution; t, Student`s Inline graphic-distribution; df, degree of freedom.

Binary treatment

The variable selection performances of our proposed method, OAL method and GLiDeR method are shown in Figure 2. We also present the proportions of the first 30 covariates selected in 100 simulation runs, as the selection proportion for each spurious covariate was similar. In the settings of Scenarios 4 and 6, both the proposed method and the OAL method outperformed the GLiDeR method in the selection of instrumental covariates. In Scenario 5, when the parameter Inline graphic of Pareto distribution decreased from 3 to 1.3, the performance of variable selection of the proposed method was almost unchanged, with confounders and prognostic covariates selected close to 100% and instrumental variables selected <20%. However, the OAL and GLiDeR methods resulted in a significant reduction in the proportion of confounders and prognostic covariates, accompanied by an increase in the proportion of instrumental covariates.

The bias distribution of ATE estimates and summary statistics are shown in Figure 4 and Table 2. In Scenarios 4 and 6, the performance of the proposed method and OAL method was similar in terms of relative bias and RMSE. The proposed method and OAL method outperformed the hdCBPS, GLiDeR and DRS methods. In Scenario 5 with outcome data generated from Pareto distribution, we found that our proposed method always outperformed the OAL, hdCBPS, GLiDeR and DRS methods in terms of relative bias and RMSE.

Figure 4.

Figure 4

Box plot of the bias of parameter estimates of ATE under Scenarios 4–6 with Inline graphic. The bias was calculated by subtracting 2 from the estimates.

Table 2.

Simulation results of each weighting methods under Scenario 4 to 6 with Inline graphic with Inline graphic

Distributions Methods Estimate Bias (%) RMSE
Normal N (0,1) Proposed 1.991 −0.475 0.064
OAL 1.992 −0.417 0.065
hdCBPS 1.993 −0.367 0.121
GLiDeR 1.994 −0.324 0.077
DRS 2.005 0.237 0.118
Pareto Pareto (1, 3) Proposed 2.015 0.731 0.056
OAL 2.014 0.718 0.057
hdCBPS 2.030 1.515 0.101
GLiDeR 2.016 0.803 0.073
DRS 2.059 2.945 0.122
Pareto (1, 2) Proposed 2.054 2.706 0.205
OAL 2.063 3.150 0.335
hdCBPS 2.170 8.491 0.346
GLiDeR 2.069 3.469 0.322
DRS 2.176 8.788 0.404
Pareto (1, 1.5) Proposed 2.235 11.748 0.970
OAL 2.349 17.468 1.480
hdCBPS 2.464 23.195 1.366
GLiDeR 2.292 14.613 1.661
DRS 2.460 23.015 1.850
Pareto (1, 1.3) Proposed 2.646 32.280 2.853
OAL 2.864 43.194 3.953
hdCBPS 2.952 47.588 3.861
GLiDeR 2.722 36.106 5.077
DRS 3.017 50.867 5.511
t df = 5 Proposed 1.997 −0.140 0.091
OAL 1.994 −0.286 0.093
hdCBPS 1.981 −0.943 0.158
GLiDeR 2.007 0.370 0.113
DRS 2.018 0.922 0.152

Notations: RMSE is calculated as Inline graphic; Normal, standard normal distribution; Pareto, Pareto distribution; t, Student`s Inline graphic-distribution; df, degree of freedom.

Figure 3.

Figure 3

Box plot of the bias of parameter estimates of Inline graphic under Scenarios 1–3 with Inline graphic. The bias was calculated by subtracting 2 from the estimates.

REAL DATA APPLICATION

We utilized the GMAL method to a real-world data from the ADNI Study. We examined the baseline characteristics and genetic covariates. Specifically, we focused on the CSF-tau. It is acknowledged that tau plays a crucial role in microtubule polymerization and stabilization [27]. Previous studies have indicated that abnormalities in tau protein can trigger the AD cascade, leading to dementia [28]. Pathogenic CSF-tau can activate the antiviral pathway in microglial cells and inhibit neuronal self-repair mechanisms, thereby promoting cognitive impairment [29], which further supported the potential therapeutic value of drugs targeting CSF-tau or the antiviral pathway for treating AD [29]. In this study, our goal is to explore the potential influence of CSF-tau on AD severity.

The CSF-tau measured at baseline is the exposure of interest. The outcome we are concerned with is the severity of AD measured at Month 24. We employed the widely accepted 11-item version of Alzheimer’s Disease Assessment Scale (ADAS-11) cognitive score to evaluate the extent of AD severity. ADAS-11 score ranges from 0 to 70, where a higher score signifies a greater severity.

The dataset had 364 participants with complete information. In the GMAL analysis, age, gender and education level were considered as known risk factors for AD [30]. Zhang et al. discovered significant DNAm patterns in CSF biomarkers among individuals with AD and those who were cognitively normal [31]. Their research elucidated that DNAm in blood is indicative of the biological processes linked to early brain impairment associated with AD [31]. Moreover, they found associations between blood DNAm at multiple CpG sites and tau pathology as well as DNAm in the brain [31]. Therefore, we also included the whole-genome CpG sites as candidate covariates. The whole-genome CpG sites may include confounders and prognostic covariates or surrogates for these two types of covariates [32]. We initially pre-processed the DNAm profiles by: (i) excluding probes with Inline graphic-value Inline graphic 0.05; (ii) filtering out gender-related probes; (iii) removing probes with SNPs at CpG sites; (iv) eliminating cross-reactive probes and (v) averaging DNAm levels for samples measured multiple times [33]. After pre-processing, 865 859 CpG sites were retained in the analysis as candidate covariates. Due to the inapplicability of GMAL and GOAL for Inline graphic, we initially performed an epigenome-wide association study (EWAS) analysis. We selected the top 100 CpG sites according to the Bonferroni-adjusted Inline graphic values of each CpG site. After accounting for age, sex and education level, as well as the covariates chosen via the GMAL and GOAL methods, we estimated the parameter of interest and 95% confidence intervals (95% CIs). The CIs were determined through bootstrapping with 200 replications.

The characteristics of the known risk factors according to different levels of ADAS-11 are summarized in Supplementary Table S4. A total of 64 CpG sites were utilized for the GMAL analysis, while the GOAL analysis involved 54 CpG sites. We listed the 64 CpG sites selected by GMAL in Supplementary Table S3. A subset of these selected variables, such as cg04874795, cg02674693 and cg01681367, have previously shown strong associations with AD [34, 35]. Assessing covariate balance is a critical aspect of ensuring accurate causal inference, as imbalances can introduce bias into estimates. We examined the balance of covariates by calculating the absolute Pearson’s correlation coefficient of CSF-tau with each covariate [16]. Figure 5 illustrates the performance of covariate balance in both the unweighted and the weighted samples. The weights generated by the GMAL method effectively achieved balance among covariates across various CSF-tau levels.

Figure 5.

Figure 5

Covariate balance achieved by the unweighted and GMAL methods. The Y-axis represents the absolute Pearson’s correlation between each selected covariate and the CSF-tau.

Results from Table 3 indicate that participants with high level of CSF-tau exhibited a higher severity of AD. The proposed method estimated a narrower CI (estimate, 0.069; 95% CI, 0.042–0.096) compared to the GOAL method (0.058; 0.031–0.101).

Table 3.

Causal estimator and corresponding 95% CIs using the proposed method and GOAL method

Methods Estimate 95% CIs
Proposed 0.069 0.042–0.096
GOAL 0.058 0.031–0.101

Notations: The 95% CIs were calculated using the bootstrap method with 200 replications.

DISCUSSION

In this article, we introduced the GMAL method, which is specifically designed for causal estimation from high-dimensional observational data. Our method improves upon the previous approaches, which used penalty weights derived from a linear ‘full’ outcome regression model. By utilizing the median regression model, we achieved a robust variable selection against heavy-tailed distributions. Moreover, we demonstrate the practical utility of GMAL through data analysis. Specifically, we illustrate its effectiveness in variable selection for causal inference from omics data.

Simulation studies showed that the GMAL method always outperformed the existing methods when the data were generated from Pareto distributions. Furthermore, as the shape parameter of the Pareto distribution decreases, indicating a higher degree of heavy tail, the variable selection performance of the OAL, GLiDeR and GOAL methods deteriorates significantly compared to the GMAL. One possible explanation for this could be that the penalty weights in the OAL, GLiDeR and GOAL methods are derived from the linear regression model that relates covariates to the outcome. The proposed method used a median regression model to construct weights that might be more robust for variable selection for an outcome with skewed distributions.

For biological data with more samples than features or variables, the GMAL method can be directly applied in such cases. However, when the dimension of covariates is greater than participants (Inline graphic), the GMAL method is not directly applicable. Possible ways to preprocess the biological data as an Inline graphic problem for processing with GMAL would be useful. In this article, we performed an EWAS analysis as dimensionality reduction preprocessing. The method of dimensionality reduction needs further research.

In our simulations, we did not take into account other types of outcome distributions, such as mixed distributions. In future research, it may be worthwhile to investigate the variable selection performance for outcome variables with mixed distribution.

Key Points

  • There has been a growing interest in variable selection for causal inference within the context of high-dimensional data. However, when the outcome exhibits a skewed distribution, ensuring the accuracy of variable selection might be challenging.

  • We proposed GMAL to select covariates that can achieve an accurate estimation of causal effects when the outcome follows skewed distributions.

  • The GMAL performs comparably to the existing methods in variable selection when the outcome follows a symmetric distribution. Besides, the GMAL exhibits an obvious superiority over the existing methods when the outcome variable follows a skewed distribution.

ETHICS APPROVAL AND CONSENT TO PARTICIPATE

Since the simulated datasets did not involve any human data, ethics approval was not applicable.

CONSENT FOR PUBLICATION

Not applicable.

AUTHORS’ CONTRIBUTIONS

T.W., G.Q. and Y.Y. conceived the study. Y.L., Q.G. and K.W. performed the analysis and prepared the manuscript, including figures and tables. All authors have provided critical comments on the draft and read and approved the final manuscript. Y.L., Q.G. and K.W. contributed equally to this work.

Supplementary Material

Supplementary_materials_BIB_First_Look_bbae059
Appendix_BIB_First_Look_bbae059

ACKNOWLEDGEMENTS

Not applicable.

Author Biographies

Yahang Liu is a PhD student at Fudan University. Her research focuses on variable selection for causal inference in high-dimensional settings and its application.

Qian Gao is an associate professor at Shanxi Medical University. Her research focuses on statistical methods for variable selection in high-dimensional settings and its application in omics data.

Kecheng Wei is a PhD student at Fudan University. His research focuses on statistical methods for casual inference.

Chen Huang is a PhD student at Fudan University. Her research focuses on causal inference research with multi-site survival data and distributed analysis.

Ce Wang is a PhD at Fudan University. His research focuses on statistical methods for causal inference in survival data.

Yongfu Yu is a Professor at Fudan University. His research focuses on causal inference and life course epidemiology (cardiometabolic disease). He also has interests in research design and statistical methods related to birth cohorts.

Guoyou Qin is a Professor at Fudan University. His research focuses on statistical methods for causal inference and complex data. He also concentrates on the application of statistical methods in medicine and public health, with a primary focus on the fields of oncology and chronic diseases.

Tong Wang is a Professor at Shanxi Medical University. His research focuses on developing statistical methods for complex data and causal inference. He also has research interests in determining risk and etiological factors of non-communicable diseases.

Contributor Information

Yahang Liu, Department of Biostatistics, School of Public Health, Fudan University, Shanghai, China.

Qian Gao, Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China; Key Laboratory of Coal Environmental Pathogenicity and Prevention (Shanxi Medical University), Ministry of Education, China.

Kecheng Wei, Department of Biostatistics, School of Public Health, Fudan University, Shanghai, China.

Chen Huang, Department of Biostatistics, School of Public Health, Fudan University, Shanghai, China.

Ce Wang, Department of Biostatistics, School of Public Health, Fudan University, Shanghai, China.

Yongfu Yu, Department of Biostatistics, School of Public Health, Fudan University, Shanghai, China; Shanghai Institute of Infectious Disease and Biosecurity, Shanghai, China; Key Laboratory of Public Health Safety of Ministry of Education, Key Laboratory for Health Technology Assessment, National Commission of Health, Fudan University, Shanghai, China.

Guoyou Qin, Department of Biostatistics, School of Public Health, Fudan University, Shanghai, China; Shanghai Institute of Infectious Disease and Biosecurity, Shanghai, China; Key Laboratory of Public Health Safety of Ministry of Education, Key Laboratory for Health Technology Assessment, National Commission of Health, Fudan University, Shanghai, China.

Tong Wang, Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China; Key Laboratory of Coal Environmental Pathogenicity and Prevention (Shanxi Medical University), Ministry of Education, China.

FUNDING

National Natural Science Foundation of China (Nos. 82173612 to G.Q., 82273730 to Y.Y., 82073674 to T.W. and 82204163 to Q.G.); Shanghai Rising-Star Program (21QA1401300 to Y.Y.); Shanghai Municipal Natural Science Foundation (22ZR1414900 to Y.Y.); Shanghai Municipal Science and Technology Major Project (ZD2021CY001 to G.Q.).

DATA AVAILABILITY STATEMENT

Publicly available datasets were analyzed in this study. These data can be found at: adni.loni.usc.edu.

References

  • 1. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983;70(1):41–55. [Google Scholar]
  • 2. Ertefaie A, Asgharian M, Stephens DA. Variable selection in causal inference using a simultaneous penalization method. J Causal Inference 2018;6(1). [Google Scholar]
  • 3. Koch B, Vock DM, Wolfson J. Covariate selection with group lasso and doubly robust estimation of causal effects. Biometrics 2018;74(1):8–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Wilson A, Reich BJ. Confounder selection via penalized credible regions. Biometrics 2014;70(4):852–61. [DOI] [PubMed] [Google Scholar]
  • 5. Brookhart MA, Schneeweiss S, Rothman KJ, et al. Variable selection for propensity score models. Am J Epidemiol 2006;163(12):1149–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Shortreed SM, Ertefaie A. Outcome-adaptive lasso: variable selection for causal inference. Biometrics 2017;73(4):1111–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Antonelli J, Parmigiani G, Dominici F. High-dimensional confounding adjustment using continuous spike and slab priors. Bayesian Anal 2019;14(3):805–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Ye Z, Zhu Y, Coffman DL. Variable selection for causal mediation analysis using LASSO-based methods. Stat Methods Med Res 2021;30(6):1413–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Ghosh S, Tan Z. Doubly robust semiparametric inference using regularized calibrated estimation with high-dimensional data. Ther Ber 2022;28(3):1675–703. [Google Scholar]
  • 10. Ning Y, Sida P, Imai K. Robust estimation of causal effects via a high-dimensional covariate balancing propensity score. Biometrika 2020;107(3):533–54. [Google Scholar]
  • 11. Sun B, Tan Z. High-dimensional model-assisted inference for local average treatment effects with instrumental variables. J Bus Econ Stat 2022;40(4):1732–44. [Google Scholar]
  • 12. Li Y, Li L. Propensity score analysis with local balance. Stat Med 2023;42:2637–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Mak K-K, Kim D-H, Leigh JP. Sociodemographic differences in the association between obesity and stress: a propensity score-matched analysis from the Korean National Health and Nutrition Examination Survey (KNHANES). Nutr Cancer 2015;67(5):804–10. [DOI] [PubMed] [Google Scholar]
  • 14. VanderWeele TJ, Hawkley LC, Thisted RA, Cacioppo JT. A marginal structural model analysis for loneliness: implications for intervention trials and clinical practice. J Consult Clin Psychol 2011;79(2):225–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Gao Q, Zhang Y, Liang J, et al. High-dimensional generalized propensity score with application to omics data. Brief Bioinform 2021;22(6). 10.1093/bib/bbab331. [DOI] [PubMed] [Google Scholar]
  • 16. Fong C, Hazlett C, Imai K. Covariate balancing propensity score for a continuous treatment: application to the efficacy of political advertisements. Ann Appl Stat 2018;12(1):156–77. [Google Scholar]
  • 17. Zhang Z, Chen Z, Troendle JF, Zhang J. Causal inference on quantiles with an obstetric application. Biometrics 2012;68(3):697–706. [DOI] [PubMed] [Google Scholar]
  • 18. Zhang J, Troendle J, Reddy UM, et al. Contemporary cesarean delivery practice in the United States. Am J Obstet Gynecol 2010;203(4):326. e1-326, e10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Chen L. Testing the mean of skewed distributions. J Am Stat Assoc 1995;90(430):767–72. [Google Scholar]
  • 20. Yuan Y, MacKinnon DP. Robust mediation analysis based on median regression. Psychol Methods 2014;19(1):1–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Hirano K, Imbens GW. The propensity score with continuous treatments. Applied Bayesian modeling and causal inference from incomplete-data perspectives. 2004;226164:73–84. [Google Scholar]
  • 22. Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol 1974;66(5):688–701. [Google Scholar]
  • 23. Tang D, Kong D, Pan W, Wang L. Ultra-high dimensional variable selection for doubly robust causal inference. Biometrics 2022;79:903–14. [DOI] [PubMed] [Google Scholar]
  • 24. Robins JM. Association, causation, and marginal structural models. Synthese 1999;121(1/2):151–79. [Google Scholar]
  • 25. Ju C, Benkeser D, Laan MJ. Robust inference on the average treatment effect using the outcome highly adaptive lasso. Biometrics 2020;76(1):109–18. [DOI] [PubMed] [Google Scholar]
  • 26. Sun S, Moodie EE, Nešlehová JG. Causal inference for quantile treatment effects. Environ 2021;32(4):e2668. [Google Scholar]
  • 27. Kametani F, Hasegawa M. Reconsideration of amyloid hypothesis and tau hypothesis in Alzheimer's disease. Front Neurosci 2018;12:25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Iqbal K, Liu F, Gong CX, Grundke-Iqbal I. Tau in Alzheimer disease and related tauopathies. Curr Alzheimer Res 2010;7(8):656–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Mummery CJ, Börjesson-Hanson A, Blackburn DJ, et al. Tau-targeting antisense oligonucleotide MAPT(Rx) in mild Alzheimer's disease: a phase 1b, randomized, placebo-controlled trial. Nat Med 2023;29(6):1437–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Viña J, Lloret A. Why women have more Alzheimer's disease than men: gender and mitochondrial toxicity of amyloid-beta peptide. J Alzheimers Dis 2010;20(Suppl 2):S527–33. [DOI] [PubMed] [Google Scholar]
  • 31. Zhang W, Young JI, Gomez L, et al. Distinct CSF biomarker-associated DNA methylation in Alzheimer's disease and cognitively normal subjects. Res Sq 2023;15(1):78. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Higgins-Chen AT, Boks MP, Vinkers CH, et al. Schizophrenia and epigenetic aging biomarkers: increased mortality, reduced cancer risk, and unique clozapine effects. Biol Psychiatry 2020;88(3):224–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Shireby GL, Davies JP, Francis PT, et al. Recalibrating the epigenetic clock: implications for assessing biological age in the human cortex. Brain 2020;143(12):3763–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Li QS, Sun Y, Wang T. Epigenome-wide association study of Alzheimer's disease replicates 22 differentially methylated positions and 30 differentially methylated regions. Clin Epigenetics 2020;12(1):149. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Smith RG, Pishva E, Shireby G, et al. A meta-analysis of epigenome-wide association studies in Alzheimer's disease highlights novel differentially methylated loci across cortex. Nat Commun 2021;12(1):3517. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary_materials_BIB_First_Look_bbae059
Appendix_BIB_First_Look_bbae059

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be found at: adni.loni.usc.edu.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES