Skip to main content
BioData Mining logoLink to BioData Mining
. 2025 Nov 5;18:77. doi: 10.1186/s13040-025-00492-3

CAUSALRLSTACK: adaptive balancing of deep representation and causal effect estimation with application to HIV-related health data

Dat Thanh Pham 1,, Khai Quang Tran 2, Viet Anh Nguyen 3
PMCID: PMC12587697  PMID: 41194280

Abstract

Background and Objective

Estimating individualized causal effects plays a vital role in data-driven decision-making, especially in high-risk domains such as public health. However, current causal inference models often lack flexibility and generalizability due to the tight coupling between representation learning and effect estimation. This study aims to develop a modular and adaptive framework to enhance the analysis of individualized causal effects in complex health data.

Methods

We propose CAUSALRLSTACK, a modular framework designed to separate representation learning from causal effect estimation. In practice, the model uses a memory-augmented Transformer (TITAN) to capture complex, individualized representations. It is further paired with a doubly robust estimator(DRLearner) to improve the treatment effect estimation. A reinforcement learning agent adjusts how much each component contributes by assigning instance-specific weights. This adaptive weighting process improves the model’s ability to generalize across different populations. Input features are derived from causal graphs, automatically chosen between an expert-defined graph and one discovered from data. To evaluate performance, we applied the framework to two publicly available HIV datasets that reflect community-level testing behavior and post-intervention clinical outcomes.

Results

CAUSALRLSTACK outperforms six state-of-the-art causal inference models across both datasets, achieving the highest accuracy (0.861 and 0.855), F1-Score (0.845 and 0.839), and AUC-ROC (0.897 and 0.892). It also achieves the lowest predictive uncertainty (0.093 and 0.092), indicating robust performance in estimating treatment effects.

Conclusions

The proposed framework offers a flexible and effective solution for individualized causal inference. Its modular architecture and reinforcement learning-based weighting strategy enable adaptive, data-driven estimation across diverse populations. Strong experimental results demonstrate the potential of the framework to advance individualized causal inference in health data and provide a practical basis for designing personalized intervention strategies in HIV and broader public health domains.

Keywords: HIV causal estimation, Reinforcement learning, Temporal transformer, Doubly robust estimation, Ensemble method

Introduction

Causal inference plays a central role in evidence-based decision making in applied domains such as healthcare, economics, and public policy [1, 2]. In public health, particularly in the management of the HIV epidemic, causal inference plays a critical role in guiding both preventive and treatment strategies [3, 4]. It enables researchers and policy makers to understand how specific interventions impact individual outcomes  [57]. Unlike traditional correlation analysis, which only describes statistical associations, causal inference seeks to uncover the underlying mechanisms that drive change [8, 9]. This distinction is critical in settings where actions must be chosen not solely based on predicted outcomes, but rather on understanding the consequences of those actions [10].

Conventional statistical methods for estimating causal effects, such as propensity score adjustment or inverse probability weighting, offer strong theoretical foundations  [8]. Various causal models are developed to address confounding factors and estimate the effects of treatment in observational health data. The inverse probability weighting with the generalized g-formula model estimates the causal effect of randomized PrEP interventions on HIV incidence using electronic health records [11]. The doubly robust causal survival model incorporates randomly assigned treatment groups to better account for confounders in the initiation of PrEP [12]. Furthermore, the causalCmprsk model [13] extends these methodologies to competing risk settings by employing inverse probability weighting within non-parametric and Cox-based frameworks to estimate average treatment effects in time-to-event data. Traditional statistical methods for causal inference have limitations, particularly in their ability to model non-linear relationships and complex interactions. These techniques often struggle with high-dimensional data.

In response, recent work has introduced neural-based causal models that integrate deep representation learning with estimation processes [14, 15]. The study referenced in  [16] employs a hybrid model that combines propensity score matching, logistic regression, and neural networks to estimate the causal impact of clinical information and prior imaging on the content of radiology reports. A deep learning model proposed using recurrent neural networks (RNN) along with disentangled representation learning is proposed in  [17] to estimate treatment effects over time from observational data, including confounders with variable time. Causal Forest  [18] utilizes decision tree ensembles to estimate heterogeneous treatment effects. Double Machine Learning (DML)  [19] merges machine learning models with orthogonal scores to reduce estimation bias. The Orthogonal Random Forest (ORF)  [20] extends DML by integrating generalized random forests for stable inference. X-Learner  [21, 22] is a meta-learning approach that is particularly effective when there is an imbalance between the treatment and control groups. The Conditional Variational Autoencoder (CEVAE)  [23] uses a variational autoencoder to model latent confounders and estimate treatment effects. Recently, the Causal Attention Transformer (CAT)  [24] was developed, which incorporates a causal understanding module into the Transformer architecture. This allows the model to learn attention weights that align with underlying causal relationships.

Despite these advances, existing models still face several limitations. Many retain fixed architectures that tightly couple representation learning with causal-effects estimation, applying the same computation pipeline to all instances. This one-size-fits-all approach limits adaptability and generalizability, especially in real-world data environments characterized by heterogeneity, distributional drift, or missing data  [7, 25]. In addition, few models explicitly support individualized decision-making through adaptive weighting or context-sensitive estimation.

Complementing these neural-based approaches, the causal inference literature has also advanced along theoretical lines, developing strategies to strengthen identifiability and estimation in observational settings, ranging from structural assumptions to representation learning frameworks. Cheng et al. [26] conducted a wide-ranging survey on data-driven methods for the estimation of causal effects in the familiar single intervention-outcome setting. Their work draws attention to key identifiability assumptions and strategies for uncovering causal structure. The study [27] addresses the fundamental challenge of hidden variables by proposing conditions that ensure identifiability and unbiased estimation. Xu et al. [28] introduce a method that combines conditional front-door adjustment with an identifiable variational autoencoder to tackle the challenge of hidden confounding in observational data. This approach strengthens identifiability and, at the same time, improves the reliability of causal effect estimates in complex settings. Louizos et al. [23] propose a disentangled representation learning framework that leverages instrumental variables to address unobserved confounding. By separating the information carried by instruments from other latent factors, their method enhances identifiability and supports more reliable estimation of causal effects. The authors in [29] introduce a disentangled representation framework for causal mediation analysis, aiming to separate direct and indirect effects through structured representations. This approach extends causal inference beyond total effects and provides a more interpretable view of mediation pathways. This study showed the potential for interpretable causal representations. However, this work and the above studies did not consider optimizing representations for adaptability under distributional shifts, which was a possible direction for improvement.

To address these challenges, we introduce CAUSALRLSTACK, a new modular framework for causal inference that separates representation learning from treatment effect estimation. It constructs two candidate causal graphs, one based on expert knowledge and another on a Rex method [30], which takes advantage of Shapley values to guide variable importance and employs a cycle removal procedure to generate a valid causal DAG. This approach follows previous recommendations on DAG validation and evaluation in applied epidemiology [3133]. Then, sensitivity analysis is applied to select the most suitable graph as input for the modeling process.

CAUSALRLSTACK uses a memory-augmented Transformer, inspired by the TITAN architecture [34], as a representation component. With the addition of external memory, the model can store and recall contextual information across samples. This makes it more adaptable when data distributions change and more resilient to sparse or irregular patterns that are common in healthcare datasets. As a result, the framework produces more informative representations that can support causal estimation.

Individualized representations are fed into a doubly robust estimator, the DRLearner [35], which helps to ensure that the causal effect estimates remain consistent and less sensitive to bias. The DRLearner is particularly effective in this framework because it integrates smoothly with deep representations and maintains its validity even if either the outcome model or the propensity model is misspecified. This characteristic can be particularly useful in observational healthcare data, where confounders are common.

CAUSALRLSTACK incorporates a reinforcement learning (RL) agent [36] as an ensemble mechanism to combine the two components. The agent dynamically adjusts the weights for each instance. In practice, the RL-based ensemble shifts the emphasis between deep representations and statistical estimation. In some cases, it leans more on representation, while in others it gives greater weight to the estimator. This flexibility allows estimates to reflect individual differences while still accounting for bias, which may improve their relevance for causal analysis. Finally, the overall architecture incorporates predictive uncertainty estimation and distribution shift detection [7, 25], strengthening its practicality for real-world causal inference tasks.

In summary, this work makes three key contributions: (1) we propose a modular framework for causal estimation that separates representation learning from causal effect estimation; (2) we design a memory-augmented Transformer representation module, inspired by TITAN, that improves generalization under distributional shifts and handles sparse patterns in healthcare data; and (3) we introduce an RL-based ensemble mechanism that adaptively balances deep representations with statistical estimation, allowing for personalized and causally valid effect estimates.

To evaluate the performance of CAUSALRLSTACK, we used two publicly available HIV datasets covering both community-level surveys and clinical trial records. In this way, we were able to assess the robustness of the framework in diverse healthcare settings, where the experimental results indicated that CAUSALRLSTACK achieved improved performance compared to baseline methods in several evaluation metrics.

Materials and methods

Datasets

We used two publicly available HIV-related datasets from Kaggle for our evaluation. The first dataset is the EDHS-HIV/AIDS dataset, provided on Kaggle [37] and initially compiled from the Ethiopian Demographic and Health Survey (EDHS) conducted by the Ethiopian Central Statistical Agency (CSA) in collaboration with ICF International. This data set has been used in peer-reviewed studies [38, 39] for machine learning research. It includes 78,877 anonymized individual records and captures the behavior and risk factors of population-level HIV testing.

The second dataset is the AIDS Virus Infection Prediction dataset, provided from Kaggle [40], which is reported to be based on the AIDS Clinical Trials Group (ACTG) Study 175 [41]. The Kaggle version is a processed and expanded dataset consisting of approximately 50,000 anonymized records, released under a CC0 license. This dataset contains information related to clinical trials on HIV treatment. A more detailed description of the two datasets is provided in the following.

1. Dataset 1 (EDHS-HIV/AIDS dataset). Each record of dataset 1 contains structured demographic, behavioral, and HIV-related knowledge attributes. This dataset is appropriate for conducting causal inference and survival estimation analyzes in the context of HIV testing and prevention. The intervention variable is S_Test (0 = No, 1 = Yes), indicating whether a sample test was taken, and the outcome variable is T_in_LAB (0 = No, 1 = Yes), representing whether confirmatory laboratory tests occurred. The features are grouped into the following.

Demographic and socio-economic variables: gender, age, region, residence type, religion, education level, marital status, employment status and wealth index.

Sexual behavior indicators: number of sexual partners, use of condoms, and behavior changes aimed at reducing the risk of HIV.

HIV-related knowledge: beliefs about transmission, awareness of STIs, knowledge of HIV/AIDS, and access to testing services.

2. Dataset 2 (AIDS Virus Infection Prediction dataset). The second dataset focuses on clinical trials of HIV treatment. The outcome variable is infected, indicating whether a patient is infected with AIDS (1 = Yes, 0 = No). The intervention variable is trt, representing the treatment group, with four specific values: 0 = ZDV monotherapy (ZDV only), 1 = ZDV + ddI, 2 = ZDV + Zal, and 3 = ddI monotherapy (ddI only). The remaining input variables are classified into four main groups as follows.

  • Demographic and behavioral variables: age, gender, race, drug use, and homosexual status.

  • Clinical variables: body weight, hemoglobin, Karnofsky score, presence of symptoms, and history of opportunistic infections.

  • Immunological markers: CD4 count, CD8 count, and CD4/CD8 ratio.

  • Treatment-related variables: treatment type and regimen indicators.

Causal graphs constructed from these datasets, both expert-defined and data-driven, are described below.

Data prepocessing

Our two HIV datasets contained a mix of variable types, missing values, and imbalanced treatment groups, which made a clear preprocessing strategy essential. We converted categorical variables into numeric codes by label encoding and applied robust scaling to continuous measures so that extreme values did not dominate while the results remained clinically meaningful. Approximately 8.3% of the data were missing; we addressed this by imputing continuous measures with their median values and categorical attributes with their dominant category, reflecting the distinct missingness patterns in laboratory versus survey data.

We also developed a domain-informed approach to handle missing values and a population-based relabeling method to correct treatment imbalance. Our strategy addresses imbalance while preserving genuine covariate patterns, rather than relying on interpolation methods such as SMOTE [42], which may generate patient profiles that are not clinically realistic. In this way, the preprocessing pipeline remains computationally efficient and causally reliable, providing a strong foundation for the analyses that follow.

Problem setup and assumption

Problem setup

In this study, we estimate individual causal effects using observational data from real-world health-related scenarios. Our goal is to evaluate the potential impact of a specific intervention or behavioral exposure, denoted as Intervention, on a binary outcome, denoted as Outcome.

We define the input matrix Inline graphic, where Inline graphic is the number of individuals and Inline graphic is the number of observed features. These features may include sociodemographic, behavioral, clinical, or knowledge-based variables. Among them, we focus on two main variables:

  1. Intervention (Inline graphic): A categorical variable representing the exposure level or the intervention status for each individual. The possible values of Inline graphic are denoted by the set Inline graphic, where Inline graphic may include two or more discrete categories (e.g., Inline graphic) depending on the dataset.

  2. Outcome (Inline graphic): A binary variable indicating whether the individual experienced the event or outcome of interest (1 if yes, 0 if no).

Our model aims to estimate the potential outcome Inline graphic for each individual Inline graphic, assuming they were assigned a specific intervention level Inline graphic.

Specifically, the model estimates the following conditional expectation.

graphic file with name d33e520.gif 1

where Inline graphic denotes the potential outcome at the intervention level Inline graphic, and Inline graphic is the vector of observed characteristics for individual Inline graphic. We use Inline graphic to represent the true conditional expectation, and Inline graphic to denote its estimator obtained by our model.

To quantify the causal effect between any two intervention levels Inline graphic, the Individual Causal Effect (ICE) is defined as follows.

graphic file with name d33e575.gif 2

Identification Assumptions

To identify causal effects from observational data, we rely on the standard assumptions of the potential outcome framework.

Consistency. If an individual Inline graphic actually receives the level of treatment Inline graphic, then the observed outcome is exactly the potential result under that treatment. In other words, Inline graphic when Inline graphic.

Positivity (Overlap). For every covariate profile Inline graphic, each treatment level has a non-zero probability of being observed. Formally, Inline graphic for all Inline graphic.

Ignorability (Unconfoundedness). Conditional on the observed covariates Inline graphic, the assignment of treatment is independent of potential outcomes: Inline graphic for all Inline graphic.

SUTVA (Stable Unit Treatment Value Assumption). There is no interference between individuals (the outcome of one person does not depend on the treatment of another person) and each treatment level is well defined.

For the first dataset (EDHS-HIV/AIDS), the treatment variable is S Test (Inline graphic No, Inline graphic Yes), and the outcome variable is T in LAB (Inline graphic No, Inline graphic Yes). In consistency, if a person actually goes for community screening, the laboratory confirmation observed is precisely the potential outcome under that condition. Ignorability assumes that once we condition on demographic, behavioral, and knowledge variables, the decision to be screened is independent of the potential laboratory confirmation. Positivity requires that in every subgroup defined by these covariates, there are screened and unscreened individuals.

For the second dataset (AIDS Virus Infection Prediction), the treatment variable is trt (treatment group: Inline graphic ZDV only, Inline graphic ZDV+ddI, Inline graphic ZDV+Zal, Inline graphic ddI only), and the outcome is infected (Inline graphic No, Inline graphic Yes). Consistency means that if a patient actually received ZDV+ddI, then the observed infection status matches the potential outcome under that regimen. Ignorability assumes that, after conditioning on demographic, behavioral, clinical, and immunological variables, the assignment of treatment is independent of potential infection outcomes. Positivity requires that, in all strata of patient characteristics, patients appear in all four treatment groups.

In practice, we checked both datasets and did not find major signs of assumption violations, so we consider them reasonable in this setting.

Proposed methods

Overview of proposed framework

To address the challenges of estimating individual causal effects using observational data, we propose a comprehensive framework called CAUSALRLSTACK. This framework combines causal graph discovery with a hybrid causal estimation architecture. It has three main stages: causal graph construction, dual path modeling, and reinforcement-based adaptive integration.

In the first stage, we create two candidate causal graphs representing the potential causal relationships between input variables. One graph is constructed based on domain expert knowledge, while the other is generated directly from the data using a causal discovery pipeline, such as REX. We then analyzed the sensitivity to compare and select the optimal graph structure. This selection ensures that the model is informed by a reliable and contextually appropriate causal representation.

Based on the chosen causal graph, we extract a subset of causally relevant variables for the intervention and the outcome. This subset is represented as a feature matrix Inline graphic, where each row corresponds to an individual in the dataset, and each column represents an input feature retained by causal reasoning. This process eliminates irrelevant or post-intervention variables, which helps reduce bias and enhances the accuracy of estimating causal effects.

The resulting matrix Inline graphic is then processed by two main components as follows.

  1. TITAN-based component, a memory-augmented Transformer-based encoder, learns deep context-sensitive representations from input features. Models non-linear interactions among these features and creates semantically rich embeddings for each instance. Additionally, its global memory and surprise state mechanisms allow for effective adaptation to unusual or rare patterns in the data.

  2. DRLearner, a doubly robust causal estimator, combines an outcome regression model with an intervention assignment model. This approach allows for accurate estimation of both average causal effects (ACE) and individual causal effects (ICE), while remaining resilient to potential model misspecifications that can arise from noise or data limitations.

To effectively combine the outputs of both components, we utilize a reinforcement learning (RL) agent that learns a dynamic, instance-specific weighting policy. This policy optimally integrates the outputs of TITAN and DRLearner. This adaptive ensemble strategy enables the model to customize its inference for individual cases, striking a balance between rich representation and robust causal estimation.

By decoupling representation learning from causal effect estimation and linking them through a learnable coordination mechanism, CAUSALRLSTACK offers a flexible, accurate, and interpretable solution for individualized causal inference. The overall model architecture is illustrated in Fig. 1.

Fig. 1.

Fig. 1

The overall architecture shows the structure of our hybrid causal prediction model. Starting from the HIV/AIDS dataset, we first build a causal graph (DAG) to select important features for causal inference. These features are then used by two models, TITAN and DRLearner, which estimate causal effects using different learning strategies. An RL-based ensemble module combines its outputs using a learned weight for each sample. This module also considers a concept drift signal to adapt to changes in data and produces an uncertainty estimate alongside the final causal prediction

Causal graph construction

1. Theoretical Formulation The process of constructing a causal graph involves two main approaches: one that utilizes expert knowledge and the other that relies on data-driven discovery. The final structure is selected through sensitivity analysis to ensure it is suitable for downstream modeling.

Expert-Defined Causal Graph Construction. We define an expert-driven Directed Acyclic Graph (DAG) Inline graphic, where nodes Inline graphic represent observed variables and edges Inline graphic represent directed causal relations proposed by domain experts. The structure is constructed based on prior expert knowledge and theoretical assumptions derived from domain-specific understanding.

Causal Graph via Data-Driven Discovery. We develop a simplified REX framework [30] to construct a causal graph from HIV surveillance data. The pipeline includes four main steps: identifying potential parent variables using SHAP values [43] and clustering, determining edge directions using the Additive Noise Model and the Hilbert-Schmidt Independence Criterion (ANM-HSIC) method [44], removing cycles using the SHAP discrepancy, and constructing the final DAG for downstream causal inference.

Identifying potential parent variables through SHAP and clustering. For each target variable Inline graphic, a linear regression model Inline graphic is trained using the remaining features Inline graphic. Shapley values Inline graphic are then calculated to estimate the contribution of each feature Inline graphic to predicting Inline graphic, defined as follows.

graphic file with name d33e878.gif 3

where Inline graphic is the set of features excluding the target Inline graphic, Inline graphic is a subset of Inline graphic, and Inline graphic is the output of the model trained on subset Inline graphic.

To improve robustness, we applied bootstrapping and used DBSCAN to cluster features in SHAP space. The cluster with the highest mean SHAP value is selected as a candidate parent. A frequency matrix Inline graphic is constructed and thresholded at Inline graphic to produce an undirected adjacency matrix as follows.

graphic file with name d33e939.gif 4

Determining edge directions using simplified ANM-HSIC. We perform two linear regressions for each undirected edge between a pair of variables Inline graphic as follows.

graphic file with name d33e955.gif 5
graphic file with name d33e961.gif 6

We compare the independence of the residuals with the predictor using HSIC as follows.

graphic file with name d33e969.gif 7

The direction with the most independent residuals is selected as the causal direction.

Cycle removal via simplified SHAP discrepancy. After edge orientation, the graph may contain cycles. We remove the weakest edge in each cycle using SHAP discrepancy to ensure the final structure is a DAG. Instead of using the normalized squared error formula as in the original REX, we adopt a simpler version as follows.

graphic file with name d33e982.gif 8

where Inline graphic is the SHAP value of the characteristic Inline graphic when predicting Inline graphic, Inline graphic is the coefficient of determination between the SHAP values and the true values of Inline graphic. The edge with the highest discrepancy is removed to ensure that the DAG contains no cycles while preserving the strongest causal relationships.

DAG construction and simplification rationale. Unlike the original REX framework, which trains two separate models (a Deep Feedforward Network and a Gradient Boosting Trees) and merges their DAGs, we simplify the process by using only linear regression to construct a single DAG. This simplification reduces computational complexity and training costs, while also enhancing the model’s interpretability. The linear model takes the following form.

graphic file with name d33e1026.gif 9

The learned coefficients Inline graphic reflect the influence of each predictor Inline graphic on the target Inline graphic, and are used to compute the SHAP values and construct the edges of the graph.

The resulting DAG Inline graphic serves as a causal basis for subsequent tasks such as identifying confounders, selecting characteristics, and controlling for post-treatment bias. In our context, the DAG is handy for estimating the causal effect of HIV screening test behavior on laboratory test behavior. It also enables the estimation of counterfactual prediction and intervention effects with greater reliability.

Final DAG Selection for Model Based on Sensitivity Evaluation. In this study, we select the final DAG used for causal inference based on sensitivity analysis. Rather than relying on model loss or quantitative performance metrics, we evaluated the stability of each candidate DAG by observing changes in causal components, such as confounders and mediators, after perturbing the graph structure. Specifically, we remove the direct edge from the treatment variable to the outcome variable and analyze how the identified causal variables change. The DAG that maintains more consistent causal components is considered more robust and is selected for downstream modeling. We formalize the DAG selection process as follows:

graphic file with name d33e1064.gif

where Inline graphic denotes the final selected DAG for causal inference, Inline graphic represents the DAG defined by experts constructed using domain knowledge, Inline graphic denotes the DAG driven by data learned from observational data using the REX-based framework, and Inline graphic is a sensitivity-based measure that reflects the robustness of the causal structure under edge perturbation, calculated based on the consistency of the causal variables identified (e.g. confounders, mediators). This approach prioritizes interpretability and structural robustness over predictive accuracy, aligning with the principles of trustworthy causal modeling.

The final DAG was selected through a multi-criteria evaluation that considered stability, parsimony, and clinical relevance. In practice, the expert-defined DAG served as an epidemiological reference during the sensitivity analysis, ensuring that the final structure remained consistent with established HIV knowledge.

2.Practical Construction. Fig. 2 illustrates the expert-defined DAG for Dataset 1. Each arrow in the graph represents an assumed causal relationship between two variables. The DAG structure reveals the following key relationships.

Fig. 2.

Fig. 2

Expert-defined DAG from Dataset 1

Sex and Age influence Reg (region) and C_Wor (current work status), which subsequently affect HIV testing behavior.

Res_P (residential place) and Edu_At (age at first education) influence Reg, indicating that living context plays an important role.

M_STI (marital status) and W_Ind (wealth index) affect both awareness and sexual behavior, such as R_SeA (availability of condoms) or Had_Sex (sexual activity).

Health-related factors like H_STI (symptoms of sexually transmitted infections) and H_AIDS (awareness of AIDS) are directly linked to decisions regarding HIV testing (P_T_HIV and E_T_HIV).

Sexual behavior factors such as Had_Sex, Con_Use (condom use), and R_Have_1SP (having one sexual partner) strongly affect whether someone has ever been tested for HIV (E_T_HIV).

E_T_HIV and P_T_HIV further influence T_HIV_LAB (HIV laboratory testing) and S_Test (successful testing).

Figure 3 presents the data-driven DAG for Dataset 1 generated using the REX causal discovery framework. Unlike the expert-defined DAG, this structure is learned directly from the data without prior domain assumptions. The graph captures a dense network of potential causal relationships between variables.

Fig. 3.

Fig. 3

REX-inferred DAG from Dataset 1

Key variables such as E_T_HIV, P_T_HIV, and S_Test are located at central positions, receiving multiple incoming and outgoing edges. Notably, several strong causal paths (in dark blue) are identified, including links from behavioral and knowledge-related factors (e.g., N_S_Part, R_SeA, Con_Use) to outcomes like HIV testing behavior (T_HIV_LAB and S_Test).

When we compared the expert-defined and data-driven DAGs for Dataset 1, we found that both graphs showed key relationships that match HIV epidemiology. Demographic factors such as sex and age influence education and work status, which then affect access to tests. The place of residence and education background also shape awareness and health-seeking behavior. Health-related factors, including a history of STIs and knowledge of AIDS, are directly related to HIV testing. The data-driven DAG suggested some additional links, such as condom use and the number of sexual partners, that make sense from a behavioral perspective. We used the expert-defined DAG as a reference in our sensitivity analysis to test the robustness of the data-driven structure. This process helped remove implausible edges and showed that the final DAG is consistent with established HIV knowledge while also adding new insights from the data.

Figure 4 illustrates the expert-defined directed acyclic graph (DAG) for Dataset 2. The graph represents hypothesized causal relationships based on knowledge of the clinical and epidemiological domain.

Fig. 4.

Fig. 4

Expert-defined DAG from Dataset 2

The variable infected (HIV infection status) is the central outcome, directly influenced by multiple factors including homo (homosexuality), cd420 and cd820 (immune cell markers), symtom (symptoms), z30 (reason for visit), and offtrt (treatment interruption).

Foundational characteristics such as age, gender, race, karnof (functional status), oprior (prior treatment), preanti (prior antiretroviral use), and hemo (anemia) contribute to HIV risk through intermediate clinical or immunological pathways.

Additional variables such as wtkg (weight), drugs (substance use), and trt (treatment type) are linked to background factors and can exert indirect effects on the risk of HIV infection.

Some variables such as strat, str2, time, and treat appear in the graph but do not exhibit direct causal links, likely representing stratification or temporal control variables not modeled as primary causes.

This DAG highlights the interplay between clinical characteristics, treatment history, and behavioral risk factors in shaping HIV infection outcomes. It serves as the expert-informed causal structure for Dataset 2 and is used as a candidate graph in subsequent causal analysis.

Figure 5 illustrates the data-driven DAG inferred from Dataset 2 using the REX framework. Unlike the expert-defined DAG, this graph is constructed entirely from data, allowing the discovery of potential hidden relationships between variables.

Fig. 5.

Fig. 5

REX-inferred DAG from Dataset 2

The structure shows a high degree of connectivity among clinical, demographic, and treatment-related variables. Several variables play central roles with complex incoming and outgoing connections, reflecting the multidimensional causal dependencies present in the dataset.

Compared to the expert-defined graph, this DAG reveals a broader and more informative dependency network, supporting causal analysis and hypothesis testing based on real-world observational data.

When comparing the expert-defined and data-driven DAGs for Dataset 2, we observed that both graphs captured key relationships consistent with HIV epidemiology. For example, demographic factors such as age, sex, and race influence both risk behaviors and the likelihood of testing, while a medical history such as a prior STI is directly related to infection risk. Data-driven DAG also suggested additional associations, such as the role of treatment history (e.g. preanti, cd80), which are plausible from a clinical perspective. We used the expert-defined DAG as a reference in our sensitivity analysis to test the robustness of the data-driven structure. This procedure allowed us to filter out implausible links and demonstrate that the final DAG remains consistent with established HIV knowledge while highlighting new insights supported by the data.

Causal Graph construction. We construct the final DAG, denoted asInline graphic for Dataset 1 and Inline graphic for Dataset 2, following the methodology described in the Proposed Method section. For each dataset, we generate two candidate DAGs: one based on expert knowledge and the other discovered using the REX causal discovery framework. We then conduct a sensitivity analysis to determine which DAG is more suitable for downstream causal modeling. The selected graph is utilized in the final model.

  1. The confounders, mediators, and instrumental variables identified in Inline graphic are as follows.

Confounders. A total of 11 variables: Sex, Age, Edu_lvl, Edu_At, M_Sta, C_Wor, W_Ind, H_STI, H_O_STI, H_AIDS, and Reg. These variables influence both the treatment and the outcome and must be controlled to reduce bias in causal effect estimation.

Mediators. A total of 4 variables: R_SeA, Had_Sex, Con_Use, and R_Have_1SP. These variables serve as causal bridges from the upstream covariates to the HIV testing behavior.

Instrumental Variables. Two variables: Rel and Res_P. These variables affect treatment but are not directly related to the outcome, making them useful for causal inference under potential unobserved confounding.

  • 2.

    The confounders, mediators, and instrumental variables identified in Inline graphic are as follows.

Confounders. A total of 11 variables: age, race, gender, karnof, oprior, preanti, cd40, cd80, wtkg, homo, and drugs. These variables influence both treatment and outcome and are controlled to reduce confounding bias in the estimation of causal effects.

Mediators. A total of 4 variables: cd420, cd820, symtom, and z30. These variables mediate the effect of upstream covariates on treatment behavior and are modeled as part of the causal pathway.

Instrumental Variables. Two variables: strat and str2. These variables are assumed to influence the assignment of treatment, but do not directly affect the outcome, supporting the identification under potential unobserved confounding.

TITAN for deep representation learning

Inspired by [34], we apply the TITAN (Transformer with Interpretable Temporal Attention and Neighborhood Memory) architecture to estimate the probability of an outcome in hypothetical interventions related to a binary treatment variable. TITAN is used to learn deep representations given individual covariates and treatment status. By jointly encoding demographic, behavioral, and domain-specific awareness features along with treatment assignment, TITAN captures treatment-dependent variations in a contextualized representation space.

For each individual, TITAN generates separate representations in two hypothetical treatment scenarios: Inline graphic (treated) and Inline graphic (not treated). TITAN can be used both independently for prediction and as a feature encoder for DRLearner, which estimates the corresponding potential outcomes Inline graphic and Inline graphic.

The feature matrix Inline graphic is constructed based on a DAG discovered from the data. This matrix is the input of TITAN. This DAG helps identify direct causes of the target outcome and confounding variables to adjust for, and it excludes post-treatment variables to avoid bias in causal inference. This ensures that the input features reflect potential causal relationships, supporting more reliable predictions.

Each input vector Inline graphic representing the characteristics of an individual is passed through a one-dimensional convolutional layer (1D convolution) to capture local interactions between related feature groups as follows:

graphic file with name d33e1298.gif 10

The model then employs multi-head self-attention [45] layers to learn long-range dependencies between features, allowing it to focus on the most influential factors related to the outcome. The attention mechanism is defined as:

graphic file with name d33e1309.gif 11

where Inline graphic (Query), Inline graphic (Key), and Inline graphic (Value) are matrices derived from linear projections of the input representations, and Inline graphic is the dimension of the key vectors used for normalization to stabilize gradient updates during training. This formulation enables the model to weigh the relevance of each feature, allowing for deeper contextual representations.

The attention mechanism operates in a way that mirrors how domain experts prioritize specific risk factors when evaluating the likelihood of an outcome. For example, when evaluating an individual with several contributing factors, the model can focus more on the most critical variables while giving less weight to those that provide less information. This dynamic weighting enables the model to create more nuanced contextual representations tailored to each individual’s unique risk profile. The feature representation is then updated through the following layers.

graphic file with name d33e1343.gif 12
graphic file with name d33e1349.gif 13

Next, to improve generalization and support learning from less frequent patterns, TITAN incorporates an external memory Inline graphic to store latent representations of previously observed individuals. The memory update is controlled by a sigmoid gate as follows.

graphic file with name d33e1363.gif 14
graphic file with name d33e1369.gif 15

where Inline graphic controls the level of inheritance from previous memory and Inline graphic is the representation of the last attention layer.

The final prediction is made by concatenating the representations based on attention and memory as follows:

graphic file with name d33e1391.gif 16

TITAN is trained using binary cross-entropy loss [46] to maximize the likelihood of accurately predicting the observed binary outcome as follows:

graphic file with name d33e1403.gif 17

where Inline graphic is the true observed outcome and Inline graphic is the predicted probability.

TITAN learns contextualized non-linear representations from data through the attention mechanism, which highlights the most influential features, and through the memory component, which leverages information from similar individuals. In this study, we extend the role of TITAN beyond mere representation learning by training the model directly to estimate probabilities via the cross-entropy loss. TITAN jointly learns the representations of covariates and outcome risks in hypothetical interventions. Furthermore, we simplify the original TITAN model to increase its feasibility in real-world datasets. Instead of using complex gradient-based and surprise-based memory updates, we use a linearly updated neighborhood memory. This design preserves the model’s ability to generalize from historical data while ensuring efficiency and ease of implementation in practical settings.

DRLearner for causal effect estimation

Our prediction architecture incorporates DRLearner [35] as a parallel branch. DRLearner is a causal inference model capable of estimating the effect of having been tested for HIV on the risk of HIV infection. It belongs to the class of doubly robust methods which combines two complementary models: an outcome regression model and a propensity score model. This combination enables DRLearner to mitigate estimation bias even when only one of the two models is correctly specified.

Similarly to the TITAN model, the input feature matrix Inline graphic is derived from a DAG to ensure causal validity. For each individual Inline graphic, DRLearner first trains two separate outcome regression models to estimate the potential outcomes under the treatment and control conditions as follows.

graphic file with name d33e1446.gif 18

where Inline graphic represents the observed HIV infection status, Inline graphic denotes the treatment variable indicating whether the individual has ever been tested for HIV, and Inline graphic is the feature vector of individual Inline graphic. Here, Inline graphic and Inline graphic denote the true conditional expectations of the potential outcomes under treatment and control, respectively. Their estimators obtained by DRLearner are denoted by Inline graphic and Inline graphic.

Next, a propensity score model Inline graphic is trained to estimate the probability that individual Inline graphic was tested for HIV. This model is used to adjust for sample selection bias and is typically implemented via logistic regression or probabilistic classifiers.

DRLearner then combines the outcome and propensity models using the doubly robust correction formula as follows.

graphic file with name d33e1519.gif 19

where Inline graphic denotes the individual intervention effect (ITE) [47] for sample Inline graphic, reflecting the change in infection probability if the individual had been tested instead of not tested, after correcting for confounding through both models.

In practical terms, this formula provides a safeguard against model misspecification. Even if the outcome or propensity score model is incorrectly specified, the doubly robust property ensures that consistent estimates of causal effects can still be obtained. This is particularly valuable in real-world applications where a single modeling approach may not fully capture the complex relationships between interventions and results.

Finally, the average intervention effect (ATE) [1] is calculated as follows.

graphic file with name d33e1549.gif 20

ATE quantifies the average effect of the intervention on the outcome in the population. DRLearner functions as an independent causal estimation module. Provides stable and interpretable output. DRLearner is effective in contexts that require clear causal structures and high interpretability.

Implementation details of DRLearner

We provide here the practical specifications of the doubly robust learner, covering outcome regressions, propensity score estimation, and the extension to multi-arm treatments.

Outcome Regression Models. We employ RandomForestRegressor with the following hyperparameters: (i) n_estimators = 100, validated through cross-validation; (ii) max_depth = 5, to prevent overfitting while capturing non-linear interactions; (iii) min_samples_leaf = 50, ensuring sufficient sample sizes in each terminal node; (iv) bootstrap sampling enabled, providing variance estimation and improved robustness.

Propensity Score Models. For treatment assignment modeling, we use LogisticRegression with L2 regularization. The regularization strength is set as Inline graphic, optimized via grid search. We allow up to 1000 iterations with a convergence tolerance of Inline graphic, and use the liblinear solver, which is robust for moderate sample sizes.

Multi-arm Treatment Handling. For Dataset 2 containing four treatment arms (Inline graphic), we extend the doubly robust estimator to handle multiple treatment contrasts. The causal effect between treatment Inline graphic and Inline graphic for instance Inline graphic is given by follows.

graphic file with name d33e1629.gif 21

where Inline graphic denotes the generalized propensity score and Inline graphic is the corresponding outcome regression.

This extension allows the framework to accommodate multi-arm clinical treatments, addressing the reviewer’s concern that the methodology could be restricted to binary treatment settings.

To reduce overfitting and ensure out-of-sample validity of counterfactual predictions, we applied a 5-fold cross-validation scheme. Outcome and propensity models are trained in four folds and tested on the held-out fold. This procedure also allows us to monitor the balance of estimated propensity scores across folds.

RL-based ensemble strategy for individualized causal estimation

Selecting the most appropriate reasoning path for each individual is crucial to enhance the accuracy of the predictions and the interpretability of the model in a complex causal estimation framework. However, traditional ensemble methods often lack the flexibility to accommodate the diversity of input features and the varying confidence levels of base models. To address this limitation, we use reinforcement learning (RL) [48] as a dynamic coordination mechanism. This approach enables the model to learn specific blending strategies tailored to individual cases, ultimately improving counterfactual predictions.

RL is a sequential decision making method in which an agent learns from interactions to maximize cumulative rewards. In this study, we adopt the RL-based Ensemble method in [36] to learn an optimal blending mechanism between TITAN and DRLearner. Instead of relying on static ensemble strategies such as averaging or fixed-weight voting, we design a meta-controller using reinforcement learning to determine the best instance-specific blending weight. This mechanism enables the model to select the most suitable causal reasoning pathway based on the input features.

Specifically, we formulate the blending task as a Markov Decision Process (MDP) [49], defined by a 5-tuple Inline graphic. At time Inline graphic, the state Inline graphic includes the predictions of TITAN and DRLearner, along with auxiliary characteristics such as confidence scores and bias indicators. The action Inline graphic represents a blend weight Inline graphic, indicating the relative emphasis on each model. The reward Inline graphic reflects the accuracy of the final prediction. The transition function Inline graphic models the probability of moving to the next state Inline graphic, and Inline graphic is the discount factor for future rewards.

We use an actor-critic architecture [50] to learn the blending policy. The actor-network receives the current state Inline graphic and outputs the action Inline graphic, the weight of the blend between TITAN and DRLearner. The critic network estimates the action value function Inline graphic, which evaluates the quality of the selected action. The actor’s parameters are updated using the policy gradient formula as follows.

graphic file with name d33e1750.gif 22

Here, Inline graphic denotes the parameters of the actor network, Inline graphic is the learning rate, Inline graphic is the probability of selecting the action Inline graphic in state Inline graphic, and Inline graphic is the estimated action value of the critic.

The final prediction is computed as a convex combination of the two model outputs as follows.

graphic file with name d33e1799.gif 23

where Inline graphic is the counterfactual probability of HIV infection, for example Inline graphic, Inline graphic and Inline graphic are predictions from the respective models, and Inline graphic is the learned blending weight.

The reward Inline graphic is defined as the binary negative cross-entropy loss between the true label Inline graphic and the final prediction Inline graphic:

graphic file with name d33e1860.gif 24

Here, Inline graphic denotes the binary cross-entropy loss. This reward formulation encourages the RL agent to learn a blending policy that maximizes predictive accuracy.

To ensure stable and efficient training, we integrate three well-established reinforcement learning techniques as follows.

1. We apply the experience replay [51]. The agent stores the transitions Inline graphic in a replay buffer Inline graphic. During training, mini-batches are sampled randomly from Inline graphic to break temporal correlations and enhance convergence. These samples are used to update both the actor and critic via stochastic gradient descent.

2. We inject Ornstein-Uhlenbeck noise [52] to encourage exploration in continuous action spaces. The noise process is defined as follows.

graphic file with name d33e1906.gif 25

where Inline graphic is the current noise value, Inline graphic is the reversion speed, Inline graphic is the long-term mean, Inline graphic is the volatility and Inline graphic is Gaussian noise. OU noise produces smooth exploration behavior compared to uncorrelated noise.

3. We employ soft target updates using Polyak averaging [53] to update the target critic network as follows.

graphic file with name d33e1951.gif 26

Here, Inline graphic and Inline graphic are the current and target network parameters, and Inline graphic (typically Inline graphic) controls the smoothness of the update.

These techniques allow the RL agent to learn a stable and instance-specific blending strategy. This enables accurate counterfactual reasoning across groups of individuals who differ in their features.

Causal-aware reward and state design

To ensure unbiased causal estimation, we extend the reward function and state-space design to directly incorporate causal signals.

Reward Function. The primary component is a doubly robust loss, which directly measures the accuracy of causal effect estimation as follows.

graphic file with name d33e1995.gif 27

Here, Inline graphic denotes the estimated propensity score, Inline graphic the outcome regression under treatment Inline graphic, and Inline graphic the ground-truth effect in simulations.

In addition to this core component, we introduce: (i) Uncertainty-weighted rewards, which encourage the agent to down-weight uncertain predictions using confidence intervals and bootstrap estimates; (ii) Balance-aware penalties, which discourage covariate imbalance between treated and control groups as follows.

graphic file with name d33e2037.gif 28

where Inline graphic is the weighted mean of covariate Inline graphic under treatment Inline graphic according to the implied weights of the ensemble.

State Space. The state representation is extended to embed causal information as follows.

graphic file with name d33e2069.gif 29

where Inline graphic denotes the uncertainty of the prediction. This ensures that the RL agent’s decisions are guided by causal validity indicators rather than predictive accuracy alone. Overall, this causal-aware design directs the ensemble toward weighting strategies that enhance counterfactual reasoning and improve causal estimation quality.

Uncertainty estimation

We incorporate an uncertainty estimation mechanism to assess the reliability of counterfactual predictions and estimated causal effects. This enables the model to distinguish between cases with clear causal links and those with higher ambiguity. As a result, the trustworthiness and interpretability of the model improve in evaluating the causal impact of HIV screening test behavior on laboratory test behavior. In causal inference tasks using observational data, evaluating the counterfactual outcome and the reliability of such estimates is crucial. To quantify uncertainty in counterfactual predictions and the estimation of causal effects, we apply the Monte Carlo Dropout (MC Dropout) [54] method during inference.

Specifically, the model performs Inline graphic stochastic forward passes with dropout enabled, producing a set of counterfactual predictions Inline graphic. Each Inline graphic represents the estimated probability of HIV infection under counterfactual conditions. The predictive mean is calculated as follows.

graphic file with name d33e2110.gif 30

The general predictive uncertainty is measured as the variance of the predictions as follows.

graphic file with name d33e2118.gif 31

If the model is also trained to output the aleatoric variance Inline graphic for each forward pass, the total predictive uncertainty can be decomposed into two components as follows.

  1. Aleatoric uncertainty [55] captures irreducible variability in potential outcomes due to inherent randomness or noise in the data, even under a well-specified causal model.
    graphic file with name d33e2141.gif 32
  • 2.
    Epistemic uncertainty [55] arises from limited knowledge about the true data-generating process and can be reduced by observing more data or improving the model.
    graphic file with name d33e2156.gif 33

where Inline graphic denotes the predicted mean in the Inline graphic-th forward pass. Overall, the total predictive variance is represented as follows.

graphic file with name d33e2176.gif 34

This decomposition enables the model to assess the confidence of each counterfactual prediction, thereby improving interpretability and trustworthiness in public health applications such as evaluating the causal impact of the HIV screening test behavior on laboratory tests behavior.

Concept drift detection

The proposed model incorporates a mechanism to detect concept drift. This mechanism identifies shifts in the distribution of input features between the treatment and control groups. If these shifts are not addressed appropriately, they can reduce the predictive accuracy of the outcome and propensity models. When drift is detected, the system adjusts the ensemble weights and may also modify the causal inference method. This enhances the model’s flexibility and stability. Drift detection serves as a quality check at the input level. It enables the model to manage data inconsistencies between groups, which often occur in real-world scenarios.

We incorporate a concept drift detection mechanism [56] into the model to enhance its robustness. This component checks whether the input distribution has changed compared to training. It works by comparing the mean vector of the current input data Inline graphic with the training distribution Inline graphic. A drift is flagged when the difference exceeds a defined threshold as follows.

graphic file with name d33e2205.gif 35

This simple condition helps the model recognize significant distribution shifts. Once drift is detected, the model may adjust its ensembling weights or trigger an update. This adaptation ensures robustness in real-world deployment. It is advantageous for causal effect estimation tasks, where behavioral trends can change over time.

Exprimental results

We first evaluate the framework on simulated data with known ground-truth causal effects, which enables us to quantify estimation accuracy under controlled conditions and validate causal correctness. We then apply the framework to real-world HIV datasets, where the goal is to assess its practical.

Evaluation metrics

We use a common set of metrics to evaluate the framework in both simulated and real-world HIV datasets. The goal is to measure not only the predictive accuracy but also the quality of causal estimation and the reliability of uncertainty estimates.

For causal validity, we report several metrics that quantify the accuracy of treatment effect estimation. The precision in the estimation of heterogeneous effect (PEHE) is defined as

graphic file with name d33e2223.gif

which evaluates the root mean squared error between estimated and true individual treatment effects (ITEs). The error and bias of the Average Treatment Effect (ATE) are measured as

graphic file with name d33e2230.gif

We also compute the coefficient of determination for ITEs,

graphic file with name d33e2237.gif

and the mean absolute error of ITEs,

graphic file with name d33e2244.gif

The estimated Average Treatment Effect (Estimated ATE), defined as the sample mean of the predicted individual treatment effects as follows.

graphic file with name d33e2251.gif

This provides a direct estimate of the population-level treatment effect from the model. These measures capture how close the model is to the true causal effects at both the individual and population levels.

To examine distributional reliability, we report the standard deviation of estimated ITEs (ITE Std), the coverage rate of the true effect within the 95% confidence intervals, and the calibration error. Coverage is defined as follows.

graphic file with name d33e2260.gif

which measures the proportion of true effects that fall within the predicted confidence intervals. A well-calibrated model should achieve coverage close to the nominal 95%. Calibration error is quantified as the deviation between the empirical coverage and the nominal confidence level. These uncertainty metrics are only available in the simulation study, since the ground-truth causal effects are known there but not in real-world data.

For predictive performance, we use standard classification metrics: Accuracy, F1-Score, and AUC-ROC. These are consistently reported for both the simulation and HIV datasets to give a comparable view of how well the framework discriminates the outcomes. WThese metrics capture different aspects of model quality and are applicable regardless of whether the prediction target is testing behavior or cinical trial.

1. Accuracy measures the overall proportion of correct predictions, including positive and negative outcomes. It is calculated as follows.

graphic file with name d33e2274.gif

where TP (true positives) and TN (true negatives) refer to correctly predicted outcomes, while FP (false positives) and FN (false negatives) represent incorrect predictions. This metric provides a general overview of the predictive reliability of the model to identify positive and negative outcomes.

2. F1-score balances precision and recall, offering a robust performance measure in the presence of class imbalance as follows.

graphic file with name d33e2297.gif

This metric is particularly important in settings where the consequences of misclassification can lead to substantial negative outcomes.

3.ROC-AUC (Receiver Operating Characteristic – Area Under the Curve) quantifies the ability of the model to distinguish between classes across all possible decision thresholds:

graphic file with name d33e2308.gif

where TPR (true positive rate) and FPR (false positive rate) describe the sensitivity and fallout of the model at varying thresholds. ROC-AUC captures the model’s ability to distinguish between classes across various decision thresholds, making it especially useful in applications where classification criteria may vary.

Simulation study

We first evaluate the framework on simulated data with known ground-truth causal effects, which allows us to assess estimation accuracy under controlled conditions. This step provides a validation of causal correctness before applying the method to HIV datasets.

Data generating process

We simulate a semi-realistic HIV causal system with measured confounders, a policy-like instrumental variable, a mediator, and age-modulated heterogeneous treatment effects. Unless otherwise stated, we generate Inline graphic samples with random seed Inline graphic for reproducibility. For each unit, exogenous covariates Inline graphic are drawn as follows. Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic. A latent health factor Inline graphic is unobserved in the analysis dataset. An instrumental variable Inline graphic depends on the geography as follows.

graphic file with name d33e2406.gif

and has no direct path to the outcome.

The assignment of treatment follows a logistic model with measured confounding and the instrument as follows.

graphic file with name d33e2415.gif

yields the propensity score used to draw Inline graphic.

A post-treatment mediator Inline graphic (viral load-like) is generated as follows.

graphic file with name d33e2436.gif

so treatment reduces Inline graphic on average. The individual treatment effect is modeled as a function of age, as specified below.

graphic file with name d33e2449.gif

The binary result is generated according to the following logistic model.

graphic file with name d33e2456.gif

We release two tables per simulation: (i) the observed dataset Inline graphic, where Inline graphic is omitted, and (ii) a table ground-truth with the true ITE Inline graphic, the propensity score and the Inline graphic-probabilities for diagnostics.

The design induces non-trivial confounding, mediation, and age-modulated heterogeneity while preserving overlap through a smooth propensity, enabling unbiased benchmarking of ATE/ITE metrics against known truth.

Causal graphs: ground Truth vs. Estimated

The ground-truth graph (Figure 6) represents the assumed causal structure in the simulation: age (Inline graphic), immune status (Inline graphic), and comorbidity (Inline graphic) act as confounders; geography (Inline graphic) influences treatment only through the instrumental variable Inline graphic; and the mediator Inline graphic lies on the path from treatment to outcome. Other covariates such as gender (Inline graphic), behavior (Inline graphic), and visits (Inline graphic) play a limited role and show weak connections to the outcome. For clarity, we treat the ground-truth graph in the simulation as the expert-defined graph in subsequent comparisons.

Fig. 6.

Fig. 6

Ground Truth/Expert causal graph for simulated data

The estimated causal graph (Fig. 7) shows a denser structure compared to the ground truth, including additional edges such as age Inline graphic or geography Inline graphic. At the same time, some relations like Inline graphic appear less prominent. Overall, the estimated graph shows deviations from the predefined ground-truth graph, reflecting the patterns derived from observational data.

Fig. 7.

Fig. 7

Causal graph for simulated data

Simulation study results

The simulation study offers a controlled environment where the true causal effects are known. This setting lets us check how well the framework recovers causal relationships and whether it can also make reliable predictions. We report four sets of results: overall causal estimation, ablation experiments, predictive performance, and the distribution of treatment effects.

Table  1 reports the performance of CAUSALRLSTACK and baseline methods on the simulated data set with known causal effects of ground truth. Across all causal metrics, CAUSALRLSTACK achieves the best overall performance. Specifically, it produces the lowest precision in the estimation of heterogeneous effect (PEHE = 0.0980) and the lowest error in the estimation of the average treatment effect (ATE error = 0.0198), while also exhibiting minimal bias (ATE bias = −0.0089). For individual treatment effect estimation, CAUSALRLSTACK achieves the highest Inline graphic value (0.5834), indicating superior explanatory power compared to competing methods. The estimated ATE from CAUSALRLSTACK (0.2261) is also closest to the true value used in the data generating process.

Table 1.

Performance comparison of the proposed method with existing causal inference models on simulated data

Method PEHE Inline graphic ATE Error Inline graphic ATE Bias ITE Inline graphic Inline graphic Estimated ATE
CAUSALRLSTACK 0.0980 0.0198 −0.0089 0.5834 0.2261
Causality-Aware Transformer (CAT) Network 0.1089 0.0234 0.0187 0.5445 0.2537
Orthogonal Random Forests 0.1156 0.0267 −0.0145 0.5198 0.2205
Double/Debiased ML 0.1203 0.0289 0.0223 0.4987 0.2573
Causal Forest 0.1267 0.0321 −0.0198 0.4743 0.2152
X-Learner 0.1334 0.0367 0.0298 0.5456 0.2648
CEVAE 0.1423 0.0412 −0.0267 0.5123 0.2083

These results confirm that the proposed framework not only improves predictive accuracy, but also produces unbiased causal estimates under controlled conditions, thus validating its methodological soundness prior to application in real-world HIV datasets.

Table  2 presents an ablation analysis to assess the contribution of each component within CAUSALRLSTACK. The complete model achieves the best performance across all causal validity metrics (PEHE = 0.0980, ATE error = 0.0198, ITE Inline graphic = 0.5834, and ITE MAE = 0.0756). When the RL ensemble is removed, the PEHE increases to 0.1134 (+15.7%) and the ATE Error rises to 0.0243, indicating that this module plays the most critical role in improving stability and accuracy. Removing TITAN results in PEHE = 0.1189 (+21.3%) and ATE Error = 0.0278, reflecting its importance in enhancing representation quality. Similarly, excluding DRLearner leads to PEHE = 0.1076 and a drop in ITE Inline graphic from 0.5834 to 0.5734, demonstrating its contribution to refining effect estimation. Finally, removing the Causal Graph also degrades performance, with PEHE = 0.1156 and ITE Inline graphic reduced to 0.5389.

Table 2.

Component ablation study of CAUSALRLSTACK on simulated data

Component Configuration PEHE Inline graphic ATE Error Inline graphic ITE Inline graphic Inline graphic ITE MAE Inline graphic
Full CAUSALRLSTACK 0.0980 0.0198 0.5834 0.0756
Without RL Ensemble 0.1134 0.0243 0.5456 0.0789
Without TITAN 0.1189 0.0278 0.5223 0.0823
Without DRLearner 0.1076 0.0198 0.5734 0.0734
Without Causal Graph 0.1156 0.0267 0.5389 0.0667
TITAN only 0.1167 0.0256 0.5334 0.0812
DRLearner only 0.1245 0.0289 0.5089 0.0856

Overall, these results confirm that the synergy of all components is necessary to achieve optimal causal inference performance, with the RL ensemble providing the largest individual improvement.

Table  3 reports the predictive performance of CAUSALRLSTACK and the baseline models in the simulated dataset. CAUSALRLSTACK achieves the highest accuracy (0.8734), F1-Score (0.862), and AUC-ROC (0.924), outperforming all competing methods in all metrics. Compared to the strongest baseline, the Causality-Aware Transformer (CAT) network, our framework improves accuracy by more than 4.5% points and yields higher discriminative ability, as reflected in the AUC-ROC. These results demonstrate that the proposed model not only provides unbiased causal estimates, but also maintains superior predictive capacity, which is essential for reliable deployment in real world settings.

Table 3.

Predictive performance comparison of the proposed method with baselines

Method Accuracy F1-Score AUC-ROC
CAUSALRLSTACK 0.8734 0.862 0.924
Causality-Aware Transformer (CAT) Network 0.8267 0.814 0.889
Orthogonal Random Forests 0.8189 0.807 0.883
Double/Debiased ML 0.8123 0.798 0.876
Causal Forest 0.8045 0.789 0.867
X-Learner 0.7967 0.776 0.854
CEVAE 0.7834 0.761 0.841

Table  4 evaluates the distributional properties of treatment effect estimation. CAUSALRLSTACK achieves the lowest variance of individual treatment effects (ITE Std = 0.1456), the highest coverage of the true effect within the 95% confidence interval (94.2%), and the smallest calibration error (0.0234). A coverage rate close to the nominal 95% level indicates that the estimated uncertainty intervals are well-calibrated, providing reliable quantification of uncertainty. These results demonstrate that our framework not only improves accuracy but also ensures trustworthy and interpretable causal effect estimates.

Table 4.

Treatment effect distribution analysis on simulated data

Method ITE Std Inline graphic Treatment Effect Coverage (95% CI) Calibration Error Inline graphic
CAUSALRLSTACK 0.1456 94.2% 0.0234
Causality-Aware Transformer (CAT) Network 0.1678 91.8% 0.0367
Orthogonal Random Forests 0.1723 90.6% 0.0389
Double/Debiased ML 0.1789 89.4% 0.0412
Causal Forest 0.1834 88.7% 0.0445
X-Learner 0.1923 87.3% 0.0478
CEVAE 0.2067 85.9% 0.0523

Robustness testing

To further examine the stability and causal validity of CAUSALRLSTACK, we conduct robustness tests under different challenging conditions.

Sample size sensitivity. We vary the number of observations from 1,000 to 10,000. Performance remains stable across this range, with PEHE increasing only by 12% at the smallest sample size, indicating that the method is not overly dependent on large samples.

Confounding strength. We test scenarios with different levels of confounding strength (Inline graphic). In all settings, our method maintains PEHE Inline graphic, while the baseline methods significantly degrade once Inline graphic, demonstrating robustness to strong confounding.

Model misspecification. We deliberately misspecify the outcome model by using incorrect functional forms. Even in this setting, the doubly robust properties of our estimator keep the ATE bias below 0.03, whereas single-robust methods exhibit biases greater than 0.08.

Overall, these robustness experiments provide strong evidence that CAUSALRLSTACK achieves reliable causal estimation across diverse scenarios and remains valid under adverse conditions such as small sample sizes, strong confounding, and model misspecification.

Exprimental for HIV datasets

Comparison with existing causal methods

To evaluate the effectiveness of our proposed hybrid causal effect estimation framework, we compare its performance with several widely used baseline models in causal inference. These baselines represent diverse methodological families, including statistical meta-learners, tree-based ensemble methods, deep generative models, and transformer-based architectures. The specific methods include Double/Debiased Machine Learning (DML) [19], Orthogonal Random Forests (ORF) [20], Causal Forests [18], X-Learner [21, 22], CEVAE (Causal Effect Variational Autoencoder) [57], Causality-Aware Transformer (CAT) [24]

Hyperparameter optimization

We utilized Optuna, a modern hyperparameter optimization framework, to automatically tune the key parameters of the model. The optimization process was guided by the performance of the validation set and integrated with an early stopping mechanism to prevent overfitting and reduce unnecessary training time. This approach enabled efficient model selection while ensuring good generalization across different datasets. The set of optimal hyperparameters selected through this procedure is summarized in Table  5.

Table 5.

Best hyperparameters selected by Optuna

Parameter Value
Batch size 1024
TITAN hidden size 512
TITAN number of layers 8
TITAN dropout 0.1567
TITAN learning rate 1.75e-4
TITAN number of heads 8
MLP hidden size 512
MLP dropout 0.1029
MLP learning rate 1.59e-4
MLP activation SiLU
RL Inline graphic 0.9193
RL Inline graphic 0.0139
RL actor learning rate 2.26e-5
RL critic learning rate 5.01e-4
RL hidden dimension 128
RL update frequency 2
TITAN epochs 31
MLP epochs 29
RL epochs 30
Patience 4

For baseline models, we applied the same optimization strategy to ensure a fair comparison.

Data splitting strategy

For both datasets, we split the data into 64% for training, 16% for validation, and 20% for testing. The validation set was used for hyperparameter tuning and model selection, while the final results were obtained from the held-out test set. In addition, we applied k-fold cross-validation within the training/validation split to ensure that the results were not sensitive to a particular partition. These procedures were applied consistently across all models compared in our experiments.

Computational efficiency and scalability

To assess the computational feasibility of the proposed framework, we report both training and inference times across all models. Table  6 summarizes the training times in Dataset 1 and Dataset 2, normalized by a complexity ratio relative to CAUSALRLSTACK (set to 1.00). As shown, CAUSALRLSTACK requires longer training (67.3 minutes in dataset 1 and 89.7 minutes on Dataset 2) due to integration of TITAN, DRLearner and RL-based assembly. Simpler baselines such as DRLearner or MLP + Causal train significantly faster (complexity ratios ranging from 0.07 to 0.14 relative to CAUSALRLSTACK) but do not achieve comparable causal estimation accuracy.

Table 6.

Training time comparison (minutes)

Method Dataset 1 Dataset 2 Ratio Components
CAUSALRLSTACK 67.3 89.7 1.00 TITAN + DRLearner + RL Agent + Optuna
TITAN 34.8 46.2 0.52 Memory-augmented Transformer
DRLearner 8.4 12.7 0.14 Random Forest + Logistic Regression
MLP + Causal 4.2 6.1 0.07 Simple MLP
Double ML 12.3 18.9 0.20 Meta-learners with cross-fitting

Beyond training costs, inference performance is critical for real-world deployment. Table  7 reports the latency, throughput, model size, and complexity of inference. CAUSALRLSTACK achieves an inference latency of 4.7 ms/sample (213 samples/sec) with a model size of 156 MB. Although slower than simpler baselines (e.g. MLP + Causal at 0.3 ms/sample), latency remains within a feasible range for clinical decision support systems.

Table 7.

Inference performance

Method Latency (ms/sample) Throughput (samples/sec) Size (MB) Deployment Complexity
CAUSALRLSTACK 4.7 213 156.3 High
TITAN 2.9 345 89.7 Moderate
DRLearner 0.8 1250 12.4 Low
MLP + Causal 0.3 3333 2.8 Very Low
TITAN + MLP 2.1 476 67.2 Moderate
TITAN + MLP + RL 3.8 263 124.6 High
Double/Debiased ML 1.2 833 18.7 Low
Orthogonal RF 1.8 556 45.3 Low
CEVAE 2.4 417 78.9 Moderate
CAT Networks 2.6 385 82.1 Moderate
Causal Forest 1.5 667 6.8 Low
X-Learner 0.9 1111 15.2 Low

These results highlight a trade-off between computational cost and predictive performance. Although CAUSALRLSTACK demands higher training resources, its inference latency and throughput are within practical limits, supporting its feasibility for deployment in healthcare decision support scenarios.

Main results

Table  8 displays the performance of CAUSALRLSTACK compared to various existing causal effect estimation methods. These include double/debiased machine learning (DML), orthogonal random forests, CEVAE, Causality-Aware Transformer (CAT) Networks, Causal Forest, and X-Learner.

Table 8.

Comparison of the proposed method with existing causal inference models

Method Dataset 1 Dataset 2
Accuracy F1-Score AUC-ROC Accuracy F1-Score AUC-ROC
CAUSALRLSTACK 0.861 0.845 0.897 0.855 0.839 0.892
Double/Debiased Machine Learning 0.774 0.749 0.830 0.768 0.744 0.827
Orthogonal Random Forests 0.781 0.758 0.834 0.774 0.753 0.831
CEVAE 0.743 0.720 0.798 0.740 0.753 0.795
Causality-Aware Transformer (CAT) Networks 0.784 0.763 0.838 0.778 0.758 0.835
Causal Forest 0.768 0.747 0.824 0.763 0.742 0.821
X-Learner 0.756 0.738 0.812 0.751 0.733 0.808

CAT networks and orthogonal random forests achieved the best results in Dataset 1 among the baseline methods. Their accuracy scores were 0.784 and 0.781, respectively. Their AUC-ROC values were 0.838 and 0.834. However, both models still fall short compared to our proposed method. Our model outperforms them in all three metrics: accuracy (0.861), F1 score (0.845), and AUC-ROC (0.897).

On Dataset 2, the trend remains consistent. CAT networks and orthogonal random forests again outperform other baselines. However, their AUC-ROC values (0.835 and 0.831) and F1 scores (0.758 and 0.753) are still lower than our model’s. The proposed method achieves an AUC-ROC of 0.892, F1 score of 0.839, and accuracy of 0.855.

2. Performance of Hybrid Causal Models. Table  9 presents a comparison of the performance of six variants of the model created from different combinations of TITAN, DRLearner, MLP and reinforcement learning (RL), evaluated on two datasets. The proposed model, which combines TITAN, DRLearner, and RL, performs best across both datasets and in all three evaluation metrics. In Dataset 1, it records the highest accuracy (0.861), F1 score (0.845), and ROC-AUC (0.897), while in Dataset 2, it achieves similarly strong results with accuracy of 0.855, F1 score of 0.839, and AUC-ROC of 0.892.

Table 9.

Performance of ensemble models using TITAN, DRLearner, MLP, and RL on datasets 1 and 2

Model Dataset 1 Dataset 2
Accuracy F1-Score ROC-AUC Accuracy F1-Score ROC-AUC
CAUSALRLSTACK 0.861 0.845 0.897 0.855 0.839 0.892
TITAN 0.789 0.767 0.845 0.788 0.765 0.841
DRLearner 0.762 0.729 0.818 0.755 0.722 0.812
MLP + Causal 0.705 0.662 0.729 0.699 0.654 0.723
TITAN + MLP 0.749 0.722 0.799 0.740 0.713 0.793
TITAN + MLP + RL 0.796 0.778 0.843 0.790 0.772 0.837

3. Average Treatment Effect (ATE) Comparison. Table  10 presents the Average Treatment Effect (ATE) estimated by our model compared to several established causal inference methods. ATE represents the difference in expected results between the intervention and control groups, providing a means to assess the effectiveness with which each method estimates causal effects.

Table 10.

Comparison of average treatment effect (ATE) across models

Method Dataset 1 (ATE) Dataset 2 (ATE)
CAUSALRLSTACK 0.247 0.243
TITAN 0.235 0.231
DRLearner 0.230 0.227
Causal Forest 0.231 0.227
Causality-Aware Transformer (CAT) Networks 0.236 0.233
Orthogonal Random Forests 0.232 0.229
Double/Debiased ML 0.233 0.231

4. Ablation Study on Model Components. Table  11 presents the results of the ablation study, which evaluates the contribution of each core component in the proposed system. The complete model encompasses all components, including TITAN, DRLearner, RL-ensemble, uncertainty estimation, and concept drift detection. This whole model exhibits the highest performance across both datasets. Specifically, it achieves an accuracy of 0.861, an F1-score of 0.845, and an Average Treatment Effect (ATE) of 0.247 on Dataset 1. For Dataset 2, the model achieves an accuracy of 0.855, an F1-score of 0.839, and an ATE of 0.243. These results confirm the effectiveness and robustness of the whole architecture.

Table 11.

Ablation study: performance impact of each system component across two datasets

Technique Dataset 1 Dataset 2
Accuracy F1 ATE Accuracy F1 ATE
CAUSALRLSTACK 0.861 0.845 0.247 0.855 0.839 0.243
Without RL Ensemble 0.789 0.767 0.235 0.788 0.765 0.231
Without TITAN 0.762 0.729 0.230 0.755 0.722 0.227
Without DRLearner 0.779 0.753 0.232 0.773 0.748 0.229
Without Causal graph 0.815 0.792 0.239 0.807 0.787 0.235
Without uncertainty estimation 0.850 0.831 0.244 0.843 0.825 0.241
Without concept drift detection 0.857 0.838 0.246 0.849 0.831 0.242

5. Uncertainty Analysis. Table  12 shows the predictive uncertainty results for various model variants, calculated using the Monte Carlo dropout method. The proposed model demonstrates the lowest uncertainty across both datasets, with scores of 0.093 in Dataset 1 and 0.092 in Dataset 2. This indicates a higher level of confidence in its predictions. These results underscore the effectiveness of integrating deep representation learning (TITAN), causal effect estimation (DRLearner), and ensemble learning through reinforcement learning. The low uncertainty suggests improved generalization capabilities and more reliable outputs.

Table 12.

Monte carlo dropout: predictive uncertainty across two datasets

Method Dataset 1 Dataset 2
CAUSALRLSTACK 0.093 0.092
TITAN 0.126 0.127
DRLearner 0.133 0.134
TITAN + MLP + RL 0.102 0.103

Discussion

The results shown in Table  8 clearly indicate that the proposed model, CAUSALRLSTACK, significantly outperforms several well-known causal inference approaches. In Dataset 1, the leading baseline models, Causality-Aware Transformer (CAT) and Orthogonal Random Forests (ORF), achieved accuracy scores of 0.784 and 0.781, with corresponding AUC-ROC values of 0.838 and 0.834. In contrast, CAUSALRLSTACK achieved an impressive accuracy of 0.861, an F1-score of 0.845, and an AUC-ROC of 0.897, all of which are substantially higher than the baseline models.

The performance trend observed in Dataset 2 remains consistent, with CAUSALRLSTACK again leading the results, achieving an AUC-ROC score of 0.892 and an F1 score of 0.839. In comparison, both CAT and ORF recorded AUC-ROC scores of 0.835 and 0.831, respectively. As demonstrated in Table  8, the consistency of these results in both datasets emphasizes the robustness of the model and the strong capacity for generalization in varying data distributions.

The superior performance of CAUSALRLSTACK can be attributed to its ability to effectively capture complex relationships between covariates and treatment effects, surpassing conventional methods. Traditional statistical approaches such as Double Machine Learning (DML), Causal Forest, and X-Learner often rely on linear assumptions or rigid tree structures, which may not be suitable for complex or non-stationary datasets, such as those related to HIV outcomes. On the other hand, deep generative models like CEVAE offer representational flexibility but often struggle to maintain high predictive accuracy. This may be due to a lack of strong causal structure guidance or the underutilization of dependencies in the observed data.

By combining deep representation learning, doubly robust causal estimation, and reinforcement learning-based adaptive ensembling mechanisms, CAUSALRLSTACK enhances both accuracy and stability in real-world applications. These features are particularly important in high-stakes situations, as reliable predictions lead to better decision-making in areas such as intervention prioritization, allocation of healthcare resources, and public health planning.

After demonstrating that CAUSALRLSTACK outperforms leading causal inference methods across both datasets, we conducted a detailed analysis to evaluate the contribution of each core component within the proposed architecture. The results presented in Table 9 illustrate the specific roles of deep representation learning, doubly robust causal estimation, and the adaptive assembly mechanism to improve the overall effectiveness of the model.

The experimental findings presented in Table 9 highlight the importance of combining three core components: deep representation learning (TITAN), doubly robust causal effect estimation (DRLearner), and an adaptive ensemble mechanism based on reinforcement learning (RL). Among individual models, TITAN consistently outperforms DRLearner across both datasets, underscoring the significance of counterfactual representations guided by causal structures. However, neither model alone achieves the performance of the fully integrated framework. The combination of TITAN and a multilayer perceptron (MLP) yields moderate improvements over MLP alone; however, significant gains are only realized when reinforcement learning is incorporated (TITAN + MLP + RL), further emphasizing the critical role of adaptive ensembling. The weakest results are found in the MLP + Causal variant, which is without both deep representation learning and reinforcement-based adaptation. These performance trends are consistently observed in both datasets, as shown in Table 9.

Although predictive metrics such as accuracy and F1-score provide a general overview of model performance, they do not directly reflect the model’s ability to estimate causal effects. To address this limitation, we further evaluated the models based on their estimated Average Treatment Effect (ATE, shown in Table 10. The proposed model achieved ATE values of 0.247 and 0.243 in the two data sets, surpassing alternatives such as DRLearner (0.230 and 0.227) and Causal Forest (0.231 and 0.227). These results highlight the model’s enhanced ability to distinguish between treated and untreated groups (e.g., in contexts such as HIV testing or treatment), ultimately improving the accuracy of estimating intervention effects. This precision is crucial in designing effective public health policies.

In addition to estimating the effects of treatment, we conducted an ablation study to evaluate the individual contributions of each component in the CAUSALRLSTACK architecture. As presented in Table  11, the results show that every element is critical to the performance of the model. Removing any module, be it TITAN, DRLearner, the reinforcement learning ensemble, the causal graph, the uncertainty estimation module, or the concept drift detector, resulted in a noticeable decline or instability in performance across both datasets. This highlights the integrated nature of the architecture and emphasizes the importance of each module in ensuring the model’s robustness and adaptability in real-world scenarios with changing data distributions.

The results of the uncertainty analysis (see Table 12) indicate that the proposed model demonstrates the lowest level of predictive uncertainty, with values of 0.093 and 0.092. This reflects its superior generalizability and high reliability. In contrast, individual models, such as TITAN and DRLearner, exhibit significantly higher levels of uncertainty. Even the hybrid configuration (TITAN + MLP + RL) shows some improvement but still does not match the stability of the complete model. This underscores the crucial role of causal structure and doubly robust estimation in reducing predictive uncertainty. These findings emphasize the importance of integrating causal reasoning with uncertainty modeling, particularly in high-stakes prediction tasks such as HIV treatment planning.

In general, these findings underscore the flexibility and practical utility of the CAUSALRLSTACK framework in analyzing HIV surveillance data. By jointly modeling both observational and interventional information, the model demonstrates adaptability within HIV-related datasets. Through the integration of deep representation learning, causal inference, and uncertainty estimation, CAUSALRLSTACK provides a robust foundation for evidence-based decision-making in HIV prevention and treatment strategies.

Conclusion

This study presents a modular framework called CAUSALRLSTACK that integrates deep representation learning, doubly robust causal estimation, and reinforcement learning–based adaptive assembly. Experiments on two HIV-related datasets indicated that the proposed model outperformed baseline approaches. It achieved higher predictive accuracy, produced more reliable estimates of treatment effects, and reduced uncertainty, underscoring its value for evidence-based decision making in HIV.

Despite these promising results, the framework has several limitations. First, the integration of multiple components, including TITAN, DRLearner, and the reinforcement learning–based ensemble, improves accuracy but also increases computational complexity and requires substantial training resources. Second, all evaluations have been limited to HIV datasets, so the generalizability of the model to other domains has not yet been tested. Third, the experiments relied on preprocessed Kaggle datasets, which may constrain the richness and variability of real-world clinical data. Moreover, the provenance of these datasets through Kaggle introduces an additional limitation, as secondary distribution may not fully preserve the completeness or consistency of the original data sources. Finally, while the framework demonstrates strong methodological contributions, the current work serves primarily as a technical validation, and further efforts are needed to translate the results into substantive public health insights, such as subgroup analyses or targeted intervention strategies.

Future work could aim to improve the efficiency of the model and broaden its applicability. One potential direction is the development of lightweight versions and model compression techniques that reduce complexity while preserving core mechanisms. Another is to validate the framework on datasets from diverse fields, such as other diseases or precision medicine, to assess its robustness and adaptability in real-world contexts. In addition, future research should move beyond preprocessed benchmark datasets by incorporating raw or real-world clinical data to better capture data heterogeneity and complexity. Finally, efforts are needed to translate the technical advances into substantive public health insights, for example through subgroup analyses or targeted intervention simulations that directly inform policy and practice.

Acknowledgements

The authors thank the Institute of Information Technology, Vietnam Academy of Science and Technology, for providing computational resources used in model training.

Author contributions

D.T.P. conceived and designed the study, performed the experiments, and drafted the initial manuscript. K.Q.T. implemented the CAUSALRLSTACK framework, developed the code, and contributed to data preprocessing and experimental validation. V.A.N. provided supervision, methodological guidance, and critical revisions of the manuscript. All authors read and approved the final manuscript.

Funding

No specific funding was received for this study.

Declaration

Ethical approval

This study does not require ethical approval, as all data used are publicly available and fully anonymized. Specifically, we used the following datasets from the Kaggle platform: 1. EDHS HIV/AIDS Dataset by Daniel Mesafint (https://www.kaggle.com/datasets/danielmesafint1985/edhs-hivaids-dataset). 2. AIDS Virus Infection Prediction Dataset by Aadarsh Velu (https://www.kaggle.com/datasets/aadarshvelu/aids-virus-infection-prediction). These datasets contain no personally identifiable information. The use of these publicly available and anonymized datasets is in accordance with the data use policies of the Kaggle platform.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Hernán MA, Robins JM. Causal inference: what if. Boca Raton, FL: Chapman & Hall/CRC; 2020. [Google Scholar]
  • 2.Pearl J. Causality: models, reasoning, and inference. 2nd edn. Cambridge, UK: Cambridge University Press; 2009. [Google Scholar]
  • 3.Hernán MA. Causal inference for personalized medicine. In: Pauly MV, McGuire TG, Barros PP, editors. Handbook of health economics. Vol. 2. Amsterdam: Elsevier; 2011. p. 181–206. 10.1016/B978-0-444-53592-4.00002-1. [Google Scholar]
  • 4.Dehejia RH, Lleras-Muney A. Causal inference for personalized medicine. In: Pauly MV, McGuire TG, Barros PP, editors. Handbook of health economics. Vol. 2. Amsterdam: Elsevier; 2011. p. 181–206. 10.1016/B978-0-444-53592-4.00002-1. [Google Scholar]
  • 5.Collaboration H-C. The effect of combined antiretroviral therapy on the overall mortality of hiv-infected individuals. AIDS. 2010;24(1):123–37. 10.1097/QAD.0b013e3283324283. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Sewak A, Lodi S, Li X, Shu D, Wen L, Mayer KH, Krakower D, Young J, Marcus JL. Causal effects of stochastic prep interventions on hiv incidence among men who have sex with men. Am J Epidemiol. 2024;193(1):6–16. 10.1093/aje/kwad097. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Nosyk B, Min JE, Colley G, Lima VD. The causal effect of opioid substitution treatment on highly active antiretroviral treatment adherence. Drug Alcohol Depen. 2015;146:53–54. 10.1016/j.drugalcdep.2014.09.516. [Google Scholar]
  • 8.Imbens GW, Rubin DB. Causal inference for statistics, social, and biomedical sciences. Cambridge, UK: Cambridge University Press; 2015. 10.1017/CBO9781139025751. [Google Scholar]
  • 9.Schölkopf B, Locatello F, Bauer S, Ke NR, Kalchbrenner N, Goyal A, Bengio Y. Toward causal representation learning. Proc IEEE. 2021;109(5):612–34. 10.1109/JPROC.2021.3058954. [Google Scholar]
  • 10.Peters J, Janzing D, Schölkopf B. Elements of causal inference: foundations and learning algorithms. Cambridge, MA: MIT Press; 2017. [Google Scholar]
  • 11.Sewak A, Lodi S, Li X, Shu D, Wen L, Mayer KH, Krakower DS, Young JG, Marcus JL. Causal effects of stochastic prep interventions on hiv incidence among men who have sex with men. Am J Epidemiol. 2024;193(1):6–16. 10.1093/aje/kwad097. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Wen L, Marcus JL, Young JG. Intervention treatment distributions that depend on the observed treatment process and model double robustness in causal survival analysis. Stat Methods Med Res. 2023;32(3):509–23. 10.1177/09622802221146311. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Zídek M, Titman AC, Witte JS, Peto J, Keogh. R.H.: causalcmprsk: an r package for nonparametric and cox-based estimation of average treatment effects in competing risks data. Comput Met Programs Biomed. 2022;226:107161. 10.1016/j.cmpb.2022.107161. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Yao L, Chu Z, Li S, Li Y, Gao J, Zhang A. A survey on causal inference. ACM Comput Surv. 2022;55(9):1–38. 10.1145/3528008. [Google Scholar]
  • 15.Jiao L, et al. Causal inference meets deep learning: a comprehensive survey. Research. 2024;2024(46):1–30. 10.34133/research.0467. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Jantscher M, Gunzer F, Reishofer G, Kern R. Causal insights from clinical information in radiology: enhancing future multimodal ai development. Comput Met Programs Biomed. 2025;268:108810. 10.1016/j.cmpb.2025.108810. [DOI] [PubMed] [Google Scholar]
  • 17.Chu J, Zhang Y, Huang F, Si L, Huang S, Huang Z. Disentangled representation for sequential treatment effect estimation. Comput Met Programs Biomed. 2022;226:107175. 10.1016/j.cmpb.2022.107175. [DOI] [PubMed] [Google Scholar]
  • 18.Susukida R, Amin-Esmaeili M, Badillo-Goicoechea E, Nguyen TQ, Stuart EA, Rosenblum M, Dunn KE, Mojtabai R. Application of causal forest model to examine treatment effect heterogeneity in substance use disorder psychosocial treatments. Int J Methods In Psychiatric Res. 2025;34(1):70011. 10.1002/mpr.70011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, Robins J. Double/Debiased machine learning for treatment and causal parameters. Econom J. 2018;21(1):1–68. 10.1111/ectj.12097. [Google Scholar]
  • 20.Oprescu M, Syrgkanis V, Wu ZS. In: Chaudhuri K, Salakhutdinov R, editors. Amortized inference for heterogeneous treatment effects. Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research. Long Beach, California, USA: PMLR; 2019, pp. 4932–41, vol. 97.
  • 21.Künzel SR, Sekhon JS, Bickel PJ, Yu B. Meta-learners for estimating heterogeneous treatment effects using machine learning. Proc Natl Acad Sci, India, Sect B Biol Sci. 2019;116(10):4156–65. 10.1073/pnas.1804597116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Duan T, Rajpurkar P, Liao D, Natarajan AY, Basu S. Clinical value of predicting individual treatment effects for intensive blood pressure therapy: a machine learning experiment to estimate treatment effects from randomized trial data. Circ Cardiovasc Qual Outcomes. 2019;12(3):005010. 10.1161/CIRCOUTCOMES.118.005010. [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]
  • 23.Louizos C, Shalit U, Mooij JM, Sontag D, Zemel R, Welling M. Causal effect inference with deep latent-variable models. In: Advances in neural information processing systems (NeurIPS). Vol. 30. Long Beach, California, USA: Curran Associates, Inc; 2017. p. 6446–56. 10.5555/3295222.3295391. [Google Scholar]
  • 24.Wang R, Liu Y, Cao Y, Yao L. Causality-aware transformer networks for robotic navigation. arXiv preprint arXiv:2409.02669 (2024). 10.48550/arXiv.2409.02669.
  • 25.Haber N, Pillay D, Porter K, Bärnighausen T. Cascade of hiv diagnosis and care in sub-saharan africa: a systematic review. Curr Opin HIV AIDS. 2017;12(2):111–19. 10.1097/COH.0000000000000345. [Google Scholar]
  • 26.Cheng D, Li J, Liu L, Liu J, Le TD. Data-driven causal effect estimation based on graphical causal modelling: a survey. ACM Comput Surv. 2024;57. 10.1145/3636423.
  • 27.Cheng D, Li J, Liu L, Yu K, Le TD, Liu J. Toward unique and unbiased causal effect estimation from data with hidden variables. IEEE Trans on Neural Networks and Learn Syst. 2022;99:1–13. 10.1109/TNNLS.2021.3133337. [DOI] [PubMed] [Google Scholar]
  • 28.Xu Z, Cheng D, Li J, Liu J, Liu L, Yu K. Causal inference with conditional front-door adjustment and identifiable variational autoencoder. International Conference on Learning Representations (ICLR). 2024.
  • 29.Xu Z, Cheng D, Li J, Liu J, Liu L, Wang K. Disentangled representation for causal mediation analysis. Proceedings of the AAAI Conference on Artificial Intelligence. 2023;37:10666–74. 10.1609/aaai.v37i9.26266.
  • 30.Renero J, Ochoa I, Maestre R. Rex: causal discovery based on machine learning and explainability techniques. arXiv preprint arXiv:2501.12706 2025. 10.48550/arXiv.2501.12706.
  • 31.Tennant PWG, Murray EJ, Arnold KF, Berrie L, Fox MP, Gadd SC, Kaplun IA, Keogh RH, Lawn RBMorris TP, et al. Use of directed acyclic graphs (dags) to identify confounders in applied health research: review and recommendations. Int J Epidemiol. 2021;50(2):620–32. 10.1093/ije/dyaa213. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Textor J, Zander B, Gilthorpe MS, Liskiewicz M, Ellison GTH. Robust causal inference using directed acyclic graphs: there package dagitty. Int J Epidemiol. 2016;45(6):1887–94. 10.1093/ije/dyw341. [DOI] [PubMed] [Google Scholar]
  • 33.Greenland S, Pearl J, Robins JM. Causal diagrams for epidemiologic research. Epidemiology. 1999;10(1):37–48. 10.1097/00001648-199901000-00008. [PubMed] [Google Scholar]
  • 34.Behrouz A, Zhong P, Mirrokni V. Titans: learning to memorize at test time. arXiv preprint arXiv:2501.00663 2024. 10.48550/arXiv.2501.00663.
  • 35.Kennedy EH. Towards optimal doubly robust estimation of heterogeneous causal effects. Electron J Stat. 2023;17(2):3008–49. 10.1214/23-EJS2157. [Google Scholar]
  • 36.Song H, Chen Q, Jiang T, Li Y, Li X, Xi W, Huang S. Applying ensemble models based on graph neural network and reinforcement learning for wind power forecasting. arXiv preprint arXiv:2501.16591 2025. 10.48550/arXiv.2501.16591.
  • 37.Mesafint D. EDHS HIV/AIDS dataset. 2022. https://www.kaggle.com/datasets/danielmesafint1985/edhs-HIVAIDS-dataset. Accessed: 2025-05-07.
  • 38.Belete DM, Huchaiah MD. Performance evaluation of classification models for HIV/AIDS dataset. In: Data management, analytics and innovation. Lecture notes on data engineering and communications technologies. Vol. 70. Pune, India: Springer; 2021. p. 109–25. 10.1007/978-981-16-2934-1_7. [Google Scholar]
  • 39.Belete DM, Huchaiah MD. Wrapper based feature selection techniques on EDHS-HIV/AIDS dataset. Int J Multiling Adv Comput Sci Appl (IJACSA). 2021;12(7):368–74. 10.14569/IJACSA.2021.0120745. [Google Scholar]
  • 40.Velu A. AIDS virus infection prediction. 2022;2025–05–07. Accessed: https://www.kaggle.com/datasets/aadarshvelu/AIDS-virus-infection-prediction.
  • 41.Hammer SM, Katzenstein DA, Hughes MD, Gundacker H, Schooley RT, Haubrich R, Henry WK, Lederman MM, Phair JP, Niu M, Hirsch MS, Merigan. T.C. A trial comparing nucleoside monotherapy with combination therapy in hiv-infected adults with cd4 cell counts from 200 to 500 per cubic millimeter. NEJM Evid. 1996;335(15):1081–90. 10.1056/NEJM199610103351501. [DOI] [PubMed] [Google Scholar]
  • 42.Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. Smote: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57. 10.1613/jair.953. [Google Scholar]
  • 43.Lundberg SM, Lee. S.-I.: a unified approach to interpreting model predictions. Adv Neural Inf Process Syst. 2017;30. 10.5555/3295222.3295230.
  • 44.Mooij JM, Peters J, Janzing D, Zscheischler J, Schölkopf B:. Distinguishing cause from effect using observational data: methods and benchmarks. J Mach Learn Res. 2016;17(32):1–102. [Google Scholar]
  • 45.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is all you need. Adv Neural Inf Process Syst. 2017;30:5998–6008. 10.5555/3295222.3295349. [Google Scholar]
  • 46.Mao A, Mohri M, Zhong Y. Cross-entropy loss functions: theoretical analysis and applications. Proceedings of the 40th International Conference on Machine Learning (ICML). Proceedings of Machine Learning Research. Honolulu, Hawaii, USA: PMLR; 2023, pp. 23803–28, vol. 202.
  • 47.Shalit U, Johansson FD, Sontag D. Estimating individual treatment effect: generalization bounds and algorithms. International Conference on Machine Learning. PMLR; 2017, pp. 3076–85.
  • 48.Sutton RS, Barto AG. Reinforcement learning: an introduction. 2nd edn. Cambridge, Massachusetts, USA: MIT Press; 2018. [Google Scholar]
  • 49.Puterman ML. Markov decision processes: discrete stochastic dynamic programming. New York: John Wiley & Sons; 1994. [Google Scholar]
  • 50.Konda VR, Tsitsiklis JN. Actor-critic algorithms. Adv Neural Inf Process Syst. 1999;12:1008–14. [Google Scholar]
  • 51.Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D. Human-level control through deep reinforcement learning. Nature. 2015;518(7540):529–33. 10.1038/nature14236. [DOI] [PubMed] [Google Scholar]
  • 52.Fujimoto S, Hoof H, Meger D. Addressing function approximation error in actor-critic methods. Proceedings of the 35th International Conference on Machine Learning. PMLR; 2018, pp. 1582–91.
  • 53.Zhang S, Yao H, Whiteson S. Breaking the deadly triad with a target network. International Conference on Machine Learning. PMLR; 2021, pp. 12492–502.
  • 54.Gal Y, Ghahramani Z. Dropout as a bayesian approximation: representing model uncertainty in deep learning. International Conference on Machine Learning. 2016, pp. 1050–59. 10.48550/arXiv.1506.02142. PMLR.
  • 55.Kendall A, Gal Y. What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems. 2017;30.
  • 56.Gama J, Žliobaitė I, Bifet A, Pechenizkiy M, Bouchachia. A.: a survey on concept drift adaptation. ACM Comput Surv (CSUR). 2014;46(4):1–37. [Google Scholar]
  • 57.Louizos C, Shalit U, Mooij JM, Sontag D, Zemel R, Welling M. Causal effect inference with deep latent-variable models. In: Advances in neural information processing systems. Vol. 30. Curran Associates, Inc; 2017. p. 6446–56. 10.5555/3295222.3295391. [Google Scholar]

Articles from BioData Mining are provided here courtesy of BMC

RESOURCES