Abstract
Developing AI tools that preserve fairness is of critical importance, specifically in high-stakes applications such as those in healthcare. However, health AI models’ overall prediction performance is often prioritized over the possible biases such models could have. In this study, we show one possible approach to mitigate bias concerns by having healthcare institutions collaborate through a federated learning paradigm (FL; which is a popular choice in healthcare settings). While FL methods with an emphasis on fairness have been previously proposed, their underlying model and local implementation techniques, as well as their possible applications to the healthcare domain remain widely underinvestigated. Therefore, we propose a comprehensive FL approach with adversarial debiasing and a fair aggregation method, suitable to various fairness metrics, in the healthcare domain where electronic health records are used. Not only our approach explicitly mitigates bias as part of the optimization process, but an FL-based paradigm would also implicitly help with addressing data imbalance and increasing the data size, offering a practical solution for healthcare applications. We empirically demonstrate our method’s superior performance on multiple experiments simulating large-scale real-world scenarios and compare it to several baselines. Our method has achieved promising fairness performance with the lowest impact on overall discrimination performance (accuracy). Our code is available at https://github.com/healthylaife/FairFedAvg.
Keywords: Federated Learning, Algorithmic Fairness, Adversarial Fairness
1. INTRODUCTION
The growing availability of electronic health records (EHRs, referring to the digitized collections of individuals’ health information) offers great hopes in developing effective tools to inform prevention and treatment interventions across healthcare. The challenges of working with EHRs (such as large-scale, temporality, and missingness [49]) present a ripe context for using machine learning (ML) methods dedicated to working with this special type of datasets. Similar to other types of human-generated and observational data, EHRs can also carry and be a source of various types of biases (especially historical and systematic ones) [6]. In current healthcare ecosystems already dealing with longstanding concerns about health disparities (especially in places such as the US [32]), AI models that are created using EHRs can act as double-edged swords. While having huge potential to benefit some individuals; if not designed carefully, they can also unfairly impact others, especially underrepresented populations [42]. Although studying justice (including fairness) in healthcare relates to a vast field, and a growing body of studies have looked at the fairness implications of AI models recently, dedicated studies to AI models in healthcare are limited [3, 40]. The present study aims at presenting customized methods capable of addressing fairness in an important subset of AI health models.
As imbalanced training data is generally considered one of the primary roots of unfairness in AI models [30], having access to a larger dataset containing more samples from the smaller classes is an ideal way to achieve better learning opportunities for minority groups and thus improve fairness (broadly defined). In fact, existing studies targeting fairness mitigation in AI health models commonly assume having access to large datasets [9, 14, 35]. However, access to such data remains very limited in the case of EHRs, and sharing data across different sites faces numerous concerns about data privacy, security, and interoperability [23]. Distributed learning paradigms, commonly referred to by federated learning (FL), are among the most popular choices in healthcare that can provide an adequate solution to alleviate such concerns. FL allows multiple healthcare sites to train a common large-scale model without sharing their local datasets. Investigating the fairness aspects of FL methods in EHR-based predictive models is the primary subset of AI health models that our study focuses on.
In a general FL setting, each site shares a common large-scale model without sharing its respective dataset. This design allows healthcare institutions to collaborate and jointly train a better model. While training on larger data accessible through multiple sites can help improve fairness (by increasing the number of samples in the minority groups), the non-IID data (IID: independent and identically distributed) across sites can lead to newer fairness issues. Some studies have addressed fairness in FL settings through local debiasing [16] and fairer aggregation strategies [13], but the presented methods are often limited in several ways. First, fairness mitigation methods for FL settings often solely focus on the aggregation method on the server side [13], leaving the local debiasing part to the discretion of each client (and unknown to the ML practitioners). Though this would allow for more flexibility on the client side, a unified debiasing method, where clients could build up on knowledge acquired by others, allows for better removal of sensitive information through a more diverse pool of training instances. Second, these methods assume a binary sensitive attribute (such as gender) [13, 16]. For example, the FairFed method [13] extended FedAvg1 [29], to alleviate group fairness concerns but solely considered binary sensitive attributes. However, studies have shown that non-binary attributes (race and insurance type), are usually sources of greater disparities in AI health models[42]. It is therefore pivotal to study those biases and develop methods that are fair regardless of the source of biases.
In this paper, we aim to address the aforementioned gaps in the literature by coupling local adversarial debiasing [12] with an extension of FedAvg [29] and FairFed [13]. Specifically, we aim to implement techniques to improve fairness both at the local and aggregation level. In doing so, our study makes the following primary contributions:
We propose a comprehensive FL architecture with local adversarial debiasing of sensitive information in the context of healthcare where EHRs are used.
We introduce a new fairness-metric-agnostic aggregation method for FL settings that can achieve better global fairness with non-binary sensitive attributes.
We demonstrate our method’s performance by running an extensive series of experiments on multiple EHR datasets and comparing it to other baselines.
2. PROBLEM FORMULATION
FL allows multiple clients (healthcare sites) to collaboratively train a single model with a set of parameters, shown with in our work. Amongst the many possible ways to formulate an FL problem, here, we describe it as a minimization problem of a weighted loss function across clients:
(1) |
where and are the loss and the weight of the client, respectively, such that . As proposed by FedAvg [29], a common workflow for an FL algorithm is based on the following steps: for each round of FL, the server randomly samples a set of clients and sends the lastly updated model’s parameters to the clients. In parallel, each client will locally train the model starting from and send their newly updated parameters back to the server. Lastly, the server will aggregate the clients’ parameters to form a new set of federated parameters such that . We provide a visual description of the FL process in a healthcare environment in Fig. 1. We do not study peer-to-peer designs, as using client-server designs is more practical in real healthcare settings involving multiple sites, where usually a mutually trusted site can be considered as the server, reducing the complexities that peer-to-peer structures will add.
To each client, FedAvg assigns weights proportional to the size of their respective datasets, such that the client with the largest dataset has a greater impact on the overall optimization problem. Specifically, FedAvg defines each client’s weight , where is the number of instances in client ’s dataset, and is the total number of instances . This weighting mechanism ensures that the global model is representative of the data distribution across all the participating clients. Though FedAvg provides great convergence guarantees [26], its obliviousness to sensitive attributes might be a source of bias in the case where one of the clients with the largest assigned weight produces model parameters that are sources of strong biases [51].
Fairness Metrics:
Because of the sensitive nature of machine learning applications in healthcare and possible discriminating predictions made by such techniques, it is pivotal to define and quantify fairness metrics when evaluating new models. While our presented method is metric-agnostic supporting desired fairness metric, we focus on the metrics related to group fairness [47]. From the wide range of metrics used in the community, we focus on those that are most relevant to our study. Notably, we use parity-based metrics [47], which measure the difference between the group outcomes, and minimax group fairness [11], which measures the performance of the worst group outcome.
We use the True Positive Rate (TPR) and Accuracy Parities, and the Worst-performing TPR. The TPR Parity requires that each group must have the same opportunity of being correctly classified as positive while the Accuracy Parity does not differentiate the positive and negative outcomes. In other words, each subgroup’s TPR or accuracy should be independent of their sensitive attribute. Though it is clear how the absolute difference between the different groups’ TPRs or accuracies can be used to quantify these parity-based metrics in the case of a binary sensitive attribute, it cannot be directly applied to non-binary sensitive attributes. A possible workaround would be to use the average difference across all possible subgroup pairs [22], however, using the standard deviation is easier to interpret. Therefore, we propose to use the True positive rate Parity Standard Deviation (TPSD), defined as the standard deviation of groups’ TPR, and the Accuracy Parity Standard Deviation (APSD), defined as the standard deviation of subgroups’ accuracy. Let us first note as a categorical sensitive attribute with possible values (race, for example), , and as the ground truth and the prediction of a binary predictor, respectively. We can now formulate the TPSD and APSD as follows:
(2) |
(3) |
where is the average TPR or accuracy across all groups. For the case of minimax group fairness, we note the worst performing TPR (the lowest) as Worst TPR and define it as follows:
(4) |
Note that, in the case of the parity-based metrics, a smaller value is preferable, while a higher value is sought after for the Worst TPR.
3. METHOD
Our study builds upon three separate studies and integrates them into our comprehensive FL design. The first is the work described by Elazar et al. [12], where an adversarial approach was developed to remove demographic information from text representations. We couple this method with a second work offering an attention-based bidirectional RNN model, called Dipole [27]. Dipole is one of the most studied EHR AI models and has been consistently used as a baseline for other studies. Using the method allows our FL design to capture the longitudinal intricacies of EHRs. The third method we adopt is FairFed [13] to aggregate clients’ trained parameters. This method promotes better fairness guarantees than FedAvg, by allocating more weight to the clients with local fairness scores similar to the global ones. Inspired by this idea, we propose to allocate higher weights to the clients producing consistently fairer model parameters. Specifically, each client locally trains a Dipole-based model with adversarial debiasing on their local dataset and reports its fairness metric to the server. The server then aggregates the models’ parameters by allocating more weight to the fairer clients.
3.1. Dipole with Adversarial Debiasing
EHRs can contain a wide range of information about a patient’s2 health, including demographic data, medical history, laboratory test results, and imaging reports. Without loss of generality, we present our method centered on a common subset that includes conditions, medications, and procedure codes in patients’ health records. We note , , with sizes , , , respectively as the set of the aforementioned codes. We also refer to as the set of all possible medical codes. Throughout this paper, we represent each patient as a sequence of visits , where is the maximum number of visits and is a one-hot encoded vector representing the medical codes assigned to the visit for patient .
We train a Dipole-based model with adversarial debiasing to learn sensitive information-free temporal representations. Accordingly, our model is composed of the following three main components: visit embeddings, bidirectional GRU, and an attention layer. We provide a visual representation of the entire architecture in Fig. 2. For each visit, the one-hot encoded vector is fed to a Linear Layer with a rectified linear unit activation function (ReLU) to create the visit embeddings as , where is the embedding size, is a weight matrix, and the bias vector. The visit embeddings are then fed to a bidirectional GRU, where the input data is processed in a forward and backward manner: , where is the previous hidden state, and , and are the hidden states produces by the GRU layer for visit in the forward and backward direction, respectively, and is the size of the GRU’s hidden states. We define the temporal visit representation as derived from . We also note to be the vector containing all the temporal representations.
Additionally, we adopt the general attention mechanism described in Dipole’s original paper [27]. This attention layer is trained to create a context vector for the visit from the temporal visit representations . To depict an accurate representation of patient at the visit, the hidden state and the context vector are concatenated and fed into a Linear Layer with a activation function: .
We use this representation to perform two tasks: a (binary) outcome prediction task, and an adversarial debiasing following the ideas proposed in [12]. Specifically, we train two classifiers where one would perform the binary prediction task and the second predict the patients’ respective sensitive attributes from . We refer to the first classifier by “” and the second by “.” The Dipole-based network introduced above is trained to produce representations that carry minimum sensitive information without important information loss for the downstream task. In other words, the network will be trained to minimize the first classifier’s loss while maximizing the second’s loss.
Let us define the first linear classifier, , that will use the representation and be trained to perform the binary prediction task:
where , with being the size of the representation . Similarly, we train the second linear classifier to predict the sensitive attribute of each patient:
where , with being the number of possible values for the sensitive attribute. Let us now define loss with the Binary Cross-Entropy Loss (BCE), and loss with the Cross-Entropy Loss (CE):
To ensure that is free of sensitive information, we train our network following the adversarial approach presented by Elazar et al. [12]. That is, we simultaneously train to solely classify each patient based on their respective sensitive attribute while the rest of the Dipole-based network is trained to produce representations that cannot classify and perform an outcome prediction task. Therefore, we define the network’s loss as:
where is a hyperparameter that lets each client adjust for the impact of the local debiasing, allowing them to opt out of the adversarial part () of the training at any time, similar to what was proposed in FADE [16]. This loss function will force the network to try to make it hard for to classify each patient into their respective sensitive attribute (maximize ) while performing a prediction task.
3.2. Fair Aggregation with non-binary sensitive attribute
To ensure our model remains fair and accurate in an FL environment, we build upon the ideas proposed by FedAvg [29] and FairFed [13] and propose a new aggregation method of locally trained parameters that promotes global fairness. FedAvg solves the optimization problem described in Eq. 1 by assigning a larger weight to the clients with a larger dataset. While this approach has been effective, it can lead to potentially severe biases if the clients with the largest datasets provide a biased set of parameters. Therefore, we propose to incorporate a fairness component in the aggregation process. Specifically, we assign the clients’ weights based on a linear combination of their previous weight and their fairness metric, with the first round of weights assigned following the FedAvg method. Therefore, we assign greater weight to the consistently fairer and larger clients to push towards greater global fairness. To keep the proposed fairness mitigation method “metric-agnostic”, we use a generic term to refer to the fairness metric to be optimized. Therefore, after initially assigning each client ’s weight with the original FedAvg formula,
(5) |
for each round , we update the clients’ weights as follows:
(6) |
where represents client ’s fairness score at round , and is a hyperparameter to adjust the weight of the fairness component. When is undefined, we set to be the average score across all clients, otherwise . Note that when , our method results in the original FedAvg, as the weights would remain unchanged throughout the rounds. Following this formula, the clients with the larger scores (poorer fairness) will receive a lower fairness reward as the difference between max and will tend towards 0 and thus will have a lower impact on the parameters aggregation. This ensures that the global model becomes incrementally fairer as it will be more impacted by the fairer set of parameters at any given aggregation round. Lastly, we ensure that :
(7) |
We can now define the new set of the model’s parameters as the weighted sum of the locally trained parameters:
(8) |
Following this method, each client solely has to communicate its score alongside its respective number of training instances which can be done without privacy concerns through methods such as Secure Aggregation [43]. We provide a step-by-step description of the entire process in Algorithm 1.
Algorithm 1.
Server executes: |
Initialize model’s parameters |
Initialize clients’ weights |
for each round do |
for each client in parallel do |
end for |
end for |
ClientUpdate(): |
refer to client ’s local dataset. |
: prediction, ground truth, and sensitive attribute, respectively. |
return () to the server |
4. EXPERIMENTS
To demonstrate the efficacy of the proposed method, we have realized a series of experiments in a wide range of scenarios in the healthcare domain. We have implemented our model using the Flower framework [4], a customizable FL framework supporting large-scale FL experiments on heterogeneous devices.
Throughout our experiments, we evaluated our method on four cohorts from two different datasets. Specifically, we have used, 1) the Synthea dataset [46], a public synthetic EHR simulation program that we have used to generate two cohorts of patients from different US states, and 2) MIMIC-III [21], the most popular real-world EHR dataset of patients admitted to the Intensive Care Unit (ICU) from the Beth Israel Deaconess Medical Center in Boston, USA to generate two cohorts, one IID (independent and identically distributed) and one non-IID.
Synthea data:
For each cohort from the Synthea dataset, we have generated a dataset where each state has a number of patients proportional to their respective real population (California has a larger dataset than Pennsylvania). We investigated fairness in a scenario (Most Populous States) where the five most populous US states (California, Texas, Florida, New York, and Pennsylvania) would collaborate (as the FL clients) and another one (Heterogeneous States) where we chose five US states with fairly different demographic distributions (Maine, Mississippi, Hawaii, New Mexico, and Alaska) to simulate a more diverse pool of clients. We focus on race as our sensitive attribute in our experiments since insurance type can change multiple times over the course of an individual’s lifetime. Our model aims at predicting the mortality prediction task following an in-patient (hospital) visit. We present more details about our cohort extraction in Appendix A.
MIMIC data:
We generated two different cohorts from the MIMIC-III dataset, where we assign MIMIC-III patients to five different FL clients. For the non-IID cohort, we synthesize non-IDD clients through a Dirichlet distribution as proposed in [13, 18], while the IID cohort was randomly generated following a uniform distribution. Similar to the Synthea scenario, we focus on race as our sensitive attribute, and the outcome prediction task relates to the standard task studied in the literature [15] related to predicting the mortality of patients admitted to the ICU, during their stay. We present more details about these cohort extractions in Appendix B and provide more details and statistics about the four cohorts in Table 1.
Table 1: Statistics for all four cohorts: Most Populous States, Heterogeneous States, MIMIC-III IID and MIMIC-III non-IID.
Race/Ethnicity | Most Populous States |
Heterogeneous States |
||||||||
---|---|---|---|---|---|---|---|---|---|---|
California | Texas | Florida | New York | Pennsylvania | Maine | Mississippi | Hawaii | New Mexico | Alaska | |
| ||||||||||
White | 2330 | 1988 | 2188 | 2126 | 2016 | 2807 | 3323 | 776 | 2227 | 893 |
Black | 223 | 368 | 446 | 451 | 258 | 39 | 2384 | 69 | 87 | 55 |
Asian | 508 | 129 | 79 | 261 | 70 | 26 | 63 | 1039 | 48 | 76 |
Hispanic | 2114 | 1849 | 945 | 713 | 175 | 29 | 195 | 298 | 2453 | 88 |
Other | 184 | 81 | 67 | 98 | 55 | 77 | 96 | 949 | 389 | 305 |
| ||||||||||
MIMIC-III IID |
MIMIC-III non-IID |
|||||||||
Client 1 | Client 2 | Client 3 | Client 4 | Client 5 | Client 1 | Client 2 | Client 3 | Client 4 | Client 5 | |
White | 1456 | 2201 | 1230 | 3016 | 2384 | 2188 | 2024 | 2050 | 1669 | 2356 |
Black | 142 | 213 | 106 | 329 | 166 | 163 | 154 | 165 | 145 | 329 |
Asian | 52 | 70 | 34 | 111 | 69 | 0 | 119 | 92 | 124 | 1 |
Hispanic | 74 | 86 | 43 | 120 | 99 | 0 | 0 | 148 | 274 | 0 |
Other | 429 | 631 | 334 | 833 | 629 | 650 | 791 | 523 | 462 | 430 |
Baselines:
For all scenarios, we evaluate the performance of the proposed method by comparing it to the following baselines. Our experiments allow us to compare the proposed method to a wide range of baselines that are most relevant to our work in the context of group fairness.
No-Fed:
Clients are trained individually on their local datasets, without any collaboration through FL.
FedAvg:
FL using FedAvg [29] to aggregate the model’s parameters. FedAvg has been one of the most popular FL algorithms and has been used as a baseline extensively. This method does not address group fairness.
FairFed:
FL with FairFed [13]. FairFed extends FedAvg to incorporate a fairness element in the aggregation algorithm. To provide a fair comparison, we optimize FairFed for the same fairness metrics as the proposed method to consider non-binary sensitive attributes.
We also experimented with the impact of different values for the weight of the fairness component in Eq. 6.
For each experiment, we report the average accuracy, TPSD, Worst TPR, and APSD (as defined in Eq. 2 to 4) and their standard deviation in a 5-fold cross-validation process. Note that for the No-Fed baseline, though the training and testing are done individually, we report the metrics collectively for all clients (as if the testing sets were centralized).
5. RESULTS
Table 2 shows the Accuracy, TPSD, Worst TPR, and APSD scores for all methods on all datasets when using adversarial debiasing. The proposed method is able to achieve competitive performance with an accuracy that remains close to the one achievable using FedAvg (i.e., without any fairness mitigation, other than the adversarial debiasing) while demonstrating better group fairness than all baselines tested for all but one metric (Worst TPR). Notably, our method achieves comparable accuracy to FairFed while reducing the parity-based metrics on all cohorts. Though FairFed seemed to outperform our proposed method for the minimax group fairness metric (Worst TPR), the results remain very close to each other. It is also worth noting that, as expected, using FL allows for both better predictive performance and fairer predictions, as shown by the poorer scores achieved with the local training.
Table 2:
No-Fed | FedAvg | FairFed | Proposed | ||
---|---|---|---|---|---|
Most Populous States | Accuracy ↑ | 87.45 ± 0.46 | 90.02 ± 0.41 | 89.49 ± 0.41 | 89.28 ± 0.45 |
TPSD ↓ | 0.184 ± 0.032 | 0.102 ± 0.037 | 0.066 ± 0.015 | 0.052 ± 0.014 | |
Worst TPR ↑ | 0.368 ± 0.047 | 0.441 ± 0.092 | 0.509 ± 0.083 | 0.502 ± 0.080 | |
APSD ↓ | 4.81 ± 1.19 | 2.88 ± 0.98 | 1.79 ± 0.66 | 1.51 ± 0.62 | |
| |||||
Heterogeneous States | Accuracy ↑ | 87.72 ± 0.26 | 90.08 ± 0.30 | 89.39 ± 0.32 | 89.68± 0.34 |
TPSD ↓ | 0.225 ± 0.023 | 0.064 ± 0.016 | 0.042 ± 0.021 | 0.031 ± 0.022 | |
Worst TPR ↑ | 0.258 ± 0.052 | 0.572 ± 0.066 | 0.062 ± 0.069 | 0.639 ± 0.064 | |
APSD ↓ | 6.26 ± 0.99 | 2.42 ± 0.37 | 1.37 ± 0.38 | 1.16 ± 0.38 | |
| |||||
MIMIC-III IID | Accuracy ↑ | 74.32 ± 0.96 | 76.59 ± 0.55 | 75.96 ± 0.64 | 76.38 ± 0.52 |
TPSD ↓ | 0.088 ± 0.019 | 0.051 ± 0.014 | 0.035 ± 0.015 | 0.034 ± 0.016 | |
Worst TPR ↑ | 0.631 ± 0.057 | 0.689 ± 0.037 | 0.749 ± 0.043 | 0.734 ± 0.037 | |
APSD ↓ | 7.29 ± 0.20 | 3.57 ± 0.92 | 3.15 ± 0.82 | 2.87 ± 0.85 | |
| |||||
MIMIC-III non-IID | Accuracy ↑ | 74.38 ± 0.51 | 77.05 ± 0.29 | 76.49 ± 0.64 | 76.44 ± 0.54 |
TPSD ↓ | 0.110 ± 0.014 | 0.049 ± 0.013 | 0.034 ± 0.014 | 0.033 ± 0.014 | |
Worst TPR ↑ | 0.644 ± 0.109 | 0.735 ± 0.081 | 0.763 ± 0.045 | 0.784 ± 0.129 | |
APSD ↓ | 7.01 ± 1.45 | 3.35 ± 1.29 | 2.94 ± 0.68 | 2.16 ± 1.06 |
Impact of fairness reward:
Similar to FairFed, we have implemented a fairness budget for the calculation of the clients’ weights in the aggregation process. We study the impact of both on the fairness of the resulting predictions and the overall accuracy (Fig. 3). We only report the results related to the MIMIC-III IID cohort, though we found similar results in other cohorts throughout our experiments. Specifically, we have set to different values (0, 0.25, 0.5, 1, 2.5, 5). corresponds to FedAvg, thus without a fair aggregation process.
As expected, increasing the fairness budget provides better fairness guarantees as shown by the differences in the fairness metrics when ranges from 0 to 2.5. One can notice a decrease in TPSD and APSD from 0.051 and 3.75 to 0.030 and 2.78, respectively, while the Worst TPR increased from 0.689 to 0.738. Note that for the parity-based measures (TPSD/APSD), a lower value is better while a higher value reflects better results for the accuracy and Worst TPR. Additionally, though it is hardly perceptible in the figure, the accuracy also decreases from 76.59 to 76.22. However, when , the method shows poor performance, both for the fairness metrics and the accuracy. This is most likely due to the fact that the fairness budget becomes too important in the aggregation process, making the clients’ weights change drastically from one round to the other, and making the optimization process less stable.
6. RELATED WORK
While there exists a wide range of studies related to our work, here, we discuss a non-exhaustive group of studies that are closely related to ours, including the applications of FL on EHRs, the studies investigating group fairness in predictive modeling with EHRs, and the works on predictive models under an FL setting focusing on improving group fairness.
Federated Learning with EHRs:
FL has been gaining significant attention in the healthcare sector, especially for its privacy-preserving nature and ability to work on decentralized data [10, 33, 50]. FL has been applied to EHRs in various applications, with one of the earliest being in-hospital mortality prediction within patients admitted to the ICU (i.e., the prediction task used in our experiments). Multiple studies [36, 44] investigated the use of different machine learning models such as logistic regression, multi-layer perception (MLP), and feed-forward neural networks on the MIMIC-III [21] and eICU [38] EHR datasets. These studies aimed to utilize FL to improve the accuracy of models while protecting patient privacy. To name a few of such studies, consider the autoencoder-based architecture proposed to cluster patients for mortality prediction [20] using the eICU dataset. Another work investigated MLP and LASSO logistic regression in an FL setting to predict 7-day mortality for hospitalized COVID-19 patients with EHR data [45]. While these studies have demonstrated the potential of FL in healthcare, the underlying AI models remain fairly simple and the possible fairness concerns are not investigated. To mitigate those concerns, recent studies proposed to use FL for improving fairness by increasing the dataset diversity [8] (without developing the idea further).
Improving Group Fairness on EHRs:
Though group fairness remains to be investigated in FL with EHRs, bias mitigation techniques on EHRs have been studied for centralized models in a few studies [14]. Notably, Pfohl et al. [35] proposed to use adversarial learning to debias a model to predict the risk of cardiovascular diseases from EHRs. A similar idea has been applied to disease prediction using chest X-ray [9]. Intra-processing techniques such as fine-tuning and pruning of pre-trained networks have also been proposed for chest X-rays [28]. Additionally, regularization techniques have been used to achieve counterfactual fairness [37], where the model is required to output the same prediction for a patient and the same patient when changing the value of their sensitive attribute. Some studies also argued that, in critical applications such as healthcare where the trade-off between accuracy and fairness could lead to dramatic outcomes, fairness should only be achieved through data collection rather than model constraints, which can be indirectly achieved through FL [7].
Group Fairness in FL:
While fairness in FL can refer to many different subfields, such as client-based fairness [24, 25, 31, 48] (enforcing that the global model performs fairly amongst the clients), here, we focus on the studies investigating group fairness, which has seen growing interest. To name a few, Zeng et al. proposed FedFB [51] to extend FairBatch [41] to an FL setting. FairFed [13] proposed a fairness-aware aggregation algorithm agnostic to the clients’ local debiasing technique. Also, Abay et al. [1] investigated the effectiveness of a global reweighting strategy. As opposed to these studies that mostly aim at reducing the difference between each group’s outcomes, FedMinMax [34] proposed an optimization algorithm to tackle minimax group fairness [11] in an FL setting. Similarly, PFFL (Provably Fair Federated Learning) [19] claims to improve group fairness in FL with a variation of Bounded Group Loss [2] to ensure no group loss is below a given threshold. Similar to our work, Federated Adversarial DEbiasing (FADE) [16] proposed to leverage adversarial debiasing [17] to enhance group fairness in a federated setting, however, FADE’s aggregation process does not consider any fairness metric. While these studies have demonstrated promising results in group fairness, most of them solely consider fairness as a binary problem and are often dependent on a single fairness metric. Additionally, these methods either study the federated algorithm or the model training process, leaving the combination of both underinvestigated. In this study, we investigated a comprehensive FL approach, from the model and training process to the weights aggregation algorithm, to enhance fairness (agnostic of the chosen metric) in a healthcare setting.
7. DISCUSSION
Though FL applications with EHRs have seen growing interest, fairness-related issues in this context remain widely under-investigated. In this study, we investigated the fairness benefits of FL methods and proposed a new FL algorithm to leverage EHRs with better fairness guarantees for non-binary sensitive attributes (like race or insurance type). Our proposed algorithm aims at giving greater weight to consistently fairer clients to push toward better global fairness. Our main results in Table 2 demonstrate that FL is a viable paradigm for healthcare sites to collaborate and train a common AI model to achieve better and fairer performance while alleviating privacy and security concerns. Additionally, the difference in fairness between the two cohorts with homogeneous data distributions (the Most Populous States and MIMIC-III), and the Heterogeneous States cohort indicates that an FL environment with heterogeneous clients provides even greater fairness guarantees. On the contrary, when using Local (No-Fed) training, the fairness performance gap is smaller for the homogeneous cohorts. Of particular note is the Heterogeneous States scenario, where the different demographics of each state allow the common model to have access to a much more diverse population which is key in training accurate and fair AI models.
We have added a hyperparameter to control the weight of the fairness component in the aggregation process to ensure that the clients’ weights do not change too drastically between each round while keeping the model as fair as possible. Additionally, we have demonstrated the importance of the fairness budget in our aggregation process and how it can impact the model’s accuracy and fairness guarantees.
Limitations:
Our study remains limited in a few ways. We have limited our study to a binary classifier. Also, our proposed method only considers one sensitive attribute at a time. We expect, however, that the same design would expand to non-binary scenarios or scenarios with multiple sensitive attributes as well. Moreover, we have only considered adversarial debiasing as our fairness mitigation technique, leaving other techniques to be studied in an FL context. Additionally, we have not studied the impact of the proposed method on a client level. It is possible that some clients’ performance could be affected negatively to achieve better fairness. Lastly, though Dipole has been one of the most successful AI models in healthcare applications, newer transformer-based models may also fit certain applications better. We note, however, integrating large-scale transformer-based models into FL designs may be challenging in practice (due to the computational resources they would need).
As for future works, one can investigate the impact of the proposed method in scenarios with a greater number of clients, and study additional fairness metrics, beyond group and worst-case fairness, as well as study further the impact of both hyperparameters and . Additionally, as we have not directly reported the performance of the proposed method for each individual subgroup, one can investigate the effect of the approach on each group individually in addition to the common metrics like TPSD.
CCS CONCEPTS.
• Computing methodologies → Machine learning; • Applied Computing → Life and medical sciences.
ACKNOWLEDGMENTS
Our study was partially supported by the NIH awards 3P20GM103446 and 5P20GM113125.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
A. SYNTHEA DATASET COHORT EXTRACTION
In this experiment, we simulated a generic scenario where the five most populous US states collaborate to jointly train a federated model and another with a selection of five heterogeneous states. Here, each state represents a client in the FL scope (or a hospital, following Fig. 1). These selections allow us to investigate fairness in two realistic FL scenarios across the country, one centered on larger clients, and one to evaluate the proposed method’s robustness to heterogeneous demographic distributions.
We simulate a scenario where the states represent the clients of our FL architecture and thus would collaborate to learn a federated model. Therefore, the states’ data is never shared with the central server nor with the other participating states. We have performed two separate experiments with their separate dataset generation processes to fit the needs of the desired scenario. Regarding our second cohort (Heterogeneous States), we have selected those states because of their distinctive data distribution. For example, Maine, Mississippi, Hawaii, New Mexico, and Alaska have the highest White, Black, Asian, Hispanic, and Native American populations per capita according to a 2020 US census [5], respectively.
Note that the two cohort generation processes are independent of each other, the states’ populations are only proportional within their own scenario. From these two cohorts, we aim to develop an accurate mortality prediction model for patients who have been discharged following an inpatient visit (excluding those who passed away during the inpatient visit) [39] while preserving group fairness. The latest inpatient visit in the medical history of each patient is defined as the anchor event, leading to an observation window spanning the entire medical record up until the latest inpatient visit, and a prediction window of one year following the patient’s latest inpatient visit. We extracted the condition, medication, and procedure codes during the observation window and used them as input to our model. As shown in [42], race and insurance type can be a source of bias for machine learning applications to healthcare and are thus most relevant to study in our work.
B. MIMIC DATASET COHORT EXTRACTION
We have created two separate cohorts from the MIMIC-III dataset where the patients are assigned to clients, representing a healthcare institution (states in the case of Synthea). From the original MIMIC dataset, we extracted two cohorts: a non-IID and an IID one. The non-IID cohort was generated following the ideas described in [13, 18], where we have simulated non-IID clients through a Dirichlet distribution with (when , the data distribution would be IID). For the IID cohort, the number of patients per each of the five created clients is randomly generated from a uniform distribution of 1,500 to 5,000 patients. From these cohorts, we aim to predict the mortality of patients admitted to the ICU, during their stay. The anchor event for this task is defined as an admission to the ICU. The observation window encompasses the first 24 hours of the patient’s ICU stay, while the prediction window extends from the time of ICU admission to discharge. This approach leads to an observation window of 24 hours and a prediction window with varying duration, representing the length of stay in the ICU. Because MIMIC-III is composed of measurements taken at irregular time intervals, the input variables are aggregated hourly, resulting in a maximum of 24 input timestamps.
Footnotes
FedAvg is one of the most popular FL aggregation algorithms.
The term patient is broadly used and also includes healthy individuals interacting with healthcare systems for purposes such as prevention.
Contributor Information
Raphael Poulain, University of Delaware USA.
Mirza Farhan Bin Tarek, University of Delaware USA.
Rahmatollah Beheshti, University of Delaware USA.
REFERENCES
- [1].Abay Annie, Zhou Yi, Baracaldo Nathalie, Rajamoni Shashank, Chuba Ebube, and Ludwig Heiko. 2020. Mitigating Bias in Federated Learning. 10.48550/ARXIV.2012.02447 [DOI]
- [2].Agarwal Alekh, Dudik Miroslav, and Steven Wu Zhiwei. 2019. Fair Regression: Quantitative Definitions and Reduction-Based Algorithms. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.).PMLR, 120–129. https://proceedings.mlr.press/v97/agarwal19d.html [Google Scholar]
- [3].Aurangzeb Ahmad Muhammad, Patel Arpit, Eckert Carly, Kumar Vikas, and Teredesai Ankur. 2020. Fairness in Machine Learning for Healthcare. 10.1145/3394486.3406461 [DOI]
- [4].Beutel Daniel J, Topal Taner, Mathur Akhil, Qiu Xinchi, Parcollet Titouan, and Lane Nicholas D. 2020. Flower: A Friendly Federated Learning Research Framework. arXiv preprint arXiv:2007.14390 (2020). [Google Scholar]
- [5].US Census. [n. d.]. Racial and Ethnic Diversity in the United States: 2010 Census and 2020 Census. https://www.census.gov/library/visualizations/interactive/racial-and-ethnic-diversity-in-the-united-states-2010-and-2020-census.html. Accessed: 2022–11-17.
- [6].Chang Trenton, Sjoding Michael W., and Wiens Jenna. 2022. Disparate Censorship & Undertesting: A Source of Label Bias in Clinical Machine Learning. 10.48550/ARXIV.2208.01127 [DOI] [PMC free article] [PubMed]
- [7].Chen Irene Y., Johansson Fredrik D., and Sontag David. 2018. Why is My Classifier Discriminatory?. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (Montréal, Canada) (NIPS’18). Curran Associates Inc., Red Hook, NY, USA, 3543–3554. [Google Scholar]
- [8].Chen Richard J., Chen Tiffany Y., Lipková Jana, Wang Judy J., Williamson Drew F. K., Lu Ming Y., Sahai Sharifa, and Mahmood Faisal. 2021. Algorithm Fairness in AI for Medicine and Healthcare. CoRR abs/2110.00603 (2021). arXiv:2110.00603 https://arxiv.org/abs/2110.00603 [Google Scholar]
- [9].Correa Ramon, Jiwoong Jason Jeong Bhavik Patel, Trivedi Hari, Gichoya Judy W., and Banerjee Imon. 2021. Two-step adversarial debiasing with partial learning – medical image case-studies. 10.48550/ARXIV.2111.08711 [DOI]
- [10].Trung Kien Dang Xiang Lan, Weng Jianshu, and Feng Mengling. 2022. Federated Learning for Electronic Health Records. ACM Transactions on Intelligent Systems and Technology (TIST) (2022). [Google Scholar]
- [11].Diana Emily, Gill Wesley, Kearns Michael, Kenthapadi Krishnaram, and Roth Aaron. 2020. Minimax Group Fairness: Algorithms and Experiments. 10.48550/ARXIV.2011.03108 [DOI]
- [12].Elazar Yanai and Goldberg Yoav. 2018. Adversarial Removal of Demographic Attributes from Text Data. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 11–21. 10.18653/v1/D18-1002 [DOI] [Google Scholar]
- [13].Yahya H Ezzeldin Shen Yan, He Chaoyang, Ferrara Emilio, and Avestimehr Salman. 2021. Fairfed: Enabling group fairness in federated learning. arXiv preprint arXiv:2110.00857 (2021). [Google Scholar]
- [14].Feng Qizhang, Du Mengnan, Zou Na, and Hu Xia. 2022. Fair Machine Learning in Healthcare: A Review. 10.48550/ARXIV.2206.14397 [DOI]
- [15].Harutyunyan Hrayr, Khachatrian Hrant, Kale David C., Ver Steeg Greg, and Galstyan Aram. 2019. Multitask learning and benchmarking with clinical time series data. Scientific Data 6, 1 (2019). 10.1038/s41597-019-0103-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Hong Junyuan, Zhu Zhuangdi, Yu Shuyang, Wang Zhangyang, Dodge Hiroko H., and Zhou Jiayu. 2021. Federated Adversarial Debiasing for Fair and Transferable Representations. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Virtual Event, Singapore) (KDD ‘21). Association for Computing Machinery, New York, NY, USA, 617–627. 10.1145/3447548.3467281 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Hong Junyuan, Zhu Zhuangdi, Yu Shuyang, Wang Zhangyang, Dodge Hiroko H, and Zhou Jiayu. 2021. Federated adversarial debiasing for fair and transferable representations. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 617–627. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Tzu-Ming Harry Hsu Hang Qi, and Brown Matthew. 2019. Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification. arXiv:1909.06335 [cs.LG] [Google Scholar]
- [19].Hu Shengyuan, Steven Wu Zhiwei, and Smith Virginia. 2022. Fair Federated Learning via Bounded Group Loss. 10.48550/ARXIV.2203.10190 [DOI]
- [20].Huang Li and Liu Dianbo. 2019. Patient Clustering Improves Efficiency of Federated Machine Learning to predict mortality and hospital stay time using distributed Electronic Medical Records. 10.48550/ARXIV.1903.09296 [DOI] [PubMed]
- [21].Johnson Alistair E.W., Pollard Tom J., Shen Lu, Lehman Li-Wei H., Feng Mengling, Ghassemi Mohammad, Moody Benjamin, Szolovits Peter, Leo Anthony Celi Roger G. Mark, and et al. 2016. MIMIC-III, a freely accessible critical care database. Scientific Data 3, 1 (2016), 160035. 10.1038/sdata.2016.35 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Kang Jian, Xie Tiankai, Wu Xintao, Maciejewski Ross, and Tong Hanghang. 2021. InfoFair: Information-Theoretic Intersectional Fairness. 10.48550/ARXIV.2105.11069 [DOI]
- [23].Keshta Ismail and Odeh Ammar. 2021. Security and privacy of electronic health records: Concerns and challenges. Egyptian Informatics Journal 22, 2 (2021), 177–183. 10.1016/j.eij.2020.07.003 [DOI] [Google Scholar]
- [24].Li Tian, Hu Shengyuan, Beirami Ahmad, and Smith Virginia. 2021. Ditto: Fair and Robust Federated Learning Through Personalization. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 6357–6368. https://proceedings.mlr.press/v139/li21h.html [Google Scholar]
- [25].Li Tian, Sanjabi Maziar, and Smith Virginia. 2019. Fair Resource Allocation in Federated Learning. CoRR abs/1905.10497 (2019). arXiv:1905.10497 http://arxiv.org/abs/1905.10497 [Google Scholar]
- [26].Li Xiang, Huang Kaixuan, Yang Wenhao, Wang Shusen, and Zhang Zhihua. 2020. On the Convergence of FedAvg on Non-IID Data. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=HJxNAnVtDS [Google Scholar]
- [27]. Ma Fenglong, Chitta Radha, Zhou Jing, You Quanzeng, Sun Tong, and Gao Jing. 2017. Dipole: Diagnosis Prediction in Healthcare via Attention-Based Bidirectional Recurrent Neural Networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Halifax, NS, Canada) (KDD ‘17). Association for Computing Machinery, New York, NY, USA, 1903–1911. [Google Scholar]
- [28].Ričards Marcinkevičs Ece Ozkan, and Vogt Julia E. 2022. Debiasing Deep Chest X-Ray Classifiers using Intra- and Post-processing Methods. 10.48550/ARXIV.2208.00781 [DOI]
- [29].Brendan McMahan Eider Moore, Ramage Daniel, Hampson Seth, and Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics. PMLR, 1273–1282. [Google Scholar]
- [30].Mehrabi Ninareh, Morstatter Fred, Saxena Nripsuta, Lerman Kristina, and Galstyan Aram. 2019. A Survey on Bias and Fairness in Machine Learning. 10.48550/ARXIV.1908.09635 [DOI]
- [31].Mohri Mehryar, Sivek Gary, and Theertha Suresh Ananda. 2019. Agnostic Federated Learning. 10.48550/ARXIV.1902.00146 [DOI] [Google Scholar]
- [32].Statistics National Center for Health. 2016. Health, United States, 2015: With Special Feature on Racial and Ethnic Health Disparities. National Center for Health Statistics (US), Hyattsville (MD). [PubMed] [Google Scholar]
- [33].Dinh C Nguyen Quoc-Viet Pham, Pubudu N Pathirana Ming Ding, Seneviratne Aruna, Lin Zihuai, Dobre Octavia, and Hwang Won-Joo. 2022. Federated learning for smart healthcare: A survey. ACM Computing Surveys (CSUR) 55, 3 (2022), 1–37. [Google Scholar]
- [34].Papadaki Afroditi, Martinez Natalia, Bertran Martin, Sapiro Guillermo, and Rodrigues Miguel. 2022. Minimax Demographic Group Fairness in Federated Learning. 10.1145/3531146.3533081 [DOI] [Google Scholar]
- [35].Pfohl Stephen, Marafino Ben, Coulet Adrien, Rodriguez Fatima, Palaniappan Latha, and Shah Nigam H. 2019. Creating Fair Models of Atherosclerotic Cardiovascular Disease Risk. 10.1145/3306618.3314278 [DOI]
- [36].Pfohl Stephen R., Dai Andrew M., and Heller Katherine. 2019. Federated and Differentially Private Learning for Electronic Health Records. 10.48550/ARXIV.1911.05861 [DOI]
- [37].Pfohl Stephen R., Duan Tony, Daisy Yi Ding, and Nigam H. Shah. 2019. Counterfactual Reasoning for Fair Clinical Risk Prediction. In Proceedings of the 4th Machine Learning for Healthcare Conference (Proceedings of Machine Learning Research, Vol. 106), Finale Doshi-Velez, Jim Fackler, Ken Jung, David Kale, Rajesh Ranganath, Byron Wallace, and Jenna Wiens (Eds.). PMLR, 325–358. https://proceedings.mlr.press/v106/pfohl19a.html [Google Scholar]
- [38].Pollard Tom J., Johnson Alistair E. W., Raffa Jesse D., Celi Leo A., Mark Roger G., and Badawi Omar. 2018. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Scientific Data 5, 1 (2018), 180178. 10.1038/sdata.2018.178 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [39].Poulain Raphael, Gupta Mehak, and Beheshti Rahmatollah. 2022. Few-Shot Learning with Semi-Supervised Transformers for Electronic Health Records. [PMC free article] [PubMed] [Google Scholar]
- [40].Rajkomar Alvin, Hardt Michaela, Howell Michael D., Corrado Greg, and Chin Marshall H. 2018. Ensuring Fairness in Machine Learning to Advance Health Equity. Annals of Internal Medicine 169, 12 (2018), 866. 10.7326/m18-1990 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [41].Roh Yuji, Lee Kangwook, Euijong Whang Steven, and Suh Changho. 2020. Fair-Batch: Batch Selection for Model Fairness. 10.48550/ARXIV.2012.01696 [DOI] [Google Scholar]
- [42].Eliane Röösli Selen Bozkurt, and Tina Hernandez-Boussard. 2022. Peeking into a black box, the fairness and generalizability of a MIMIC-III benchmarking model. Scientific Data 9, 1 (2022). 10.1038/s41597-021-01110-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [43]. Segal Aaron, Marcedone Antonio, Kreuter Benjamin, Ramage Daniel, McMahan H. Brendan, Seth Karn, Bonawitz KA, Patel Sarvar, and Ivanov Vladimir. 2017. Practical Secure Aggregation for Privacy-Preserving Machine Learning. In CCS. https://eprint.iacr.org/2017/281.pdf [Google Scholar]
- [44].Sharma Pulkit, Shamout Farah E, and Clifton David A. 2019. Preserving patient privacy while training a predictive model of in-hospital mortality. arXiv preprint arXiv:1912.00354 (2019). [Google Scholar]
- [45].Vaid Akhil, Suraj K Jaladanki Jie Xu, Teng Shelly, Kumar Arvind, Lee Samuel, Somani Sulaiman, Paranjpe Ishan, De Freitas Jessica K, Wanyan Tingyi, et al. 2020. Federated learning of electronic health records improves mortality prediction in patients hospitalized with covid-19. medRxiv (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [46].Walonoski Jason, Kramer Mark, Nichols Joseph, Quina Andre, Moesel Chris, Hall Dylan, Duffett Carlton, Dube Kudakwashe, Gallagher Thomas, and McLachlan Scott. 2017. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. Journal of the American Medical Informatics Association 25, 3 (August 2017), 230–238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [47].Wan Mingyang, Zha Daochen, Liu Ninghao, and Zou Na. 2021. Modeling Techniques for Machine Learning Fairness: A Survey. 10.48550/ARXIV.2111.03015 [DOI]
- [48].Wang Zheng, Fan Xiaoliang, Qi Jianzhong, Wen Chenglu, Wang Cheng, and Yu Rongshan. 2021. Federated Learning with Fair Averaging. 10.48550/ARXIV.2104.14937 [DOI]
- [49].Xiao C, Choi E, and Sun J. 2018. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. J Am Med Inform Assoc 25, 10 (Oct 2018), 1419–1428. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [50].Xu Jie, Glicksberg Benjamin S., Su Chang, Walker Peter, Bian Jiang, and Wang Fei. 2019. Federated Learning for Healthcare Informatics. 10.48550/ARXIV.1911.06270 [DOI] [PMC free article] [PubMed]
- [51].Zeng Yuchen, Chen Hongxu, and Lee Kangwook. 2021. Improving Fairness via Federated Learning. 10.48550/ARXIV.2110.15545 [DOI]