Abstract
The use of machine learning to guide clinical decision making has the potential to worsen existing health disparities. Several recent works frame the problem as that of algorithmic fairness, a framework that has attracted considerable attention and criticism. However, the appropriateness of this framework is unclear due to both ethical as well as technical considerations, the latter of which include trade-offs between measures of fairness and model performance that are not well-understood for predictive models of clinical outcomes. To inform the ongoing debate, we conduct an empirical study to characterize the impact of penalizing group fairness violations on an array of measures of model performance and group fairness. We repeat the analysis across multiple observational healthcare databases, clinical outcomes, and sensitive attributes. We find that procedures that penalize differences between the distributions of predictions across groups induce nearly-universal degradation of multiple performance metrics within groups. On examining the secondary impact of these procedures, we observe heterogeneity of the effect of these procedures on measures of fairness in calibration and ranking across experimental conditions. Beyond the reported trade-offs, we emphasize that analyses of algorithmic fairness in healthcare lack the contextual grounding and causal awareness necessary to reason about the mechanisms that lead to health disparities, as well as about the potential of algorithmic fairness methods to counteract those mechanisms. In light of these limitations, we encourage researchers building predictive models for clinical use to step outside the algorithmic fairness frame and engage critically with the broader sociotechnical context surrounding the use of machine learning in healthcare.
1. Introduction
The use of machine learning with observational health data to guide clinical decision making has the potential to introduce and exacerbate health disparities for disadvantaged and underrepresented populations [1–6]. This effect can derive from inequity in historical and current patterns of care access and delivery [1, 7–11], underrepresentation in clinical datasets [12], the use of biased or mis-specified proxy outcomes during model development [3, 13, 14], and differences in the accessibility, usability, and effectiveness of predictive models across groups [1, 15]. In response, considerable attention has been devoted to reasoning about the extent to which clinical predictive models may be designed to anticipate and proactively mitigate harms to advance health equity, while upholding ethical standards [1, 2, 4, 7, 16–19].
The role that algorithmic fairness techniques should have in the development of clinical predictive models is actively debated [1, 2, 16, 17, 20]. While these techniques have been extensively studied and scrutinized in domains such as criminal justice, hiring, and education [21–25], and have been made accessible by several open source software frameworks [26–28], the scope of their examination in the context of clinical predictive models has been relatively limited [29–34]. Algorithmic fairness methods specify a mathematical formalization of a fairness criterion representative of an ideal (such as equal error rates between male and female patients), and provide procedures for minimizing violations from the fairness criterion without unduly deteriorating model performance [22, 35–39]. Furthermore, these mathematical formalizations can be used as auditing mechanisms to identify problematic characteristics of a predictive model and to promote transparency in the model’s output [40–43]. In the context of this debate, it is essential to recognize that algorithmic fairness techniques enable monitoring and manipulating the output of predictive models, but are generally insufficient by themselves to mitigate the introduction or perpetuation of health disparities resulting from model-guided interventions [2, 3, 16, 17, 20, 24, 44–49].
Health disparities arise as a result of structural forms of racism and related inequities in areas such as housing, education, employment, and criminal justice that affect healthcare access, utilization, and quality [11, 50]. These effects are further compounded by under-representation of the elderly, women, and ethnic minorities in clinical trials and cohort studies [51, 52], as well as warped financial incentives in the healthcare system [53]. Typical formulations of algorithmic fairness, particularly group fairness criteria, are unaware of this context, because they are primarily defined in terms of a model’s predictions, observed outcomes, as well as membership in a pre-specified group of demographic attributes, and evaluated on cohorts derived retrospectively. As a result, algorithmic fairness criteria may be misleading in several situations, including when the observed outcome is a biased surrogate for the construct of interest [3, 49] and when predictive models are not appropriately contextualized in terms of the heterogeneous and dynamic impact of the complex interventions and policies that they enable, contributing to an erroneous conflation of model performance metrics with an accrual of benefit [44, 47, 54–56].
Some researchers argue that techniques from algorithmic fairness still have a role to play in promoting health equity [1], particularly if their use is informed by a deeply embedded understanding of the sociotechnical and clinical contexts that surround the use of a predictive model’s output. From this perspective, it is relevant to reason about the trade-offs between properties such as model performance and fairness criteria satisfaction, as long as those properties can be appropriately contextualized. However, it is generally impossible to satisfy conflicting notions of fairness simultaneously [44, 57–59] and methods that impose fairness constraints are known to do so at the cost of substantial trade-offs between various measures of model performance and fairness criteria satisfaction, in ways that are sensitive to the properties of the dataset and learning algorithm used [43, 44, 57, 58, 60–64]. As the evidence base that surrounds the use of algorithmic fairness methods in the context of clinical risk prediction remains limited, the extent to which these trade-offs manifest when adopting algorithmic fairness approaches for training clinical predictive models remains unclear.
To inform this discussion, we conduct a large-scale empirical study characterizing the trade-offs between various measures of model performance and fairness criteria satisfaction for clinical predictive models that are penalized to varying degrees against violations of group fairness criteria. We repeat the analysis across several databases, outcomes and sensitive attributes in an attempt to identify patterns that generalize. We make all code relevant for reproducing these experiments available at https://github.com/som-shahlab/fairness_benchmark.
The structure of the remainder of the paper is as follows. In Section 2, we formally define the prediction problem, group fairness criteria, metrics, and learning objectives that we evaluate. We then describe the experimental setup, including datasets used, definitions of cohorts, outcomes and sensitive attributes, as well as procedures for feature extraction, model training and evaluation. In Section 3, we report on the results. In Section 4, we provide recommendations for model developers and policy makers in light of both the trade-offs that we find as well as the limitations inherent to the algorithmic fairness framework. Furthermore, we discuss the appropriateness of typical abstractions used for evaluation of group fairness, including the use of discrete categories for sex, race, and ethnicity.
2. Methods
Across twenty five combinations of datasets, clinical outcomes, and sensitive attributes, we train a series of predictive models that are penalized to varying degrees against violations of fairness criteria, and report on model performance and group fairness metrics. Figure 1 outlines the experimental procedure.
Figure 1:
An overview of the experimental procedure. We extract cohorts from a collection of databases and label patients on the basis of observed clinical outcomes and membership in groups of multiple demographic attributes. For twenty five combinations of database, outcome, and sensitive attribute, we train a series of predictive models that are penalized to varying degrees against violations of conditional prediction parity. We report on the effect of such penalties on measures of model performance and fairness criteria satisfaction. Shown here is process for the cohort drawn from the STARR database.
We first define the mathematical formulation of group fairness from which fairness metrics and training objectives are derived (Section 2.1). We present three classes of group fairness criteria – conditional prediction parity (Section 2.1.2), calibration (Section 2.1.3), and cross-group ranking (Section 2.1.4) – each of which is operationalized by one or more fairness metrics that quantify the extent to which the associated criterion is violated. We evaluate each of these metrics for models developed with six regularization strategies that penalize violation of conditional prediction parity (Section 2.1.5). The computational experiments that examine the behavior of these constructs are described in Section 2.2, which includes a description of the datasets (Section 2.2.1), definitions of the cohort inclusion criteria, outcomes, and sensitive attributes (Section 2.2.2), as well as details of feature extraction (Section 2.2.3), model training (Section 2.2.4), and evaluation (Section 2.2.5).
2.1. Formulations of Group Fairness Criteria
2.1.1. Notation and Problem Formulation
Let be a variable designating a vector of covariates; be a binary indicator of an outcome; and be a discrete indicator for a protected or sensitive attribute, such as race, ethnicity, gender, sex, or age with K groups. The objective of supervised learning with binary outcomes is to use data to learn a function , parameterized by θ, to estimate . The output of the predictor may be further thresholded at T to produce binarized predictions . For notational convenience, we refer to an empirical mean defined over a dataset as an expectation involving . For example, refers to the empirical mean of the predicted probability of the outcome over the set of patients for whom the outcome is observed.
Many of the group fairness criteria that we evaluate can be stated as a form of conditional independence involving a function of some or all of the variables with A. Given access to a dataset and model predictions, quantifying the extent to which a group fairness criterion is violated relies on a fairness metric that compares the empirical distribution of the relevant function across groups. The objective of supervised learning subject to group fairness is to learn a model fθ such that the violation of one or more fairness criteria is small. This framing does not preclude the use of the group indicator A as a component of the vector of covariates X.
We consider fairness criteria belonging to three general categories: conditional prediction parity, calibration and cross-group ranking. The first - conditional prediction parity - will later be incorporated into our training objective. Below, we describe the motivation behind considering each class, as well as mathematically define particular metrics which will be used to evaluate our experiments.
2.1.2. Fairness Criteria Based on Conditional Prediction Parity
We refer to a class of group fairness criteria that assess conditional independence between model predictions or fθ and the categorical group indicator A as conditional prediction parity. When an instance of a fairness criterion of this class depends on a binarized prediction , we say it is a threshold-based criterion, and when it depends on a probabilistic prediction fθ, we say it is a threshold-free criterion. This form captures demographic parity [21, 35, 65], equalized odds [22], and equal opportunity [22], among others [66].
These criteria have natural interpretations that can motivate their use in some contexts. For instance, violations of demographic parity can imply allocation discrepancies across groups. Equal opportunity and equalized odds can be understood as either a notion of demographic parity within strata defined by the outcome Y or as a notion of error rate balance. For binarized predictions , the equal opportunity criterion implies equal true positive rates across groups, and the equalized odds criterion implies equal opportunity as well as equal false positive rates across groups. Satisfying the threshold-free variants of these criteria implies satisfaction of the threshold-criteria at any threshold. An implication of this is that the threshold-free variant of the equalized odds criteria corresponds to the condition of identical ROC curves across groups [22].
We now present metrics that assess the extent to which a predictive model violates notions of threshold-free conditional prediction parity, starting with demographic parity (MDP) before extending it to equal opportunity (MEqOpp) and equalized odds (MEqOdds). Threshold-free demographic parity is satisfied if P(fθ | A = Ak) matches , for each group Ak. This motivates the use of a metric of the following form:
| (1) |
where D is a function that measures a notion of distance between probability distributions. As equalized odds and equal opportunity may be interpreted as demographic parity within strata defined by the outcome, the associated metrics can be constructed analogously, with
| (2) |
for equal opportunity, and
| (3) |
for equalized odds.
With this construction, the form of function D determines the properties of the metric used to assess fairness criteria violation. If D is chosen to be a divergence or integral probability metric [67], then the fairness criteria is satisfied if the associated metric M is equal to zero, and is positive otherwise. In our experiments, we evaluate these metrics using the Earth Mover’s Distance (EMD) [68], an example of an integral probability metric.
These metrics may be further decomposed into group- and outcome- specific components. For instance, if
| (4) |
then . Similarly, if
| (5) |
and
| (6) |
then , and .
As an alternative, we also consider metrics based on the difference in means between the distributions of interest. As before, these metrics decompose into group-specific components. For instance, if we define the difference in the mean predicted probability of the outcome for patients in group k versus the marginal distribution () as
| (7) |
then an aggregate metric of demographic parity violation can be constructed as
| (8) |
Expressions for equal opportunity and equalized odds can be constructed analogously by nesting these comparisons within strata defined by the observed outcomes. If
| (9) |
and
| (10) |
then
| (11) |
and
| (12) |
2.1.3. Fairness Criteria Based on Calibration
We construct a measure of fairness in calibration to estimate the extent to which observed event rates conditioned on the predicted risk differ across groups. We do so by first presenting a measure of absolute calibration, which can be understood as a performance metric that assesses overall model calibration, before extending that measure to account for relative differences in calibration across groups.
Calibration in the context of clinical risk prediction typically refers to the extent to which probabilistic predictions are faithful estimates of the observed event rates, such that a model is well-calibrated if P(Y = 1 | fθ = p) = p for p ∈ [0, 1]. In other words, a model is well-calibrated if the outcome is observed p ∗ 100% of the time among patients whose prediction is p.
The absolute calibration error (ACE) assesses deviations from perfect calibration, relying on an auxiliary estimator gϕ : [0, 1] → [0, 1] that provides an approximation of P(Y = 1 | fθ):
| (13) |
This measure may be interpreted as an instance of the weighted mean squared calibration error proposed in Yadlowsky et al. [69], used here in the context of uncensored binary outcomes, and is also similar to the integrated calibration index proposed in Austin et al. [70]. In addition, we consider a signed variant of the metric:
| (14) |
The signed measure can assess the directionality of mis-calibration, but may be misleading when positive deviations offset negative ones. For clarity, a positive value for the signed measure corresponds to under-prediction of risk, as it corresponds to an excess number of observed outcomes given the risk estimates.
In the context of group fairness, it is relevant to reason about model calibration in a relative sense. The matching conditional frequencies [22, 58] (MCF) or sufficiency [71] condition requires that the probability of the outcome not differ across groups conditioned on the value of the risk score i.e. Y ⊥ A | fθ. An alternative (but not equivalent) formulation is to require the risk score to be well-calibrated within groups [57, 72]. We present a measure of relative calibration to estimate the extent to which observed event rates conditioned on the predicted risk differ across groups, thus assessing the violation of the MCF criterion. Given an auxiliary estimator gϕ that estimates the marginal density P(Y = 1 | fθ) for the entire population and an estimator gk that estimates P(Y = 1 | fθ, A = Ak), the relative calibration error (RCE) for group Ak is defined as
| (15) |
with the corresponding signed metric defined as
| (16) |
The measures that we present here depend on the form of the estimators gϕ and gk. In our experiments, we use logistic regression as an estimator. Alternatives include loess regression [70], kernel density estimates over P(fθ | Y ) combined with simple conditional probability rules, or binning estimators.
2.1.4. Cross-Group Ranking Measures
We now consider notions of cross-group ranking performance [73, 74]. These measures can be understood as variations on the area under the receiver operating characteristic curve (AUROC). The AUROC is a typical metric for assessing the performance of a clinical predictive model. This metric has an equivalent interpretation as the probability with which predictions for members of the positive class are ranked above those for members of the negative class:
| (17) |
A comparison of the AUROC across groups can be misleading as a comparison of relative model quality and as a fairness criterion since that comparison does not account for the degree with which positive examples of a group are ranked above negative examples of other groups. In order to gain insight into this phenomenon, we consider multi-group extensions to measures defined in the xAUC framework [73, 74] and consider deviations in these measures across groups as a violation of a measure of group fairness. We define as the probability with which positive instances of group Ak are ranked above negative instances of all other groups:
| (18) |
Similarly, we define as the probability with which negative instances of group Ak are ranked below positive instances of all other groups:
| (19) |
2.1.5. Regularized Objectives for Conditional Prediction Parity
Several techniques for constructing models that satisfy measures of group fairness have been proposed, including representation learning [35, 75–77], post-processing [22], and constrained or regularized learning objectives [36–38, 66, 78]. We focus our attention on regularized learning objectives that penalize violations of measures of threshold-free conditional prediction parity. Given a loss function , such as the negative log-likelihood, the general form of this objective is as follows:
| (20) |
where R is a non-negative regularizer indicative of the extent to which the fairness criterion is violated, and λ is a non-negative scalar. In general, tuning the value of λ allows for exploring the trade-offs between measures of model performance and fairness.
In our experiments, we evaluate six regularization strategies to penalize the violation of threshold-free conditional prediction parity measures. These strategies correspond to regularizers that penalize violation of demographic parity, equal opportunity, and equalized odds, either with regularizers of the form of Equations (1), (2), or (3), respectively, using an empirical maximum mean discrepancy (MMD) [79] with a Gaussian kernel to estimate D, or with regularizers in the form of Equations (8), (11), or (12), respectively, that penalize the sum of the squared difference in the mean prediction in the relevant outcome strata. During training, the value of the regularizer R is computed on the basis of comparisons between the distributions of logfθ, computed via a log-softmax transformation of the model logits, rather than on the basis on fθ directly. In some instances, we refer to regularizers that penalize violation of demographic parity, equalized odds, and equal opportunity as unconditional, conditional, and positive conditional, respectively, corresponding to the outcome strata used to evaluate the regularizer, and include a modifier for either MMD or mean to indicate how the relevant distributions are compared.
2.2. Experiments
2.2.1. Datasets
STARR
The Stanford Medicine Research Data Repository (STARR) [80] is a clinical data warehouse containing records from approximately three million patients from Stanford Hospitals and Clinics and the Lucile Packard Children’s Hospital for inpatient and outpatient clinical encounters that occurred between 1990 and 2020. This database contains structured longitudinal data in the form of diagnoses, procedures, medications, and laboratory tests that have been mapped to standard concept identifiers in the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) version 5.3.1 [81–83]. De-identified clinical notes with clinical concepts extracted and annotated with negation and family-history detection are also made available [80].
Optum©’s de-identifed Clinformatics® Data Mart Database
Optum©’s de-identifed Clinformatics® Data Mart Database (Optum CDM) is a statistically de-identified large commercial and medicare advantage claims database. The database includes approximately 17–19 million annual covered lives, for a total of over 57 million unique lives over a 9 year period (1/2007 through 12/2017). We utilize a variant of the database that makes available the month and date of death, sourced from internal and external sources including the Death Master File maintained by the Social Security Office, as well as records from the Center for Medicare and Medicaid Services. However, this version of the data does not provide access to detailed socioeconomic, demographic, or geographic variables. We utilize version 7.1 of the data mapped to OMOP CDM v5.3.1.
MIMIC-III
The Medical Information Mart for Intensive Care-III (MIMIC-III) [84] is a widely studied database containing comprehensive physiologic information from approximately fifty thousand intensive care unit (ICU) patients admitted to the Beth Israel Deaconess Medical Center between 2001 and 2012. We utilize MIMIC-OMOP1, a variant of the database that has been mapped to OMOP CDM v5.3.1.
2.2.2. Cohort, Outcome, and Attribute Definitions
We define a set of analogous cohort, outcome, attribute and group definitions for each database that differ on the basis of differential availability of data elements. For the STARR and Optum CDM databases, we consider the set of inpatient hospital admissions that span at least two distinct calendar dates for which patients were of age 18 years or older at the time of admission. For admissions in the STARR database, the start of the admission is variably defined as the date and time of admission to the emergency department, if available, and admission to the hospital otherwise. For patients with more than one admission meeting this criteria, we randomly sample one admission. In each case, we consider the start of an admission as the index date and time. For each admission, we derive binary outcome labels for prolonged length of stay (defined as a hospital length of stay greater than or equal to seven days) and 30-day readmission (defined as a subsequent admission within thirty days of discharge of the considered admission). For admissions derived from the STARR database, we also derive a binary outcome label for in-hospital mortality. This procedure returns admissions from 198,644 patients in STARR (Table 1) and 8,074,571 patients in Optum CDM (Supplementary Table A.1).
Table 1:
Cohort characteristics for patients drawn from STARR. Data are grouped on the basis of age, sex, and the race and ethnicity category. Shown, for each group, is the number of patients extracted and the incidence of hospital mortality, prolonged length of stay, and 30-day readmission
| Outcome Incidence |
||||
|---|---|---|---|---|
| Group | Count | Hospital Mortality | Prolonged Length of Stay | 30-Day Readmission |
| [18–30) | 23,042 | 0.00681 | 0.175 | 0.0461 |
| [30–45) | 43,432 | 0.00596 | 0.130 | 0.0396 |
| [45–55) | 27,394 | 0.0178 | 0.205 | 0.0527 |
| [55–65) | 35,703 | 0.0251 | 0.227 | 0.0558 |
| [65–75) | 36,084 | 0.0284 | 0.234 | 0.0548 |
| [75–90) | 32,989 | 0.0400 | 0.238 | 0.0545 |
| Female | 112,713 | 0.0161 | 0.166 | 0.0452 |
| Male | 85,923 | 0.0271 | 0.244 | 0.0571 |
| Asian | 29,460 | 0.0209 | 0.171 | 0.0536 |
| Black | 7,813 | 0.0198 | 0.240 | 0.0581 |
| Hispanic | 33,742 | 0.0180 | 0.193 | 0.0544 |
| Other | 20,270 | 0.0327 | 0.226 | 0.0442 |
| White | 107,359 | 0.0196 | 0.202 | 0.0488 |
For MIMIC-III, we replicate in MIMIC-OMOP the cohort and outcome definitions defined as benchmarks in the MIMIC-Extract project [85]. In particular, we consider hospital admissions associated with each patient’s first ICU stay, further restricting the set of allowable ICU stays to those lasting between twelve hours and ten days. We set the index date and time to be twenty-four hours after hospital admission, and further restrict the set of admissions to those for which the index time is at least six hours prior to ICU discharge and for which patients are between 15 and 89 years old at the index time. We define four binary outcomes, corresponding to ICU length of stay greater than three and seven days and mortality over the course of the ICU stay and hospital admission, following the approach of Wang et al. [85]. This procedure returns 26,170 patients (Supplementary Table A.2).
For the purposes of evaluating the extent to which a model satisfies group fairness criteria, we define discrete groups of the population on the basis of demographic attributes. We consider (1) a combined race and ethnicity variable based on self-reported racial and ethnic categories, (2) sex2, and (3) age at the index date, discretized into categories at 18–29, 30–44, 45–54, 55–64, 65–74, 75–89 years for STARR and Optum cohorts and 15–29, 30–44, 45–54, 55–64, 65–74, 75–89 years for MIMIC-III.
The race and ethnicity attribute is constructed by assigning Hispanic if the ethnicity is recorded as Hispanic, and the value of the recorded racial category otherwise. In line with the categories provided by the upper level of the OMOP CDM vocabulary, we consider groups corresponding to “Asian”, “American Indian or Alaska Native”, “Black or African American”, “Native Hawaiian or Other Pacific Islander”, “Other”, and “White”. We additionally aggregate groups of the race and ethnicity attribute with the “Other” category when few observed outcomes are available within a group in a database specific manner. This reduces the categorization to “Asian”, “Black or African American”, “Hispanic”, “Other”, and “White” for STARR, and to “Other” and “White” for MIMIC-III. Race and ethnicity data is not available for the version of the Optum CDM database that we use. We further contextualize this operationalization of race and ethnicity in Section 4.
2.2.3. Feature Extraction
We extract clinical features with a generic procedure that operates on OMOP CDM databases, similar to Reps et al. [83]. A graphical depiction of this procedure is provided in Supplementary Figure B.1. The process returns 539,823, 438,369, and 21,026 unique features in the STARR, Optum CDM, and MIMIC-III cohorts, respectively. In brief, we consider a set of binary features based on the occurrence of both unique OMOP CDM concepts and derived elements prior to the index date and time. We extract all diagnoses, medication orders, procedures, device exposures, encounters, and labs, assigning a unique feature identifier for the presence of each OMOP CDM concept, in time intervals defined relative to the index date and time. For lab test results, we construct additional features for lab results above and below the corresponding reference range, if available, and indicators for whether the result belongs to the empirical quintiles for the corresponding lab result observed in the time interval. For STARR data, we include additional features for the presence of clinical concepts derived from clinical notes, modified by an indicator for whether the concept corresponds to a present and positive mention [80]. For MIMIC-III, we do not use diagnoses or clinical notes to derive features. We include time-agnostic demographic features, including race, ethnicity, gender, and age group (discretized in five year intervals), using the OMOP CDM concept identifiers directly rather than the attribute definitions described previously.
We repeat the feature extraction procedure for a set of time-intervals defined relative to the index date and time in a database-specific manner. For the STARR and Optum CDM cohorts, we define intervals at 29 to 1 days prior to the index date, 89 to 30 days prior, 179 to 90 days prior, 364 to 180 days prior, and any time prior. For STARR and MIMIC-III data, we include additional time-intervals defined only over the subset of the data elements recorded with date and time resolution, repeating each extraction procedure described for which that set is not empty. For STARR, these intervals correspond to 4 hours prior to the index time, 12 hours to 4 hours prior to the index time, 24 to 12 hours prior, three days to 24 hours prior, and seven days to three days prior. For MIMIC-III, these intervals correspond to 4 hours prior to the index time, 12 hours to 4 hours prior to the index time, 24 to 12 hours prior, and any time prior.
2.2.4. Model Training
We conduct a modified cross-validation procedure designed to enable robust model selection and evaluation. For each cohort, a randomly sampled partition of 10% of the patients is set aside as a test set for final evaluation. For STARR and MIMIC-III, we further partition the remaining data into ten equally-sized folds, each of which can be considered as a validation set corresponding to a training set composed of the remainder of the folds. Due to the computational constraints imposed by the large size of the Optum CDM cohort, we do not perform cross-validation and instead randomly partition the data such that 81% of the patients are used for training, 9% of the patients are used for validation, and 10% used for testing.
In all cases, we leverage fully-connected feedforward neural networks for prediction, as they enable scalable and flexible learning for differentiable objectives in the form of equation (20). Tuning of the unpenalized models begins with a random sample of fifty hyperparameter configurations from a grid of architectural and training-dynamic hyperparameters (Supplementary Table C.1). For each combination of dataset, outcome, training-validation partition, and hyperparameter configuration, we train a model with the Adam [86] optimizer for up to 150 iterations of 100 batches, terminating early if the cross-entropy loss on the validation set does not improve for 10 iterations. For each combination of dataset and outcome, we select model hyperparameters on the basis of the mean validation log-loss across folds (the selected hyperparameters are provided in Supplementary Tables C.2, C.4, and C.3). For models trained on the Optum CDM cohort, we select model hyperparameters on the basis of the log-loss measured on the validation set. We then train models with regularization that penalizes fairness criteria violation using an objective in the form of Equation (20) for ten values of λ distributed log-uniformly on the interval 10−3 to 10, repeating the process separately for each dataset, task, attribute, and form of regularizer (of which there are six, as discussed in Section 2.1.5), holding the model hyperparameters fixed to those selected for the corresponding unpenalized model. As before, we train each model for up to 150 iterations of 100 batches, but perform early stopping and model selection on the basis of the penalized loss that incorporates the regularization term. Pytorch version 1.5.0 [87] is used to define all models and training procedures.
2.2.5. Model Evaluation
We report model performance and fairness metrics as the mean ± the standard deviation (SD) for metrics evaluated on the held-out test set for the set of ten models derived from the procedure described in Section 2.2.4. For models trained on the Optum CDM cohort, we report results on the test set for the selected model without an associated standard deviation. To assess within-group model performance, we report the AUROC, average precision, and cross entropy loss at baseline and each value of λ. To assess conditional prediction parity violation, we report on the decomposed group-specific components of the metrics presented in Section 2.1. In particular, we report the EMD and difference in means between the distribution of predictions for each group with the marginal distribution constructed via aggregation of predictions from all groups (Equations (4) and (7)), respectively, as metrics that assess violations of demographic parity. We repeat the process in the strata of the population for which the outcome is and is not observed, as metrics that assess violations of equalized odds and equal opportunity (Equations (6), (10), (5), and (9)). We report on cross-group ranking discrepancies in the form of Equations (18) and (19) for each group.
To assess absolute and relative model calibration, we estimate ACE and RCE post-hoc on the test set. To do so, for each predictive model fθ, we train an auxiliary logistic regression model to estimate P(Y | fθ) on the basis of log(fθ) for the aggregate test set and for each group. The resulting group-level estimates of ACE and RCE are constructed by plugging-in the data to the relevant estimators using Equations (13), (14), (15), and (16). The logistic regression models are fit using the LBFGS [88] algorithm implemented in Scikit-Learn [89]. We report ACE alongside model performance measures and RCE alongside other fairness metrics.
3. Results
Given the breadth of experimentation conducted, and in the interest of brevity, we focus our reporting on general trends that replicate across experimental conditions and on notable exceptions to those trends. In the main text, we focus on the results for the models derived on STARR, and provide, as examples, figures for models that predict 30-day readmission in the STARR cohort using MMD-based penalties, and exclude results for decomposed fairness metrics computed on the strata of the population for which the relevant outcome is not observed. Analogous figures for the complete set of experimental conditions and evaluation metrics are provided in the Appendix.
We observe that, in the absence of fairness-promoting regularization, models exhibit substantial differences in group-level model performance measures (AUROC, average precision, and cross entropy loss), and show clear violation of measures of conditional prediction parity as well as differences in cross-group ranking performance (Figures 2, 3, 4, 5, 6, and 7). These baseline models tend to have low absolute calibration error for each group, with small differences in relative calibration error across groups. For example, the model that predicts 30-day readmission in the STARR cohort shows a large degree of variability in AUROC across groups of the race and ethnicity attribute (Mean AUROC: 0.78, 0.66, 0.77, 0.80, 0.71 for the Asian, Black, Hispanic, Other, and White groups, respectively; Figure 2) and further shows a difference in the mean predicted probability across groups that disproportionately affects the Black population (Figure 3). However, the signed absolute and relative calibration errors by group are small (Mean ; Mean for the Asian, Black, Hispanic, Other, and White groups, respectively; Figures 2 and 3).
Figure 2:
Group-level model performance measures as a function of the extent λ that violation of the fairness criterion is penalized when the race and ethnicity category is considered as the sensitive attribute for prediction of 30-day readmission in the STARR database. Results shown are the mean ± SD for the area under the ROC curve (AUROC), average precision (Avg. Prec), the cross entropy loss (CE Loss), the absolute calibration error (ACE), the signed absolute calibration error (Sign. ACE), and cross group ranking performance (xAUC; is indicated by (y=1) and by (y=0)) for each group for objectives that penalize violation of threshold-free Demographic Parity, Equalized Odds, and Equal Opportunity with MMD-based penalties. Dashed lines correspond to the mean result for the unpenalized training procedure.
Figure 3:
Fairness metrics as a function of the extent λ that violation of the fairness criterion is penalized when the race and ethnicity category is considered as the sensitive attribute for prediction of 30-day readmission in the STARR database. Results shown are the mean ± SD for decomposed group-level metrics that assess conditional prediction parity (EMD and Mean Diff.) and relative calibration error (RCE and Sign. RCE) for objectives that penalize violation of threshold-free Demographic Parity, Equalized Odds, and Equal Opportunity on the basis of MMD-based penalties. Measures of conditional prediction parity are separately assessed in the whole population and in the strata for which the outcome is observed (suffixed with (y=1)). Dashed lines correspond to the mean result for the unpenalized training procedure.
Figure 4:
Group-level model performance measures as a function of the extent λ that violation of the fairness criterion is penalized when sex is considered as the sensitive attribute for prediction of 30-day readmission in the STARR database. Results shown are the mean ± SD for the area under the ROC curve (AUROC), average precision (Avg. Prec), the cross entropy loss (CE Loss), the absolute calibration error (ACE), the signed absolute calibration error (Sign. ACE), and cross group ranking performance (xAUC; is indicated by (y=1) and by (y=0)) for each group for objectives that penalize violation of threshold-free Demographic Parity, Equalized Odds, and Equal Opportunity with MMD-based penalties. Dashed lines correspond to the mean result for the unpenalized training procedure.
Figure 5:
Fairness metrics as a function of the extent λ that violation of the fairness criterion is penalized when sex is considered as the sensitive attribute for prediction of 30-day readmission in the STARR database. Results shown are the mean ± SD for decomposed group-level metrics that assess conditional prediction parity (EMD and Mean Diff.) and relative calibration error (RCE and Sign. RCE) for objectives that penalize violation of threshold-free Demographic Parity, Equalized Odds, and Equal Opportunity on the basis of MMD-based penalties. Measures of conditional prediction parity are separately assessed in the whole population and in the strata for which the outcome is observed (suffixed with (y=1)). Dashed lines correspond to the mean result for the unpenalized training procedure.
Figure 6:
Group-level model performance measures as a function of the extent λ that violation of the fairness criterion is penalized when the age group is considered as the sensitive attribute for prediction of 30-day readmission in the STARR database. Results shown are the mean ± SD for the area under the ROC curve (AUROC), average precision (Avg. Prec), the cross entropy loss (CE Loss), the absolute calibration error (ACE), the signed absolute calibration error (Sign. ACE), and cross group ranking performance (xAUC; is indicated by (y=1) and by (y=0)) for each group for objectives that penalize violation of threshold-free Demographic Parity, Equalized Odds, and Equal Opportunity with MMD-based penalties. Dashed lines correspond to the mean result for the unpenalized training procedure.
Figure 7:
Fairness metrics as a function of the extent λ that violation of the fairness criterion is penalized when the age group considered as the sensitive attribute for prediction of 30-day readmission in the STARR database. Results shown are the mean ± SD for decomposed group-level metrics that assess conditional prediction parity (EMD and Mean Diff.) and relative calibration error (RCE and Sign. RCE) for objectives that penalize violation of threshold-free Demographic Parity, Equalized Odds, and Equal Opportunity on the basis of MMD-based penalties. Measures of conditional prediction parity are separately assessed in the whole population and in the strata for which the outcome is observed (suffixed with (y=1)). Dashed lines correspond to the mean result for the unpenalized training procedure.
As expected, training with an objective that penalizes violation of a measure of conditional prediction parity typically leads to better satisfaction of the fairness criterion that corresponds to the form of the regularizer used in the objective. For instance, training with an unconditional penalty to encourage demographic parity typically minimizes the EMD and difference in means between the distribution of predictions for each group and the corresponding marginal distribution constructed via aggregation of the data from all groups. In some cases, the regularization strategy is less successful at minimizing violation of the targeted fairness criteria, such when using a conditional penalty in the outcome-positive strata to encourage equal opportunity on the basis of sex for the 30-day readmission model in the STARR cohort (Figure 5I). In this case, the relevant fairness criteria is actually violated to a greater extent when λ = 10 (MEqOpp = 0.0069) than at baseline (MEqOpp = 0.0056).
We observe a tension between equal opportunity and demographic parity, but these trade-offs are not consistent across experimental conditions. For instance, conditional penalties in the strata for which hospital mortality is observed in the STARR cohort lead to further violation of demographic parity, relative to baseline, across all three of the sensitive attributes that we test (Supplementary Figures D.2, D.4, and D.6). In other cases, this penalty actually leads to improved satisfaction of demographic parity, such as in the case of 30-day readmission prediction in the STARR cohort when age is the sensitive attribute (MDP: 0.041 for λ = 0 vs. MDP: 0.0010 for λ = 10; Figure 7C), but at the cost of a major reduction in AUROC and Average Precision for all groups (Figure 7C and 7F). A similar phenomenon is observed when considering the impact of unconditional penalties that target demographic parity on metrics that assess equal opportunity in that two criteria appear to coincide in some cases, such as for the model that predicts hospital mortality in the STARR cohort when race and ethnicity is considered the sensitive attribute (Supplementary Figures D.2) and conflict in others, such when sex is considered to be the sensitive attribute for models that predict prolonged length of stay (Supplementary Figure D.20) or 30-day readmission (Supplementary Figure D.24) in the Optum CDM cohort.
With few exceptions, the effect of increasing the weight on the conditional regularization penalties that target equalized odds or equal opportunity is a monotonic reduction in group-level model performance measures for all groups. In contrast, the effects of unconditional penalties that encourage demographic parity are more heterogeneous. For instance, while the effect of unconditional penalties on model performance measures that we observe over the trajectory of λ are often similar to those that we observe for conditional penalties, we note that unconditional penalties can result in little change in model performance measures (Figures 4 and 6), and in some cases, actually results in improved model performance for one or more groups relative to baseline (Figure 2 and Supplementary Figures D.1 and D.3). For example, penalizing the violation of demographic parity increases the performance of the model that predicts 30-day readmission for the Black population when λ = 3.6 compared to baseline (AUROC: 0.69 vs. 0.66; Average Precision: 0.17 vs. 0.15) (Figures 2A and 2D).
In general, models become less well-calibrated, in the absolute sense, at the group level, as the weight on either conditional penalty increases, as measured by changes in the ACE or signed ACE relative to baseline. In many cases, unconditional penalties seem to have little impact on group-level model calibration relative to that which is observed for conditional penalties (Figure 4), whereas in other cases the effect of the unconditional penalty is similar to that of the conditional penalties (Figure 6). However, both unconditional and conditional penalties can, but do not always, introduce relative calibration error across groups, and in a way that appears unrelated to the changes to absolute calibration. For example, for models that predict hospital mortality in the STARR cohort when age is the sensitive attribute, the magnitude of the effect on absolute calibration differs substantially on the basis of the type of regularization applied (Supplementary Figure D.5), while the effect on relative calibration is similar across all penalties (Supplementary Figure D.6). In cases where the effects on relative calibration are large, such as this set of models, the impact of the effect on relative calibration concentrates in relatively few groups (Supplementary Figure D.6).
We observe heterogeneity in the manner in which training with fairness-promoting objectives impacts measures of cross-group ranking across the combinations of regularizer, dataset, outcome, and sensitive attribute. In many respects, the trajectories of xAUC measures are similar to those that we generally observe for the AUROC, in that primary effect that we observe is a decline in cross-group ranking accuracy as a function of λ, regardless of the type of penalty selected (Figures 2Q, 2R, 2T, and 2U). However, in some cases, the trajectories of xAUC measures are convergent in a way that both improves fairness and allows for an improvement in the measure for at least one group at the expense of one or more other groups. In some cases, we observe this effect only for unconditional penalties (Supplementary Figures D.2, D.4, and D.6) and in others, we observe it for both unconditional and conditional penalties (Supplementary Figures D.10 and D.12).
4. Discussion
Our experiments aim to provide a comprehensive empirical evaluation of the effect of penalizing group fairness criteria violations on measures of model performance and group fairness for clinical predictive models. Our results reveal substantial heterogeneity in the effect of imposing measures of group fairness across datasets, outcomes, sensitive attribute and group definitions, and regularization strategies. These results quantify the extent to which the trade-offs among measures of model performance and group fairness described by “impossibility theorems” [57, 58, 72, 90] manifest when learning clinical predictive models from real-world databases.
We acknowledge technical limitations of our work that may limit the generalizability of our results. First, the regularizers and metrics used to quantify conditional prediction parity and relative calibration are the result of “one vs. marginal” comparisons where a measure computed for one group is compared to the measure computed for the aggregate population. This choice is one of several ways to construct fairness metrics, including “one vs. other” comparisons between one group and all other groups and pairwise comparisons across all pairs of groups. An effect of this choice is that metrics that assess violation of conditional prediction parity for an over-represented group are more likely to be small since the over-represented group comprises a larger fraction of the population than under-represented groups do. Furthermore, our use of penalized objectives could exaggerate the extent of the reported trade-offs, relative to the alternative of a Lagrangian formulation that directly encodes the fairness criteria as a constraint [36–38, 66, 78]. While the constrained approach typically only provides guarantees of constraint satisfaction in the case of a convex objective, recent work has demonstrated empirical success with a modified proxy-Lagrangian formulation that is effective for non-convex constrained optimization problems [36]. It remains to be seen whether reformulating the problem as constrained optimization allows for satisfaction of fairness constraints with less severe trade-offs than those reported here.
Our work inherits the fundamental limitations of the group fairness framework and of algorithmic fairness more broadly. The group fairness framework, which arose from legal notions of anti-discrimination, reinforces a perspective that groups based on categorical attributes are well-defined constructs that correspond to a set of homogeneous populations – a perspective that has several problematic implications. For example, the definitions of racial categories are entangled with historical and on-going patterns of structural racism, and their continued use reinforces the idea of race as an accurate way to describe human variability, rather than a socially constructed taxonomy [6, 11, 48, 50, 91–94]. This framework further marginalizes groups that are not well-represented by the attributes used to assess group fairness, including intersectional identities [48, 61, 95–97]. Furthermore, in addressing each attribute independently, the group fairness framework treats various sensitive attributes as abstract, interchangeable constructs, without awareness of meaningful contextual differences between them. For example, while observed differences on the basis of race should be primarily interpreted as deriving from systemic and structurally racist factors [6, 11, 48, 50, 91, 92], those observed for sex could potentially be attributed to clinically meaningful differences in human physiology as well as sociological factors [98].
Alternative forms of algorithmic fairness raise additional normative questions that require context-specific judgement and domain knowledge. Individual fairness measures require robustness over a metric space that encodes domain-specific norms, which concern how outputs of an algorithm may change over the space of observed covariates [21]. Counterfactual fairness provides a particular instantiation of individual fairness, defining closeness in the domain-specific metric in terms of counterfactuals with respect to a sensitive attribute [29, 99]. However, this requires specification of the causal pathways between the sensitive attribute, outcomes, and discrimination. It further assumes manipulability of sensitive attributes within the context of a well-defined structural equation model, which is particularly unrealistic for complex high-dimensional data and contestable whenever race is considered to be the sensitive attribute [48, 92].
4.1. Recommendations
Striving for health equity requires designing policies that directly counteract the systemic factors that contribute to health disparities, primarily structural forms of racism and economic inequality [11, 100]. By considering only changes to observable properties of a model, evaluations of models using the group fairness framework ignore the inequities in the data generating and measurement processes. They also miss the decision-theoretic and causal framing necessary to connect predictions to the interventions they trigger, as well as the resulting downstream effect on health disparities [2, 45, 101]. In the absence of this context, a requirement that a predictive model satisfy a notion of group fairness provides little more than a “veneer of neutrality” [17, 102]. Overall, constraining a model such that group fairness is achieved is insufficient for, and may actively work against, the goal of promoting health equity using machine learning guided interventions. To be sure, this lack of sufficiency does not mean that explicitly optimizing for fairness criteria satisfaction can not be useful. However, the value of achieving algorithmic fairness should be defined in terms of the impact of an algorithm-guided intervention on individuals, groups, and on status quo power structures that directly or indirectly perpetuate health disparities [103].
In light of these limitations, model developers in healthcare should engage in participatory design practices that explicitly incorporate perspectives from a diverse set of stakeholders, including patient advocacy groups and civil society organizations. Doing so is necessary to identify mechanisms through which measurement error, bias, and historical inequities affect data collection, measurement, and problem formulation, as well as to reason about the mechanisms by which the intervention triggered by the model’s prediction interacts with those factors [20, 104–106]. However, it is important to allow that the conclusion derived from this process may be to abstain from algorithm-aided decision making entirely if it is not practical to do so responsibly [107, 108].
5. Conclusion
The debate on the use of algorithmic fairness techniques in healthcare has largely proceeded without empirical characterization of the effects of these techniques on the properties of predictive models derived from large-scale clinical data. We explicitly measure and comprehensively report on the extent of the empirical trade-offs between measures of model performance and notions of group fairness such as conditional prediction parity, relative calibration, and cross-group ranking. These constructs are generally well-understood in theoretical contexts, but under-explored in the context of clinical predictive models. Given our results, and the known limitations of the algorithmic fairness framework, we recommend that the use of algorithmic fairness for proactive monitoring and auditing of a clinical predictive model proceed only if measures of model performance and fairness can be appropriately contextualized in terms of the impact of the complex intervention that the model enables.
Supplementary Material
6. Acknowledgements and Funding
We thank Conor Corbin, Jonathan Lu, Jennifer Wilson, Alison Callahan, Ethan Steinberg, Jason Fries, Erin Craig, Scott Fleming, Mila Hardt, and Steve Yadlowsky for insightful feedback and discussion. Data access for this project was provided by the Stanford Center for Population Health Sciences Data Core, the Stanford School of Medicine Research Office, and the Stanford Medicine Research IT team. We further thank the Stanford Medicine Research IT team for supporting STARR-OMOP, with special thanks to Jose Posada and Priya Desai, and the Stanford Research Computing Center for supporting the Nero computing platform. Approval for this non-human subjects research study is provided by the Stanford Institutional Review Board, protocol 24883. This work is supported by the National Science Foundation Graduate Research Fellowship Program DGE-1656518 and the Stanford Medicine Program for AI in Healthcare. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding bodies.
Footnotes
We note that “gender” is the term used in the OMOP CDM, but the stated definition of the underlying concept in each of the data sources refers to biological sex. This field is almost always recorded as either male or female. We observe eight occurrences that do not map to male or female in the derived STARR cohort and 1,176 occurrences in the Optum CDM cohort. We exclude these patients from model training and evaluation when sex is considered as the sensitive attribute, but include them otherwise.
References
- [1].Rajkomar A, Hardt M, Howell MD, Corrado G, Chin MH, Ensuring Fairness in Machine Learning to Advance Health Equity, Annals of Internal Medicine (December 2018). doi: 10.7326/M18-1990. URL http://annals.org/article.aspx?doi=10.7326/M18-1990 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Goodman SN, Goel S, Cullen MR, Machine Learning, Health Disparities, and Causal Reasoning, Annals of Internal Medicine 169 (12) (2018) 883. doi: 10.7326/M18-3297. URL http://annals.org/article.aspx?doi=10.7326/M18-3297 [DOI] [PubMed] [Google Scholar]
- [3].Obermeyer Z, Powers B, Vogeli C, Mullainathan S, Dissecting racial bias in an algorithm used to manage the health of populations, Science 366 (6464) (2019) 447–453. doi: 10.1126/SCIENCE.AAX2342. [DOI] [PubMed] [Google Scholar]
- [4].Ferryman K, Pitcan M, Fairness in precision medicine, Data & Society (2018). [Google Scholar]
- [5].Nordling L, A fairer way forward for AI in health care., Nature 573 (7775) (2019) S103. [DOI] [PubMed] [Google Scholar]
- [6].Vyas DA, Eisenstein LG, Jones DS, Hidden in Plain Sight — Reconsidering the Use of Race Correction in Clinical Algorithms, New England Journal of Medicine (2020) NEJMms2004740doi: 10.1056/NEJMms2004740. URL http://www.nejm.org/doi/10.1056/NEJMms2004740 [DOI] [PubMed]
- [7].Chen IY, Joshi S, Ghassemi M, Treating health disparities with artificial intelligence, Nature Medicine 26 (1) (2020) 16–17. doi: 10.1038/s41591-019-0649-2. URL http://www.nature.com/articles/s41591-019-0649-2 [DOI] [PubMed] [Google Scholar]
- [8].Gaskin DJ, Dinwiddie GY, Chan KS, McCleary R, Residential segregation and disparities in health care services utilization., Medical care research and review : MCRR 69 (2) (2012) 158–75. doi: 10.1177/1077558711420263. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Williams DR, Collins C, Racial residential segregation: a fundamental cause of racial disparities in health., Public Health Reports 116 (5) (2001) 404. doi: 10.1093/PHR/116.5.404. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Hall WJ, Chapman MV, Lee KM, Merino YM, Thomas TW, Payne BK, Eng E, Day SH, Coyne-Beasley T, Implicit Racial/Ethnic Bias Among Health Care Professionals and Its Influence on Health Care Outcomes: A Systematic Review., American journal of public health 105 (12) (2015) e60–76. doi: 10.2105/AJPH.2015.302903. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Bailey ZD, Krieger N, Agénor M, Graves J, Linos N, Bassett MT, Structural racism and health inequities in the USA: evidence and interventions, The Lancet 389 (10077) (2017) 1453–1463. doi: 10.1016/S0140-6736(17)30569-X. URL https://www-thelancet-com/series/america-equity-equality-in-health [DOI] [PubMed] [Google Scholar]
- [12].Larrazabal AJ, Nieto N, Peterson V, Milone DH, Ferrante E, Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis, Proceedings of the National Academy of Sciences (May 2020). doi: 10.1073/PNAS.1919012117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Kallus N, Zhou A, Residual unfairness in fair machine learning from prejudiced data, 35th International Conference on Machine Learning, ICML 2018. 6 (2018) 3821–3834. arXiv:1806.02887. [Google Scholar]
- [14].Jiang H, Nachum O, Identifying and Correcting Label Bias in Machine Learning, International Conference on Artificial Intelligence and Statistics (2020) 702–712arXiv:1901.04966. [Google Scholar]
- [15].Veinot TC, Mitchell H, Ancker JS, Perspective Good intentions are not enough: how informatics interventions can worsen inequality, Journal of the American Medical Informatics Association 25 (8) (2018) 1080–1088. doi: 10.1093/jamia/ocy052. URL https://academic.oup.com/jamia/advance-article/doi/10.1093/jamia/ocy052/4996916 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].McCradden M, Mazwi M, Joshi S, Anderson JA, When Your Only Tool Is A Hammer, in: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, ACM, New York, NY, USA, 2020, pp. 109–109. doi:10.1145/3375627.3375824. URL http://dl.acm.org/doi/10.1145/3375627.3375824 [Google Scholar]
- [17].McCradden MD, Joshi S, Anderson JA, Mazwi M, Goldenberg A, Zlotnik Shaul R, Patient safety and quality improvement: Ethical principles for a regulatory approach to bias in healthcare machine learning, Journal of the American Medical Informatics Association (2020). doi: 10.1093/jamia/ocaa085. URL 10.1093/jamia/ocaa085 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Char DS, Shah NH, Magnus D, Implementing Machine Learning in Health Care — Addressing Ethical Challenges, New England Journal of Medicine 378 (11) (2018) 981–983. doi: 10.1056/NEJMp1714229. URL http://www.nejm.org/doi/10.1056/NEJMp1714229 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Parikh RB, Teeple S, Navathe AS, Addressing Bias in Artificial Intelligence in Health Care, JAMA 170 (1) (2019) 51–58. doi: 10.1001/jama.2019.18058. URL https://jamanetwork.com/journals/jama/fullarticle/2756196 [DOI] [PubMed] [Google Scholar]
- [20].McCradden MD, Joshi S, Mazwi M, Anderson JA, Ethical limitations of algorithmic fairness solutions in health care machine learning, The Lancet Digital Health 2 (5) (2020) e221–e223. doi: 10.1016/S2589-7500(20)30065-0. [DOI] [PubMed] [Google Scholar]
- [21].Dwork C, Hardt M, Pitassi T, Reingold O, Zemel R, Fairness Through Awareness, Proceedings of the 3rd innovations in theoretical computer science conference (2011) 214–226arXiv:1104.3913. [Google Scholar]
- [22].Hardt M, Price E, Srebro N, Equality of Opportunity in Supervised Learning, Advances in Neural Information Processing Systems (2016) 3315–3323arXiv:1610.02413, doi: 10.1109/ICCV.2015.169. [DOI] [Google Scholar]
- [23].Chouldechova A, Roth A, The Frontiers of Fairness in Machine Learning, arXiv preprint arXiv:1810.08810 (October 2018). arXiv:1810.08810, doi: 10.17226/25021. [DOI] [Google Scholar]
- [24].Green B, The false promise of risk assessments: epistemic reform and the limits of fairness, in: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 2020, pp. 594–606. [Google Scholar]
- [25].Hutchinson B, Mitchell M, 50 years of test (un) fairness: Lessons for machine learning, in: Proceedings of the Conference on Fairness, Accountability, and Transparency, 2019, pp. 49–58. arXiv:1811.10104v2, doi:10.1145/3287560.3287600. URL 10.1145/3287560.3287600 [DOI] [Google Scholar]
- [26].Bellamy RKE, Dey K, Hind M, Hoffman SC, Houde S, Kannan K, Lohia P, Martino J, Mehta S, Mojsilovic A, Nagar S, Ramamurthy KN, Richards J, Saha D, Sattigeri P, Singh M, Varshney KR, Zhang Y, {AI Fairness} 360: An Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias (October 2018). [Google Scholar]
- [27].Dudik M, Edgar R, Horn B, Lutz R, Fairlearn (2020). URL https://github.com/fairlearn/fairlearn
- [28].Google AI Blog: Fairness Indicators: Scalable Infrastructure for Fair ML Systems. URL https://ai.googleblog.com/2019/12/fairness-indicators-scalable.html
- [29].Pfohl SR, Duan T, Ding DY, Shah NH, Counterfactual Reasoning for Fair Clinical Risk Prediction, in: Doshi-Velez F, Fackler J, Jung K, Kale D, Ranganath R, Wallace B, Wiens J (Eds.), Proceedings of the 4th Machine Learning for Healthcare Conference, Vol. 106 of Proceedings of Machine Learning Research, PMLR, Ann Arbor, Michigan, 2019, pp. 325–358. arXiv:1907.06260. URL http://proceedings.mlr.press/v106/pfohl19a.html [Google Scholar]
- [30].Pfohl S, Marafino B, Coulet A, Rodriguez F, Palaniappan L, Shah NH, Creating Fair Models of Atherosclerotic Cardiovascular Disease Risk, in: AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society, 2019. arXiv:1809.04663. [Google Scholar]
- [31].Zink A, Rose S, Fair regression for health care spending, Biometrics (2019). arXiv:1901.10566, doi: 10.1111/biom.13206. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [32].Zhang H, Lu AX, Abdalla M, McDermott M, Ghassemi M, Hurtful words: quantifying biases in clinical contextual word embeddings, in: ACM CHIL 2020 - Proceedings of the 2020 ACM Conference on Health, Inference, and Learning, Association for Computing Machinery (ACM), New York, NY, USA, 2020, pp. 110–120. doi:10.1145/3368555.3384448. URL https://dl.acm.org/doi/10.1145/3368555.3384448 [Google Scholar]
- [33].Singh M, Ramamurthy KN, Understanding racial bias in health using the Medical Expenditure Panel Survey data, arXiv preprint arXiv:1911.01509 (November 2019). arXiv:1911.01509. [Google Scholar]
- [34].Singh H, Singh R, Mhasawade V, Chunara R, Fair Predictors under Distribution Shift, arXiv preprint arXiv:1911.00677 (November 2019). arXiv:1911.00677. [Google Scholar]
- [35].Zemel RS, Wu Y, Swersky K, Pitassi T, Dwork C, Learning Fair Representations, Proceedings of the 30th International Conference on Machine Learning 28 (2013) 325–333. [Google Scholar]
- [36].Cotter A, Jiang H, Gupta MR, Wang S, Narayan T, You S, Sridharan K, Gupta MR, You S, Sridharan K, Optimization with Non-Differentiable Constraints with Applications to Fairness, Recall, Churn, and Other Goals., Journal of Machine Learning Research 20 (172) (2019) 1–59. arXiv:1809.04198. [Google Scholar]
- [37].Cotter A, Gupta M, Jiang H, Srebro N, Sridharan K, Wang S, Woodworth B, You S, Training Well-Generalizing Classifiers for Fairness Metrics and Other Data-Dependent Constraints, in: International Conference on Machine Learning, 2019, pp. 1397–1405. arXiv:1807.00028. [Google Scholar]
- [38].Agarwal A, Beygelzimer A, Dudik M, Langford J, Wallach H, A reductions approach to fair classification, in: International Conference on Machine Learning, 2018, pp. 60–69. [Google Scholar]
- [39].Song J, Kalluri P, Grover A, Zhao S, Ermon S, Learning Controllable Fair Representations, in: The 22nd International Conference on Artificial Intelligence and Statistics, 2019, pp. 2164–2173. arXiv:1812.04218. [Google Scholar]
- [40].Mitchell M, Wu S, Zaldivar A, Barnes P, Vasserman L, Hutchinson B, Spitzer E, Raji ID, Gebru T, Model cards for model reporting, in: Proceedings of the conference on fairness, accountability, and transparency, 2019, pp. 220–229. [Google Scholar]
- [41].Sun C, Asudeh A, Jagadish HV, Howe B, Stoyanovich J, Mithralabel: Flexible dataset nutritional labels for responsible data science, in: International Conference on Information and Knowledge Management, Proceedings, Association for Computing Machinery, New York, NY, USA, 2019, pp. 2893–2896. doi:10.1145/3357384.3357853. URL https://dl.acm.org/doi/10.1145/3357384.3357853 [Google Scholar]
- [42].Madaio MA, Stark L, Wortman Vaughan J, Wallach H, Co-Designing Checklists to Understand Organizational Challenges and Opportunities around Fairness in AI, in: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, ACM, New York, NY, USA, 2020, pp. 1–14. doi:10.1145/3313831.3376445. URL https://dl.acm.org/doi/10.1145/3313831.3376445 [Google Scholar]
- [43].Chen I, Johansson FD, Sontag D, Why Is My Classifier Discriminatory?, Proceedings of the 32nd International Conference on Neural Information Processing Systems (May 2018). arXiv:1805.12002. [Google Scholar]
- [44].Corbett-Davies S, Goel S, The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning, arXiv preprint arXiv:1808.00023 (2018). arXiv:1808.00023, doi: 10.1063/1.3627170. [DOI] [Google Scholar]
- [45].Fazelpour S, Lipton ZC, Algorithmic Fairness from a Non-ideal Perspective, in: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 2020, pp. 57–63. arXiv:2001.09773, doi:10.1145/3375627.3375828. [Google Scholar]
- [46].Herington J, Measuring fairness in an unfair world, in: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, AIES ‘20, Association for Computing Machinery, New York, NY, USA, 2020, p. 286–292. doi:10.1145/3375627.3375854. URL 10.1145/3375627.3375854 [DOI] [Google Scholar]
- [47].Liu LT, Dean S, Rolf E, Simchowitz M, Hardt M, Delayed Impact of Fair Machine Learning, in: Proceedings of the 35th International Conference on Machine Learning, 2018. arXiv:1803.04383. [Google Scholar]
- [48].Hanna A, Denton E, Smart A, Smith-Loud J, Towards a critical race methodology in algorithmic fairness, in: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 2020, pp. 501–512. arXiv:1912.03593, doi:10.1145/3351095.3372826. [Google Scholar]
- [49].Jacobs AZ, Wallach H, Measurement and Fairness, arXiv preprint arXiv:1912.05511 (December 2019). arXiv:1912.05511. [Google Scholar]
- [50].Hicken MT, Kravitz-Wirtz N, Durkee M, Jackson JS, Racial inequalities in health: Framing future research, Social Science and Medicine 199 (2018) 11–18. doi: 10.1016/j.socscimed.2017.12.027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [51].Vitale C, Fini M, Spoletini I, Lainscak M, Seferovic P, Rosano GM, Under-representation of elderly and women in clinical trials, International journal of cardiology 232 (2017) 216–221. [DOI] [PubMed] [Google Scholar]
- [52].Hussain-Gambles M, Atkin K, Leese B, Why ethnic minority groups are under-represented in clinical trials: a review of the literature, Health & social care in the community 12 (5) (2004) 382–388. [DOI] [PubMed] [Google Scholar]
- [53].Dickman SL, Himmelstein DU, Woolhandler S, Inequality and the health-care system in the usa, The Lancet 389 (10077) (2017) 1431–1441. [DOI] [PubMed] [Google Scholar]
- [54].Shah NH, Milstein A, Bagley SC, Making Machine Learning Models Clinically Useful, JAMA - Journal of the American Medical Association 322 (14) (2019) 1351–1352. doi: 10.1001/jama.2019.10306. [DOI] [PubMed] [Google Scholar]
- [55].Jung K, Kashyap S, Avati A, Harman S, Shaw H, Li R, Smith M, Shum K, Javitz J, Vetteth Y, Seto T, Bagley SC, Shah NH, A framework for making predictive models useful in practice, medRxiv (2020) 2020.07.10.20149419doi: 10.1101/2020.07.10.20149419. URL 10.1101/2020.07.10.20149419 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [56].Creager E, Madras D, Pitassi T, Zemel R, Causal Modeling for Fairness in Dynamical Systems, arXiv preprint arXiv:1909.09141 (September 2019). arXiv:1909.09141. [Google Scholar]
- [57].Kleinberg J, Mullainathan S, Raghavan M, Inherent Trade-Offs in the Fair Determination of Risk Scores, arXiv preprint arXiv:1609.05807 (September 2016). arXiv:1609.05807, doi: 10.1111/j.1740-9713.2017.01012.x. [DOI] [Google Scholar]
- [58].Chouldechova A, Fair prediction with disparate impact: A study of bias in recidivism prediction instruments, ArXiv e-prints (February 2017). arXiv:1703.00056, doi: 10.1089/big.2016.0047. [DOI] [PubMed] [Google Scholar]
- [59].Binns R, On the apparent conflict between individual and group fairness, Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (2020) 514–524arXiv:1912.06883, doi:10.1145/3351095.3372864. [Google Scholar]
- [60].Friedler SA, Scheidegger C, Venkatasubramanian S, On the (im)possibility of fairness *, arXiv preprint arXiv:1609.07236 (2016). arXiv:1609.07236v1. [Google Scholar]
- [61].Kearns M, Neel S, Roth A, Wu ZS, Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness, International Conference on Machine Learning (2018) 2564–2572arXiv:1711.05144. [Google Scholar]
- [62].Khani F, Liang P, Noise Induces Loss Discrepancy Across Groups for Linear Regression, arXiv preprint arXiv:1911.09876 (November 2019). arXiv:1911.09876. [Google Scholar]
- [63].Friedler SA, Choudhary S, Scheidegger C, Hamilton EP, Venkatasubramanian S, Roth D, A comparative study of fairness-enhancing interventions in machine learning, FAT* 2019 - Proceedings of the 2019 Conference on Fairness, Accountability, and Transparency (2019) 329–338arXiv:1802.04422, doi:10.1145/3287560.3287589. URL 10.1145/3287560.3287589 [DOI] [Google Scholar]
- [64].Lipton ZC, McAuley J, Chouldechova A, McAuley J, Does mitigating ML’s impact disparity require treatment disparity?, Advances in Neural Information Processing Systems 2018-Decem (2018) 8125–8135. arXiv:1711.07076. URL http://papers.nips.cc/paper/8035-does-mitigating-mls-impact-disparity-require-treatment-disparity.pdf [Google Scholar]
- [65].Calders T, Kamiran F, Pechenizkiy M, Building classifiers with independency constraints, in: 2009 IEEE International Conference on Data Mining Workshops, IEEE, 2009, pp. 13–18. doi:10.1109/ICDMW.2009.83. URL https://www.win.tue.nl/{~}mpechen/publications/pubs/CaldersICDM09.pdf [Google Scholar]
- [66].Celis LE, Huang L, Keswani V, Vishnoi NK, Classification with Fairness Constraints: A Meta-Algorithm with Provable Guarantees, Proceedings of the Conference on Fairness, Accountability, and Transparency (2018) 319–328arXiv:1806.06055. [Google Scholar]
- [67].Sriperumbudur BK, Fukumizu K, Gretton A, Schölkopf B, Lanckriet GR, On integral probability metrics,\phi-divergences and binary classification, arXiv preprint arXiv:0901.2698 (2009). [Google Scholar]
- [68].Ramdas A, Trillos NG, Cuturi M, On wasserstein two-sample testing and related families of nonparametric tests, Entropy 19 (2) (2017). arXiv:1509.02237, doi: 10.3390/e19020047. [DOI] [Google Scholar]
- [69].Yadlowsky S, Basu S, Tian L, A calibration metric for risk scores with survival data, in: Machine Learning for Healthcare Conference, 2019, pp. 424–450. [Google Scholar]
- [70].Austin PC, Steyerberg EW, The Integrated Calibration Index (ICI) and related metrics for quantifying the calibration of logistic regression models, Statistics in Medicine 38 (21) (2019) 4051–4065. doi: 10.1002/sim.8281. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.8281 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [71].Liu LT, Simchowitz M, Hardt M, The Implicit Fairness Criterion of Unconstrained Learning, in: Chaudhuri K, Salakhutdinov R (Eds.), Proceedings of the 36th International Conference on Machine Learning, Vol. 97 of Proceedings of Machine Learning Research, PMLR, Long Beach, California, USA, 2019, pp. 4051–4060. URL http://proceedings.mlr.press/v97/liu19f.html [Google Scholar]
- [72].Pleiss G, Raghavan M, Wu F, Kleinberg J, Weinberger KQ, On fairness and calibration, in: Advances in Neural Information Processing Systems, 2017, pp. 5680–5689. arXiv:1709.02012. [Google Scholar]
- [73].Kallus N, Zhou A, The fairness of risk scores beyond classification: Bipartite ranking and the xauc metric, in: Advances in Neural Information Processing Systems, 2019, pp. 3438–3448. arXiv:1902.05826. [Google Scholar]
- [74].Beutel A, Chen J, Doshi T, Qian H, Wei L, Wu Y, Heldt L, Zhao Z, Hong L, Chi EH, Goodrow C, Fairness in recommendation ranking through pairwise comparisons, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2019) 2212–2220arXiv:1903.00780, doi:10.1145/3292500.3330745. [Google Scholar]
- [75].Louizos C, Swersky K, Li Y, Welling M, Zemel R, The Variational Fair Autoencoder, arXiv preprint arXiv:1511.00830 (2015). arXiv:1511.00830. [Google Scholar]
- [76].Madras D, Creager E, Pitassi T, Zemel R, Learning Adversarially Fair and Transferable Representations, Proceedings of the 35th International Conference on Machine Learning 80 (2018) 3384–3393. arXiv:1802.06309. URL http://proceedings.mlr.press/v80/madras18a.htmlhttp://arxiv.org/abs/1802.06309 [Google Scholar]
- [77].Ilvento C, Metric learning for individual fairness, in: 1st Symposium on Foundations of Responsible Computing (FORC 2020), Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020. [Google Scholar]
- [78].Zafar MB, Valera I, Gomez Rodriguez M, Gummadi KP, Rogriguez MG, Gummadi KP, Fairness Constraints: Mechanisms for Fair Classification, in: Singh A, Zhu J (Eds.), Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Vol. 54 of Proceedings of Machine Learning Research, PMLR, Fort Lauderdale, FL, USA, 2017, pp. 962–970. [Google Scholar]
- [79].Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, Smola A, A kernel two-sample test, Journal of Machine Learning Research 13 (March) (2012) 723–773. [Google Scholar]
- [80].Datta S, Posada J, Olson G, Li W, Balraj D, Mesterhazy J, Pallas J, Desai P, Shah NH, A new paradigm for accelerating clinical data science at Stanford Medicine, arXiv preprint arXiv:2003.10534 (March 2020). arXiv:2003.10534. [Google Scholar]
- [81].Hripcsak G, Duke JD, Shah NH, Reich CG, Huser V, Schuemie MJ, Suchard MA, Park RW, Wong ICK, Rijnbeek PR, Van Der Lei J, Pratt N, Norén GN, Li Y-CC, Stang PE, Madigan D, Ryan PB, Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers, in: Studies in Health Technology and Informatics, Vol. 216, NIH Public Access, 2015, pp. 574–578. doi: 10.3233/978-1-61499-564-7-574. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [82].Marc Overhage J, Ryan PB, Reich CG, Hartzema AG, Stang PE, Validation of a common data model for active safety surveillance research, Journal of the American Medical Informatics Association 19 (1) (2012) 54–60. doi: 10.1136/amiajnl-2011-000376. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [83].Reps JM, Schuemie MJ, Suchard MA, Ryan PB, Rijnbeek PR, Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data, Journal of the American Medical Informatics Association 25 (8) (2018) 969–975. doi: 10.1093/jamia/ocy032. URL https://academic.oup.com/jamia/advance-article/doi/10.1093/jamia/ocy032/4989437 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [84].Johnson AE, Pollard TJ, Shen L, Lehman L.-w. H., Feng M, Ghassemi M, Moody B, Szolovits P, Anthony Celi L, Mark RG, MIMIC-III, a freely accessible critical care database, Scientific Data 3 (2016) 160035. doi: 10.1038/sdata.2016.35. URL http://www.nature.com/articles/sdata201635 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [85].Wang S, McDermott MBA, Chauhan G, Ghassemi M, Hughes MC, Naumann T, Ghassemi M, Mimic-extract: A data extraction, preprocessing, and representation pipeline for mimic-iii, in: Proceedings of the ACM Conference on Health, Inference, and Learning, 2020, pp. 222–235. arXiv:1907.08322. [Google Scholar]
- [86].Kingma D, Ba J, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014). [Google Scholar]
- [87].Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S, Pytorch: An imperative style, high-performance deep learning library, in: Wallach H, Larochelle H, Beygelzimer A, d’ Alch’e-Buc F, Fox E, Garnett R (Eds.), Advances in Neural Information Processing Systems 32, Curran Associates, Inc, 2019, pp. 8024–8035. URL http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf [Google Scholar]
- [88].Fletcher R, Practical methods of optimization, John Wiley & Sons, 2013. [Google Scholar]
- [89].Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E, Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830. arXiv:1201.0490, doi: 10.1007/s13398-014-0173-7.2. [DOI] [Google Scholar]
- [90].Corbett-Davies S, Pierson E, Feller A, Goel S, Huq A, Algorithmic Decision Making and the Cost of Fairness, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ‘17, ACM, New York, NY, USA, 2017, pp. 797–806. arXiv:1701.08230, doi:10.1145/3097983.3098095. [Google Scholar]
- [91].Sewell AA, The Racism-Race Reification Process, Sociology of Race and Ethnicity 2 (4) (2016) 402–432. doi: 10.1177/2332649215626936. URL http://journals.sagepub.com/doi/10.1177/2332649215626936 [DOI] [Google Scholar]
- [92].VanderWeele TJ, Robinson WR, On the Causal Interpretation of Race in Regressions Adjusting for Confounding and Mediating Variables, Epidemiology 25 (4) (2014) 473–484. doi: 10.1097/EDE.0000000000000105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [93].Duster T, Race and Reification in Science, Science 307 (5712) (2005) 1050–1051. doi: 10.1126/science.1110303. URL https://science.sciencemag.org/content/307/5712/1050 [DOI] [PubMed] [Google Scholar]
- [94].Braun L, Fausto-Sterling A, Fullwiley D, Hammonds EM, Nelson A, Quivers W, Reverby SM, Shields AE, Racial Categories in Medical Practice: How Useful Are They?, PLoS Medicine 4 (9) (2007) e271. doi: 10.1371/journal.pmed.0040271. URL https://dx.plos.org/10.1371/journal.pmed.0040271 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [95].Crenshaw K, Demarginalizing the intersection of race and sex: A black feminist critique of antidiscrimination doctrine, feminist theory and antiracist politics, u. Chi. Legal f. (1989) 139. [Google Scholar]
- [96].Hoffmann AL, Where fairness fails: data, algorithms, and the limits of antidiscrimination discourse, Information Communication and Society 22 (7) (2019) 900–915. doi: 10.1080/1369118X.2019.1573912. [DOI] [Google Scholar]
- [97].Hébert-Johnson Ú, Kim MP, Reingold O, Rothblum GN, Calibration for the (Computationally-Identifiable) Masses, in: Dy J, Krause A (Eds.), Proceedings of the 35th International Conference on Machine Learning, Vol. 80 of Proceedings of Machine Learning Research, PMLR, Stockholmsmässan, Stockholm Sweden, 2017, pp. 1939–1948. arXiv:1711.08513. [Google Scholar]
- [98].Cirillo D, Catuara-Solarz S, Morey C, Guney E, Subirats L, Mellino S, Gigante A, Valencia A, Rementeria MJ, Chadha AS, Mavridis N, Sex and gender differences and biases in artificial intelligence for biomedicine and healthcare, npj Digital Medicine 3 (1) (2020) 1–11. doi: 10.1038/s41746-020-0288-5. URL https://www.nature.com/articles/s41746-020-0288-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [99].Kusner MJ, Loftus J, Russell C, Silva R, Counterfactual Fairness, in: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (Eds.), Advances in Neural Information Processing Systems 30, Curran Associates, Inc, 2017, pp. 4066–4076. [Google Scholar]
- [100].Ford CL, Airhihenbuwa CO, The public health critical race methodology: Praxis for antiracism research, Social Science and Medicine 71 (8) (2010) 1390–1398. doi: 10.1016/j.socscimed.2010.07.030. [DOI] [PubMed] [Google Scholar]
- [101].Kilbertus N, Carulla MR, Parascandolo G, Hardt M, Janzing D, Schölkopf B, Rojas-Carulla M, Parascandolo G, Hardt M, Janzing D, Schölkopf B, Avoiding discrimination through causal reasoning, in: Advances in Neural Information Processing Systems, 2017, pp. 656–666. arXiv:1706.02744. [Google Scholar]
- [102].Benjamin R, Assessing risk, automating racism, Science 366 (6464) (2019) 421–422. doi: 10.1126/science.aaz3873. URL https://science.sciencemag.org/content/366/6464/421 [DOI] [PubMed] [Google Scholar]
- [103].Kalluri P, Don’t ask if artificial intelligence is good or fair, ask how it shifts power, Nature 583 (7815) (2020) 169–169. doi: 10.1038/d41586-020-02003-2. URL http://www.nature.com/articles/d41586-020-02003-2 [DOI] [PubMed] [Google Scholar]
- [104].Sendak M, Elish MC, Gao M, Futoma J, Ratli W, Nichols M, Bedoya A, Brien CO, Ratliff W, Nichols M, Bedoya A, Balu S, O’Brien C, Ratli W, Nichols M, Bedoya A, Brien CO, “The Human Body is a Black Box”: Supporting Clinical Decision-Making with Deep Learning, in: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 2020, pp. 99–109. arXiv:1911.08089. [Google Scholar]
- [105].Martin D, Prabhakaran V, Kuhlberg J, Smart A, Isaac WS, Participatory Problem Formulation for Fairer Machine Learning Through Community Based System Dynamics, arXiv preprint arXiv:2005.07572 (May 2020). arXiv:2005.07572. [Google Scholar]
- [106].Martin D, Prabhakaran V, Kuhlberg J, Smart A, Isaac WS, Extending the Machine Learning Abstraction Boundary: A Complex Systems Approach to Incorporate Societal Context, arXiv preprint arXiv:2006.09663 (June 2020). arXiv:2006.09663. [Google Scholar]
- [107].Selbst AD, Boyd D, Friedler SA, Venkatasubramanian S, Vertesi J, Fairness and abstraction in sociotechnical systems, in: Proceedings of the Conference on Fairness, Accountability, and Transparency, 2019, pp. 59–68. [Google Scholar]
- [108].Baumer EP, Silberman MS, When the implication is not to design (technology), in: Proceedings of the 2011 annual conference on Human factors in computing systems - CHI ‘11, ACM Press, New York, New York, USA, 2011, p. 2271 doi:10.1145/1978942.1979275. URL http://dl.acm.org/citation.cfm?doid=1978942.1979275 [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.







