Skip to main content
UKPMC Funders Author Manuscripts logoLink to UKPMC Funders Author Manuscripts
. Author manuscript; available in PMC: 2023 Jan 3.
Published in final edited form as: Proc Mach Learn Res. 2022;193:12–34.

Imputation Strategies Under Clinical Presence: Impact on Algorithmic Fairness

Vincent Jeanselme 1, Maria De-Arteaga 2,#, Zhe Zhang 3,#, Jessica Barrett 4, Brian Tom 5
PMCID: PMC7614014  EMSID: EMS158739  PMID: 36601036

Abstract

Biases have marked medical history, leading to unequal care affecting marginalised groups. The patterns of missingness in observational data often reflect these group discrepancies, but the algorithmic fairness implications of group-specific missingness are not well understood. Despite its potential impact, imputation is too often an overlooked preprocessing step. When explicitly considered, attention is placed on overall performance, ignoring how this preprocessing can reinforce groupspecific inequities. Our work questions this choice by studying how imputation affects downstream algorithmic fairness. First, we provide a structured view of the relationship between clinical presence mechanisms and groupspecific missingness patterns. Then, through simulations and real-world experiments, we demonstrate that the imputation choice influences marginalised group performance and that no imputation strategy consistently reduces disparities. Importantly, our results show that current practices may endanger health equity as similarly performing imputation strategies at the population level can affect marginalised groups differently. Finally, we propose recommendations for mitigating inequities that may stem from a neglected step of the machine learning pipeline.

Keywords: Clinical Presence, Fairness, Imputation

1. Introduction

Machine learning models for healthcare often rely on observational data. At the core of observational data generation is a complex interaction between patients and the healthcare system, which we refer to as clinical presence (Jeanselme et al., 2022). Each observation, from orders of laboratory tests to treatment decisions, reflects access to medical care, patients’ medical states, and also practitioners’ expertise and potential biases. Historically, healthcare access, treatment and outcomes have been marked by inequalities (Chen et al., 2021; Freeman and Payne, 2000; Jeanselme et al., 2021; Kim et al., 2016; Norris and Nissenson, 2008). For instance, Price-Haywood et al. (2020) hypothesised that the disproportionate mortality rate from Covid-19 among Black patients can, in part, be explained by longer waiting times before accessing care.

Clinical presence patterns can, therefore, reflect disparities. Specifically, observation and missingness can vary across groups. Developing machine learning models on these data raises ethical concerns about automating and reinforcing injustices.

Current practices for handling missing data often rely on imputing data with overall performance in mind (Emmanuel et al., 2021), without consideration of the algorithmic fairness consequences associated with this choice. Despite the risk of aggravating inequities reflected in group-specific missingness patterns, the effect of this imputation step remains understudied. In this work, we explore the impact of imputation on data imprinted by group-specific missingness patterns emerging from medical practice and historical biases. First, we identify scenarios of clinical presence that could result in group-specific missingness patterns, grounded on historical evidence of these phenomena in medicine. Then, we explore the downstream impact on group performance of standard imputation strategies on simulated data affected by this clinical missingness. Finally, we study group performances of different imputation strategies in real-world data.

This work provides empirical evidence that machine learning pipelines differing solely in their handling of missingness may result in distinct performance gaps between groups, even when population performances present no difference. The choice of imputation strategy may therefore impact performance in a way that reinforces inequities against historically marginalised groups. Moreover, our experiments show that no imputation strategy consistently outperforms the others and current recommendations may harm marginalised groups. Finally, we emphasise the relevance of this analysis by providing real-world evidence of clinical missingness patterns and echo the previous results in the MIMIC III dataset.

2. Related work

This work explores the link between missingness and algorithmic fairness in machine learning for healthcare. In this section, we review related literature across domains.

2.1. Clinical missingness

Clinical missingness is a medical expression of the well-studied missingness patterns (Little and Rubin, 2019): Missing Completely At Random (MCAR) — random subsets of patients and/or covariates are missing, Missing At Random (MAR) — missing data patterns are a function of observed variables, and Missing Not At Random (MNAR) — missing patterns depend on unobserved variables or the missing values themselves.

Traditional statistical models are not adapted to handle missing covariates. Consequently, practitioners may rely on single imputation strategies such as mean, median, nearest neighbours (Batista et al., 2002; Bertsimas et al., 2021) or the preferred multiple imputation methods (Newgard and Lewis, 2015; Rubin, 2004; White et al., 2011). Typically, these imputation approaches assume MCAR and/or MAR patterns. They may be ill-adapted to handle informative missingness, particularly as MNAR and MAR are non-identifiable from observational data alone and require domain expertise for adequate modelling. The recommended strategy to tackle this non-identifiability issue is to control the imputation model on additional covariates to render the MAR assumption more plausible (Haukoos and Newgard, 2007). Our work shows the potential shortcomings of this covariate-adjusted imputation strategy under group-specific missingness patterns.

2.2. Algorithmic fairness in medicine

The risk of reinforcing historical biases is of critical concern in medicine, where inequalities can have life-threatening implications. Measuring and mitigating this risk is the aim of algorithmic fairness (Chouldechova and Roth, 2020). In this paper, we follow the ‘equal performance’ group definition of algorithmic fairness (Rajkomar et al., 2018), which evaluates if the model performs comparably across groups (Chouldechova et al., 2018; Flores et al., 2016; Noriega-Campero et al., 2019).

Definition 1 (Equal Performance)

A pipeline p is fairer than another q with regard to group g if its performance gap is the smallest, i.e. |Δg(p)| < |Δg(q)| with Δg(p) ≔ d(p({Xi}Gi=g)) – d(p({Xi}Gig)) for some performance metric d, a pipeline p and (Xi, Gi), the covariates and associated group for patient i.

This metric has been leveraged to quantify models’ impact on algorithmic fairness in medicine (Chen et al., 2018, 2019; Pfohl et al., 2019; Seyyed-Kalantari et al., 2020; Zhang et al., 2020). For instance, Seyyed-Kalantari et al. (2020) demonstrates X-ray classifiers’ performance gap between marginalised groups. However, the link between imputation and algorithmic fairness has received limited attention despite the risk of clinical missingness disparities. Our work aims to fill this gap.

2.3. Algorithmic fairness and missingness

As a community, we need to understand how to best handle clinical missingness when imprinted by biases. Martínez-Plumed et al. (2019); Fricke et al. (2020) show that mean imputation presents better fairness properties compared to complete case analysis. These works focus on one imputation strategy and ignore the potential variability of the impact of different strategies. Closer to our work, Zhang and Long (2021) show that the choice of imputation may lead to different fairness gaps when enforcing synthetic missingness patterns. However, these works do not discuss how the different missingness patterns may arise in medicine, and how a specific group may be impacted differently by different imputation strategies. In our work, we study different missingness patterns that may arise as a result of the data-generating process in healthcare. Finally, Ahmad et al. (2019); Ghassemi et al. (2020); Rajkomar et al. (2018) describe multiple challenges linked to medical data, among which they state that historical biases may lead to missingness patterns that could impact fairness, but they do not empirically study this. While informative missingness has recently received revived attention (Jeanselme et al., 2022; Getzen et al., 2022), no work has studied its potential association with fairness. Our work aims to address these gaps in the literature by demonstrating the existence of this problem, characterising different types of group-specific missingness patterns in medicine, and exploring the impact of different imputation strategies under different clinical presence scenarios. In addition to showing the impact of imputation choice on fairness gaps, we highlight that the same imputation strategy may benefit a group under one missingness pattern but hurt this same group in another. Importantly, we also show that a given group may benefit under one imputation and suffer under another imputation in the same setting, even if the two strategies perform identically at the population level. These are novel findings that invite practitioners to perform careful sensitivity analysis of imputation choice on fairness gaps.

3. Clinical missingness scenarios

This section shows how group-specific missingness can result from clinical presence. Figure 1 introduces the following scenarios:

Figure 1. Examples of group-specific clinical presence mechanisms.

Figure 1

Limited access to quality care (S1)

When certain groups do not have access to the same health services, this results in more missing covariates for these groups.

Socioeconomic factors resulting from structural injustices (Barik and Thorat, 2015; Nelson, 2002; Szczepura, 2005; Yearby, 2018) such as insurance, work schedule flexibility, distance to hospitals (Barik and Thorat, 2015) or mobility, result in inconsistent medical history (Gianfrancesco et al., 2018), additional waiting time before looking for care (Weissman et al., 1991), avoidance of preventing care (Smith et al., 2018), and limited access to advanced diagnostic tools (Lin et al., 2019). This diminished access to care is potentially reflected as missing data. For instance, patients may have no annual checkup data if their insurance does not cover or encourage this service.

(Mis)-informed collection (S2)

Often, medical research has focused on a subset of the population. The resulting guidelines may be ill-adapted to other groups and relevant covariates may be missing due to standard recommendations.

Historically researchers focused on (perceived) highest-risk groups: breast cancer predominantly studied in women (Arnould et al., 2006; Giordano, 2018), cardiovascular disease in men (Vogel et al., 2021), skin cancers in whiter skins (Gloster Jr and Neal, 2006), and autism in men (Gould and Ashton-Smith, 2011). Resultant medical practices and guidelines target these groups. However, substantial evidence shows the prevalence of these diseases among other groups. Stemming from biological differences, different groups may present different symptoms and expressions for the same condition. The difference in disease expression and the absence of adapted tests result in missing covariates necessary to identify the disease. For instance, screening recommendations may only be prescribed conditioned on observation of “standard” symptoms. If the symptoms considered are not the expected disease expression for a marginalised subgroup, this will result in more missing screening procedures for this group.

Confirmation bias (S3)

Practitioners collect data based on expertise and informative proxies that are not recorded, e.g. patient feeling unwell.

For instance, practitioners may record the value of a test only if they suspect it will be abnormal. The literature presents evidence of this phenomenon where the presence of a specific medical test is more informative of the outcome than the test result itself (Agniel et al., 2018; Sisk et al., 2020). Wells et al. (2013) also suggest that missing laboratory tests correspond to healthy results, e.g. doctors do not collect or record data if they are irrelevant. Similarly, sicker patients present more complete data (Rusanov et al., 2014; Sharafoddini et al., 2019; Weiskopf et al., 2013).

Formalisation

Consider two covariates (X1, X2) influenced by the underlying condition Y and the group membership G. Note that the disease prevalence may also depend on G. One covariate X1 is observed for all patients, while X2 is potentially missing. Following the notations from Mohan and Pearl (2021), let O2 be the indicator of observation of X2 such that the observed value is defined as:

X2*={ifO2=0X2otherwise

In (S1), G informs O2 because of group socioeconomic differences. In (S2) and (S3), G impacts the observation process through group-specific disease expression. While the influence of medical covariates on the missingness patterns characterises both (S2) and (S3), (S2) describes how guidelines may depend on observed covariates, whereas (S3) reflects how the observation process may depend on X2 itself or unobservable covariates correlated with X2. For instance, (S2) may consist of a guideline recommending to measure X2 if X1 is within a given range. However, if a patient is a member of a group for which X1 is not informative—or for which the informative range is different—X2 might not be observed as X1 is not in the guideline test-triggering range. This may lead to more missing data for X2 in the group with different characteristics for X1. (S3) differs as practitioners would record the value of X2 only if this one is abnormal.

These dependencies result in three distinct patterns between missingness, group and covariates, summarised with directed acyclic graphs (DAGs) in Figure 2.

Figure 2.

Figure 2

Directed Acyclic Graphs (DAGs) associated with the identified clinical missingness scenarios. Full circled covariates are observed, dotted ones unobserved. Y is the condition, G, the group membership, X1 and X2 the two covariates. O2 is the observation process associated to X2. Red dependencies underline the differences between scenarios.

4. Experiments

In this section, we explore how the choice of imputation affects group-specific performance, and potentially reinforces disparities in data marked by clinical missingness. We first present simulation studies in which we enforce specific missingness patterns. This analysis allows us to control clinical missingness patterns and measure the potential impact of imputation on algorithmic fairness. We accompany these results with real-world evidence of group-specific missingness patterns and show the impact of different imputation strategies on marginalised group performance. For reproducibility, all experiments’ code is available on Github1.

4.1. Datasets

Assume a population of N patients with associated covariates X, marginalised group membership G, and outcome of interest Y.

Simulation

We introduce a bidimen-sional (X ∈ ℝ2) synthetic population (N = 10,100) divided into two groups (G ∈ {0,1}), and assume the marginalised group is a minority in the population with ratio 1:100. These groups differ in disease expression, i.e. positive cases across groups differ in how they express the disease. Then clinical missingness patterns are enforced on the second dimension X2 following the scenarios introduced in Section 3. Figure 3 provides a graphical summary of how clinical missingness is enforced on the synthetic data. The associated predictive task is to classify between positives and negatives. (See Appendix A.1 for full data generation protocol reflecting the enforcement of the previously-introduced scenarios).

Figure 3.

Figure 3

Graphical summary of clinical missingness enforcement in the simulation experiments. Note that our simulations’ choices result in missingness in the marginalised group only in (S1) and (S2), but in the majority only in (S3).

MIMIC III

The real-world analysis relies on the laboratory tests from Medical Information Mart for Intensive Care (MIMIC III) dataset (Johnson et al., 2016). Following data harmonisation (Wang et al., 2020), we select adults who survived 24 hours or more after admission to the intensive care unit, resulting in a set of 36,296 patients sharing 67 laboratory tests. The goal is to predict short-term survival (7 days after the observation period — Y) using the most recent value of each laboratory test observed in the first 24 hours of observation (X). We select short-term survival as it is a standard task in the machine learning literature (Jeanselme et al., 2022; Nagpal et al., 2021; Tsiklidis et al., 2022; Xu et al., 2019) and the associated labels are less likely to suffer from group-specific misdiagnosis, and, therefore, disentangles our analysis from potential biases in labelling (Chen et al., 2020). In practice, deploying this model could be used for care prioritisation of patients with predicted elevated risk.

4.2. Handling missing data

The simulation and MIMIC III datasets present missing data that are traditionally imputed for analysis. We consider the following common imputation strategies:

Single median imputation (Median)

Missing data are replaced by the population median of each covariate. Due to its straightforward implementation, this methodology remains predominant in the literature despite known shortcomings (Rubin, 1976; Sinharay et al., 2001; Crawford et al., 1995).

Multiple Imputation using Chained Equation (MICE)

Missing data are iteratively drawn from a regression model built over all other available covariates after median initialisation. This approach is repeated I times with an associated predictive model for each imputed draw. At test time, the same imputation models generate I imputed points for which models’ predictions are averaged. MICE is recommended in the literature (Janssen et al., 2010; Newgard and Haukoos, 2007; Wood et al., 2004; Zhou et al., 2001; White et al., 2011) as it quantifies the uncertainty associated with missingness. In the experiments, we used 10 iterations repeated 10 times resulting in I =10 datasets with associated predictive models.

Group MICE

The previous MICE methodology assumes a MAR mechanism. To make this assumption more plausible, Haukoos and Newgard (2007) recommend the addition of potentially informative covariates. In our experiment, we, therefore, rely on both group membership and covariates for imputing the missing data (X˜X,G with X˜ representing the imputed covariates).

Group MICE Missing

Encoding missingness has been shown to improve performance when the patterns of missingness are informative (Groenwold, 2020; Lipton et al., 2016; Saar-Tsechansky and Provost, 2007; Sperrin et al., 2020). As clinical missingness can contain informative patterns (Jeanselme et al., 2022; Lipton et al., 2016), we concatenate missingness indicators to the imputed data from Group MICE (Appendix A explores the concatenation of missing indicators with the other strategies).

4.3. Experimental setting

After imputation, each pipeline relies on a logistic regression model — a pillar in medicine (Nick and Campbell, 2007; Goldstein et al., 2017) — to discriminate between positive and negative cases (YX˜).

Adopting the equal performance across groups definition (Rajkomar et al., 2018) of algorithmic fairness, we measure each pipeline’s discriminative performances for the different groups. We use the Area Under the Curve for the Receiver Operating Characteristic curve (AUC - ROC, i.e. d in Section 2.2) as proposed in Röösli et al. (2022); Larrazabal et al. (2020); Zhang et al. (2022).

This metric quantifies algorithmic fairness but does not quantify how deployment can hurt subgroups at a fixed threshold on the predicted risk. In the MIMIC III study, we measure the False Negative Rate (FNR) assuming the availability of priority care for 30% of the population (sensitivity to this threshold is presented in Appendix A.2). In the 30% highest-risk population, we measure the prioritisation — the group-specific proportion of patients who would receive care under this policy — and misclassification rates in the groups of interest. In this setting, FNR corresponds to the non-prioritisation of high-risk patients. The gap in FNR between groups answers the question: how marginalised groups would be incorrectly deprioritized? Additional experimental design descriptions and results are provided in Appendix A.

5. Results

This section presents the insights obtained through both simulations and real-world experiments.

5.1. Simulations

We conduct 100 simulations in which the three clinical presence scenarios are independently enforced. We apply the imputation strategies described in Section 4.2 and train a logistic regression with l2 penalty (λ = 1). Results are computed on a 20% test set and averaged over the 100 simulations. Figure 4 presents the AUC gap (Δ defined in Section 2.2) between the majority and the minority, and group-specific AUCs.

Figure 4.

Figure 4

AUC performance gaps Δ and group-specfic AUCs across scenarios on 100 synthetic experiments. If Δ < 0, the marginalised group has worse AUC than the majority.

Insight 1: Equally-performing imputation strategies at the population level can result in different marginalised group performances

Consider (S1), all imputation methodologies result in similar population AUCs, as shown by the grey dots. However, note how the AUC evaluated on the marginalised group presents a gap of 0.1 between MICE and Group MICE. This phenomenon is explained by how imputation strategies result in different imputed covariate distributions. The logistic regressions built on these imputed data would weigh covariates differently and then have different predicted values.

Insight 2: No strategy consistently outperforms the others across clinical presence scenarios

Population-level performances remain stable between Group MICE and MICE over all scenarios, but these strategies have contrasting marginalised group AUCs. Importantly, Group MICE should be preferred in (S1) as it minimises the performance gap. For the same reason, MICE should be used in (S2), whereas both methodologies present inconclusive fairness differences in (S3). While this result is specific to this simulation, this exemplifies how no methodology consistently reduces the performance gap across groups.

Insight 3: Current recommendation of leveraging additional covariates to satisfy MAR assumption, or using missingness indicators can harm marginalised group’s performance

Note how Group MICE presents worse performance than MICE in (S2). The recommendation of including additional covariates to make the MAR assumption more plausible is not always suitable as it may add noise and lead to poorer performance. In another example, see how the model considering missingness provides an edge in (S3) compared to Group MICE but hurts performance in (S1). This observation reinforces the necessity of measuring the performance sensitivity to imputation. Additionally, it underlines how understanding the missingness process is essential to control for relevant covariates.

5.2. MIMIC III

In this real-world experiment, we consider groups defined by the following attributes: ethnicity (Black vs non-Black), sex (female vs male), and insurance (publicly vs privately insured). Table 1 shows the number of orders and the number of distinct laboratory tests (out of the 67 possible tests) performed during the first-day post-admission for each subgroup. This last number reflects the missingness of the vector used for prediction.

Table 1. Mean (std) number of orders and observed tests performed during the first post-admission stratified by marginalised group and outcomes.

Orders Distinct tests
Alive+ 5.68 (4.64) * 40.80 (6.73) *
Dead+ 7.57 (5.44) 37.22 (7.50)
Black 5.24 (4.08) * 40.94 (6.94) *
Other 5.86 (4.77) 40.52 (6.84)
Female 5.54 (4.45) * 40.75 (6.89) *
Male 6.03 (4.91) 40.41 (6.80)
Public 5.67 (4.57) * 40.46 (6.76) *
Private 6.11 (5.01) 40.75 (7.01)
+

By the 8th day after admission.

*

Significant t-test p-value (< 0.001).

For this experiment, patients are split into three sets: 80% for training, 10% for hyperparameter tuning and 10% for testing. We perform a l2 penalty search for the logistic regression among λ ∈ [0.1,1,10,100]. Table 2 presents predictive performances at the population level averaged on the bootstrapped test set over 100 iterations. Assuming capacity for additional care for the 30% highest risk, we explore care prioritisation. Figure 5 displays our main results: the gaps in prioritisation and the false negative rates stratified by groups of interest under the different imputation strategies.

Table 2.

Predictive performance under different imputation strategies. Mean (std) computed on the test set bootstrapped 100 times.

AUC ROC
Group MICE Missing 0.786 (0.009)
Group MICE 0.738 (0.012)
MICE 0.742 (0.012)
Median 0.748 (0.011)

Figure 5.

Figure 5

Prioritisation performance gaps Δ across marginalised groups in MIMIC III experiment. If Δ > 0, the marginalised group has a larger value of the given metric than the rest of the population.

Insight 4: Real-world data presents group-specific clinical presence patterns

While the causes of clinical missingness cannot be distinguished from observational data alone, one can observe evidence of non-random missingness patterns in the MIMIC III dataset, as shown in Table 1. Specifically, note the larger number of orders for patients who die during their stay compared with the ones who survive. This pattern is consistent with a possible confirmation bias scenario (S3), if doctors are monitoring sicker patients more closely. Another example of non-random missingness is that there are fewer test orders for female, Black, and publicly insured patients, but little difference in the diversity of tests prescribed. While this may be explained by the underlying conditions or other medically relevant factors, the combination of similar diversity of tests but less frequent observations results in a less up-to-date patient’s health status for modelling. Thus, even though the cause of testing differences is unclear, these observations show the connection between testing patterns, group membership, and outcomes. This real-world evidence of non-random missingness patterns among subgroups of patients raises concerns about increasing inequities if the fairness implications of imputation methods are not considered.

Insight 5: Marginalised groups can benefit or be harmed by equally performing imputation strategies at the population level

Note how MICE and Group MICE perform similarly at the population level in Table 2, but present different performances for marginalised groups (see Figure 5). Consider the ethnicity split: these methodologies have opposite consequences on Black patients. MICE would result in more care for Black patients and a smaller gap in FNR. By contrast, Group MICE would halve prioritisation and double the FNR gap in favour of non-Black patients. Crucially, this difference solely results from the imputation strategy adopted in these two pipelines.

Insight 6: Different marginalised groups may be impacted oppositely by the same imputation strategy

Female and publicly insured patients have higher prioritisation rates under all imputation methods. However, these groups show opposite gaps in their FNR compared to their counterparts (men and privately insured patients): women have more false negative cases missed while those publicly insured have fewer false negatives.

In another case of opposite impacts of imputation, Group MICE presents the smallest FNR performance gap for sex, but the largest gaps for both ethnicity and insurance. Group MICE also results in better FNR performance for publicly insured but worse for Black patients. This observation underlines the importance of identifying marginalised groups in development and deployment populations. The optimal trade-off between group and population performances, and between marginalised groups, needs to be considered as different pipelines could have opposite impacts.

6. Discussion

This paper is motivated by how interactions between patients and the healthcare system can result in group-specific missingness patterns. We show that resultant inequities in clinical missingness can impact downstream algorithmic fairness under different imputation strategies. This analysis demonstrates that no imputation strategy consistently provides better performances for marginalised groups. In particular, a model providing an edge in one setting can underperform in another, or even harm a different group. Moreover, the experiments conducted using the MIMIC-III dataset demonstrate the relevance of the identified problem as more than a merely theoretical concern, showing that it is present in a widely used electronic health record dataset.

Note that our work does not claim that the specific patterns we observe will necessarily be present in other datasets. As we have emphasised, different combinations of missingness processes may lead to different fairness gaps and interactions between imputation and group performance. It may even lead to equal fairness performance of all imputation strategies, but one cannot know this a priori.

Learning from medical data without sufficient attention to the potential entanglement of clinical missingness and historical biases could reinforce and automatise inequities, and further harm historically marginalised groups. This work calls for caution in the use of imputation to reach health equity. We invite practitioners to:

  • Record protected attributes and identify marginalised groups.

  • Explore the practitioner-patient interaction process to identify clinical missingness disparities.

  • Report the assumptions made at each stage of the pipeline.

  • Perform sensitivity analysis on imputation to understand its impact on algorithmic fairness.

Future work will theoretically define in which settings the presented results stand and how model choice could mitigate discrepancies in the missingness patterns. Moreover, clinical missingness is only one dimension of how clinical presence shapes the data-generating process. The temporality and irregularity of medical time series may convey group-specific disparities that machine learning methods may amplify.

Supplementary Material

Appendix A

Acknowledgments

The authors would like to thank Changjian Shui (McGill Univeristy) for constructive feedback on the manuscript. This work has been partially funded by UKRI Medical Research Council (MC_UU_00002/5 and MC_UU_00002/2) and the NIH through grant R01NS124642.

Footnotes

Contributor Information

Vincent Jeanselme, Email: vincent.jeanselme@mrc-bsu.cam.ac.uk, MRC Biostatistics Unit, University of Cambridge, Cambridge, UK.

Maria De-Arteaga, Email: dearteaga@mccombs.utexas.edu, McCombs School of Business, University of Texas at Austin, Austin, USA.

Zhe Zhang, Email: zhe@rady.ucsd.edu, Rady School of Management, University of California, San Diego, USA.

Jessica Barrett, Email: jessica.barrett@mrc-bsu.cam.ac.uk, MRC Biostatistics Unit, University of Cambridge, Cambridge, UK.

Brian Tom, Email: brian.tom@mrc-bsu.cam.ac.uk, MRC Biostatistics Unit, University of Cambridge, Cambridge, UK.

References

  1. Agniel Denis, Kohane Isaac S, Weber Griffin M. Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. Bmj. 2018;361 doi: 10.1136/bmj.k1479. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Ahmad Muhammad Aurangzeb, Eckert Carly, Teredesai Ankur. The challenge of imputation in explainable artificial intelligence models. arXiv preprint. 2019:arXiv:1907.12669 [Google Scholar]
  3. Arnould N, Pouget O, Gharbi M, Brettes JP. Breast cancer in men: are there similarities with breast cancer in women? Gynecologie, Obstetrique & Fertility. 2006;34(5):413–419. doi: 10.1016/j.gyobfe.2006.03.014. [DOI] [PubMed] [Google Scholar]
  4. Barik Debasis, Thorat Amit. Issues of unequal access to public health in india. Frontiers in public health. 2015;3:245. doi: 10.3389/fpubh.2015.00245. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Batista Gustavo EAPA, Monard Maria Carolina, et al. A study of k-nearest neighbour as an imputation method. His. 2002;87(251-260):48. [Google Scholar]
  6. Bertsimas Dimitris, Orfanoudaki Agni, Pawlowski Colin. Imputation of clinical covariates in time series. Machine Learning. 2021;110(1):185–248. [Google Scholar]
  7. Chen Irene, Johansson Fredrik D, Sontag David. Why is my classifier discriminatory? Advances in Neural Information Processing Systems. 2018;31 [Google Scholar]
  8. Chen Irene Y, Szolovits Peter, Ghassemi Marzyeh. Can ai help reduce disparities in general medical and mental health care? AMA journal of ethics. 2019;21(2):167–179. doi: 10.1001/amajethics.2019.167. [DOI] [PubMed] [Google Scholar]
  9. Chen Irene Y, Pierson Emma, Rose Sherri, Joshi Shalmali, Ferryman Kadija, Ghassemi Marzyeh. Ethical machine learning in healthcare. Annual Review of Biomedical Data Science. 2020;4 doi: 10.1146/annurev-biodatasci-092820-114757. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Chen Richard J, Chen Tiffany Y, Lipkova Jana, Wang Judy J, Williamson Drew FK, Lu Ming Y, Sahai Sharifa, Mahmood Faisal. Algorithm fairness in ai for medicine and healthcare. arXiv preprint. 2021:arXiv:2110.00603 [Google Scholar]
  11. Chouldechova Alexandra, Roth Aaron. A snapshot of the frontiers of fairness in machine learning. Communications of the ACM. 2020;63(5):82–89. [Google Scholar]
  12. Chouldechova Alexandra, Benavides-Prado Diana, Fialko Oleksandr, Vaithianathan Rhema. A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions; Conference on Fairness, Accountability and Transparency; 2018. pp. 134–148. [Google Scholar]
  13. Crawford Sybil L, Tennstedt Sharon L, McKinlay John B. A comparison of analytic methods for non-random missingness of outcome data. Journal of clinical epidemiology. 1995;48(2):209–219. doi: 10.1016/0895-4356(94)00124-9. [DOI] [PubMed] [Google Scholar]
  14. Emmanuel Tlamelo, Maupong Thabiso, Mpoeleng Dimane, Semong Thabo, Mphago Banyatsang, Tabona Oteng. A survey on missing data in machine learning. Journal of Big Data. 2021;8(1):1–37. doi: 10.1186/s40537-021-00516-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Flores Anthony W, Bechtel Kristin, Lowenkamp Christopher T. False positives, false negatives, and false analyses: A rejoinder to machine bias: There’s software used across the country to predict future criminals. and it’s biased against blacks. Fed Probation. 2016;80:38. [Google Scholar]
  16. Freeman Harold P, Payne Richard. Racial injustice in health care. 2000 doi: 10.1056/NEJM200004063421411. [DOI] [PubMed] [Google Scholar]
  17. Fricke Christian, et al. Missing fairness: The discriminatory effect of missing values in datasets on fairness in machine learning. 2020 [Google Scholar]
  18. Getzen Emily, Ungar Lyle, Mowery Danielle, Jiang Xiaoqian, Long Qi. Mining for equitable health: Assessing the impact of missing data in electronic health records. medRxiv. 2022 doi: 10.1016/j.jbi.2022.104269. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Ghassemi Marzyeh, Naumann Tristan, Schulam Peter, Beam Andrew L, Chen Irene Y, Ranganath Rajesh. A review of challenges and opportunities in machine learning for health. AMIA Summits on Translational Science Proceedings. 2020;2020:191. [PMC free article] [PubMed] [Google Scholar]
  20. Gianfrancesco Milena A, Tamang Suzanne, Yazdany Jinoos, Schmajuk Gabriela. Potential biases in machine learning algorithms using electronic health record data. JAMA internal medicine. 2018;178(11):1544–1547. doi: 10.1001/jamainternmed.2018.3763. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Giordano Sharon H. Breast cancer in men. New England Journal of Medicine. 2018;378(24):2311–2320. doi: 10.1056/NEJMra1707939. [DOI] [PubMed] [Google Scholar]
  22. Gloster Hugh M, Jr, Neal Kenneth. Skin cancer in skin of color. Journal of the American Academy of Dermatology. 2006;55(5):741–760. doi: 10.1016/j.jaad.2005.08.063. [DOI] [PubMed] [Google Scholar]
  23. Goldstein Benjamin A, Navar Ann Marie, Pencina Michael J, Ioannidis John. Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review. Journal of the American Medical Informatics Association. 2017;24(1):198–208. doi: 10.1093/jamia/ocw042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Gould Judith, Ashton-Smith Jacqui. Missed diagnosis or misdiagnosis? girls and women on the autism spectrum. Good Autism Practice (GAP) 2011;12(1):34–41. [Google Scholar]
  25. Groenwold Rolf HH. Informative missingness in electronic health record systems: the curse of knowing. Diagnostic and prognostic research. 2020;4(1):1–6. doi: 10.1186/s41512-020-00077-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Haukoos Jason S, Newgard Craig D. Advanced statistics: missing data in clinical research—part 1: an introduction and conceptual framework. Academic Emergency Medicine. 2007;14(7):662–668. doi: 10.1197/j.aem.2006.11.037. [DOI] [PubMed] [Google Scholar]
  27. Janssen Kristel JM, Donders A Rogier T, Harrell Frank E, Jr, Vergouwe Yvonne, Chen Qingxia, Grobbee Diederick E, Moons Karel GM. Missing covariate data in medical research: to impute is better than to ignore. Journal of clinical epidemiology. 2010;63(7):721–727. doi: 10.1016/j.jclinepi.2009.12.008. [DOI] [PubMed] [Google Scholar]
  28. Jeanselme Vincent, De-Arteaga Maria, Elmer Jonathan, Perman Sarah M, Dubrawski Artur. Sex differences in post cardiac arrest discharge locations. Resuscitation plus. 2021;8:100185. doi: 10.1016/j.resplu.2021.100185. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Jeanselme Vincent, Martin Glen, Peek Niels, Sperrin Matthew, Tom Brian, Barrett Jessica. Deepjoint: Robust survival modelling under clinical presence shift. arXiv preprint. 2022:arXiv:2205.13481 [Google Scholar]
  30. Johnson Alistair EW, Pollard Tom J, Shen Lu, Li-Wei H Lehman, Feng Mengling, Ghassemi Mohammad, Moody Benjamin, Szolovits Peter, Celi Leo Anthony, Mark Roger G. Mimic-iii, a freely accessible critical care database. Scientific data. 2016;3(1):1–9. doi: 10.1038/sdata.2016.35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Kim Luke K, Looser Patrick, Swaminathan Rajesh V, Horowitz James, Friedman Oren, Shin Ji Hae, Minutello Robert M, Bergman Geoffrey, Singh Harsimran, Chiu Wong S, et al. Sex-based disparities in incidence, treatment, and outcomes of cardiac arrest in the united states, 2003–2012. Journal of the American Heart Association. 2016;5(6):e003704. doi: 10.1161/JAHA.116.003704. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Larrazabal Agostina J, Nieto Nicolás, Peterson Victoria, Milone Diego H, Ferrante Enzo. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proceedings of the National Academy of Sciences. 2020;117(23):12592–12594. doi: 10.1073/pnas.1919012117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Lin Yu-Kai, Lin Mingfeng, Chen Hsinchun. Do electronic health records affect quality of care? evidence from the hitech act. Information Systems Research. 2019;30(1):306–318. [Google Scholar]
  34. Lipton Zachary C, Kale David, Wetzel Randall. Directly modeling missing data in sequences with rnns: Improved classification of clinical time series; Machine Learning for Healthcare Conference; 2016. pp. 253–270. [Google Scholar]
  35. Little Roderick JA, Rubin Donald B. Statistical analysis with missing data. Vol. 793 John Wiley & Sons; 2019. [Google Scholar]
  36. Martínez-Plumed Fernando, Ferri Cèsar, Nieves David, Hernandez-Orallo José. Fairness and missing values. arXiv preprint. 2019:arXiv:1905.12728 [Google Scholar]
  37. Mohan Karthika, Pearl Judea. Graphical models for processing missing data. Journal of the American Statistical Association. 2021;116(534):1023–1037. [Google Scholar]
  38. Nagpal Chirag, Jeanselme Vincent, Dubrawski Artur. In: Greiner Russell, Kumar Neeraj, Gerds Thomas Alexander, van der Schaar Mihaela., editors. Deep parametric time-to-event regression with time-varying covariates; Proceedings of AAAI Spring Symposium on Survival Prediction - Algorithms, Challenges, and Applications 2021; 2021. Mar 22-24, pp. 184–193. PMLR URL http://proceedings.mlr.press/v146/nagpal21a.html. [Google Scholar]
  39. Nelson Alan. Unequal treatment: confronting racial and ethnic disparities in health care. Journal of the national medical association. 2002;94(8):666. [PMC free article] [PubMed] [Google Scholar]
  40. Newgard Craig D, Haukoos Jason S. Advanced statistics: missing data in clinical research—part 2: multiple imputation. Academic Emergency Medicine. 2007;14(7):669–678. doi: 10.1197/j.aem.2006.11.038. [DOI] [PubMed] [Google Scholar]
  41. Newgard Craig D, Lewis Roger J. Missing data: how to best account for what is not known. Jama. 2015;314(9):940–941. doi: 10.1001/jama.2015.10516. [DOI] [PubMed] [Google Scholar]
  42. Nick Todd G, Campbell Kathleen M. Logistic Regression. Humana Press; Totowa, NJ: 2007. pp. 273–301. [DOI] [PubMed] [Google Scholar]
  43. Noriega-Campero Alejandro, Bakker Michiel A, Garcia-Bulle Bernardo, Pentland Alex’Sandy’. Active fairness in algorithmic decision making; Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society; 2019. pp. 77–83. [Google Scholar]
  44. Norris Keith, Nissenson Allen R. Race, gender, and socioeconomic disparities in ckd in the united states. Journal of the American Society of Nephrology. 2008;19(7):1261–1270. doi: 10.1681/ASN.2008030276. [DOI] [PubMed] [Google Scholar]
  45. Pfohl Stephen, Marafino Ben, Coulet Adrien, Rodriguez Fatima, Palaniappan Latha, Shah Nigam H. Creating fair models of atherosclerotic cardiovascular disease risk; Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society; 2019. pp. 271–278. [Google Scholar]
  46. Price-Haywood Eboni G, Burton Jeffrey, Fort Daniel, Seoane Leonardo. Hospitalization and mortality among black patients and white patients with covid-19. New England Journal of Medicine. 2020;382(26):2534–2543. doi: 10.1056/NEJMsa2011686. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Rajkomar Alvin, Hardt Michaela, Howell Michael D, Corrado Greg, Chin Marshall H. Ensuring fairness in machine learning to advance health equity. Annals of internal medicine. 2018;169(12):866–872. doi: 10.7326/M18-1990. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Röösli Eliane, Bozkurt Selen, Hernandez-Boussard Tina. Peeking into a black box, the fairness and generalizability of a mimic-iii benchmarking model. Scientific Data. 2022;9(1):1–13. doi: 10.1038/s41597-021-01110-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Rubin Donald B. Inference and missing data. Biometrika. 1976;63(3):581–592. [Google Scholar]
  50. Rubin Donald B. Multiple imputation for nonresponse in surveys. Vol. 81 John Wiley & Sons; 2004. [Google Scholar]
  51. Rusanov Alexander, Weiskopf Nicole G, Wang Shuang, Weng Chunhua. Hidden in plain sight: bias towards sick patients when sampling patients with sufficient electronic health record data for research. BMC medical informatics and decision making. 2014;14(1):51. doi: 10.1186/1472-6947-14-51. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Saar-Tsechansky Maytal, Provost Foster. Handling missing values when applying classification models. Journal of Machine Learning Research. 2007 [Google Scholar]
  53. Seyyed-Kalantari Laleh, Liu Guanxiong, McDermott Matthew, Chen Irene Y, Ghassemi Marzyeh. Chexclusion: Fairness gaps in deep chest x-ray classifiers; BIOCOMPUTING 2021: proceedings of the Pacific symposium; 2020. pp. 232–243. [PubMed] [Google Scholar]
  54. Sharafoddini Anis, Dubin Joel A, Maslove David M, Lee Joon. A new insight into missing data in intensive care unit patient profiles: Observational study. JMIR medical informatics. 2019;7(1):e11605. doi: 10.2196/11605. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Sinharay Sandip, Stern Hal S, Russell Daniel. The use of multiple imputation for the analysis of missing data. Psychological methods. 2001;6(4):317. [PubMed] [Google Scholar]
  56. Sisk Rose, Lin Lijing, Sperrin Matthew, Barrett Jessica K, Tom Brian, Diaz-Ordaz Karla, Peek Niels, Martin Glen P. Informative presence and observation in routine health data: A review of methodology for clinical risk prediction. Journal of the American Medical Informatics Association. 2020 doi: 10.1093/jamia/ocaa242. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Smith Kyle T, Monti Denise, Mir Nageen, Peters Ellen, Tipirneni Renuka, Politi Mary C. Access is necessary but not sufficient: factors influencing delay and avoidance of health care services. MDM Policy & Practice. 2018;3(1):2381468318760298. doi: 10.1177/2381468318760298. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Sperrin Matthew, Martin Glen P, Sisk Rose, Peek Niels. Missing data should be handled differently for prediction than for description or causal explanation. Journal of Clinical Epidemiology. 2020;125:183–187. doi: 10.1016/j.jclinepi.2020.03.028. [DOI] [PubMed] [Google Scholar]
  59. Szczepura Ala. Access to health care for ethnic minority populations. Postgraduate medical journal. 2005;81(953):141–147. doi: 10.1136/pgmj.2004.026237. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Tsiklidis Evan J, Sinno Talid, Diamond Scott L. Predicting risk for trauma patients using static and dynamic information from the mimic iii database. Plos one. 2022;17(1):e0262523. doi: 10.1371/journal.pone.0262523. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Vogel Birgit, Acevedo Monica, Appelman Yolande, Bairey Merz C Noel, Chieffo Alaide, Figtree Gemma A, Guerrero Mayra, Kunadian Vijay, Lam Carolyn SP, Maas Angela HEM, et al. The lancet women and cardiovascular disease commission: reducing the global burden by 2030. The Lancet. 2021;397(10292):2385–2438. doi: 10.1016/S0140-6736(21)00684-X. [DOI] [PubMed] [Google Scholar]
  62. Wang Shirly, McDermott Matthew BA, Chauhan Geeticka, Ghassemi Marzyeh, Hughes Michael C, Naumann Tristan. Mimic-extract: A data extraction, preprocessing, and representation pipeline for mimic-iii; Proceedings of the ACM Conference on Health, Inference, and Learning; 2020. pp. 222–235. [Google Scholar]
  63. Weiskopf Nicole G, Rusanov Alex, Weng Chunhua. Sick patients have more data: the non-random completeness of electronic health records; AMIA Annual Symposium Proceedings; 2013. p. 1472. [PMC free article] [PubMed] [Google Scholar]
  64. Weissman Joel S, Stern Robert, Fielding Stephen L, Epstein Arnold M. Delayed access to health care: risk factors, reasons, and consequences. 1991 doi: 10.7326/0003-4819-114-4-325. [DOI] [PubMed] [Google Scholar]
  65. Wells Brian J, Chagin Kevin M, Nowacki Amy S, Kattan Michael W. Strategies for handling missing data in electronic health record derived data. Egems. 2013;1(3) doi: 10.13063/2327-9214.1035. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. White Ian R, Royston Patrick, Wood Angela M. Multiple imputation using chained equations: issues and guidance for practice. Statistics in medicine. 2011;30(4):377–399. doi: 10.1002/sim.4067. [DOI] [PubMed] [Google Scholar]
  67. Wood Angela M, White Ian R, Thompson Simon G. Are missing outcome data adequately handled? a review of published randomized controlled trials in major medical journals. Clinical trials. 2004;1(4):368–376. doi: 10.1191/1740774504cn032oa. [DOI] [PubMed] [Google Scholar]
  68. Xu Jinghong, Tong Li, Yao Jiyou, Guo Zilu, Lui Ka Yin, Hu XiaoGuang, Cao Lu, Zhu Yanping, Huang Fa, Guan Xiangdong, et al. Association of sex with clinical outcome in critically ill sepsis patients: a retrospective analysis of the large clinical database mimic-iii. Shock (Augusta, Ga) 2019;52(2):146. doi: 10.1097/SHK.0000000000001253. [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. Yearby Ruqaiijah. Racial disparities in health status and access to healthcare: the continuation of inequality in the united states due to structural racism. American Journal of Economics and Sociology. 2018;77(3-4):1113–1152. [Google Scholar]
  70. Zhang Haoran, Lu Amy X, Abdalla Mohamed, McDermott Matthew, Ghassemi Marzyeh. Hurtful words: quantifying biases in clinical contextual word embeddings; proceedings of the ACM Conference on Health, Inference, and Learning; 2020. pp. 110–120. [Google Scholar]
  71. Zhang Haoran, Dullerud Natalie, Roth Karsten, Oakden-Rayner Lauren, Pfohl Stephen, Ghassemi Marzyeh. In: Flores Gerardo, Chen George H, Pollard Tom, Ho Joyce C, Naumann Tristan., editors. Improving the fairness of chest x-ray classifiers; Proceedings of the Conference on Health, Inference, and Learning; 2022. Apr 07-08, pp. 204–233. PMLR. [Google Scholar]
  72. Zhang Yiliang, Long Qi. Fairness in missing data imputation. arXiv preprint. 2021:arXiv:2110.12002 [Google Scholar]
  73. Zhou Xiao-Hua, Eckert George J, Tierney William M. Multiple imputation in public health research. Statistics in medicine. 2001;20(9-10):1541–1549. doi: 10.1002/sim.689. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix A

RESOURCES