Abstract
Fair calibration is a widely desirable fairness criteria in risk prediction contexts. One way to measure and achieve fair calibration is with multicalibration. Multicalibration constrains calibration error among flexibly-defined subpopulations while maintaining overall calibration. However, multicalibrated models can exhibit a higher percent calibration error among groups with lower base rates than groups with higher base rates. As a result, it is possible for a decision-maker to learn to trust or distrust model predictions for specific groups. To alleviate this, we propose proportional multicalibration, a criteria that constrains the percent calibration error among groups and within prediction bins. We prove that satisfying proportional multicalibration bounds a model’s multicalibration as well its differential calibration, a fairness criteria that directly measures how closely a model approximates sufficiency. Therefore, proportionally calibrated models limit the ability of decision makers to distinguish between model performance on different patient groups, which may make the models more trustworthy in practice. We provide an efficient algorithm for post-processing risk prediction models for proportional multicalibration and evaluate it empirically. We conduct simulation studies and investigate a real-world application of PMC-postprocessing to prediction of emergency department patient admissions. We observe that proportional multicalibration is a promising criteria for controlling simultaneous measures of calibration fairness of a model over intersectional groups with virtually no cost in terms of classification performance.
1. Introduction
Today, machine learning (ML) models have an impact on outcome disparities across sectors due to their wide-spread use in decision-making. When applied in clinical decision support (CDS), ML models help care providers decide whom to prioritize to receive finite and timesensitive resources among a population of potentially very ill patients. These resources include hospital beds (Barak-Corren et al., 2021a; Dinh and Berendsen Russell, 2021), organ transplants (Schnellinger et al., 2021), specialty treatment programs (Henry et al., 2015; Obermeyer et al., 2019), and, recently, ventilator and other breathing support tools to manage the COVID-19 pandemic (Riviello et al., 2022).
In scenarios like these, decision makers typically rely on risk prediction models to be calibrated. Calibration measures the extent to which a model’s risk scores, , match the observed probability of the event, (Brier et al., 1950). Perfect calibration implies that , for all values of . Calibration allows the risk scores to be used to rank patients in order of priority and informs care providers about the urgency of treatment. However, models that are not equally calibrated among subgroups defined by different sensitive attributes (race, ethnicity, gender, income, etc.) may lead to systematic denial of resources to marginalized groups (e.g. (Obermeyer et al., 2019; Ashana et al., 2021; Roberts, 2011; Zelnick et al., 2021; Ku et al., 2021)). For example, Obermeyer et al. (2019) analyzed a large health system algorithm used to enroll high-risk patients into care management programs and showed that, at a given risk score, Black patients exhibited significantly poorer health than white patients.
When evidence of algorithmic bias is observed, for example in estimates of kidney function (Diao et al., 2021), it is unclear how care should be adjusted for affected patient groups. Variance in clinical judgements that are not evidence-based are themselves a source of unfairness, referred to as unwarranted clinical variation (Harrison et al., 2019; Sutherland and Levesque, 2020; Newhouse and Garber, 2013). Ideally, patients with the same evidence, e.g. model risk prediction, would receive the same standard of care. In this work, we propose measures and methods designed to reduce unwarranted variation in care that may arise from biased prediction models.
To address equity in calibration, Hebert-Johnson et al. (2018) proposed a fairness measure called multicalibration (MC), which asks that calibration be satisfied simultaneously over many flexibly-defined subgroups. Remarkably, MC can be satisfied efficiently by post-processing risk scores without negatively impacting the generalization error of a model, unlike other fairness concepts like demographic parity (Foulds and Pan, 2020) and equalized odds (Hardt et al., 2016). This has motivated the use of MC in practical settings (e.g. Barda et al. (2021)) and has spurred several extensions (Kim et al., 2019; Jung et al., 2021; Gupta et al., 2021; Gopalan et al., 2022). If we bin our risk predictions, the MC criteria specifies that, for every group within each bin, the absolute difference between the mean observed outcome and the mean of the predictions should be small.
As Barocas et al. (2019) note, fair calibration reduces to a more general fairness notion dubbed sufficiency. Under sufficiency, the expected outcome should be independent of group membership, conditioned on the risk prediction. We see this criteria as highly desirable in CDS contexts. Sufficiency eliminates a source of uncertainty from decision-making: how to interpret a model recommendation from a model that has been observed to perform more or less well on certain subpopulations. Given a risk score from a model satisfying sufficiency, a decision maker cannot distinguish between patient outcomes on the basis of their group membership. Thus, they cannot justify variations from the model’s recommendations on the basis of patient identity alone.
In this work, we start by assessing the conditions under which MC satisfies sufficiency. To do so, we derive a fairness criteria directly from sufficiency, differential calibration (DC). DC is an extension of differential fairness (Foulds et al., 2019b), both named due to their relation to differential privacy (Dwork and Roth, 2013). DC constrains ratios of population risk between groups within risk prediction bins. We show that DC measures the extent to which one satisfies sufficiency in an interpretable way. In short, among patients assigned the same risk score from a model satisfying -DC, the outcome is at most more likely in one group compared to any another. A low -DC thereby constrains the amount by which a decision maker may learn to unequally trust the model for different groups.
By relating sufficiency to MC, we describe a shortcoming of MC that can occur when the outcome probabilities are strongly tied to group membership. Under this condition, the amount of calibration error relative to the expected outcome can be unequal between groups. This inequality hampers the ability of MC to (approximately) guarantee sufficiency except by setting extremely low error thresholds, resulting in the need for large numbers of updates.
We propose a simple variant of MC called proportional multicalibration (PMC) that instead requires the proportion of calibration error within each bin and group to be small. We prove that PMC bounds both multicalibration and differential calibration. We show that PMC can be satisfied with an efficient post-processing method, similarly to MC (Pfisterer et al., 2021). Proportionally multicalibrated models thereby obtain robust fairness guarantees that are less dependent on population risk categories. It does so in fewer steps than MC by prioritizing updates to groups with high proportional calibration error.
Finally, we investigate the application of these methods to predicting patient admissions in the emergency department, a real-world resource allocation task that is targeted by current CDS models (Barak-Corren et al., 2021a). We create a benchmark dataset for this task using the recently released MIMIC-IV emergency department dataset Johnson et al. (2021), and benchmark PMC- and MC-based postprocessing approaches. We show that post-processing for PMC results in models that are accurate, multicalibrated, and differentially calibrated.
2. Reconciling Multicalibration and Sufficiency
2.1. Preliminaries
We consider the task of training a risk prediction model for a population of individuals with outcomes, , and features, . Let be the joint distribution from which individual samples are drawn. We assume the outcomes are random samples from underlying independent Bernoulli distributions, denoted as . Individuals can be further grouped into collections of subsets, , such that is the subset of individuals belonging to , and indicates that individual belongs to group . We denote our risk prediction model as .
In order to consider calibration in practice, the risk predictions are typically discretized into bins. We represent discretization by a parameter, , that specifies the width of the bins. As an example, corresponds to decile bins. For brevity, proofs and some formal definitions in the following sections are given in Appendices A.2 and A.3.2.
2.2. Multicalibration
Hebert-Johnson et al. (2018) define multicalibration by first defining calibration with respect to a subset (i.e., group) of individuals:
Definition 1 (-calibration) Let . For is -calibrated with respect to if there exists some with such that for all ,
MC then guarantees that -calibration holds over every subset from a collection of subsets:
Definition 2 (-Multicalibration) Let be a collection of subsets of predictor is -multicalibrated on if for all is -calibrated with respect to .
We note that, according to Definition 1, a model need only be calibrated over a sufficiently large subset of each group in order to satisfy the definition. This relaxation is used to maintain a satisfactory definition of MC when working with discretized predictions. For simplicity, we conduct most of our analysis using the continuous versions of fairness definitions like Definition 2 (see Appendix A.3.1 for an extended discussion).
MC is one of few approaches to achieving fairness that does not require a significant trade-off to be made between a model’s generalization error and the improvement in fairness it provides. As Hébert-Johnson et al. (2018) show, this is because achieving multicalibration is not at odds with achieving accuracy in expectation for the population as a whole. This separates calibration fairness from other fairness constraints like demographic parity and equalized odds (Hardt et al., 2016), both of which may denigrate the performance of the model on specific groups (Chouldechova, 2017; Pleiss et al., 2017). In clinical settings, such trade-offs may be difficult or impossible to justify. In addition to its alignment with accuracy in expectation, Hébert-Johnson et al. (2018) propose an efficient post-processing algorithm for MC similar on boosting. We discuss additional extensions to MC in Appendix A.1.
2.3. Measuring Sufficiency via Differential Calibration
MC provides a sense of fairness by approximating calibration by group, which is perfectly satisfied when for all and . Calibration by group is closely related to the sufficiency fairness criterion (Barocas et al., 2019). Sufficiency states that the outcome probability is independent from , conditioned on the risk score. In the binary group setting , we can express sufficiency as
| (1) |
or equivalently,
Unlike calibration by group, sufficiency does not stipulate that the risk scores be calibrated, yet from a fairness perspective, sufficiency and calibration by group are equivalent (Barocas et al., 2019). In both cases, the sense of fairness stems from the desire for to capture everything about group membership that is relevant to predicting .
Under sufficiency, the risk score is equally informative of the outcome, regardless of group membership. Because of this, given a risk score from a model satisfying sufficiency, a decision maker cannot distinguish between patient outcomes on the basis of their group membership. In this sense, a model satisfying sufficiency provides an added level of trust in deployment: we know that a decision maker cannot justify different decisions on the basis of the patient’s identity for the same risk prediction. If risk prediction models satisfied sufficiency, it would eliminate the need for group-specific decision protocol given recommendations from the same model. Consider, for example, the decision-making uncertainty related to estimates of kidney function (Diao et al., 2021).
Below, we define an approximate measure of sufficiency that constrains pairwise differentials between groups, and accomodates binned predictions:
Definition 3 (-Differential Calibration) Let be a collection of subsets of . A model is -differentially calibrated with respect to if, for all pairs for which , for any ,
| (2) |
By inspection we see that in -DC measures the extent to which satisifies sufficiency. That is, when for all pairs . -DC requires that, for any risk score, the outcome is at most times more likely among one group than another, and a minimum of less likely.
Definition 3 fits the general definition of a differential fairness measure proposed by Foulds et al. (2019a) and previously used to study demographic parity criteria (Foulds and Pan, 2020). We describe the relation in more detail in Appendix A.1.1, including Eq. (2)’s connection to differential privacy (Dwork and Lei, 2009) and pufferfish privacy (Kifer and Machanavajjhala, 2014).
Taken alone, DC does not prevent a decision-maker from equally distrusting a model for all patients. That is because it does not guarantee the model is globally calibrated. Rather, it makes it harder to distinguish the calibration quality of the model between groups.
2.4. The differential calibration of multicalibrated models is limited by low-risk groups
At a basic level, the form of MC and sufficiency differ: MC constrains absolute differences between groups across prediction bins, whereas sufficiency constrains pairwise differentials between groups. To reconcile MC and DC/sufficiency more formally, we pose the following question: if a model satisfies -MC, what, if anything does this imply about the -DC of the model? (In Appendix A.4, Theorem 22, we answer the inverse question). We now show that multicalibrated models have a bounded DC, but that this bound is limited by small values of .
Theorem 4 Let be a model satisfying - on a collection of subsets . Let be the minimum expected risk prediction among and . Then is -differentially calibrated.
Theorem 4 illustrates the important point that, in terms of percentage error, MC does not provide equal protection to groups with different risk profiles. Imagine a model satisfying (0.05)-MC for groups . Consider individuals receiving model predictions MC guarantees that, for any category , the expected outcome probability is at least and at most . This bounds the percent error among groups with this prediction to about 6%. In contrast, consider individuals for whom ; each group may have a true outcome probability as low as 0.25, which is an error of 20% - about 3.4x higher than the percent error in the higher-risk group.
In Appendix Theorem 22, we show that differentially calibrated models can bound multicalibration only if they are also -calibrated (Definition 1).
3. Proportional Multicalibration
We are motivated to define a measure that is efficiently learnable like MC (Definition 2) but better aligned with the multiplicative interpretation of sufficiency, like DC (Definition 3). To do so, we define PMC, a variant of MC that constrains the proportional calibration error of a model among subgroups and risk strata. In this section, we show that bounding a model’s PMC is enough to meaningfully bound DC and MC. Furthermore, we provide an efficient algorithm for satisfying PMC based on a simple extension of MC/Multiaccuracy boosting (Kim et al., 2019). We begin by defining proportional calibration, which expresses calibration error as a percentage of the outcome probability among a group.
Definition 5 (-Proportional Calibration) Let . For is -proportionally calibrated with respect to if there exists some with such that for all ,
Proportional multicalibration is then defined by requiring Definition 5 be satisified among a collection of groups:
Definition 6 (-Proportional Multicalibration) Let be a collection of subsets of . A predictor is -proportionally multicalibrated on if for all is -proportionally calibrated with respect to .
We also define a discretized version of PMC in Appendix A.3.2 that is useful for implementing the measure in Algorithm 1 and measuring PMC in our experiments. In Appendix A.3.1, we show -PMC meaningfully bounds -PMC under different discretizations, such that we can minimize -PMC to achieve low -PMC.
In practice, we must ensure that the outcome probability for any (group, prediction bin) category is greater than zero for PMC to be meaningful. We later introduce a lower bound, , to prevent the outcome probability from being too small. It is common in clinical settings to only show a risk prediction to a decision maker if it exceeds some threshold; in those settings, can be set to match this threshold.
Comparison to Differential Calibration
Rather than constraining the differentials of prediction- and group- specific outcomes among all pairs of subgroups in as in DC (Definition 3), PMC constrains the relative error of each group in . In practical terms, this makes it more efficient to calculate PMC by a factor of steps compared to DC. In addition, PMC constrains each group’s calibration with respect to ground truth, whereas DC only constrains the differentials between groups. We formalize the relationship between these two measures below.
Theorem 7 Let be a model satisfying -PMC on a collection . Then is -differentially calibrated.
Theorem 7 demonstrates that -proportionally multicalibrated models satisfy a straightforward notion of differential fairness that depends monotonically only on . Because PMC bounds DC, proportionally calibrated models inherit desirable properties of differentially calibrated models, especially the following:
Corollary 8 Let be a model satisfying -PMC on a collection . Then satisfies -pufferfish privacy with respect to .
Corollary 8 follows from the definitions of -DC and pufferfish privacy (Appendix Definition 13, Kifer and Machanavajjhala (2014)). This property is perhaps best understood as a guarantee of trust rather than privacy. Namely, it guarantees that the patient’s group identity does not provide useful information to a decision-maker in deciding how much to trust model predictions. The patient’s group identity is in this sense “private” from the model’s calibration. Unlike many privacy-preserving algorithms, we do not need to add add any noise to the model to satisfy DC; conversely, it is achieved by making the model better calibrated for some groups.
Comparison to Multicalibration
Rather than constraining the absolute difference between risk predictions and the outcome as in MC, PMC requires that the calibration error be a small fraction of the expected risk in each category . In this sense, it provides a stronger protection than MC by requiring calibration error to be a small fraction regardless of the risk group. In many contexts, we would argue that this is also more aligned with the notion of fairness in risk prediction contexts. Under MC, the underlying probability of an outcome within a group affects the fairness protection that is received (i.e., the percentage error that Definition 2 allows). Because underlying probabilities of many clinically relevant outcomes vary significantly among subpopulations, multicalibrated models may systematically permit higher percentage error to specific groups. The difference in relative calibration error among populations with different risk profiles also translates in weaker sufficiency guarantees, as demonstrated in Theorem 4. In contrast, PMC provides a fairness guarantee that is less dependent on subpopulation risks.
In the following theorem, we show that MC is also constrained when a model satisfies PMC.
Theorem 9 Let be a model satisfying -PMC on a collection . Then is -multicalibrated on .
The proof of Theorem 9 is given in Appendix A.2. This theorem implies that a proportionally calibrated model with sufficiently low will satisfy a similarly low value of MC. We further discuss and illustrate the bounds given by Theorems 4, 7 and 9 in Appendix A.3.
Because PMC bounds a model’s multicalibration, proportionally multicalibrated models also bound the generalization error of a predictor such that it is close to the “best in class” predictions.
Corollary 10 Let be a model satisfying on a collection . Let be a set of predictors, the outcome-generating distribution, and . Then
This guarantee follows directly from Theorem 5 in Hebert-Johnson et al. (2018). In a nutshell, it means that proportionally multicalibrated models are bound to be close to the best predictor of their hypothesis class . This contrasts fair calibration with other approaches to fairness that typically form a trade-off with a model’s overall error. We further illustrate relationship between PMC, DC, and MC in Appendix Fig. 6.
3.1. Learning proportionally multicalibrated predictors
In Algorithm 1, we propose an extension of MCBoost (Pfisterer et al., 2021) to efficiently update risk predictors to satisfy PMC. Algorithm 1 works by checking for calibration errors among groups and prediction intervals that violate the user threshold, and adjusting these predictions towards the target. PMCBoost differs in two main ways: first, it updates whenever calibration error is not within for all categories, as opposed to simply within . Second, it ignores updates for categories with low outcome probability (less than ). Next, we prove that PMCBoost learns an -PMC model in a polynomial number of steps.
Proposition 11 Define . Let be a collection of subsets of such that, for all . Let be a risk prediction model to be post-processed. For all , let . There exists an algorithm that satisfies with respect to in steps.
We analyze Algorithm 1 and show it satisfies Proposition 11 in Appendix A.2. This more stringent threshold requires an additional steps compared to the algorithm for MC, where is a lower bound on the expected outcome within a category .
Achieving proportional multicalibration with MCBoost
Another way to achieve PMC is to use the existing MCBoost algorithm, but setting . In other words, setting a very low value for should also satisfy Theorem 6 because, if can be made small enough, the calibration error on all categories will be small compared to the outcome prevalence, . However, to achieve PMC guarantees by MCBoost requires a large number of unnecessary updates for high risk groups, since the DC and PMC of multicalibrated models are limited by low-risk groups (Theorem 4). Furthermore, the number of steps in MCBoost (and PMCBoost scales as an inverse high-order polynomial of (cf. Thm. 2 (Hebert-Johnson et al., 2018)).
We can compare the algorithm complexity of MCBoost and PMCBoost to get a better comparison. Say we would like to satisfy -PMC using MCBoost. Hebert-Johnson et al. (2018) show that MCBoost takes steps to complete. We must reduce -MC by a factor of to satisfy -PMC; this increases the number of MCBoost steps by a factor of . For a reasonable value of , this takes 1000x more steps, and corresponds to 10x more steps than it would take to satisfy -PMC with PMCBoost.
Now consider the reverse: satisfying -MC using PMCBoost. Because of Theorem 9, we need only reduce -PMC by a factor of , which increases the number of PMCBoost steps by . For example, to achieve -MC , PMCBoost takes about 1.3x more steps than MCBoost.

4. Experiments
In our first set of experiments (Section 4), we study MC and PMC in simulated population data to understand and validate the analysis in previous sections. In the second section, we compare the performance of varied model treatments on a real world hospital admission task, using an implementation of Algorithm 1. We make use of empirical versions of our fairness definitions which we refer to as MC loss, PMC loss, and DC loss. In short, these measures calculate the maximum (proportional) calibration error or pairwise calibration differential among subgroups and risk categories in the data sample. Due to space constraints the formal definitions are given in Appendix A.3.2 (Definitions 19, 20 and 21).
Simulation study
We simulate data from multicalibrated models. For simplicity, we specify a data structure with a one-to-one correspondence between subset and model estimated risk, such that for all in . Therefore all information for predicting the outcome based on the features in is contained in the attributes that define subgroup . Outcome probability is specified as and , where is the number of subsets , defined by and indexed by with increasing . For each group, . We randomly select for one group to be and for the remaining groups, , where . In all cases, the sign of is determined by a random draw from a Bernoulli distribution. For these simulations we set and , such that and . We generate simulated datasets, with observations per group, and for each , we calculate the ratio of the absolute mean error to , i.e. the PMC loss function for this data generating mechanism.
We also simulate three specific scenarios where: 1) is equivalent for all groups (Fixed); 2) increases with increasing ; and 3) decreases with increasing , with in each case. These scenarios compare when is determined by all groups, the group with the lowest outcome probability, and the group with the highest outcome probability, respectively.
Hospital admission
Next, we test PMC alongside other methods in application to prediction of inpatient hospital admission for patients visiting the emergency department (ED). Overcrowding and long wait times in EDs have been shown to increase odds of inpatient death (5%, CI 2–8%), length of stay (0.8%, CI 0.5–1%), and costs per admission (1%, CI 0.7–2%) (Sun et al., 2013). The burden of overcrowding and long wait times in EDs is significantly higher among non-white, non-Hispanic patients and socio-economically marginalized patients (James et al., 2005; McDonald et al., 2020).
Recent work has demonstrated risk prediction models that can expedite patient visits by predicting patient admission at an early stage of a visit with a high degree of certainty (AUC ≥ 0.9 across three large care centers) (Barak-Corren et al., 2017b,a, 2021b,a). Our goal is to ensure no group of patients will be over- or underprioritized over another by these models, which could exacerbate the treatment and outcome disparities that currently exist.
We construct a prediction task similar to previous studies but using a new data resource: the MIMIC-IV-ED repository (Johnson et al., 2021). After data preparation (see Appendix A.5), the cohort consists of 173,561 visits with demographics and admission statistics given in Table 1. Table 2 shows the list of features used for prediction, modelled off prior work (Barak-Corren et al., 2021a). In Table 1 we observe stark differences in admission rates by demographic group and gender, suggesting that the use of a proportional measure of calibration could be appropriate for this task. We trained and evaluated -penalized logistic regression (LR), random forest (RF), and deep neural network (DNN) models of patient admission, with and without post-processing with MCBoost (Pfisterer et al., 2021) or PMCBoost. We varied , and to characterize parameter sensitivity among the methods (Table 3). For each of the parameter settings, we conducted 100 repeat experiments with different shuffles of the data. Comparisons are reported on a test set of 20% of the data for each trial. Additional experiment details are available in Appendix A.5.
Table 1:
Admission prevalence (Admissions/Total (%)) among patients in the MIMIC-IV-ED data repository, stratified by the intersection of ethnoracial group and gender.
| Gender | F | M | Overall |
|---|---|---|---|
| Ethnoracial Group | |||
| American Indian/Alaska Native | 70/257 (27%) | 82/170 (48%) | 152/427 (36%) |
| Asian | 1043/3595 (29%) | 1032/2384 (43%) | 2075/5979 (35%) |
| Black/African American | 3124/27486 (11%) | 2603/14458 (18%) | 5727/41944 (14%) |
| Hispanic/Latino | 1063/10262 (10%) | 1168/5795 (20%) | 2231/16057 (14%) |
| Other | 1232/5163 (24%) | 1479/3849 (38%) | 2711/9012 (30%) |
| Unknown/Unable to Obtain | 1521/2156 (71%) | 2074/2377 (87%) | 3595/4533 (79%) |
| White | 18147/50174 (36%) | 18951/45435 (42%) | 37098/95609 (39%) |
| Overall | 26200/99093 (26%) | 27389/74468 (37%) | 53589/173561 (31%) |
Table 2:
Features used in the hospital admission task.
| Description | Features |
|---|---|
| Vitals | temperature, heartrate, resprate, o2sat, systolic blood pressure, diastolic blood pressure |
| Triage Acuity | Emergency Severity Index (Tanabe et al., 2004) |
| Check-in Data | chief complaint, self-reported pain score |
| Health Record Data | no. previous visits, no. previous admissions |
| Demographic Data | ethnoracial group, gender, age, marital status, insurance, primary language |
Table 3:
Parameters for the hospital admission prediction experiment.
| Parameter | Values |
|---|---|
| tolerance () | (0.001, 0.01, 0.1, 0.2) |
| min group probability () | (0.05, 0.1) |
| binning () | 0.1 |
| min outcome probability () | (0.01, 0.1) |
| Base Model | LR, RF, DNN |
| Groups | (race/ethnicity, gender), (race/ethnicity, gender, insurance product) |
5. Results
Fig. 2 shows the results of our simulation study. The results indicate that, without the proportionality factor, -multicalibrated models exhibit a dependence between the group prevalence and the amount of proportional calibration loss. The results demonstrate why -MC alone is not sufficient to achieve sufficiency, particularly when outcome probabilities vary by group.
Figure 2:

The relationship between MC, PMC, and outcome prevalence as illustrated via a simulation study in which the rates of the outcome are associated with group membership. Gray points denote the PMC loss of a (0.1)-MC model on 1000 simulated datasets, and colored lines denote three scenarios in which each group’s calibration error varies. Although MC is identical in all scenarios, PMC loss is higher among groups with lower positivity rates in most scenarios unless the groupwise calibration error increases with positivity rate.
Results on the hospital admission prediction task are summarized in Figs. 3 to 5 and Table 4. As Table 4 shows, PMCBoost has a small effect on predictive performance while improving DC loss and PMC loss by 10–81% across LR, RF, and DNN models. Somewhat surprisingly, PMCBoost improves MC loss more effectively than MCBoost for both DNN (19% versus 11%) and RF models (18% versus 6%). We observe that PMCBoost and MCBoost do not improve MC loss of LR models, although PMCBoost improves other metrics of fairness for LR (PMC loss and DC loss) without changing AUROC. In Fig. 4, we illustrate the calibration improvement of RF models using PMCBoost for three patient subgroups defined by the intersection of race/ethnicity, gender, and insurance type.
Figure 3:

A comparison of LR, RF, and DNN models, with and without MCBoost and PMCBoost, on the hospital admission task. From left to right, trained models are compared in terms of test set AUROC, MC loss, PMC loss, and DC loss. Points represent the median performance over 100 shuffled train/test splits with bootstrapped 99% confidence intervals. We test for significant differences between post-processing methods using two-sided Wilcoxon rank-sum tests with Bonferroni correction. .
Figure 5:

Number of iterations to convergence for PMCBoost and MCBoost when optimizing for the same value of -PMC .
Table 4:
Median percent change in loss from the base model using MC- and PMCBoost.
| AUROC | MC | PMC | DC | ||
|---|---|---|---|---|---|
| ML | Δ% | Δ% | Δ% | Δ% | |
| MCBoost | DNN | −0.58 | −10.83 | −16.27 | −2.41 |
| LR | −0.01 | 10.95 | −9.50 | −6.46 | |
| RF | −0.09 | −5.91 | −55.44 | −13.88 | |
| PMCBoost | DNN | −0.56 | −19.49 | −28.83 | −9.91 |
| LR | 0.01 | 14.18 | −37.90 | −14.90 | |
| RF | −0.27 | −17.53 | −80.82 | −23.52 |
Figure 4:

Calibration curves for RF models with and without PMCBoost postprocessing.
Computational Cost
As mentioned in Section 3.1, MCBoost may be configured to satisfy -PMC by setting to a much smaller value, although this may take an excessive number of steps. To better understand this trade-off, we empirically compared MC- and PMCBoost by the number of steps required for each to reach their best performance in Fig. 5. On average, MCBoost requires approximately 5x more updates to achieve similar performance on PMC loss as PMCBoost, due to its dependence on very small values of . Wall clock times scale similarly, as detailed in Table 5.
Table 5:
For MCBoost and PMCBoost, we compare the average number of updates and wall clock time (s) taken to train for the equivalent values of -PMC.
| Iterations | Time (s) | |||
|---|---|---|---|---|
| ML | Postprocessing | |||
| LR | 0.001 | MC | 2094.0 | 678.0 |
| PMC | 263.0 | 320.0 | ||
| 0.010 | MC | 631.0 | 223.0 | |
| PMC | 153.0 | 202.0 | ||
| RF | 0.001 | MC | 1996.0 | 556.0 |
| PMC | 323.0 | 234.0 | ||
| 0.010 | MC | 713.0 | 197.0 | |
| PMC | 256.0 | 167.0 |
Sensitivity Analysis
In Appendix A.6, we look at detailed performance comparisons of MCBoost and PMCBoost over values of and group definitions in Figs. 8 to 10. We observe that, while low values of for MCBoost improve its PMC loss performance, PMCBoost typically performs as well or better, particularly for larger values of , and does so in fewer iterations.
6. Discussion and Conclusion
In this paper we have analyzed multicalibration through the lens of sufficiency and differential calibration to reveal the sensitivity of this metric to correlations between outcome rates and group membership. We have proposed a measure, PMC, that alleviates this sensitivity and attempts to capture the “best of both worlds” of MC and sufficiency. PMC provides equivalent percentage calibration protections to groups regardless of their risk profiles, and in so doing, bounds a model’s differential calibration. We provide an efficient algorithm for learning PMC predictors by postprocessing a given risk prediction model. On a real-world and clinically relevant task (admission prediction), we have shown that post-processing three types of models with PMC leads to better performance across all three fairness metrics, with a small impact on predictive performance.
Our preliminary analysis suggests PMC can be a valuable metric for training fair algorithms in resource allocation contexts. Future work could extend this analysis on both the theoretical and practical side. On the theoretical side, the generalization properties of the PMC measure should be established and its sample complexity quantified, as Rose (2018) did with MC. Additional extensions of PMC could establish a bound on the accuracy of PMC-postprocessed models in a similar vein to work by Kim et al. (2019) and Hebert-Johnson et al. (2018). On the empirical side, future works should benchmark PMC on a larger set of real-world problems, and explore use cases in more depth. In addition, user studies could be employed to validate whether proportionally multicalibrated models do in fact instill more trust in decision-makers compared to baseline models.
Figure 1:

Visualization of the constraint differences between multicalibration and proportional multicalibration (PMC). The filled area represents the maximum the predicted risk can deviate from the fraction of positives in the population, for any group. Below , the constraint is constant for PMC.
Acknowledgments
E.L. was partially supported by the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) grant 5 T32 HD40128-19. W.G.L. was partially supported by the National Institutes of Health (NIH) National Library of Medicine grant R00-LM012926.
Appendix A. Appendix
In this section, we include additional comparisons to related work, additional definitions, proofs to the theorems in the main text, and additional experimental details. The code to reproduce the figures and experiments is available here: https://github.com/cavalab/proportional-multicalibration.
A.1. Related Work
Definitions of Fairness
There are myriad ways to measure fairness that are covered in more detail in other works (Barocas et al., 2019; Chouldechova and Roth, 2018; Castelnovo et al., 2021). We briefly review three notions here. The first, demographic parity, requires the model’s predictions to be independent of patient demographics . Although a model satisfying demographic parity can be desirable when the outcome should be unrelated to sensitive attributes (Foulds and Pan, 2020), it can be unfair if important risk factors for the outcome are associated with those attributes (Hardt et al., 2016). For example, it may be more fair to admit socially marginalized patients to a hospital at a higher rate if they are assessed less able to manage their care at home. Furthermore, if the underlying rates of illness vary demographically, requiring demographic parity can result in a healthier patients from one group being admitted more often than patients who urgently need care.
When the base rates of admission are expected to differ demographically, we can instead ask that the model’s errors be balanced across groups. One such notion is equalized odds, which states that for a given , the model’s predictions should be independent of . Satisfying equalized odds is equivalent to having equal FPR and FNR for every group in .
When the model is used for patient risk stratification, as in the target use case in this paper, it is important to consider a model’s calibration for each demographic group in the data. Because risk prediction models influence who is prioritized for care, an unfairly calibrated model can systematically under-predict risk for certain demographic groups and result in under-allocation of patient care to those groups. Thus, guaranteeing group-wise calibration via an approach such as multicalibration also guarantees fair patient prioritization for health care provision. In some contexts, risk predictions are not directly interpreted, but only used to rank patients, which in some contexts is sufficient for resource allocation. Authors have proposed various ways of measuring the fairness of model rankings, for example by comparing AUROC between groups (Kallus et al., 2020).
Approaches to Fairness
Many approaches to achieving fairness guarantees according to demographic parity, equalized odds and its relaxations have been proposed (Dwork et al., 2012; Hardt et al., 2016; Berk et al., 2017; Jiang and Nachum, 2019; Kearns et al., 2018). When choosing an approach, is important to carefully weigh the relative impact of false positives, false negatives, and miscalibration on patient outcomes, which differ by use case. When group base rates differ (i.e., group-specific positivity rates), equalized odds and calibration by group cannot both be satisfied (Kleinberg et al., 2016). Instead, one can often equalized multicalibration while satisfying relaxations of equalized odds such as equalized accuracy, where for a group with base rate . However, to do so may require denigrating the performance of the model on specific groups (Chouldechova, 2017; Pleiss et al., 2017), which is unethical in our context.
As mentioned in the introduction, we are also motivated to utilize approaches to fairness that 1) dovetail well with intersectionality theory, and 2) provide privacy guarantees. Most work in the computer science/machine learning space does not engage with the broader literature on socio-cultural concepts like intersectionality, which we see as a gap that makes adoption in real-world settings difficult (Hanna et al., 2020). One exception to this statement is differential fairness (Foulds et al., 2019a), a measure designed with intersectionality in mind. In addition to being a definition of fairness that provides equal protection to groups defined by intersections of protected attributes, models satisfying -differential fairness also satisfy -pufferfish privacy. This privacy guarantee is very desirable in risk prediction contexts, because it limits the extent to which the model reveals sensitive information to a decision maker that has the potential to influence their interpretation of the model’s recommendation. However, prior work on differential fairness has been limited to using it to control for demographic parity, which is not an appropriate fairness measure for our use case (Foulds and Pan, 2020).
Multicalibration has inspired several extensions, including relaxations such as multiaccuracy (Kim et al., 2019), low-degree multicalibration (Gopalan et al., 2022), and extensions to conformal prediction and online learning (Jung et al., 2021; Gupta et al., 2021). Noting that multicalibration is a guarantee over mean predictions on a collection of groups , Jung et al. (2021) propose to extend multicalibration to higher-order moments (e.g., variances), which allows one to estimate a confidence interval for the calibration error for each category. Gupta et al. (2021) extend this idea and generalize it to the online learning context, in which an adversary chooses a sequence of examples for which one wishes to quantify the uncertainty of different statistics of the predictions. Recent work has also utilized higher order moments to “interpolate” between the guarantees provided by multiaccuracy, which only requires accuracy in expectation for groups in , and multicalibration, which requires accuracy in expectation at each prediction interval (Kim et al., 2019). Like proportional multicalibration (Definition 6), definitions of multicalibration for higher order moments provide additional criteria for quantifying model performance over many groups; in general, however, much of the focus in other work is on statistics for uncertainty estimation. Like these works, one may view our proposal for proportional multicalibration as alternative definition of what it means to be multicalibrated. The key difference is that proportional multicalibration measures the degree to which multicalibration depends on differences in outcome prevalence between groups, and in doing so provides guarantees of pufferfish privacy and differential calibration.
Dwork et al. (2019) study the relation of fair rankings to multicalibration, and, in a similar vein to differential fairness measures, formulate a fairness measure for group rankings using the relations between pairs of groups. However, these definitions are specific to the ranking relation between the groups, whereas differential calibration cares only about the outcome differential (conditioned on model predictions) between pairs of groups.
A.1.1. Differential Fairness
DF was explicitly defined to be consistent with the social theoretical framework of intersectionality. This framework dates back as early as the social movements of the ‘60s and ‘70s (Collins and Bilge, 2020) and was brought into the academic mainstream by pioneering work from legal scholar Kimberlé Crenshaw (Crenshaw, 1989, 1991) and sociologist Patricia Hill Collins (Collins, 1990). Central to intersectionality is that hierarchies of power and oppression are structural elements that are fundamental to our society. Through an intersectional lens, these power structures are viewed as interacting and co-constituted, inextricably related to one another. To capture this viewpoint, DF (Foulds et al., 2019a) constrains the differential of a general data mechanism among all pairs of groups, where groups are explicitly defined as the intersections of protected attributes in .
Definition 12 (-differential fairness) (Foulds et al., 2019a) Let denote a set of distributions and let for . A mechanism is -differentially fair with respect to for all with , and Range if, for all where
| (3) |
Definition 13 (Pufferfish Privacy) Let the collection of subsets represent sets of secrets. A mechanism is -pufferfish private (Kifer and Machanavajjhala, 2014) with respect to if for all with , for all secret pairs and ,
| (4) |
when and are such that
Note on pufferfish and differential privacy
Although Eq. (3) is notable in its similarity to differential privacy (Dwork and Lei, 2009), they differ in important ways. Differential privacy aims to limit the amount of information learned about any one individual in a database by computations performed on the data (e.g. . Pufferfish privacy only limits information learned about the group membership of individuals as defined by . Kifer and Machanavajjhala (2014) describe in detail the conditions under which these privacy frameworks are equivalent.
Efficiency Property
Foulds et al. (2019a) also define an interesting property of differential fairness that allows guarantees of higher order (i.e., marginal) groups to be met for free; the property is given in Appendix A.3.2.
Definition 14 (Efficiency Property) (Foulds et al., 2019a) Let be an -differentially fair mechanism with respect to . Let the collection of subsets group individuals according to the Cartesian product of attributes . Let be any collection of subsets that groups individuals by the Cartesian product of attributes in , where and . Then is -differentially fair in .
The authors call this the “intersectionality property”, although in practice it guarantees the reverse: if a model satisfies -DF for the low level (i.e. intersectional) groups in , then it satisfies -DF for every higher-level (i.e. marginal) group. For example, if a model is -differentially fair for intersectional groupings of individuals by race and sex, then it is -DF for the higher-level race and sex groupings as well. Whereas the number of intersections grows exponentially as additional attributes are protected (Kearns et al., 2018), the number of total possible subgroupings grows at a larger combinatorial rate: for protected attributes, we have groups, where is the number of levels of attribute .
Limitations
To date, analysis of DF for predictive modeling has been limited to defining as the mechanism, which is akin to asking for demographic parity. Under demographic parity, one requires that model predictions be independent from group membership entirely, and this limits the utility of it as a fairness notion. Although a model satisfying demographic parity can be desirable when the outcome should be unrelated to (Foulds and Pan, 2020), it can be unfair if important risk factors for the outcome are associated with demographics (Hardt et al., 2016). For example, if the underlying rates of an illness vary demographically, requiring demographic parity can result in a healthier patients from one group being admitted more often than patients who urgently need care.
A.2. Proofs for Theorems in the Main Text
Theorem 4 Let be a model satisfying on a collection of subsets . Let be the minimum expected risk prediction among and . Then is . differentially calibrated.
Proof Let and -MC guarantees that for all groups and predictions [0, 1]. Plugging these lower and upper bounds into Eq. (2), we observe that the lower bound on -DC for is given by . The maximum of the left-hand side for a fixed occurs at the smallest value of ; therefore satisfies . By switching the numeratator and denominator we obtain the minimum differential and the left-hand side constraint from Definition 3, i.e. . Thus is -differentially calibrated. ■
Theorem 7 Let be a model satisfying on a collection . Then is -differentially calibrated.
Proof Let and . If satisfies -PMC (Definition 6), then . Solving for the upper bound on -DC, we immediately have . ■
Theorem 9 Let be a model satisfying -PMC on a collection . Then is -multicalibrated on .
Proof To distinguish the parameters, let be a model satisfying -PMC. Let and . Then . We solve for the upper bound on -MC from Definition 2 for the case when . This yields
We can also solve for the lower bound on -MC from Definition 2 for the case when . This yields
For any . Therefore the first case limits the multicalibration of .■
Proposition 11 Define . Let be a collection of subsets of such that, for all . Let be a risk prediction model to be post-processed. For all , let . There exists an algorithm that satisfies with respect to in steps.
Proof
We show that Algorithm 1 converges using a potential function argument (Bansal and Gupta, 2019), similar to the proof techniques for the MC boosting algorithms in Hebert-Johnson et al. (2018); Kim et al. (2019). Let be the underlying risk, be our initial model, and be our updated prediction model for individual , where and . We use , and without subscripts to denote these values over . We cannot easily construct a potential argument using progress towards -PMC, since its derivative is undefined at . Instead, we analyze progress towards the difference in the norm at each step.
| (5) |
From Algorithm 1 we have
Substituting into Eq. (5) gives
We know that , and that the smallest update is . Thus,
Since our initial loss, , is at most , Algorithm 1 converges in at most updates for category .
To understand the total number of steps, including those without updates, we consider the worst case, in which only a single category is updated in a cycle of the for loop (if no updates are made, the algorithm exits). Since each repeat consists of at most loop iterations, this results in total steps. ■
A.3. Extended Theoretical Analysis
Illustrating Relationships between Definitions
Fig. 6 shows how the definitions of MC, DC, and PMC are related. In each subplot, the x and y coordinates map the guarantee from one metric (x axis) to the implied guarantee in the other metric (y axis).
The right panel of Fig. 6 illustrates this relation in comparison to the DC-MC relationship described in Appendix A.4, Theorem 22. At small values of and and when the model is perfectly calibrated overall, -PMC and -DC behave similarly. However, given , -differentially calibrated models suffer from higher MC error than proportionally calibrated models when -PMC < 0.3. The right graph also illustrates the feasible range of for -PMC is , past which it does not provide meaningful -MC. The steeper relation between -PMC and MC may have advantages or disadvantages, depending on context. It suggests that, by optimizing for -PMC, small improvements to this measure can result in relatively large improvements to MC; conversely, -DC models that are well calibrated may satisfy a lower value of -MC over a larger range of .
A.3.1. Discretization
To clarify and simplify our analysis, we work mainly with the continuous versions of multicalibration and proportional multicalibration, under the assumption that minimizing the discretized versions (i.e., binning ) will translate to low values of the continous version. In this section we provide detailed bounds on the continuous versions of PMC and DC that are implied by the discretized versions.
Figure 6:

A comparison of -DC, -MC, and -PMC in terms of their parameters and . In both panes, the x value is a given value of one metric for a model, and the y axis is the implied value of the other metric, according to Theorem 22-Theorem 9. The left filled area denotes the dependence of the privacy/DC of -multicalibrated models on the minimum risk interval, . The right filled area denotes the dependence of the MC of -differentially calibrated models on their overall calibration, -PMC does not have these sensitivities.
First, we will formally define two different discretization schemes. The first, -discretization, defines equally spaced bins on the interval [0, 1], as follows.
For ensuring multiplicative closeness under PMC, it can be useful to instead discretize the prediction bins so that the bins are equally spaced on a log scale. We define such a discretization below.
Definition 15 (-geometric discretization.) Let . The -geometric discretization of [0, 1] is denoted by a set of intervals, , where .
(Hebert-Johnson et al., 2018) define a discretized version of MC in which is binned according to a discretization parameter, :
Definition 16 -multicalibration) Let be a collection of subsets of . For any , a predictor is -multicalibrated on if, for all and where
(Hébert-Johnson et al., 2018) establish that -multicalibrated models are at most -multicalibrated. In an analagous fashion, we show below that -PMC implies -PMC for bins defined by a -discretization. When using a -geometric discretization, -PMC implies -PMC, which can be a tighter bound than the former.
Claim 1 Define and let be a collection of subsets of . Let for all and . Let be a model satisfying -proportional multicalibration. Then is at most -proportionally multicalibrated.
Proof
By Definition 18, satisfies
for categories satisfying . Given bins, the subset where has a size of at most . Therefore there is a subset where for all -PMC (Definition 6) is satisfied.
Let be the constaint on -PMC. Let and . Consider the case and let . -discretization shifts by at most . Let
Substituting yields
Plugging in as the minimum of , we complete the proof.■
The term can be potentially large when . One way to avoid this issue is to make the change in between bins scale with using Definition 15. What makes Definition 15 different from -discretization is that the intervals are a multiplicative, rather than additive, distance apart. Hence, for a given , a model satisfying -PMC can have its prediction shift by at most a factor of . This leads us to the following claim.
Claim 2 Define and let be a collection of subsets of . Let for all and . Let be a model satisfying -proportional multicalibration. Given a -geometric discretization, is at most -proportionally multicalibrated.
Proof
By Definition 18, satisfies
for categories satisfying . Given bins, the subset where has a size of at most . Therefore there is a subset where for all -PMC (Definition 6) is satisfied.
Let be the constaint on -PMC. Let and . Consider the case and let , i.e. the tight bound. -geometric discretization shifts by at most a factor of . This implies
Substituting yields
■
Figure 7:

Relationship between -PMC and -PMC given a geometric discretization. Illustrated for , for various values of and .
We illustrate the relationship between -PMC and -PMC given a geometric discretization in Fig. 7, which quantifies the relationship for different settings of and .
A.3.2. Additional Definitions
Definition 17 (-discretization.) Given , the -discretization of is denoted by a set of intervals, , where .
Definition 18 -PMC) model is -proportionally multicalibrated with respect to a collection of subsets if, for all and satisfying ,
| (6) |
The following loss functions are empirical analogs of the definitions of MC, PMC, and DC, and are used in the experiment section to measure performance.
Definition 19 (MC loss) Let , and let . Define a collection of subsets such that for all . Let for . Define the collection containing all satisfying . The loss of a model on is
Definition 20 (PMC loss) Let , and let . Define a collection of subsets such that for all . Let for . Define the collection containing all satisfying . Let . The PMC loss of a model on is
Definition 21 (DC loss) Let , and let Define a collection of subsets such that for all . Given a risk model and prediction intervals , Let for . Define the collection containing all satisfying . The DC loss of a model on is
A.4. Additional Theorems
A.4.1. Differentially Calibrated models with global calibration are multicalibrated
Here we show that, under the assumption that a model is globally calibrated (satisfies -calibration), models satisfying -DC are also multicalibrated.
Theorem 22 Let be a model satisfying and -calibration. Then is -multicalibrated.
Proof
From Eq. (2) we observe that is bounded by the two groups with the largest and smallest group- and prediction- specific probabilities of the outcome. Let be the risk stratum maximizing -DC, and let and . These groups determine the upper and lower bounds of as and .
We note that , since , and and are the extreme values of among . So, -MC is bound by the group outcome that most deviates from the predicted value, which is either or . Let . There are then two scenarios to consider:
when ; and
when .
We will look at the first case. Let . Due to -calibration, . Then
Above we have used the facts that , and . The second scenario is complementary and produces the identical bound. ■
Theorem 22 formally describes how -calibration controls the baseline calibration error contribution to -MC, while -DC limits the deviation around this value by constraining the (log) maximum and minimum risk within each category.
A.4.2. Multicalibrated Models Satisfy Intersectional Guarantees
In contrast to DF, MC (Hebert-Johnson et al., 2018) was not designed to explicitly incorporate the principles of intersectionality. However, we show that it provides an identical efficiency property to DF in the theorem below. Given an individual’s attributes , it will be useful to refer to subsets we wish to protect, e.g. demographic identifiers. To do so, we define , such that is the set of values taken by attribute .
Theorem 23 Let the collection of subsets define groups of individuals according to the Cartesian product of attributes . Let be any collection of subsets that groups individuals by the Cartesian product of attributes in , where and . If satisfies on , then is -multicalibrated on .
In proving Theorem 23, we will make use of the following lemma.
Lemma 24 The -MC criteria can be rewritten as: for a collection of subsets , and ,
and
Proof The lemma follows from Definition 2, and simply restates it as a constraint on the maximum and minimum expected risk among groups at each prediction level. ■
Proof [Proof of Theorem 23] We use the same argument as Foulds et al. (2019a) in proving this property for DF. Define as the Cartesian product of the protected attributes included in , but not . Then for any
| (7) |
| (8) |
| (9) |
| (10) |
Moving from Eq. (7) to Eq. (8) follows from substituting the maximum value of for observations in the intersection of subsets in and which is the upper limit of the expression in Eq. (7). Moving from Eq. (8) to Eq. (9) follows from recognizing that the sum for all subsets in is 1. Finally, moving from Eq. (9) to Eq. (10) follows from recognizing that the intersections of subsets in and that satisfy Eq. (9), must define a subset of . Applying the same argument, we can show that
Substituting into Theorem 24,
and
or
for all . Therefore is -multicalibrated with respect to . ■
As a concrete example, imagine we have the protected attributes , gender . According to Theorem 23, would contain four sets: In contrast, there are eight possible sets in , where the wildcard indicates a match to either attribute. As noted in Appendix A.1.1, the efficiency property is useful because the number of possible sets in grows at a large combinatorial rate, rate as additional attributes are added; meanwhile grows at a slower, yet exponential, rate. For an intuition for why this property holds, consider that the maximum calibration error of two subgroups is at least as large as the maximum expected error of those groups combined; e.g., the maximum calibration error in a higher order groups such as will be covered by the maximum calibration error in either or .
A.5. Additional Experiment Details
Models
The deep neural network (DNN) was a five layer feed-forward NN with 100 units per layer and ReLU activations. We trained using an adam solver with a learning rate of 0.001, a batch size of 200 and used early stopping to terminate when a 10% validation set did not improve for 10 epochs.
Both LR and DNN models used median imputation and feature normalization as preprocessing steps. For the RF, we used the XGBoost implementation Chen and Guestrin (2016) which handles missing data natively.
Training
Models were trained on a heterogenous computing cluster. Each training instance was limited to a single core and 4 GB of RAM. We conducted a full parameter sweep of the parameters specified in Table 3. A single trial consisted of a method, a parameter setting from Table 3, and a random seed. Over 100 random seeds, the data was shuffled and split 75%/25% into train/test sets. Results in the manuscript are summarized over these test sets.
Code
Code for the experiments is available here: https://github.com/cavalab/proportional-multicalibration. Code is licensed under GNU Public License v3.0.
Data
We make use of data from the MIMICIV-ED repository, version 1.0, to train admission risk prediction models (Johnson et al., 2021). This resource contains more than 440,000 ED admissions from Beth Isreal Deaconness Medical Center between 2011 and 2019. We preprocessed these data to construct an admission prediction task in which our model delivers a risk of admission estimate for each ED visitor after their first visit to triage, during which vitals are taken. Additional historical data for the patient was also included (e.g., number of previous visits and admissions). A list of features is given in Table 2.
A.6. Additional Experimental Results
Table 3 lists a few parameters that may affect the performance of post-processing for both MC and PMC. Of particular interest when comparing MC versus PMC post-processing is the parameter , which controls how stringent the calibration error must be across categories to terminate, and the group definition , which selects which features of the data will be used to asses and optimize fairness. We look at the performance of MC and PMC postprocessing over values of and group definitions in Figs. 8 to 10. Finally, we empirically compare MC- and PMC-postprocessing by the number of steps required for each to reach their best performance in Fig. 11 and Table 5.
From Fig. 8, it is clear that post-processing has a minimal effect on AUROC in all cases; note the differences disappear if we round to two decimal places. When post processing with RF, we do note a relationship between lower values of and a very slight decrease in performance, particularly for MC-postprocessing.
Figs. 9 and 10 show performance between methods on MC loss and PMC loss, respectively. In terms of MC loss, PMC-postprocessing tends to produce models with the lowest loss, at values greater than 0.01. Lower values of do not help MC-postprocessing in most cases, suggesting that these smaller updates may be overfitting to the post-processing data. In terms of PMC loss (Fig. 10), we observe that performance by MC-postprocessing is highly sensitive to the value of . For smaller values of , MC-postprocessing is able to achieve decent performance by these metrics, although in all cases, PMC-postprocessing generates a model with a better median loss value at some configuration of .
We assess how many steps/updates MC and PMC take for different values of in Fig. 11, and summarize empirical measures of running time in Table 5. On the figure, we annotate the point for which each post-processing algorithm achieves the lowest median value of PMC loss across trials. Fig. 11 validates that PMC-postprocessing is more efficient than MC-postprocessing at producing models with low PMC loss, on average requiring 4.0x fewer updates to achieve its lowest loss on test. From Table 5 we observe that PMC typically requires a larger number of updates to achieve its best performance on MC loss (about 2x wall clock time and number of updates), whereas MC-postprocessing requires a larger number of updates to achieves its best performance on PMC loss and DC loss, due to its dependence on very small values of . We accompany these results with the caveat that they are based on performance on one real-world task, and wall clock time measurements are influenced by the heterogenous cluster environment; future work could focus on a larger empirical comparison.
Table 6:
The number of times each postprocessing method achieved the best score among all methods, out of 100 trials.
| postprocessing metric | Base Model | MC | PMC |
|---|---|---|---|
| AUROC | 5 | 88 | 6 |
| MC loss | 8 | 21 | 70 |
| PMC loss | 0 | 27 | 72 |
| DC loss | 0 | 36 | 63 |
We quantify how often the use of each postprocessing algorithm gives the best loss for each metric and trial in Table 6. PMCBoost (Algorithm 1) achieves the best fairness the highest percent of the time, according to DC loss (63%), MC loss (70%), and PMC loss (72%), while MC-postprocessed models achieve the best AUROC in 88% of cases. This provides strong evidence that, over a large range of values, PMCBoost is beneficial compared to MCBoost.
Figure 8:

AUROC test performance versus across experiment settings. Rows are different ML base models, and columns are different attributes used to define . The color denotes the post-processing method.
Figure 9:

MC loss test performance versus across experiment settings. Rows are different ML base models, and columns are different attributes used to define . The color denotes the post-processing method.
Figure 10:

PMC loss test performance versus across experiment settings. Rows are different ML base models, and columns are different attributes used to define . The color denotes the post-processing method
Figure 11:

Number of post-processing updates by MC and PMC versus across experiment settings. Rows are different ML base models, and columns are different attributes used to define . The color denotes the post-processing method. Each result is annotated with the median PMC loss for that method and parameter combination.
Footnotes
Institutional Review Board (IRB) This research does not require IRB approval.
Data and Code Availability
This paper uses the MIMIC-IV-ED dataset (Johnson et al., 2021), which is available on the PhysioNet repository (Goldberger et al., 2000). Code for the experiments is available here: https://github.com/cavalab/proportional-multicalibration.
References
- Ashana Deepshikha Charan, Anesi George L, Liu Vincent X, Escobar Gabriel J, Chesley Christopher, Eneanya Nwamaka D, Weissman Gary E, Miller William Dwight, Harhay Michael O, and Halpern Scott D. Equitably allocating resources during crises: Racial differences in mortality prediction models. American journal of respiratory and critical care medicine, 204(2):178–186, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bansal Nikhil and Gupta Anupam. Potential-function proofs for gradient methods. Theory of Computing, 15(1):1–32, 2019. [Google Scholar]
- Barak-Corren Yuval, Fine Andrew M., and Reis Ben Y.. Early Prediction Model of Patient Hospitalization From the Pediatric Emergency Department. Pediatrics, 139(5), May 2017a. ISSN 1098–4275. doi: 10.1542/peds.2016-2785. [DOI] [PubMed] [Google Scholar]
- Barak-Corren Yuval, Israelit Shlomo Hanan, and Reis Ben Y. Progressive prediction of hospitalisation in the emergency department: Uncovering hidden patterns to improve patient flow. Emergency Medicine Journal, 34(5):308–314, May 2017b. ISSN 1472–0205, 1472–0213. doi: 10.1136/emermed-2014-203819. [DOI] [PubMed] [Google Scholar]
- Barak-Corren Yuval, Agarwal Isha, Michelson Kenneth A, Lyons Todd W, Neuman Mark I, Lipsett Susan C, Kimia Amir A, Eisenberg Matthew A, Capraro Andrew J, Levy Jason A, Hudgins Joel D, Reis Ben Y, and Fine Andrew M. Prediction of patient disposition: Comparison of computer and human approaches and a proposed synthesis. Journal of the American Medical Informatics Association, 28(8): 1736–1745, July 2021a. ISSN 1527–974X. doi: 10.1093/jamia/ocab076. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barak-Corren Yuval, Chaudhari Pradip, Perniciaro Jessica, Waltzman Mark, Fine Andrew M., and Reis Ben Y.. Prediction across healthcare settings: A case study in predicting emergency department disposition. npj Digital Medicine, 4(1):1–7, December 2021b. ISSN 2398–6352. doi: 10.1038/s41746-021-00537-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barda Noam, Yona Gal, Rothblum Guy N., Greenland Philip, Leibowitz Morton, Balicer Ran, Bachmat Eitan, and Dagan Noa. Addressing bias in prediction models by improving sub-population calibration. Journal of the American Medical Informatics Association: JAMIA, 28(3):549–558, March 2021. ISSN 1527–974X. doi: 10.1093/jamia/ocaa283. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barocas Solon, Hardt Moritz, and Narayanan Arvind. Fairness and Machine Learning. fairmlbook.org, 2019.
- Berk Richard, Heidari Hoda, Jabbari Shahin, Joseph Matthew, Kearns Michael, Morgenstern Jamie, Neel Seth, and Roth Aaron. A convex framework for fair regression. arXiv preprint arXiv:1706.02409, 2017. [Google Scholar]
- Brier Glenn W et al. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1):1–3, 1950. [Google Scholar]
- Castelnovo Alessandro, Crupi Riccardo, Greco Greta, and Regoli Daniele. The zoo of Fairness metrics in Machine Learning. arXiv:2106.00467 [cs, stat], June 2021. [Google Scholar]
- Chen Tianqi and Guestrin Carlos. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ‘16, pages 785–794, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4232-2. doi: 10.1145/2939672.2939785. [DOI] [Google Scholar]
- Chouldechova Alexandra. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. arXiv:1703.00056 [cs, stat], February 2017. [DOI] [PubMed] [Google Scholar]
- Chouldechova Alexandra and Roth Aaron. The Frontiers of Fairness in Machine Learning. arXiv:1810.08810 [cs, stat], October 2018. [Google Scholar]
- Collins Patricia Hill. Black Feminist Though: Knowledge, Consciousness, and the Politics of Empowerment. Routledge, 1 edition, September 1990. [Google Scholar]
- Collins Patricia Hill and Bilge Sirma. Intersectionality. John Wiley & Sons, 2020. ISBN 1-5095-3969-7. [Google Scholar]
- Crenshaw Kimberle. Demarginalizing the Intersection of Race and Sex: A Black Feminist Critique of Antidiscrimination Doctrine, Feminist Theory and Antiracist Politics. University of Chicago Legal Forum, 1989(1):31, 1989. [Google Scholar]
- Crenshaw Kimberle. Mapping the Margins: Intersectionality, Identity Politics, and Violence against Women of Color. Stanford Law Review, 43(6):1241, July 1991. ISSN 00389765. doi: 10.2307/1229039. URL https://www.jstor.org/stable/1229039?origin=crossref. [DOI] [Google Scholar]
- Diao James A, Wu Gloria J, Taylor Herman A, Tucker John K, Powe Neil R, Kohane Isaac S, and Manrai Arjun K. Clinical implications of removing race from estimates of kidney function. JAMA : the journal of the American Medical Association, 325(2):184–186, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dinh Michael M and Russell Saartje Berendsen. Overcrowding kills: How COVID-19 could reshape emergency department patient flow in the new normal. Emergency Medicine Australasia, 33(1):175–177, 2021. ISSN 1742–6723. doi: 10.1111/1742-6723.13700. [DOI] [Google Scholar]
- Dwork Cynthia and Lei Jing. Differential privacy and robust statistics. In Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, pages 371–380, 2009. [Google Scholar]
- Dwork Cynthia and Roth Aaron. The Algorithmic Foundations of Differential Privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2013. ISSN 1551–305X, 1551–3068. doi: 10.1561/0400000042. [DOI] [Google Scholar]
- Dwork Cynthia, Hardt Moritz, Pitassi Toniann, Reingold Omer, and Zemel Richard. Fairness Through Awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, ITCS ‘12, pages 214–226, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1115-1. doi: 10.1145/2090236.2090255. [DOI] [Google Scholar]
- Dwork Cynthia, Kim Michael P., Reingold Omer, Rothblum Guy N., and Yona Gal. Learning from Outcomes: Evidence-Based Rankings. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 106–125, November 2019. doi: 10.1109/FOCS.2019.00016. [DOI] [Google Scholar]
- Foulds James, Islam Rashidul, Keya Kamrun Naher, and Pan Shimei. An Intersectional Definition of Fairness. arXiv:1807.08362 [cs, stat], September 2019a. [Google Scholar]
- Foulds James, Islam Rashidul, Keya Kamrun Naher, and Pan Shimei. An Intersectional Definition of Fairness. arXiv:180%.08362 [cs, stat], September 2019b. URL http://arxiv.org/abs/1807.08362. arXiv: 1807.08362. [Google Scholar]
- Foulds James R. and Pan Shimei. Are Parity-Based Notions of AI Fairness Desirable? Data Engineering, page 51, 2020. [Google Scholar]
- Goldberger AL, Amaral LA, Glass L, Hausdorff JM, Ivanov PC, Mark RG, Mietus JE, Moody GB, Peng CK, and Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation, 101(23):E215–220, June 2000. ISSN 1524–4539. doi: 10.1161/01.cir.101.23.e215. [DOI] [PubMed] [Google Scholar]
- Gopalan Parikshit, Kim Michael P., Singhal Mihir, and Zhao Shengjia. Low-Degree Multicalibration. arXiv:2203.01255 [cs], March 2022. [Google Scholar]
- Gupta Varun, Jung Christopher, Noarov Georgy, Pai Mallesh M., and Roth Aaron. Online Multivalid Learning: Means, Moments, and Prediction Intervals, January 2021. [Google Scholar]
- Hanna Alex, Denton Emily, Smart Andrew, and Smith-Loud Jamila. Towards a critical race methodology in algorithmic fairness. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 501–512, Barcelona Spain, January 2020. ACM. ISBN 978–1-4503-6936-7. doi: 10.1145/3351095.3372826. [DOI] [Google Scholar]
- Hardt Moritz, Price Eric, Price Eric, and Srebro Nati. Equality of Opportunity in Supervised Learning. In Lee DD, Sugiyama M, Luxburg UV, Guyon I, and Garnett R, editors, Advances in Neural Information Processing Systems 29, pages 3315–3323. Curran Associates, Inc., 2016. [Google Scholar]
- Harrison Reema, Manias Elizabeth, Mears Stephen, Heslop David, Hinchcliff Reece, and Hay Liz. Addressing unwarranted clinical variation: A rapid review of current evidence. Journal of Evaluation in Clinical Practice, 25(1): 53–65, 2019. ISSN 1365–2753. doi: 10.1111/jep.12930. [DOI] [PubMed] [Google Scholar]
- Hebert-Johnson Ursula, Kim Michael, Reingold Omer, and Rothblum Guy. Multicalibration: Calibration for the (Computationally-Identifiable) Masses. In Proceedings of the 35th International Conference on Machine Learning, pages 1939–1948. PMLR, July 2018. [Google Scholar]
- Hébert-Johnson Úrsula, Kim Michael P., Reingold Omer, and Rothblum Guy N.. Calibration for the (Computationally-Identifiable) Masses. arXiv:1711.08513 [cs, stat], March 2018. [Google Scholar]
- Henry Katharine E., Hager David N., Pronovost Peter J., and Saria Suchi. A targeted realtime early warning score (TREWScore) for septic shock. Science Translational Medicine, 7 (299):299ra122, August 2015. ISSN 1946–6242. doi: 10.1126/scitranslmed.aab3719. [DOI] [PubMed] [Google Scholar]
- James Catherine A., Bourgeois Florence T., and Shannon Michael W.. Association of Race/Ethnicity with Emergency Department Wait Times. Pediatrics, 115(3):e310–e315, March 2005. ISSN 0031–4005. doi: 10.1542/peds.2004-1541. [DOI] [PubMed] [Google Scholar]
- Jiang Heinrich and Nachum Ofir. Identifying and Correcting Label Bias in Machine Learning, January 2019.
- Johnson Alistair, Bulgarelli Lucas, Pollard Tom, Leo Anthony Celi Roger Mark, and Horng Steven. MIMIC-IV-ED, 2021.
- Jung Christopher, Lee Changhwa, Pai Mallesh, Roth Aaron, and Vohra Rakesh. Moment Multicalibration for Uncertainty Estimation. In Proceedings of Thirty Fourth Conference on Learning Theory, pages 2634–2678. PMLR, July 2021. [Google Scholar]
- Kallus Nathan, Mao Xiaojie, and Zhou Angela. Assessing algorithmic fairness with unobserved protected class using data combination. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20, page 110, Barcelona, Spain, January 2020. Association for Computing Machinery. ISBN 978-1-4503-6936-7. doi: 10.1145/3351095.3373154. [DOI] [Google Scholar]
- Kearns Michael, Neel Seth, Roth Aaron, and Wu Zhiwei Steven. Preventing Fairness Gerry-mandering: Auditing and Learning for Subgroup Fairness. arXiv:1711.05144 [cs], December 2018. [Google Scholar]
- Kifer Daniel and Machanavajjhala Ashwin. Pufferfish: A framework for mathematical privacy definitions. ACM Transactions on Database Systems (TODS), 39(1):1–36, 2014. [Google Scholar]
- Kim Michael P., Ghorbani Amirata, and Zou James. Multiaccuracy: Black-box postprocessing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 247–254, 2019. [Google Scholar]
- Kleinberg Jon, Mullainathan Sendhil, and Raghavan Manish. Inherent trade-offs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807, 2016. [Google Scholar]
- Ku Elaine, McCulloch Charles E, Adey Deborah B, Li Libo, and Johansen Kirsten L. Racial disparities in eligibility for preemptive wait-listing for kidney transplantation and modification of eGFR thresholds to equalize waitlist time. Journal of the American Society of Nephrology, 32(3):677–685, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McDonald Erica J., Quick Matthew, and Oremus Mark. Examining the Association between Community-Level Marginalization and Emergency Room Wait Time in Ontario, Canada. Healthcare Policy, 15(4):64–76, May 2020. ISSN 1715–6572. doi: 10.12927/hcpol.2020.26223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Newhouse Joseph P. and Garber Alan M.. Geographic Variation in Medicare Services. New England Journal of Medicine, 368(16):1465–1468, April 2013. ISSN 0028–4793. doi: 10.1056/NEJMp1302981. [DOI] [PubMed] [Google Scholar]
- Obermeyer Ziad, Powers Brian, Vogeli Christine, and Mullainathan Sendhil. Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464):447–453, October 2019. ISSN 0036–8075, 1095–9203. doi: 10.1126/science.aax2342. [DOI] [PubMed] [Google Scholar]
- Pfisterer Florian, Kern Christoph, Dandl Susanne, Sun Matthew, Kim Michael P., and Bischl Bernd. Mcboost: Multi-Calibration Boosting for R. Journal of Open Source Software, 6(64):3453, 2021. [Google Scholar]
- Pleiss Geoff, Raghavan Manish, Wu Felix, Kleinberg Jon, and Weinberger Kilian Q. On Fairness and Calibration. In Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, and Garnett R, editors, Advances in Neural Information Processing Systems 30, pages 5680–5689. Curran Associates, Inc., 2017. [Google Scholar]
- Riviello Elisabeth D., Dechen Tenzin, O’Donoghue Ashley L., Cocchi Michael N., Hayes Margaret M., Molina Rose L., Moraco Nicole H., Mosenthal Anne, Rosenblatt Michael, Talmor Noa, Walsh Daniel P., Sontag David N., and Stevens Jennifer P.. Assessment of a Crisis Standards of Care Scoring System for Resource Prioritization and Estimated Excess Mortality by Race, Ethnicity, and Socially Vulnerable Area During a Regional Surge in COVID-19. JAMA Network Open, 5(3): e221744, March 2022. ISSN 2574–3805. doi: 10.1001/jamanetworkopen.2022.1744. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roberts Dorothy. Fatal invention: How science, politics, and big business re-create race in the twenty-first century. New Press/ORIM, 2011. ISBN 1-59558-691-1. [Google Scholar]
- Rose Sherri. Machine learning for prediction in electronic health data. JAMA network open, 1 (4):e181404–e181404, 2018. [DOI] [PubMed] [Google Scholar]
- Schnellinger Erin M, Cantu Edward, Harhay Michael O, Schaubel Douglas E, Kimmel Stephen E, and Stephens-Shields Alisa J. Mitigating selection bias in organ allocation models. BMC medical research methodology, 21(1):1–9, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sun Benjamin C., Hsia Renee Y., Weiss Robert E., Zingmond David, Liang Li-Jung, Han Weijuan, McCreath Heather, and Asch Steven M.. Effect of Emergency Department Crowding on Outcomes of Admitted Patients. Annals of Emergency Medicine, 61(6):605–611.e6, June 2013. ISSN 01960644. doi: 10.1016/j.annemergmed.2012.10.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sutherland Kim and Levesque Jean-Frederic. Unwarranted clinical variation in health care: Definitions and proposal of an analytic framework. Journal of Evaluation in Clinical Practice, 26(3):687–696, 2020. ISSN 1365–2753. doi: 10.1111/jep.13181. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tanabe Paula, Gimbel Rick, Yarnold Paul R., Kyriacou Demetrios N., and Adams James G.. Reliability and validity of scores on The Emergency Severity Index version 3. Academic Emergency Medicine: Official Journal of the Society for Academic Emergency Medicine, 11 (1):59–65, January 2004. ISSN 1069–6563. doi: 10.1197/j.aem.2003.06.013. [DOI] [PubMed] [Google Scholar]
- Zelnick Leila R, Leca Nicolae, Young Bessie, and Bansal Nisha. Association of the estimated glomerular filtration rate with vs without a coefficient for race with time to eligibility for kidney transplant. JAMA network open, 4(1): e2034004–e2034004, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
This paper uses the MIMIC-IV-ED dataset (Johnson et al., 2021), which is available on the PhysioNet repository (Goldberger et al., 2000). Code for the experiments is available here: https://github.com/cavalab/proportional-multicalibration.
