Abstract
This research paper investigates the impact of demographic biases, specifically race and gender, on machine learning-based glioma grading, using the TCGA data set compiled from The Cancer Genome Atlas. The study applies three common classifiers (logistic regression, random forests, and extreme gradient boosting) and explores pre-processing (reweighting) and post-processing (equalized odds) strategies for bias mitigation. It evaluates prediction performance metrics (Matthews correlation coefficient, recall, specificity) and fairness metrics (disparate impact, equal opportunity difference, error rate difference), highlighting the trade-offs between fairness and accuracy across different demographic groups. For the most severe bias (race), the pre-trained logistic regression model using the reweighting algorithm shows some deterioration in prediction outcomes for the under-represented group and even an increase in unfairness, while the post-processing approach improves results for the under-represented group and provides significant improvements in fairness. These results are interesting because they could be taken into account in real-world clinical decision-making or in outcomes for under-represented patient groups.
Keywords: Glioma, Fairness, Performance, Bias, Debiasing, Machine learning
Subject terms: Cancer, Computational biology and bioinformatics, Neuroscience
Introduction
Machine learning (ML) models have become common tools in many aspects of our lives, with practical applications in virtually any real-world setting. ML is particularly well-suited to support decision-making and prediction, as it can offer significant benefits, especially in the era of Big Data, where vast volumes of data must be processed. However, just as human decisions are often subjective and can therefore introduce bias, ML-based models are also susceptible to perpetuating potentially discriminatory biases present in the data they are trained on, resulting in “unfair” decisions, even when unintentional.
Fairness in ML refers to whether a model treats sub-populations equally; that is, a fair ML algorithm should not behave favorably or unfavorably toward any group of individuals based on their characteristics, such as gender, race, income, marital status, or political orientation 1. The existence of any bias significantly affects the outcomes of ML models, favoring or disadvantaging certain sociodemographic subgroups, which can lead to serious consequences. For example, in healthcare, if an ML algorithm is trained on a data set with a predominant group (e.g., men), it is more likely to misdiagnose a disease in the under-represented group (e.g., women).
Bias in ML can have two main sources 2: (i) data bias, which arises when the training data itself is biased or unrepresentative; and (ii) model bias, which occurs when the algorithms themselves exhibit unfair behavior due to design decisions, even if there are no biased attributes in the input data. On the other hand, fairness can refer to both individual fairness, which seeks to ensure some property of equity for similar individuals, and group fairness, which attempts to ensure that some statistic remains consistent across different groups 3.
Despite being a relatively new area of research, the interdisciplinary nature of the articles addressing fairness and bias, spanning multiple real-life domains, underscores the importance of fairness in ML-based decision-making and prediction. Researchers have proposed various methods to assess and mitigate demographic and/or socioeconomic biases in a wide range of applications that could negatively impact marginalized or under-represented populations, such as predicting school dropout rates 4, credit risk assessment 5, recidivism prediction 6, and job recruitment 7.
However, healthcare is probably one of the areas where assessing the fairness of ML models is most important because algorithms trained on biased data can lead to erroneous predictions, increasing disparities in care and biased clinical decisions towards under-represented populations, with potentially serious health consequences 8,9. Considering its importance in the field of health and building on previous research that designed a data-centric ML approach to predict glioma grades 10, the present work seeks to analyze the effect of demographic bias when using ML models for glioma grading. More specifically, we examine the prediction performance and fairness of three well-known classifiers, as well as the potential disparity between over- and under-represented groups for two biased attributes (race and gender). Furthermore, we explore whether a balance between performance and fairness can be achieved by applying pre- and post-processing bias mitigation strategies. To analyze this trade-off, we evaluate fairness metrics (disparate impact, equal opportunity difference, error rate difference) and prediction performance metrics (Matthews correlation coefficient, recall, specificity) before and after bias mitigation.
Related work
In the last years, numerous studies have focused on investigating and mitigating the impact of bias in clinical decision-making. Using five ML classifiers (naive Bayes, logistic regression, support vector machine with linear kernel, random forest, and gradient-boosted trees), Li et al. 11 concluded that female, Black, and socioeconomically disadvantaged patients were more likely to be under-diagnosed with heart failure. Sahin et al. 12 demonstrated that demographic characteristics led to disparities in the fairness of Alzheimer’s dementia prediction using ML models, while these models showed a fair prognosis when based solely on biomarkers.
Cabanillas et al. 13 investigated the equity of clinical risk prediction models across gender groups for three clinical scenarios (delirium, sepsis, and acute kidney injury) and showed that the under-diagnosis rate was consistently higher in the female population. Dang et al. 14 conducted an empirical study of various techniques to mitigate the biased behavior of logistic regression and extreme gradient boosting when used to predict depression. Yaseliani et al. 15 developed a framework consisting of a neural network and an equal-odds-based bias mitigation algorithm to reduce the bias of some attributes (sex, race, income, marital status, working condition, and race) in predicting opioid use disorder.
Wang et al. 16 assessed racial fairness in predicting mortality in patients with various chronic diseases (congestive heart failure, chronic kidney disease, chronic obstructive pulmonary disease, chronic liver disease, and dementia) using a suite of ML-based models (logistic regression, random forest, AdaBoost, and multilayer perceptron). After applying a bias mitigation algorithm, the experimental results revealed significant differences in F1-score and sensitivity. Park et al. 17 compared three techniques for reducing racial bias in predicting postpartum depression and postpartum mental health service utilization using three ML classifiers (logistic regression, random forest, and extreme gradient boosting) and concluded that debiasing through a reweighting algorithm performed better than removing the race attribute, as measured by disparate impact and equal opportunity difference.
Li et al. 18 analyzed the performance of bias mitigation algorithms when applied in conjunction with logistic regression, random forest, gradient-boosting trees, and long short-term memory for cardiovascular disease prediction. They found that the resampling approach was more effective for attributes with large bias (gender) than for those with low bias (race). Sivarajkumar et al. 19 developed a new method based on an autoencoder architecture and a weighted loss function to generate unbiased patient representations from electronic health records, demonstrating that the proposed method achieved higher fairness than other ML models.
Fairness in glioma grading
The most common primary tumors of the central nervous system in adults are gliomas, which arise from different types of glial cells (astrocytomas, oligodendrogliomas, and ependymomas) and have high rates of relapse and mortality. According to the World Health Organization (WHO) 20, gliomas are classified into four grades based on various histological criteria, specifically well developed for the evaluation of astrocytomas. Grades I (pilocytic astrocytomas) and II (diffuse astrocytomas) are the most benign tumors with a favorable prognosis and are considered low-grade gliomas (LGG), while grades III (anaplastic astrocytomas) and IV (glioblastomas multiforme, GBM) are considered high-grade gliomas. Overall 5-year survival rates range from 94.7% for pilocytic astrocytoma to 6.8% for glioblastoma 21.
While histopathological grading, as is the case of gliomas, is based on morphology and is therefore less likely to be influenced by demographic bias 22, fairness analysis is appropriate in this context due to the high clinical risks involved in the diagnosis and grading of gliomas. Indeed, numerous clinical studies have reported significant disparities in incidence, prognosis, overall survival, and mortality rates among different demographic and socioeconomic groups, supporting the importance of accounting for bias in glioma grade prediction when using ML algorithms to ensure their fair and equitable application. Furthermore, other studies have demonstrated disparities in cancer clinical trial enrollment based on race, ethnicity, gender, age, body size, and socioeconomic status 23–26. Delgado-López et al. 27 pointed out some challenges that hinder the clinical integration of ML, including algorithmic bias and the need for regulatory frameworks to ensure equitable and ethical applications of ML in healthcare, particularly in the detection and treatment of astrocytomas.
Reihanian et al. 28 highlighted that women with GBM showed significantly higher 1-year survival and longer mean survival than men. Yin et al. 29 observed significant gender differences in the incidence, prognosis, and response to treatment of gliomas, with male patients having a higher incidence and worse prognosis than women; they also found that men diagnosed with GBM had a higher mortality rate than women. John et al. 30 noted significant racial disparities in the alteration frequencies of six GBM genes, concluding the need to consider race/ethnicity in GBM treatment. Zawad and Washington 31 designed a method based on an ensemble of feature selection algorithms to reduce disparities in glioma classification on a gender-biased data set.
Di Nunno et al. 32 found that glioblastoma patients from low-income census tracts had a lower survival rate than patients from high-income census tracts. Hu et al. 33 investigated the effects of marital status on overall survival in GBM patients, revealing that survival was higher for both married and single patients than for divorced, separated, and widowed patients. In an epidemiological study of adolescent and young adult patients diagnosed with glioma, Puthenpura et al. 34 concluded that patients belonging to racial/ethnic minorities, patients with lower socioeconomic status, and patients residing in rural areas had a higher risk of death.
Fairness analysis
In healthcare, an ML model is considered unfair if its decisions are biased toward a specific social or demographic group without clinical justification. Conversely, a model is defined as fair if its prediction errors are similar across all groups 18. Before discussing some bias mitigation (or debiasing) strategies, it is necessary to introduce an important element of fairness in ML upon which they are based: protected (or sensitive) attributes. A protected attribute divides a population into groups whose outcomes should have parity in terms of the benefits received; that is, it separates the population into a privileged group (i.e., the over-represented or favored group) and an unprivileged group (i.e., the disadvantaged or under-represented group). Therefore, protected attributes correspond to the input features for which we seek to avoid discrimination and ensure fairness.
Bias mitigation strategies
Several methods have been developed to alleviate bias and increase fairness, but at the same time, they must ensure that this does not significantly degrade the prediction performance of ML algorithms. One simple approach is to force the model to ignore protected attributes, but this is ineffective because there may be ways to predict the protected attributes from other features 35, or because those protected attributes might be important to the problem at hand.
Bias mitigation methods are typically classified into three categories based on the stage of the ML pipeline at which they are applied 3: pre-processing, in-processing, and post-processing. Pre-processing techniques attempt to alter the distributions of protected attributes or perform transformations on the data to remove underlying bias from the training data before it is fed into the ML model. In-processing methods modify the learning model itself during training, typically incorporating a fairness metric into the model optimization functions to converge toward a model parameterization that maximizes the balance between accuracy and fairness. Post-processing techniques adjust the predictions of the model after this has been fully trained to ensure fairness across different groups.
Pre-processing and post-processing methods offer several advantages over in-processing techniques, primarily in terms of flexibility, computational cost, model independence, and interpretability 36. For example, both pre-processing and post-processing approaches are completely model-independent, as they do not internally modify the ML model 1 and are therefore considered universal approaches. Unlike in-processing methods, which are tightly integrated with a specific loss function and model type, pre-processing and post-processing techniques are more flexible because they can adapt to different definitions of fairness and be applied to any ML model. On the other hand, since pre-processing and post-processing approaches do not require retraining the model, they are usually computationally simpler and easier to implement than complex in-processing approaches 37. Regarding interpretability, the changes made by the pre-processing algorithms are visible in the data itself; that is, the transformed data set can be inspected to understand how bias was mitigated. In the case of post-processing, since the adjustments are applied directly to the model outputs, it can be guaranteed that the results meet a predefined constraint. Furthermore, some authors have pointed out that retraining the model can be costly, time-consuming, and technically challenging, so in-processing methods may not be the most suitable in real-world clinical settings 38,39. Thus, considering that in-processing presents numerous practical limitations, this study focused on pre-processing and post-processing techniques, carrying out experiments with a reweighting algorithm and an equalized odds algorithm.
Reweighting mitigates the observed bias by iteratively assigning different weights to individual data points, forcing the model to prioritize learning from under-represented groups and de-emphasize over-represented groups 40. This often involves increasing the weights of under-represented groups and decreasing those of over-represented groups. The goal is to improve fairness with minimal impact on accuracy, which is achieved by formulating a constrained optimization problem. Given a protected attribute (A) and the target class (Y), the weight
for a sample belonging to group
is calculated as:
![]() |
1 |
where
is the overall probability of being in the unprivileged group a,
is the overall probability of having the class y, and
is the actual probability of being in the specific group combination (e.g., the proportion of all non-White patients diagnosed with LGG).
Since the actual joint probability
for an unprivileged group will be much smaller than the product of its marginal probabilities
, this results in assigning a high weight (
) to the samples in the unprivileged group and a low weight (
) to the samples in the privileged group.
The equalized odds post-processing method proposed by Hardt et al. 41 consists of analyzing the predictions of an ML model to assess the equalized odds criterion (i.e., equal true positive and false positive rates for the two groups of the protected attribute), while keeping model error to a minimum. If the equalized odds criterion is not met, the algorithm adjusts the predictions, usually by modifying the classification thresholds for the different groups, with the aim of equalizing the true positive and false positive rates.
Reweighting was chosen because it offers advantages over other pre-processing algorithms. For example, unlike the resampling approaches, reweighting preserves the size and density of the original data distribution, thus avoiding the risk of data inflation (over-sampling) and loss of information (under-sampling). Regarding relabeling, while this approach introduces noise into the training data by intentionally flipping the label of certain samples in the sensitive group, reweighting allows the model to learn while preserving the true labels. On the other hand, attribute removal only prevents direct discrimination but fails to address indirect discrimination based on correlated features. In short, reweighting is mathematically superior because it preserves the full content of the information and the integrity of the original data. This makes it a more robust and attractive option than resampling (which discards data) or relabeling (which corrupts labels) when data quality and feature utility are paramount 38,42, as is the case in healthcare applications.
The choice of equalized odds over other post-processing methods is due to its ability to apply a more robust fairness metric that directly controls for errors, which is critical in high-risk decisions such as glioma grading. Simple threshold optimization aims for demographic balance, but can present significant disparities in which groups suffer high rates of false positive outcomes or low rates of true positive outcomes. While calibrated equalized odds might seem like a better option because it imposes a very strict fairness metric that requires both the predicted probabilities to be well calibrated within each sensitive group and to satisfy equalized odds, such rigid constraints on the decision boundary can significantly reduce the accuracy of the model 43. In short, equalized odds provides a solid and verifiable guarantee of fairness (equal error rates) with a simple implementation, making it the most common and robust choice when equalizing treatment is the goal and the model accuracy must be largely preserved 44.
Materials and methods
This section presents our case study and provides a descriptive analysis of the data. It also describes the prediction models tested in our experiments and the approaches considered to mitigate bias.
Data
The data used in this study were obtained from The Cancer Genome Atlas (TCGA, https://www.cancer.gov/ccg/research/). The data set was constructed based on the TCGA-LGG and TCGA-GBM projects and consists of three demographic attributes (gender, age at diagnosis, and race) and 20 frequently mutated molecular biomarkers from 839 adult patients diagnosed with LGG or GBM 45. Except for age, all other attributes are categorical. In the experiments performed, the age attribute was normalized using the z-score standardization technique so that all values have a mean of 0 and a standard deviation of 1. Molecular biomarkers are represented as 0 (not mutated) and 1 (mutated) according to the TCGA case number. There are no missing data for any of the attributes. The data set consists of two classes indicating the grade of glioma: 352 (41.95%) patients with GBM (considered here as the positive class) and 487 (58.05%) patients with LGG (the negative class), resulting in a class imbalance ratio of 1.38 (i.e., the ratio of the majority class to the minority class).
Descriptive statistics
The mean age of the patients was 50.94 years [95% CI 49.88,52.05]. While the difference in mean age between men and women was negligible (
vs.
), significant differences were observed in the age of patients with GBM and those with LGG (
vs.
). The data set included 488 (58.16%) men, of whom 217 (44.47%) were diagnosed with GBM and 271 (55.53%) with LGG, and 351 (41.84%) women, of whom 135 (38.46%) with GBM and 216 (61.54%) with LGG. The vast majority of patients were White (765, 91.18%), followed by Black or African American (59, 7.03%), with the remainder being Asian and American Indian (15, 1.79%). Of these 308 (40.26%) were White patients with GBM and 457 (59.74%) with LGG, and 44 (59.46%) were non-White patients with GBM and 30 (40.54%) with LGG.
For non-demographic attributes, Table 1 shows the counts (frequencies) and proportions (relative frequencies) of the 20 molecular biomarkers, ordered from highest to lowest according to the number of patients with a mutation. For each of the five biomarkers with the highest number of mutated cases, the distribution between male and female patients was as follows: IDH1 (225 vs. 179), TP53 (202 vs. 146), ATRX (116 vs. 101), PTEN (87 vs. 54), and EGFR (70 vs. 42), resulting in male to female ratios of 1.26, 1.38, 1.15, 1.61, and 1.66, respectively. Regarding race, the distribution between White and non-White patients was as follows: IDH1 (381 vs. 23), TP53 (327 vs. 21), ATRX (201 vs. 16), PTEN (125 vs. 16), and EGFR (98 vs. 14), resulting in White to non-White ratios of 16.57, 15.57, 12.56, 7.81, and 7.00, respectively.
Table 1.
Frequencies and proportions of patients with and without mutated molecular biomarkers.
| Non-mutated | Mutated | |||
|---|---|---|---|---|
| Biomarker | Count | Proportion (%) | Count | Proportion (%) |
| IDH1 | 435 | 51.85 | 404 | 48.15 |
| TP53 | 491 | 58.52 | 348 | 41.48 |
| ATRX | 622 | 74.14 | 217 | 25.86 |
| PTEN | 698 | 83.19 | 141 | 16.81 |
| EGFR | 727 | 86.65 | 112 | 13.35 |
| CIC | 728 | 86.77 | 111 | 13.23 |
| MUC16 | 741 | 88.32 | 98 | 11.68 |
| PIK3CA | 766 | 91.30 | 73 | 8.70 |
| NF1 | 772 | 92.01 | 67 | 7.99 |
| PIK3R1 | 785 | 93.56 | 54 | 6.44 |
| FUBP1 | 794 | 94.64 | 45 | 5.36 |
| RB1 | 799 | 95.23 | 40 | 4.77 |
| NOTCH1 | 801 | 95.47 | 38 | 4.53 |
| BCOR | 810 | 96.54 | 29 | 3.46 |
| CSMD3 | 812 | 96.78 | 27 | 3.22 |
| SMARCA4 | 812 | 96.78 | 27 | 3.22 |
| GRIN2A | 812 | 96.78 | 27 | 3.22 |
| IDH2 | 816 | 97.26 | 23 | 2.74 |
| FAT4 | 816 | 97.26 | 23 | 2.74 |
| PDGFRA | 817 | 97.38 | 22 | 2.62 |
More interesting are the values reported in Table 2, which shows the percentage of patients by gender and racial group for the ten molecular biomarkers with the highest number of mutated cases. For example, 46.11% of male patients had the IDH1 biomarker mutated, while the figure for female patients was 51.00%. As can be seen, the highest differences correspond to the values between White and non-White patients, especially in the case of IDH1 (49.80% vs. 31.08%) and TP53 (42.75% vs. 28.38%). It is also important to note that the PTEN and EGFR biomarkers showed a higher number of non-White patients with mutations than White patients: 31.62% vs. 16.34% and 18.92% vs. 12.81%, respectively.
Table 2.
Percentage of patients by gender and racial group for the molecular biomarkers with the highest number of mutated cases.
| Gender | Race | |||
|---|---|---|---|---|
| Biomarker | Male | Female | White | non-White |
| IDH1 | 46.11 | 51.00 | 49.80 | 31.08 |
| TP53 | 41.39 | 41.60 | 42.75 | 28.38 |
| ATRX | 23.77 | 28.77 | 26.27 | 21.62 |
| PTEN | 17.83 | 15.38 | 16.34 | 21.62 |
| EGFR | 14.34 | 11.97 | 12.81 | 18.92 |
Protected attributes
In the present study, we considered two protected attributes: race and gender. Regarding race, we split the data set into White and non-White patients (this included Black or African American, Asian, and American Indian individuals), with White being the privileged group. The ratio of White to non-White patients was 10.34, indicating a highly skewed ratio in favor of the White group. Regarding gender, we divided the data set into male (the privileged group) and female patients. In this case, the ratio of male to female patients was 1.80.
The favorable class (i.e., the value of the target variable that represents the most important outcome for the patient) was GBM, as this is the most malignant and aggressive type of primary brain tumor, with rapid growth and the lowest survival rate (less than 5% at 5 years) 46, and therefore the most critical class for a correct diagnosis. Figure 1 shows the proportions of patients with GBM and LGG for the privileged and unprivileged groups for the two protected attributes on which this study relies to better contextualize the fairness analyses. The graphs suggest that the skewed distribution by gender and race could lead to biased predictions: (i) for the gender attribute, both the privileged and unprivileged groups consisted of a higher proportion of patients diagnosed with LGG than with GBM, resulting in a ratio of LGG to GBM for the privileged and unprivileged groups of 1.25 and 1.60, respectively; (ii) for the race attribute, the White population (privileged group) had a higher proportion of patients with GBM than patients with LGG, while the unprivileged group was dominated by patients diagnosed with GBM. In this case, the LGG-to-GBM ratios for the privileged and unprivileged groups were 1.48 and 0.68, respectively.
Figure 1.
Distribution of patients with GBM and LGG for the privileged and unprivileged groups of the two protected attributes: race (left) and gender (right).
Prediction models and performance metrics
Three ML prediction models commonly used in healthcare applications were fitted: (i) logistic regression (LR), which is a linear classifier; (ii) random forest (RF), an ensemble learning method with bagging; and (iii) extreme gradient boosting (XGB), an ensemble learning method with boosting. While this work does not aim to determine the best-performing ML model, but rather to evaluate the trade-off between prediction performance and fairness, the hyper-parameters of each model were fine-tuned using a random search optimization algorithm 47.
Five-fold cross-validation was performed to assess performance and fairness. The data was randomly split into five stratified parts of equal (or approximately equal) size. For each fold, nine blocks were merged to construct a training set, and the remaining part was used as an independent test set. Stratification was used to preserve the class proportions of the entire data set within each block, thereby reducing the variance of the estimates. The prediction results of the test samples using the training sets were averaged across the five runs, thus avoiding possible extreme results from a single experiment and making the conclusions more compelling.
To evaluate the prediction performance of ML models, we considered the Matthews correlation coefficient (MCC), recall, and specificity (Spec). The MCC is a robust metric if positive and negative cases are equally important and is more informative than other metrics, such as the F1-score 48,49. The MCC ranges between
and
: a value of
represents a perfect prediction, 0 indicates no better than a random prediction, and
denotes a completely erroneous prediction.
Fairness metrics
Fairness metrics allow us to assess how well a prediction model meets the definitions of fairness. In this study, rather than measuring individual fairness, we focus on group-level fairness metrics, which compare outcomes across groups and quantify the level of disparity or discrimination (or, conversely, the degree of fairness) 1,50. Similar to prediction performance metrics, the group fairness metrics are also based on a confusion matrix where the columns represent the classes predicted by the ML model, the rows represent the actual classes, and the main diagonal contains the number of correct predictions.
In binary prediction problems, the disparate impact (DI), also called demographic impact, is the ratio of the average positive outcomes for the unprivileged group to those of the privileged group. It allows us to determine how often the rate of positive outcomes for the unprivileged group coincides with that of the privileged group. It is defined as follows:
![]() |
2 |
where TPR is the proportion of positives correctly predicted by the model, and p and u represent the privileged and unprivileged groups, respectively.
The equal opportunity difference (EOD) is calculated as the difference of true positive rates between unprivileged and privileged groups. The goal of this metric is to ensure that the model offers all groups the same probability of obtaining positive outcomes; that is, a model meets equal opportunity if the true positive rates are the same for both groups:
![]() |
3 |
The error rate difference (ERD) corresponds to the difference in false positive and false negative rates between the privileged and unprivileged groups. This is an important metric when the cost of errors is the same regardless of the group and we need to ensure that it is distributed equitably between them:
![]() |
4 |
where FPR is the proportion of negative cases that the model incorrectly predicted as positive, and FNR is the proportion of positive cases that the model incorrectly predicted as negative.
The ideal threshold (i.e., fair outcomes) is 1.0 for the disparate impact metric and 0.0 for the equal opportunity difference and error rate difference metrics. Values below the ideal threshold imply greater benefits for the privileged group, while values above indicate a bias in favor of the unprivileged group.
Results
This section presents the results of the evaluation of the ML models before (referred to here as baseline models) and after applying bias mitigation strategies. First, the ML models were trained on the training data, and their performance on the test samples were evaluated using the prediction performance and fairness metrics mentioned earlier. The bias mitigation algorithms presented in Section 3.1 were then applied to the best-performing ML model, and its overall efficiency was determined by analyzing the trade-off between prediction performance and fairness, comparing it to the results of the base model. The ultimate goal of this analysis was to test whether the model’s fairness could be improved while maintaining prediction performance.
Evaluation of the baseline models
Fig. 2 and Tables 3 and 4 compare the prediction performance of the three baseline models (without bias mitigation algorithms), as measured by recall, specificity, and MCC, across the entire test set (here referred to as Total) and also on each individual group of the two protected attributes. When evaluating the prediction performance of the baseline models across the entire test samples, the MCC revealed that LR was the best-performing algorithm (0.742 [95% CI: 0.740,0.745]), followed by RF (0.734 [95% CI: 0.732,0.736]) and XGB (0.688 [95% CI: 0.685,0.691]). Similarly, when evaluating the performance in the positive class (i.e., recall), the best classifier was again LR (0.906 [95% CI: 0.905,0.907]), while the recall obtained with RF was 0.892 [95% CI: 0.889,0.895] and that given by XGB was 0.836 [95% CI: 0.831,0.840]. However, when focusing on the negative class, although the differences in this case were not significant, XGB had the highest specificity value (0.854 [95% CI: 0.852,0.856]), followed by RF (0.848 [95% CI: 0.846,0.850]) and LR (0.844 [95% CI: 0.842,0.846]).
Fig. 2.
Comparison of prediction performance on privileged and unprivileged groups, and on the total population using the baseline models for the attributes race (left) and gender (right).
Table 3.
Average prediction performance (and standard deviation) using the baseline models (considering race bias).
| Recall | Specificity | MCC | |||||||
|---|---|---|---|---|---|---|---|---|---|
| White | non-White | Total | White | non-White | Total | White | non-White | Total | |
| LR | 0.923 | 0.909 | 0.906 | 0.856 | 0.650 | 0.844 | 0.767 | 0.590 | 0.742 |
| (0.034) | (0.089) | (0.017) | (0.027) | ( 0.149) | (0.025) | (0.048) | (0.164) | (0.037) | |
| RF | 0.902 | 0.867 | 0.892 | 0.865 | 0.583 | 0.848 | 0.757 | 0.494 | 0.734 |
| (0.031) | (0.183) | (0.044) | (0.031) | (0.118) | (0.033) | (0.034) | (0.219) | (0.028) | |
| XGB | 0.867 | 0.660 | 0.836 | 0.865 | 0.633 | 0.854 | 0.726 | 0.297 | 0.688 |
| (0.051) | (0.161) | (0.068) | (0.031) | ( 0.126) | (0.030) | (0.053) | (0.149) | (0.048) | |
Table 4.
Average prediction performance (and standard deviation) using the baseline models (considering gender bias).
| Recall | Specificity | MCC | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Male | Female | Total | Male | Female | Total | Male | Female | Total | |
| LR | 0.921 | 0.908 | 0.906 | 0.839 | 0.852 | 0.844 | 0.756 | 0.748 | 0.742 |
| (0.0037) | (0.075) | (0.017) | (0.036) | ( 0.049) | (0.025) | (0.039) | (0.061) | (0.037) | |
| RF | 0.899 | 0.838 | 0.892 | 0.832 | 0.902 | 0.848 | 0.728 | 0.742 | 0.734 |
| (0.058) | (0.050) | (0.044) | (0.049) | (0.069) | (0.033) | (0.046) | (0.052) | (0.028) | |
| XGB | 0.821 | 0.850 | 0.836 | 0.847 | 0.865 | 0.854 | 0.671 | 0.709 | 0.688 |
| (0.110) | (0.049) | (0.068) | (0.039) | ( 0.031) | (0.030) | (0.080) | (0.036) | (0.048) | |
Regarding the results obtained for each of the two groups for the race attribute (left graphs in Fig. 2), while recall (i.e., the proportion of GBM patients correctly predicted) was similar for White and non-White populations (except when XGB was used, which produced worse results on the under-represented group than on the over-represented group), significant differences emerged when assessing specificity (i.e., the proportion of LGG patients correctly predicted) and MCC (which takes into account hits and misses on both classes), where in all cases the worst results were for the unprivileged (non-White) group. However, when comparing the results between men and women, the graphs in Fig. 2 (right) reveal that the prediction performance, as measured by any of the three metrics, was very similar for both groups. Unlike what was observed for the race attribute, specificity and MCC now yielded equal or even better results on the unprivileged group (female) than on the favored group (male).
To further analyze these results, we performed a pairwise comparison of models using a correlated Bayesian t-test 51 for each prediction performance evaluation metric to determine whether the difference in scores between each pair of models was significant. Unlike the frequentist correlated t-test, where the inference is a p-value, in the Bayesian t-test it is a posterior probability. Another interesting feature of the Bayesian t-test is that it considers the correlation and standard error of the results obtained through cross-validation. The outputs of the statistical test are summarized in Tables 5, 6 and 7, where the number in a cell indicates the probability that the model in the row scored significantly higher (posterior probability greater than 0.5) than the model in the column. The values in these tables indicate that LR performed significantly better than RF and XGB, regardless of the metric used. Furthermore, the posterior probabilities also reveal that RF outperformed XGB in all three metrics and the differences were statistically significant.
Table 5.
Pairwise comparison of the performance results obtained by the baseline models (without taking into account any bias).
| Recall | Specificity | MCC | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| LR | RF | XGB | LR | RF | XGB | LR | RF | XGB | |||
| LR | 0.852 | 0.913 | LR | 0.892 | 0.906 | LR | 0.883 | 0.911 | |||
| RF | 0.148 | 0.886 | RF | 0.108 | 0.866 | RF | 0.117 | 0.880 | |||
| XGB | 0.087 | 0.114 | XGB | 0.094 | 0.134 | XGB | 0.089 | 0.120 | |||
Table 6.
Pairwise comparison of the performance results obtained by the baseline models (taking into account race bias).
| Recall | Specificity | MCC | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| LR | RF | XGB | LR | RF | XGB | LR | RF | XGB | |||
| LR | 0.954 | 0.913 | LR | 0.933 | 0.906 | LR | 0.956 | 0.911 | |||
| RF | 0.046 | 0.754 | RF | 0.067 | 0.782 | RF | 0.117 | 0.880 | |||
| XGB | 0.086 | 0.246 | XGB | 0.092 | 0.218 | XGB | 0.087 | 0.234 | |||
Table 7.
Pairwise comparison of the performance results obtained by the baseline models (taking into account gender bias).
| Recall | Specificity | MCC | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| LR | RF | XGB | LR | RF | XGB | LR | RF | XGB | |||
| LR | 0.937 | 0.913 | LR | 0.900 | 0.906 | LR | 0.931 | 0.911 | |||
| RF | 0.063 | 0.698 | RF | 0.100 | 0.758 | RF | 0.069 | 0.724 | |||
| XGB | 0.084 | 0.302 | XGB | 0.090 | 0.242 | XGB | 0.085 | 0.276 | |||
Figure 3 evaluates the bias in the race and gender attributes for the three baseline models. Regarding race, the disparate impact values were 0.762 [95% CI 0.619,0.893] (XGB), 0.961 [95% CI 0.860,1.062] (RF), and 0.987 [95% CI 0.886,1.088] (LR), with all models exhibiting values below the ideal threshold (DI = 1). This indicates that the privileged group (White) achieved a higher hit in the positive class than the unprivileged group (non-White). The equal opportunity differences also revealed some bias in favor of the privileged group, as all of them were below the optimal threshold (EOD = 0):
[95% CI -0.316,-0.098] (XGB),
[95% CI -0.193,0.123] (RF), and
(LR) [95% CI -0.107,0.079]. Similarly, the error rate differences were also negative, meaning that the prediction outcomes were biased toward the privileged group:
(XGB) [95% CI -0.559,-0.317],
(RF) [95% CI -0.494,-0.138], and
(LR) [95% CI -0.383,-0.057].
Fig. 3.
Comparison of fairness using the baseline models for the attributes race (left) and gender (right).
Regarding gender, the DI value for XGB (1.034) [95% CI 0.908,1.194] suggested a bias toward the unprivileged group (female), while the values for RF (0.932) [95% CI 0.880,0.988] and for LR (0.986) [95% CI 0.894,1.078] indicated prediction results favoring the privileged group (male). The same was observed in the case of the EOD and ERD metrics, where XGB (EOD = 0.028 [95% CI -0.082,0.138], ERD = 0.046 [95% CI -0.042,0.134]) showed a bias towards the unprivileged group, while RF (EOD =
[95% CI -0.114,-0.008], ERD = 0.008 [95% CI -0.052,0.038]) and LR (EOD =
[95% CI -0.097,0.071], ERD =
[95% CI -0.145,-0.015]) favored the privileged group.
As with the prediction performance metrics, the results of the fairness metrics were also analyzed by pairwise comparing the models using a correlated Bayesian t-test to check for statistically significant differences in scores between each pair of models. Table 8 (race) and Table 9 (gender) show that the results of LR were significantly better than those of RF for both race and gender, and that the results of RF were higher than those of XGB. Therefore, it could be concluded that the performance of LR was fairer than that of the other two models.
Table 8.
Pairwise comparison of the fairness results obtained by the baseline models (taking into account race bias).
| DI | EOD | ERD | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| LR | RF | XGB | LR | RF | XGB | LR | RF | XGB | |||
| LR | 0.636 | 0.876 | LR | 0.859 | 0.912 | LR | 0.559 | 0.880 | |||
| RF | 0.364 | 0.810 | RF | 0.141 | 0.844 | RF | 0.441 | 0.811 | |||
| XGB | 0.124 | 0.190 | XGB | 0.088 | 0.156 | XGB | 0.120 | 0.189 | |||
Table 9.
Pairwise comparison of the fairness results obtained by the baseline models (taking into account gender bias).
| DI | EOD | ERD | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| LR | RF | XGB | LR | RF | XGB | LR | RF | XGB | |||
| LR | 0.204 | 0.885 | LR | 0.589 | 0.787 | LR | 0.750 | 0.833 | |||
| RF | 0.796 | 0.889 | RF | 0.411 | 0.757 | RF | 0.250 | 0.854 | |||
| XGB | 0.115 | 0.111 | XGB | 0.213 | 0.243 | XGB | 0.146 | 0.167 | |||
Evaluation of the debiased LR model
An improvement in fairness often comes at the cost of an inherent loss in prediction performance, and conversely, maximizing prediction outcomes often results in a decrease in fairness 35,52. However, it should be noted that in many real-world applications, and especially in the healthcare sector, a solution based on reducing the bias of the outcomes at the expense of worsening the overall performance of the prediction model is unacceptable.Therefore, a balance between these two objectives must be found.
Consequently, this set of experiments focuses on analyzing the potential trade-off between fairness and prediction performance after bias mitigation, thus pursuing the two aforementioned objectives: increasing prediction performance and reducing discrimination (i.e., increasing fairness/equity between groups). This analysis was performed solely with LR because, as demonstrated by the correlated Bayesian t-test, it was the ML model that yielded the best results in both performance and fairness before applying any bias mitigation strategy.
Figure 4 shows the prediction performance metrics for LR before (baseline model) and after applying the two bias mitigation strategies (pre-processing and post-processing). When comparing the recall obtained on the privileged and unprivileged groups, for both race and gender, the differences were not significant, regardless of the bias mitigation algorithm used, nor when compared to the baseline model. In contrast, when measuring performance by specificity, the post-processing algorithm showed a significant loss in the performance of the privileged (White) group, dropping from a specificity of 0.856 [95% CI 0.854,0.858] with the baseline model to 0.689 [95% CI 0.686,0.691]. When using the MCC, both bias mitigation strategies showed a reduction in performance. However, while post-processing affected the privileged group, the pre-processing algorithm showed a reduction in the performance of the under-represented (non-White) group.The MCC for the White group decreased from 0.767 [95% CI 0.734,0.770] with the baseline model to 0.580 [95% CI 0.576,0.584] with post-processing. With pre-processing, the MCC decreased from 0.590 to 0.522 [95% CI 0.506,0.539].
Fig. 4.
Comparison of performance metrics using the baseline LR model and the unbiased LR model: race (left) and gender (right).
Regarding gender, both bias mitigation methods yielded significantly better specificity and MCC outcomes than the baseline model (i.e., without applying any debiasing algorithm): from 0.650 [95% CI 0.616,0.684] (baseline model) to 0.847 [95% CI 0.835,0.859] (pre-processing) and 0.838 [95% CI 0.824,0.852] (post-processing) in specificity, and from 0.590 [95% CI 0.553,0.627] (baseline model) to 0.757 [95% CI 0.745,0.769] (pre-processing) and 0.734 [95% CI 0.719,0.749] (post-processing) in MCC.
Figure 5 compares the fairness metrics obtained with the baseline model with those resulting from applying the two bias mitigation algorithms. In all three approaches, no notable differences were observed between the privileged and unprivileged groups when measuring fairness with the disparate impact metric, either by race or gender. Therefore, disparate impact did not allow for conclusions to be drawn or for identifying which bias mitigation strategy was most effective in achieving fairer outcomes. Notably, based on this metric, the post-processing algorithm shifted the bias toward the privileged racial group (from 0.985 to 1.010 [95% CI 0.905,1.121]), while in the case of gender, it was the pre-processing method that generated bias toward the privileged group (from 0.985 to 1.002 [95% CI 0.920,1.084]), although not significantly for any of the protected attributes.
Fig. 5.
Comparison of fairness metrics using the baseline LR model and the unbiased LR model.
More interesting are the EOD and ERD graphs in Fig. 5. Thus, for race (with a very high ratio of White to non-White patients, i.e., a very high bias level of 10.34), the pre-processing algorithm increased the bias in favor of the privileged group, as the EOD increased from
to
[95% CI -0.178,0.022] and the ERD from
to
[95% CI -0.379,-0.178]. In contrast, for gender (with a low male-to-female ratio of 1.80), the pre-processing strategy virtually eliminated the bias or, at most, produced a very small bias toward the minority group: the EOD decreased from
to 0.002 [95% CI -0.073,0.077] and ERD from
to
[95% CI -0.051,0.071]. When evaluating the post-processing algorithm, the EOD and ERD values indicate that bias decreased significantly for both race and gender. Regarding race, the EOD decreased from
to 0.009 [95% CI -0.087,0.105] (i.e. a very small bias in favor of the unprivileged group), and the ERD decreased from
to
[95% CI -0.225,0.165]. On the other hand, for gender, the EOD value decreased from
to
[95% CI -0.097,0.089] and the ERD from
to
[95% CI -0.084,0.074]. These results suggest that post-processing was the most effective or best-fitting bias mitigation strategy in terms of balancing prediction performance and fairness, regardless of the level of demographic bias (i.e., the ratio of cases in the privileged group to the unprivileged group).
Stability analysis
Our evaluation reveals a critical tension in algorithmic fairness research: the reliance on point estimates for minority groups. While the equalized odds post-processing technique nominally achieved the highest fairness across all metrics, it is known that fairness metrics such as DI, EOD, and ERD are highly volatile when dealing with small sample sizes and especially when there is a significant imbalance between the majority and minority groups imbalances 53. This volatility stems from the high sample variance inherent in proportions derived from small denominators, where minor variations in the classification of one or two minority patients can lead to disproportionate fluctuations in the resulting fairness estimate.
In the specific case of the race attribute, with 765 White patients and only 74 non-White patients, the law of large numbers does not apply, and individual data points exert a disproportionate influence on the final outcomes. Furthermore, since the four-fifths rule is statistically unreliable when the sample size is small 54, we performed a stability analysis. Because the minority group is much smaller, changing the prediction for a single non-White patient changes the fairness metrics by 1.35%, while a single White patient only changes the rates by 0.13%. This high sensitivity implies that the fairness metrics for the minority group are 10 times more volatile then the metrics for the majority group: a single data point in the minority group has the same impact on the results as 10 data points in the majority group, leading to noisy estimates and suggesting that the metrics are driven by individual variance rather than the systemic behavior of the algorithm.
Furthermore, since fairness metrics are essentially comparisons of proportions, they are mathematically unstable in small sample contexts 55. If we assume the model has the same performance for both groups, the small sample size of the non-White group (
) yields a standard error (SE) approximately 3.2 times larger than that of the White group, regardless of how fair the model actually is. When we estimate the standard error by breaking it down into subgroups, the noise ratio increases even further. For GBM, the White group is 7 (308/44) times larger than the non-White group, leading to a noise ratio of
. For LGG, since the White group is 15.2 (457/30) times larger than the non-White group, the estimate for non-White patients is almost 4 (
) times less stable than for White patients.
The instability follows an inverse square root relationship with sample size (
). While the aggregated non-White group (
) is 3.2 times more volatile than the White group, this instability escalates to 3.9 times in the LGG subgroup (
). This mathematical threshold suggests that the observed disparities within subgroups are dominated by sampling variance, rather than by algorithmic behavior.
Conclusions
The experimental results illustrated the effect of bias mitigation algorithms on fairness and prediction performance for both race and gender groups, demonstrating that post-processing the outcomes of the model was the most effective strategy for balancing the goals of maximizing performance and minimizing bias. It is worth noting that the TCGA data set used in our experiments exhibits a much higher level of race-based bias than gender-based bias. In this context, differences were observed between the pre-processing and post-processing strategies: the reweighting pre-processing method only improved fairness in cases of low bias, while the equalized odds post-processing technique proved more effective at achieving a balance of fairness between privileged and unprivileged groups, regardless of the level of bias, while maintaining high prediction performance. Post-processing, on the other hand, does not require access to the training data and works with any already trained model, making it a practical and efficient option for systems deployed in real-world clinical settings.
In most healthcare applications, such as glioma classification, considering simplicity, flexibility, complexity, and interpretability, pre-processing and post-processing strategies are preferable to in-processing. Among other advantages, they are model-independent, can be adapted to different fairness metrics, the adjustments they make are transparent and easily understood by healthcare professionals, and they generally perform well in most experimental settings. In the case of the TCGA data set used in our experiments, this study demonstrated that differences in results depend on both the bias mitigation strategy and the chosen fairness metric, since current measures focus on specific objectives that can sometimes conflict. Therefore, it is crucial to consider the context of the problem.
A major limitation of this study is the disparity in sample size between the White (
) and non-White (
) groups, which introduces considerable instability into the fairness metrics. Sensitivity analysis indicates that a classification change for a single non-White patient results in a 1.35% variation in the group error rate, a degree of volatility more than 10 times greater than that observed in the White group (0.13%). Consequently, the 95% confidence intervals are extremely wide and frequently exceed the threshold. Therefore, these metrics should be interpreted as exploratory rather than definitive evidence of algorithmic bias.
It is important to note that this study only explored pre-processing and post-processing methods for bias mitigation applied to the best-performing ML model among the three classifiers initially analyzed. Therefore, the findings cannot be generalized to any glioma grading problem, but are limited to the data used in our experiments. However, we believe this study can serve as a starting point for other researchers to conduct further studies that delve into other relevant sources of bias (such as age, socioeconomic status, ethnicity, etc.), broaden the diversity of the data (e.g., by including radiomic features obtained from neuroimaging techniques), evaluate a wider range of bias mitigation algorithms and prediction models (e.g., support vector machines, decision trees, etc.), or use other fairness metrics beyond those based on a confusion matrix (such as calibration- and score-based metrics).
Author contributions
R. S.-M.: Writing – review & editing, Data curation, Methodology, Resources, Conceptualization, Supervision, Formal analysis. V. G.: Writing – review & editing, Resources, Software, Visualization, Validation, Formal analysis. J. S. S.: Writing – original draft, Investigation, Methodology, Formal analysis, Supervision, Funding acquisition. All authors have read and agreed to the published version of the manuscript. All authors reviewed and approved the final version.
Funding
This study was partially funded by the Conselleria d’Innovació, Universitats, Ciència i Societat Digital – Generalitat Valenciana (grant number CAICO/2023/032). Open access funding provided by the Institute of New Imaging Technologies and the Department of Computer Languages and Systems at Universitat Jaume I.
Data availability
All data supporting the findings of this study are available in the UCI Machine Learning Repository: Glioma Grading Clinical and Mutation Features [Dataset]. https://doi.org/10.24432/C5R62J.
Code availability
The custom code used in this study is available upon request from the corresponding author.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Caton, S. & Haas, C. Fairness in machine learning: A survey. ACM Comput. Surv.56, 166. 10.1145/3616865 (2024). [Google Scholar]
- 2.Pessach, D. & Shmueli, E. A review on fairness in machine learning. ACM Comput. Surv.10.1145/3494672 (2022). [Google Scholar]
- 3.Mahoney, T., Varshney, K., Hind, M. & Safari, a. O. M. C. AI Fairness (O’Reilly Media Inc, 2020).
- 4.Wongvorachan, T., Bulut, O., Liu, J. X. & Mazzullo, E. A comparison of bias mitigation techniques for educational classification tasks using supervised machine learning. Information15, 326. 10.3390/info15060326 (2024). [Google Scholar]
- 5.Kisten, M. & Khosa, M. Enhancing fairness in credit assessment: Mitigation strategies and implementation. IEEE Access12, 177277–177284. 10.1109/ACCESS.2024.3505836 (2024). [Google Scholar]
- 6.Chouldechova, A. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Int. J. Mol. Sci.5, 153–163. 10.1089/big.2016.0047 (2017). [DOI] [PubMed] [Google Scholar]
- 7.Delecraz, S. et al. Responsible artificial intelligence in human resources technology: An innovative inclusive and fair by design matching algorithm for job recruitment purposes. J Responsible Technol.11, 100041. 10.1016/j.jrt.2022.100041 (2022). [Google Scholar]
- 8.Chen, R. J. et al. Algorithmic fairness in artificial intelligence for medicine and healthcare. Nat. Biomed. Eng.7, 719–742. 10.1038/s41551-023-01056-8 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Giovanola, B. & Tiribelli, S. Beyond bias and discrimination: redefining the AI ethics principle of fairness in healthcare machine-learning algorithms. AI Soc.38, 549–563. 10.1007/s00146-022-01455-6 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Sánchez-Marqués, R., García, V. & Sánchez, J. S. A data-centric machine learning approach to improve prediction of glioma grades using low-imbalance TCGA data. Sci. Rep.14, 17195. 10.1038/s41598-024-68291-0 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Li, Y., Wang, H. & Luo, Y. Improving fairness in the prediction of heart failure length of stay and mortality by integrating social determinants of health. Circ. Heart Fail.15, e009473. 10.1161/CIRCHEARTFAILURE.122.009473 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Sahin, D., Jessen, F. & Kambeitz, J. Algorithmic fairness in biomarker-based machine learning models to predict Alzheimer’s dementia in individuals with mild cognitive impairment. Alzheimer’s Dement.18, e062125. 10.1002/alz.062125 (2022). [Google Scholar]
- 13.Cabanillas Silva, P. et al. Evaluating gender bias in ML-based clinical risk prediction models: A study on multiple use cases at different hospitals. J Biomedl Inform.157, 104692. 10.1016/j.jbi.2024.104692 (2024). [DOI] [PubMed] [Google Scholar]
- 14.Dang, V. N. et al. Fairness and bias correction in machine learning for depression prediction across four study populations. Sci. Rep.14, 7848. 10.1038/s41598-024-58427-7 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Yaseliani, M., Noor-E-Alam, M. & Hasan, M. M. Mitigating sociodemographic bias in opioid use disorder prediction: Fairness-aware machine learning framework. JMIR AI10.2196/55820 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Wang, Y. et al. Assessing fairness in machine learning models: A study of racial bias using matched counterparts in mortality prediction for patients with chronic diseases. J. Biomed. Inform.156, 104677. 10.1016/j.jbi.2024.104677 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Park, Y. et al. Comparison of methods to reduce bias from clinical prediction models of postpartum depression. JAMA Netw. Open4, e213909–e213909. 10.1001/jamanetworkopen.2021.3909 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Li, F. et al. Evaluating and mitigating bias in machine learning models for cardiovascular disease prediction. J. Biomed. Inform.138, 104294. 10.1016/j.jbi.2023.104294 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Sivarajkumar, S., Huang, Y. & Wang, Y. Fair patient model: Mitigating bias in the patient representation learned from the electronic health records. J. Biomed. Inform.148, 104544. 10.1016/j.jbi.2023.104544 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Louis, D. N. et al. The 2016 World Health Organization classification of tumors of the central nervous system: A summary. Acta Neuropathol.131, 803–820. 10.1007/s00401-016-1545-1 (2016). [DOI] [PubMed] [Google Scholar]
- 21.Pellerino, A., Caccese, M., Padovan, M., Cerretti, G. & Lombardi, G. Epidemiology, risk factors, and prognostic factors of gliomas. Clin. Transl. Imaging10, 467–475. 10.1007/s40336-022-00489-6 (2022). [Google Scholar]
- 22.Wang, J., Wang, T., Han, R., Shi, D. & Chen, B. Artificial intelligence in cancer pathology: Applications, challenges, and future directions. Cytojournal22, 45. 10.25259/Cytojournal_272_2024 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Pestine, E., Stokes, A. & Trinquart, L. Representation of obese participants in obesity-related cancer randomized trials. Ann. Oncol.29, 1582–1587. 10.1093/annonc/mdy138 (2018). [DOI] [PubMed] [Google Scholar]
- 24.Reihl, S. et al. A population study of clinical trial accrual for women and minorities in neuro-oncology following the NIH revitalization act. Neuro-Oncology23, vi88. 10.1093/neuonc/noab196.344 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Rivera Perla, K. M. et al. Predicting access to postoperative treatment after glioblastoma resection: An analysis of neighborhood-level disadvantage using the Area Deprivation Index (ADI). J. Neurooncol.158, 349–357. 10.1007/s11060-022-04020-9 (2022). [DOI] [PubMed] [Google Scholar]
- 26.Alpert, A. B. et al. Addressing barriers to clinical trial participation for transgender people with cancer to improve access and generate data. J. Clin. Oncol.41, 1825–1829. 10.1200/JCO.22.01174 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Delgado-López, P. D. et al. Artificial intelligence in neuro-oncology: Methodological bases, practical applications and ethical and regulatory issues. Clin. Transl. Oncol.10.1007/s12094-025-03948-4 (2025). [DOI] [PubMed] [Google Scholar]
- 28.Reihanian, Z. et al. Impact of age and gender on survival of glioblastoma multiforme patients: A multicenter retrospective study. Cancer Rep.7, e70050. 10.1002/cnr2.70050 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Yin, J. et al. Gender differences in gliomas: From epidemiological trends to changes at the hormonal and molecular levels. Cancer Lett.598, 217114. 10.1016/j.canlet.2024.217114 (2024). [DOI] [PubMed] [Google Scholar]
- 30.John, D. et al. Racial disparities in glioblastoma genomic alterations: A comprehensive analysis of a multi-institution cohort of 2390 patients. World Neurosurg.188, e625–e630. 10.1016/j.wneu.2024.05.183 (2024). [DOI] [PubMed] [Google Scholar]
- 31.Zawad, M. R. S. & Washington, P. Evaluating fair feature selection in machine learning for healthcare. (2024). 10.48550/arXiv.2403.19165.
- 32.Di Nunno, V. et al. Economic income and survival in patients affected by glioblastoma: A systematic review and meta-analysis. Neurooncol. Pract.11, 546–555. 10.1093/nop/npae045 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Hu, S., Sun, C., Chen, M. & Zhou, J. Marital status as an independent prognostic factor in patients with glioblastoma: A population-based study. World Neurosurg.182, e559–e569. 10.1016/j.wneu.2023.11.145 (2024). [DOI] [PubMed] [Google Scholar]
- 34.Puthenpura, V. et al. Racial/ethnic, socioeconomic, and geographic survival disparities in adolescents and young adults with primary central nervous system tumors. Pediatr. Blood Cancer68, e28970. 10.1002/pbc.28970 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Corbett-Davies, S., Pierson, E., Feller, A., Goel, S. & Huq, A. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Halifax, NS, Canada, 2017). 10.1145/3097983.3098095.
- 36.Hort, M., Chen, Z., Zhang, J. M., Harman, M. & Sarro, F. Bias mitigation for machine learning classifiers: A comprehensive survey. ACM J. Responsib. Comput.10.1145/3631326 (2024). [Google Scholar]
- 37.Dang, V., Campello, V., Hernández-González, J. & Lekadir, K. Empirical comparison of post-processing debiasing methods for machine learning classifiers in healthcare. J. Mach. Learn. Res.9, 465–493. 10.1007/s41666-025-00196-7 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Calmon, F. P., Wei, D., Vinzamuri, B., Ramamurthy, K. N. & Varshney, K. R. Optimized pre-processing for discrimination prevention. In: Proc. 31st International Conference on Neural Information Processing Systems, 3995–4004 (2017). 10.5555/3294996.3295155.
- 39.Petersen, F., Mukherjee, D., Sun, Y. & Yurochkin, M. Post-processing for individual fairness. In: Proc. 35th International Conference on Neural Information Processing Systems (2021). 10.5555/3540261.3542247.
- 40.Kamiran, F. & Calders, T. Data preprocessing techniques for classification without discrimination. Knowl. Inf. Syst.33, 1–33. 10.1007/s10115-011-0463-8 (2012). [Google Scholar]
- 41.Hardt, M., Price, E. & Srebro, N. Equality of opportunity in supervised learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, 3323–3331 (Barcelona, Spain, 2016). 10.5555/3157382.3157469.
- 42.An, J., Ying, L. & Zhu, Y. Why resampling outperforms reweighting for correcting sampling bias with stochastic gradients. arXiv (2021). https://arxiv.org/abs/2009.13447.
- 43.Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J. & Weinberger, K. Q. On fairness and calibration. In: Proc. 31st International Conference on Neural Information Processing Systems, 5684–5693 (2017). 10.5555/3295222.3295319.
- 44.Grant, D. G. Equalized odds is a requirement of algorithmic fairness. Synthese201, 101. 10.1007/s11229-023-04054-0 (2023). [Google Scholar]
- 45.Tasci, E., Zhuge, Y., Kaur, H., Camphausen, K. & Krauze, A. V. Hierarchical voting-based feature selection and ensemble learning model scheme for glioma grading with clinical and molecular characteristics. Int. J. Mol. Sci.23, 14155. 10.3390/ijms232214155 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Delgado-López, P. D. & Corrales-García, E. M. Survival in glioblastoma: A review on the impact of treatment modalities. Clin. Transl. Oncol.18, 1062–1071. 10.1007/s12094-016-1497-x (2016). [DOI] [PubMed] [Google Scholar]
- 47.Bergstra, J. & Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res.13, 281–305 (2012). [Google Scholar]
- 48.Chicco, D. & Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics21, 6. 10.1186/s12864-019-6413-7 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Chicco, D., Tötsch, N. & Jurman, G. The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation. BioData Min.14, 13. 10.1186/s13040-021-00244-z (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Zhou, N., Zhang, Z., Nair, V. N., Singhal, H. & Chen, J. Bias, fairness and accountability with artificial intelligence and machine learning algorithms. Int. Stat. Rev.90, 468–480. 10.1111/insr.12492 (2022). [Google Scholar]
- 51.Corani, G. & Benavoli, A. A Bayesian approach for comparing cross-validated algorithms on multiple data sets. Mach. Learn.100, 285–304. 10.1007/s10994-015-5486-z (2015). [Google Scholar]
- 52.Menon, A. K. & Williamson, R. C. The cost of fairness in binary classification. In Friedler, S. A. & Wilson, C. (eds.) Proceedings of the 1st Conference on Fairness, Accountability and Transparency, vol. 81 of Proceedings of Machine Learning Research, 107–118 (2018). 10.48550/arXiv.1705.09055.
- 53.Friedler, S. A. et al. A comparative study of fairness-enhancing interventions in machine learning. In: Proc. Conference on Fairness, Accountability, and Transparency, 329–338 (2019). 10.1145/3287560.3287589.
- 54.Watkins, E. A. & Chen, J. The four-fifths rule is not disparate impact: A woeful tale of epistemic trespassing in algorithmic fairness. In: Proc. 2024 ACM Conference on Fairness, Accountability, and Transparency, 764–775 (2024). 10.1145/3630106.3658938.
- 55.Roth, P. L., Bobko, P. & Switzer, F. S. III. Modeling the selection oracle: How selection rate and test properties affect the 4/5ths rule. J. Appl. Psychol.91, 507–522. 10.1037/0021-9010.91.3.507 (2006). [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
All data supporting the findings of this study are available in the UCI Machine Learning Repository: Glioma Grading Clinical and Mutation Features [Dataset]. https://doi.org/10.24432/C5R62J.
The custom code used in this study is available upon request from the corresponding author.









