Abstract
Recent research has shown that artificial intelligence (AI) models can exhibit bias in performance when trained using data that are imbalanced by protected attribute(s). Most work to date has focused on deep learning models, but classical AI techniques that make use of hand-crafted features may also be susceptible to such bias. In this paper we investigate the potential for race bias in random forest (RF) models trained using radiomics features. Our application is prediction of tumour molecular subtype from dynamic contrast enhanced magnetic resonance imaging (DCE-MRI) of breast cancer patients. Our results show that radiomics features derived from DCE-MRI data do contain race-identifiable information, and that RF models can be trained to predict White and Black race from these data with 60–70% accuracy, depending on the subset of features used. Furthermore, RF models trained to predict tumour molecular subtype using race-imbalanced data seem to produce biased behaviour, exhibiting better performance on test data from the race on which they were trained.
Keywords: Bias, AI, Fairness, Radiomics, Breast, DCE-MRI
1. Introduction
The potential for artificial intelligence (AI) models to exhibit bias, or disparate performance for different protected groups, has been demonstrated in a range of computer vision and more recently medical imaging applications. For example, biased performance has been reported in AI models for diagnostic tasks from chest X-rays [9,21], cardiac magnetic resonance (MR) image segmentation [10, 18,19], brain MR image analysis [7,16,22,24] and dermatology image analysis [1,6]. In response, the field of Fair AI has emerged to address the challenge of making AI more trustworthy and equitable in its performance for protected groups [14].
A common cause of bias in AI model performance is the combination of a distributional shift between the data of different protected groups and demographic imbalance in the training set. For example, in chest X-rays there is a distributional shift between sexes due to the presence of breast tissue lowering the signal-to-noise ratio of images acquired from female subjects [9]. However, more subtle distributional shifts can also exist which cannot be perceived by human experts, and recent work has shown that race-based distributional shifts are present in a range of medical imaging modalities, including breast mammography [5]. This raises the possibility of race bias in AI models trained using imbalanced data from these modalities.
Most work on AI bias to date has focused on deep learning techniques, in which the features used for the target task are optimised as part of the training process. In the presence of distributional shift and training set imbalance this learning process can lead to bias in the features and potentially in model performance. Classical AI approaches are trained using fixed hand-crafted features such as radiomics, and so might be considered to be less susceptible to bias. However, despite these approaches still being widely applied, little experimental work has been performed to assess the potential for, and presence of, bias in these features and the resulting models.
In this paper, we investigate the potential for bias in a classical AI model (Random Forest) based on radiomics features. Our chosen application is potential race bias in Random Forest models trained using radiomics features derived from dynamic contrast enhanced magnetic resonance imaging (DCE-MRI) of breast cancer patients. This application is of interest because there have been reported differences in breast density and composition between races [12,15], as well as tumour biology [11], indicating a possible distributional shift in (imaging) data acquired from different races, and hence the possibility of bias in AI models trained using these data. Our target task is the prediction of tumour molecular subtype from the radiomics features. This is a clinically useful task because different types of tumour are commonly treated in different ways (e.g. surgery, chemotherapy), and tumour molecular subtype is normally determined through an invasive biopsy. Therefore, development and validation of an AI model to perform this task from imaging data would obviate the need for such biopsies.
This paper makes two key contributions to the field of Fair AI. First, we present the first thorough investigation into possible bias in AI models based on radiomics features. Second, we perform the first investigation of bias in AI models based on features derived from breast DCE-MRI imaging.
2. Materials
In our experiments we employ the dataset described in [20]1, which features pre-operative DCE-MRI images acquired from 922 female patients with invasive breast cancer at Duke Hospital, USA, together with demographic, clinical, pathology, genomic, treatment, outcome and other data. From the DCE-MRI images, 529 radiomics features have been derived which are split into three (partially overlapping) categories: whole breast, fibroglandular tissue (FGT) only and tumour only. The full dataset consists of approximately 70% White subjects, 22% Black subjects and 8% other races. We refer the reader to [20] for a full summary of patient characteristics and the data provided.
3. Methods
For all experiments we employed a Random Forest (RF) classifier as our AI model, similar to the work described in [20]. For each model, we performed a grid search hyperparameter optimisation using a 5-fold cross validation on the training set. Following this, the final model was trained with the selected hyperparameter values using all training data and applied to the test set. The hyperparameters optimised were the number of trees (50, 100, 200, 250), the maximum depth of the trees (10, 15, 30, 45) and the splitting criterion (entropy, Gini). Our model training differed from that described in [20] in three important ways:
We used only Black and White subjects to enable us to analyse bias in a controlled environment. Data from all other races were excluded from both the training and test sets. This meant that our dataset comprised 854 subjects (651 White and 203 Black).
To simplify our analysis, we focused on just one of the binary classification problems reported in [20]: prediction of Luminal A vs non-Luminal A tumour molecular subtype. Based on these labels, the numbers of positive (Luminal A) and negative (non-Luminal A) subjects for each race are summarised in Table 1. As can be seen, there is a higher proportion of non-Luminal A tumours in the Black patients, which is consistent with prior studies on relative incidence of tumour subtypes by race [2,8].
We did not perform feature selection prior to training and evaluating the RF classifiers. We chose to omit this step because one of our objectives was to analyse which specific radiomics features (if any) could lead to bias in the trained models, so we did not want to exclude any features prior to this analysis.
Table 1.
Summary of positive (Luminal A) and negative (non-Luminal A) labels in the dataset overall and broken down by race.
| Label | White | Black | All |
|---|---|---|---|
| Positive (Luminal A) | 442 | 107 | 549 |
| Negative (non-Luminal A) | 209 | 96 | 305 |
4. Experiments and Results
4.1. Race Classification
In the first experiment, our aim was to determine if the radiomics features contain race-identifiable information. The presence of such information is a known potential cause of bias in trained models as it would be indicative of a distributional shift in the data between races, not just in the imaging data but in the derived (hand-crafted) radiomics features. To investigate this, we trained RF classifiers to predict race (White or Black) from the entire radiomics feature set, and also for the whole breast, FGT and tumour features individually. For these experiments, to eliminate the effect of class (i.e. race) imbalance, we randomly sampled from the dataset to create race-balanced training and test sets, each consisting of 100/100 White/Black subjects.
Results are reported as percentage classification accuracy in Table 2 for all subjects in the test set and also separately for each race. We can see that it is possible to predict race from radiomics features with around 60–70% accuracy. The results are similar for both White and Black subjects and do not differ significantly for the category of radiomics features used. It should be noted that the whole breast, FGT and tumour categories are partially overlapping, hence the similar performance for the different radiomics categories. Specifically, a set of features related to breast and FGT volume is included in both the whole breast and FGT categories, and another set related to FGT and tumour enhancement is present in both the FGT and tumour categories [20].
Table 2.
Race classification accuracy from radiomics features derived from breast DCE-MRI. Results are presented as percentage classification accuracy and reported for whole test set as well as broken down by race. Classification was performed from all radiomics features as well as just those derived from the whole breast, fibroglandular tissue (FGT) and tumour only.
| Radiomics features | Whole test set | White subjects only | Black subjects only |
|---|---|---|---|
| All | 63% | 64% | 66% |
| Whole breast only | 62% | 70% | 57% |
| FGT only | 61% | 65% | 60% |
| Tumour only | 62% | 62% | 66% |
4.2. Bias Analysis
Having established one of the key conditions for the presence of bias in AI models, i.e. a distributional shift between the data of different protected groups, we next investigated whether training with highly imbalanced training sets can lead to bias in performance.
For these experiments we split the dataset into a training set of 426 subjects and a test set of 428 subjects. The split was random under the constraints that the White and Black subjects and the Luminal A and non-Luminal A subjects were evenly distributed between train and test sets. The training set consisted of 325/101 White/Black subjects and 274/152 Luminal A/non-Luminal A subjects, and the test set consisted of 326/102 White/Black subjects and 275/153 Luminal A/non-Luminal A subjects.
In addition, we curated two additional training sets consisting of only the White subjects and only the Black subjects from the combined training set described above. Due to the racial imbalance in the database, these training sets consisted of 325/101 subjects for White/Black subjects. Using all three training sets (i.e. all, White-only and Black-only), we trained RF models for the task of classifying Luminal A vs non-Luminal A tumour molecular subtype and evaluated their performance for the entire test set as well as for the White subjects and the Black subjects in the test set individually. Class (i.e. molecular subtype) imbalance was addressed by applying a weighting to training samples that was inversely proportional to the class frequency.
Results are presented in Table 3, in which performance is quantified using the percentage classification accuracy. We performed this experiment using all radiomics features, just the whole breast features, just the FGT features and just the tumour features. We can see that in terms of overall performance, the models trained using all data and the White-only data had higher accuracy than the models trained using Black-only data, reflecting the impact of different training set sizes. Regarding race-specific performance, the models trained using all training data (i.e. 325/101 White/Black subjects) performed slightly better on White subjects, likely reflecting the effect of training set imbalance. The difference in performance in favour of White subjects varied from 3–11% (mean 6.25%), depending on the subset of features used. The models trained using White-only data had a larger performance disparity in favour of White subjects, varying between 6–11% (mean 9%). The models trained using Black-only data showed generally better performance on Black subjects (mean 3.5% difference), although the model trained using all radiomics features was 1% better for White subjects. In contrast, the model trained using whole breast radiomics features performed 10% better for Black subjects. With the exception of this last result, in general there was not a noticeable difference in bias between the models trained using all radiomics features, just whole breast features, just FGT features and just tumour features, which is consistent with the similar race classification results reported in Sect. 4.1.
Table 3.
Tumour molecular subtype classification accuracy for Luminal A vs. non-Luminal A task. Results presented as percentage accuracy and reported for training/testing using all subjects, White subjects only and Black subjects only. From top-to-bottom: results computed using all radiomics features, just whole breast features, just fibroglandular tissue (FGT) features and just tumour features.
| ALL FEATURES | Train | |||
|---|---|---|---|---|
| Test | All | White | Black | |
| All | 65% | 65% | 60% | |
| White | 68% | 67% | 60% | |
| Black | 57% | 58% | 59% | |
| WHOLE BREAST | Train | |||
| Test | All | White | Black | |
| All | 61% | 62% | 53% | |
| White | 62% | 63% | 51% | |
| Black | 57% | 57% | 61% | |
| FGT | Train | |||
| Test | All | White | Black | |
| All | 67% | 64% | 61% | |
| White | 68% | 67% | 60% | |
| Black | 62% | 56% | 62% | |
| TUMOUR | Train | |||
| Test | All | White | Black | |
| All | 67% | 65% | 59% | |
| White | 68% | 67% | 58% | |
| Black | 65% | 57% | 61% | |
4.3. Covariate Analysis
Next we investigated a range of covariates to test for the presence of confounding variable(s) that could be leading to the observed bias. From the full set of patient data available within the dataset we selected those variables that could most plausibly have associations with both race and model performance. These variables are summarised in Table 4. For the continuous variable (i.e. age), the table shows the median and lower/upper quartiles for White and Black patients separately. For categorical variables (i.e. all other variables), counts and percentages are provided. The p-values were computed using a Mann-Whitney U test for age and Chi-square tests for independence for all other variables. We can see that three of the covariates showed significant differences (at 0.05 significance) in their distributions between White and Black subjects: age, estrogen receptor status and neoadjuvant chemotherapy.
Table 4.
Distributions of covariates in the dataset by race (White and Black subjects only). Continuous variables are reported as median (M), lower (L) and upper (U) quartiles. Categorical variables are reported as count (N) and percentage (%). p-values calculated using Mann Whitney U tests for continuous variables and Chi Square tests for independence for categorical variables.
| Covariate | White | Black | p-value | |
|---|---|---|---|---|
| Age at diagnosis (years, M(L,U)) | 53.3(45.9, 61.8) | 50.5(44.0, 58.5) | 0.012 | |
| Scanner (N/%): | GE | 451/69.3% | 134/ 66.0% | 0.430 |
| Siemens | 200/30.7% | 69/34.0% | ||
| Field strength (N/%): | 1.5T | 315/48.4% | 111/54.7% | 0.258 |
| 2.89T | 1/0.1% | 0/0.0% | ||
| 3T | 335/ 51.5% | 92/45.3% | ||
| Menopause at diagnosis (N/%): |
Pre | 276/42.4% | 94/46.3% | 0.574 |
| Post | 364/55.9% | 105/51.7% | ||
| N/A | 11/1.7% | 4/2.0% | ||
| Estrogen receptor status (N/%): |
Positive | 510/ 78.3% | 123/60.6% | 7.430e–07 |
| Negative | 141/21.7% | 80/39.4% | ||
| Human epidermal growth factor 2 receptor status (N/%): |
Positive | 111/17.1% | 36/17.7% | 0.906 |
| Negative | 540/82.9% | 167/82.3% | ||
| Adjuvant radiation therapy (N/%): |
Yes | 434/67.7% | 144/71.0% | 0.341 |
| No | 210/32.3% | 58/29.0% | ||
| Neoadjuvant radiation therapy (N/%): |
Yes | 13/2.0% | 7/ 3.4% | 0.358 |
| No | 632/98.0% | 7/96.6% | ||
| Adjuvant chemotherapy (N/%): |
Yes | 391/ 63.1% | 108/57.1% | 0.167 |
| No | 229/36.9% | 81/42.9% | ||
| Neoadjuvant chemotherapy (N/%): |
Yes | 178/ 28.1% | 91/46.9% | 1.593e–06 |
| No | 455 /71.9 % | 103/53.1% | ||
As stated earlier, non-luminal breast cancer, which is generally estrogen receptor negative, is more commonly seen in Black subjects than White subjects [2,8]. In addition, this cancer is more commonly treated with neoadjuvant chemotherapy, whereas luminal breast cancer is treated with surgery, followed by endocrine therapy and chemotherapy [23] [4]. This may contribute to the statistically significant differences seen in the covariates.
5. Discussion and Conclusions
The main contribution of this paper has been to present the first investigation focused on potential bias in AI models trained using radiomics features. The work described in [20] also reported performance of their AI models based on radiomics features broken down by race. However, in our work we have performed a more controlled analysis to investigate the potential for bias and its possible causes. As a second key contribution, our paper represents the first investigation into bias in AI models based on breast DCE-MRI imaging.
Our key findings are that: (i) radiomics features derived from breast DCE-MRI data contain race-identifiable information, leading to the potential for bias in AI models trained using such data, and (ii) RF models trained to predict tumour molecular subtype seem to exhibit biased behaviour when trained using race-imbalanced training data.
These findings show that the process of producing hand-crafted features such as radiomics features does not remove the potential for bias from the imaging data, and so further investigation of the performances of other similar models is warranted. However, an unanswered question is whether the production of hand-crafted features reduces the potential for bias. To investigate this, in future work we will compare bias in radiomics-based AI models to similar image-based AI models.
Our analysis of covariates did highlight several possible confounders, so we emphasise that the cause of the bias we have observed remains to be established. In future work we will perform further analysis of these potential confounders, including of interactions between multiple variables, to help determine this cause.
Interestingly, the work described in [20], which included the same Luminal A vs. non-Luminal A classification task using the same dataset did not find a statistically significant difference in performance between races. However, there are a number of differences between our work and [20]. First, [20] used all training data (half of the full dataset) when training their RF models, i.e. they did not create deliberately imbalanced training sets as we did. Therefore, their race distribution was presumably similar to that of the full dataset (i.e. 70% White, 22% Black, 8% other races). It may be that this was not a sufficient level of imbalance to result in biased performances, and/or that the presence of other races (apart from White and Black) in the training and test sets reduced the bias effect. Second, we also note that the comparison performed in [20] was between White and other races, whereas we compared White and Black races. Third, in [20] a feature selection step was employed to optimise performance of their models. It is possible that this reduced the potential for bias by removing features that contained race-specific information, although our race classification results (see Sect. 4.1) suggest that this information is present across all categories of feature.
In this work we have focused on distributional shift in imaging data (and derived features) as a cause of bias, but bias can also arise from other sources, such as bias in data acquisition, annotations, and use of the models after deployment [3,13]. We emphasise that by focusing on this specific cause of bias we do not believe that others should be neglected, and we argue for the importance of considering possible bias in all parts of the healthcare AI pipeline.
Finally, this paper has focused on highlighting the presence of bias, and we have not addressed the important issue of what should be done about it. Bias mitigation techniques have been proposed and investigated in a range of medical imaging problems [19,25,26], and approaches such as these may have a role to play in addressing the bias we have uncovered in this work. However, when attempting to mitigate bias one should bear in mind that the classification tasks of different protected groups may have different levels of difficulty, making it challenging to eliminate bias completely. Furthermore, one should take care to ensure that the performances of the protected groups are ‘levelled up’ rather than ‘levelled down’ [17] to avoid causing harm to some protected groups.
Acknowledgements
This work was supported by the National Institute for Health Research (NIHR) Biomedical Research Centre at Guy’s and St Thomas’ NHS Foundation Trust and King’s College London, United Kingdom. Additionally this research was funded in whole, or in part, by the Wellcome Trust, United Kingdom WT203148/Z/16/Z. The views expressed in this paper are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care.
Footnotes
The dataset is publicly available at: https://doi.org/10.7937/TCIA.e3sv-re93.
References
- 1.Abbasi-Sureshjani S, Raumanns R, Michels BEJ, Schouten G, Cheplygina V. In: Cardoso J, et al., editors. Risk of training diagnostic algorithms on data with demographic bias; IMIMIC/MIL3ID/LABELS-2020 LNCS; Cham. 2020. pp. 183–192. [DOI] [Google Scholar]
- 2.Abd El-Rehim DM, et al. Expression of luminal and basal Cytokeratins in human breast carcinoma. J Pathol. 2004;203(2):661–671. doi: 10.1002/path.1559. [DOI] [PubMed] [Google Scholar]
- 3.Chen IY, Pierson E, Rose S, Joshi S, Ferryman K, Ghassemi M. Ethical machine learning in healthcare. Annu Rev Biomed Data Sci. 2021;4(1):123–144. doi: 10.1146/annurev-biodatasci-092820-114757. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Domergue C, et al. Impact of her2 status on pathological response after neoadjuvant chemotherapy in early triple-negative breast cancer. Cancers. 2022;14(10):2509. doi: 10.3390/cancers14102509. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Gichoya JW, et al. AI recognition of patient race in medical imaging: a modelling study. Lancet Digit Health. 2022;7500(22) doi: 10.1016/S2589-7500(22)00063-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Guo LN, Lee MS, Kassamali B, Mita C, Nambudiri VE. Bias in, bias out: underreporting and underrepresentation of diverse skin types in machine learning research for skin cancer detection - a scoping review. J Am Acad Dermatol. 2021;87(1):157–159. doi: 10.1016/j.jaad.2021.06.884. [DOI] [PubMed] [Google Scholar]
- 7.Ioannou S, Chockler H, Hammers A, King AP. In: Abdulkadir A, et al., editors. A study of demographic bias in CNN-based brain MR segmentation; Machine Learning in Clinical Neuroimaging Lecture Notes in Computer Science; Cham. 2022. [DOI] [Google Scholar]
- 8.Jones VC, Kruper L, Mortimer J, Ashing KT, Seewaldt VL. Understanding drivers of the black: White breast cancer mortality gap: A call for more robust definitions. Cancer. 2022;128(14):2695–2697. doi: 10.1002/cncr.34243. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Larrazabal AJ, Nieto N, Peterson V, Milone D, Ferrante E. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proc Natl Acad Sci USA. 2020;117(23):12592–12594. doi: 10.1073/pnas.1919012117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Lee T, Puyol-Antón E, Ruijsink B, Shi M, King AP. In: Camara O, et al., editors. A systematic study of race and sex bias in CNN-based cardiac MR segmentation; Statistical Atlases and Computational Models of the Heart Regular and CMRxMotion Challenge Paper Lecture Notes in Computer Science; Cham. 2022. [Google Scholar]
- 11.Martini R, et al. African ancestry-associated gene expression profiles in triple-negative breast cancer underlie altered tumor biology and clinical outcome in women of African descent. Cancer Discov. 2022;12(11):2530–2551. doi: 10.1158/2159-8290.CD-22-0138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.McCarthy AM, et al. Racial differences in quantitative measures of area and volumetric breast density. J Natl Cancer Inst. 2016;108(10) doi: 10.1093/jnci/djw104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.McCradden MD, Joshi S, Mazwi M, Anderson JA. Ethical limitations of algorithmic fairness solutions in health care machine learning. Lancet Digit Health. 2020;2(5):e221–e223. doi: 10.1016/S2589-7500(20)30065-0. [DOI] [PubMed] [Google Scholar]
- 14.Mehrabi N, Morstatter F, Saxena N, Lerman K, Galstyan A. A survey on bias and fairness in machine learning. ACM Comput Surv. 2021;54(6):1–35. [Google Scholar]
- 15.Moore JX, Han Y, Appleton C, Colditz G, Toriola AT. Determinants of mammographic breast density by race among a large screening population. JNCI Cancer Spectr. 2020;26(4) doi: 10.1093/jncics/pkaa010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Petersen E, et al. In: Wang L, Dou Q, Fletcher PT, Speidel S, Li S, et al., editors. Feature robustness and sex differences in medical imaging: a case study in MRI-based Alzheimer’s disease detection; Medical Image Computing and Computer Assisted Intervention - MICCAI 2022 Lecture Notes in Computer Science; Cham. 2022. [DOI] [Google Scholar]
- 17.Petersen E, Holm S, Ganz M, Feragen A. The path toward equal performance in medical machine learning. Patterns. 2023;4(7):100790. doi: 10.1016/j.patter.2023.100790. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Puyol-Antón E, et al. Fairness in cardiac magnetic resonance imaging: assessing sex and racial bias in deep learning-based segmentation. Front Cardiovasc Med. 2022;9:859310. doi: 10.3389/fcvm.2022.859310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Puyol-Antón E, et al. In: de Bruijne M, et al., editors. Fairness in cardiac MR image analysis: an investigation of bias due to data imbalance in deep learning based segmentation; MICCAI 2021 LNCS; Cham. 2021. pp. 413–423. [DOI] [Google Scholar]
- 20.Saha A, et al. A machine learning approach to radiogenomics of breast cancer: a study of 922 subjects and 529 DCE-MRI features. Br J Cancer. 2018;119:508–516. doi: 10.1038/s41416-018-0185-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Seyyed-Kalantari L, Zhang H, McDermott MBA, Chen IY, Ghassemi M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nat Med. 2021;27(12):2176–2182. doi: 10.1038/s41591-021-01595-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Stanley EAM, Wilms M, Mouches P, Forkert ND. Fairness-related performance and explainability effects in deep learning models for brain image analysis. J Med Imaging. 2022;9(6):061102. doi: 10.1117/1.JMI.9.6.061102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Uchida N, Suda T, Ishiguro K. Effect of chemotherapy for luminal a breast cancer. Yonago Acta Med. 2013;56(2):51–56. [PMC free article] [PubMed] [Google Scholar]
- 24.Wang R, Chaudhari P, Davatzikos C. Bias in machine learning models can be significantly mitigated by careful training: evidence from neuroimaging studies. Proc Natl Acad Sci USA. 2023;120(6):e2211613120. doi: 10.1073/pnas.2211613120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Zhang H, Dullerud N, Roth K, Oakden-Rayner L, Pfohl S, Ghassemi M. Improving the fairness of chest X-ray classifiers; Proceedings of Conference on Health, Inference, and Learning; 2022. pp. 204–233. [Google Scholar]
- 26.Zong Y, Yang Y, Hospedales T. MEDFAIR: benchmarking fairness for medical imaging; Proceedings of International Conference on Learning Representations (ICLR); 2023. [Google Scholar]
