Causal debiasing for unknown bias in histopathology—A colon cancer use case

Ramón L Correa-Medero; Rish Pai; Kingsley Ebare; Daniel D Buchanan; Mark A Jenkins; Amanda I Phipps; Polly A Newcomb; Steven Gallinger; Robert Grant; Loic Le marchand; Imon Banerjee

doi:10.1371/journal.pone.0303415

. 2024 Nov 22;19(11):e0303415. doi: 10.1371/journal.pone.0303415

Causal debiasing for unknown bias in histopathology—A colon cancer use case

Ramón L Correa-Medero ^1,², Rish Pai ³, Kingsley Ebare ³, Daniel D Buchanan ^4,^5,¹⁴, Mark A Jenkins ^5,¹⁵, Amanda I Phipps ^6,⁹, Polly A Newcomb ^6,⁹, Steven Gallinger ^7,^12,¹³, Robert Grant ^7,^10,¹¹, Loic Le marchand ⁸, Imon Banerjee ^1,^2,^*

Editor: Guanghui Liu¹⁶

¹School of Computing And Augmented Intelligence Arizona State University, Phoenix, Arziona, United States of America

²Mayo Clinic Arizona Department of Radiology, Phoenix, Arziona, United States of America

³Department Of Pathology Mayo Clinic Arizona, Phoenix, Arizona, United States of America

⁴Colorectal Oncogenomics Group, Department of Clinical Pathology, The University of Melbourne, Melbourne, Victoria, Australia

⁵University of Melbourne Centre for Cancer Research, Victorian Comprehensive Cancer Centre, Melbourne, Victoria, Australia

⁶Public Health Sciences Division, Fred Hutchinson Cancer Center, Seattle, Washington, United States of America

⁷Lunenfeld-Tanenbaum Research Institute, Sinai Health System, Toronto, Ontario, Canada

⁸Department of Epidemiology, University of Hawaii, Honolulu, Hawaii, United States of America

⁹Department of Epidemiology, University of Washington, Seattle, Washington, United States of America

¹⁰Vector Institute, Toronto, Ontario, Canada

¹¹Division of Medical Oncology and Hematology, Princess Margaret Cancer Centre, Toronto, Ontario, Canada

¹²Ontario Institute for Cancer Research, Toronto, Ontario, Canada

¹³Hepatobiliary Pancreatic Surgical Oncology Program, University Health Network, Toronto, Ontario, Canada

¹⁴Genomic Medicine and Family Cancer Clinic, Royal Melbourne Hospital, Parkville, Victoria, Australia

¹⁵Centre for Epidemiology and Biostatistics, Melbourne School of Population and Global Health, Carlton, Victoria, Australia

¹⁶State University of New York at Oswego, UNITED STATES OF AMERICA

Competing Interests: The authors have declared that no competing interests exist.

^✉

* E-mail: Banerjee.Imon@mayo.edu

Roles

Ramón L Correa-Medero: Formal analysis, Methodology, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

Rish Pai: Conceptualization, Data curation, Resources, Supervision, Writing – review & editing

Kingsley Ebare: Data curation, Writing – review & editing

Daniel D Buchanan: Data curation, Writing – review & editing

Mark A Jenkins: Data curation, Writing – review & editing

Amanda I Phipps: Data curation, Writing – review & editing

Polly A Newcomb: Data curation, Writing – review & editing

Steven Gallinger: Data curation, Writing – review & editing

Robert Grant: Data curation, Writing – review & editing

Loic Le marchand: Data curation, Writing – review & editing

Imon Banerjee: Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Writing – original draft, Writing – review & editing

Guanghui Liu: Editor

PMCID: PMC11584097 PMID: 39576760

Abstract

Advancement of AI has opened new possibility for accurate diagnosis and prognosis using digital histopathology slides which not only saves hours of expert effort but also makes the estimation more standardized and accurate. However, preserving the AI model performance on the external sites is an extremely challenging problem in the histopathology domain which is primarily due to the difference in data acquisition and/or sampling bias. Although, AI models can also learn spurious correlation, they provide unequal performance across validation population. While it is crucial to detect and remove the bias from the AI model before the clinical application, the cause of the bias is often unknown. We proposed a Causal Survival model that can reduce the effect of unknown bias by leveraging the causal reasoning framework. We use the model to predict recurrence-free survival for the colorectal cancer patients using quantitative histopathology features from seven geographically distributed sites and achieve equalized performance compared to the baseline traditional Cox Proportional Hazards and DeepSurvival model. Through ablation study, we demonstrated benefit of novel addition of latent probability adjustment and auxiliary losses. Although detection of cause of unknown bias is unsolved, we proposed a causal debiasing solution to reduce the bias and improve the AI model generalizibility on the histopathology domain across sites. Open-source codebase for the model training can be accessed from https://github.com/ramon349/fair_survival.git

1 Introduction

Definitive diagnosis of tissue lesions often requires the use of histopathology analysis. Digitization of standard tissue glass slides into Whole Slide Images (WSIs) has opened a new possibility for the use of advanced computer vision techniques not only for diagnosis but also for prognosis of advanced diseases [1, 2]. Colorectal cancer is particularly challenging as highly complex image appearance often results in uncertain prognosis [3]. It thus serves as the ultimate use case for computer vision, where an artificial intelligence (AI) model can be trained to learn patterns indicative of poor prognosis. Pai et al. developed an AI-based model to reliably stratify colon cancer samples by their risk for recurrence better than using standard risk factors. [4]. Introducing potential for AI models to improve the workloads of human pathologists. Such models can even be applied to other populations where the availability of expert pathologists may be limited [2].

Deployment of AI models depends on the validation of model performance on external datasets [5]. However, the accuracy of the model is often significantly degraded on external datasets, as reported by many studies in the literature. Drops in performance are usually found to be caused by models being biased to spurious patterns in the training data [6]. As a result, many machine learning methods are unsuccessful when applied to data from unseen hospital sites despite achieving promising results on their development dataset. In addition, disparities in model predictions can reflect and thus propagate existing health disparities among underrepresented populations [7]. In a recent study, Howard et al. [8] reported that some models trained on TCGA datasets use detected source site information to predict prognosis or mutation states. Such performance could be based on the fact that the distribution of clinical information, such as survival and gene expression patterns, significantly differs among the samples provided by various laboratories. Many authors [9] considered differences in slide staining as a primary factor for the imaging bias and tried to solve it by color augmentation and stain-normalization [10–12]. In addition to acquisition-specific bias, such as scanner configuration and noise, stain variation, and artifacts, bias may exist in the targeted population. For example, Zech et.al. [13] showed that AI can diagnose pneumonia from radiology images, but it uses radiographic markers used in portable intensive care unit scanners as surrogates. Similarly, Rueckel et al. found a pneumothorax detection model that used shortcuts based on inserted chest tubes [14]. It has also been observed that imaging AI models learn spurious age, sex, and race correlations from images when trained for seemingly unrelated tasks [15]. More concerning direction, studies have shown AI imaging models demonstrate bias against historically underserved subgroups of age, sex, and race in disease diagnosis [16].

Given the complexity of data collection and the risk of associated bias, a significant amount of work has been done to solve bias using computational techniques [17]. Bias mitigation techniques focus on the explicit use of target biasing attributes to reduce disparities. However, this also serves as a limiting factor, given that the biasing variable must be known and made available for the development of a new model. Targeting one particular attribute leads to another challenge known as the “whack-a-mole” phenomenon, where other biasing attributes are then enhanced, restarting the cycle [18].

Computational approaches view generalization as two possible types of data shift. Domain shift sees the distribution of input features to models change across dataset sites. This could be due to differences in staining or different acquisition devices. Label shift, on the other hand, is a change in the labels with the same image appearance. We observed this event where different institutions/professionals disagreed on labeling a slide as either stage 2 or stage 3, subtle differences that impact model performance.

Alabdoulmohsin et al. proposed that the key challenge to generalization across sites is a mixture of domain and label shift, making the use of only one technique suboptimal. This work discusses how the dual domain and label shift is caused by a change in the latent population. This latent population is present across all data sites; however, its occurrence across domains is shifted such that the relationship between patients and diseases is changed. The main challenge is thus to identify this latent population in order to leverage existing adaptation techniques. Their group introduced a causal analysis framework where this hidden population captured by hidden variable U can be learned from the data should several additional proxy variables be available. The causal diagram in Fig 3 demonstrates the relationship between the hidden variable U, the task variable Y, and the image feature X. Two additional variables, proxy W and C, are introduced to better estimate the hidden variable U. The variable C is a concept variable following the work of Koh et al. [19]. Concepts are used to guide a model in learning feature representations directly related to the main task through auxiliary training tasks. The proxy variable W, identified by researchers, provides a potential glimpse into the hidden population in the data and is used to guide the learning of the latent population. The group proposed a methodology to learn the hidden population indicator and used domain adaptation techniques to adapt the model to new populations, lessening the effect of the shift on the model performance.

In this work, we propose a novel survival model by incorporating the concept of latent-shift to reduce the effect of unknown bias by treating the generalization as a domain adaptation problem. We followed Alabdulmohsin et al. [20] unsupervised domain adaptation technique to the latent distribution shifts, which generalize the standard settings of covariate and label shift of the domain. We leverage auxiliary data in the source domain in the form of tumor staging and hospital site information, a proxy for socioeconomic status, and apply it to derive an identification strategy for the optimal predictor under the target distribution.

2 Methods

2.1 Dataset

The study population consisted of patients with colorectal carcinoma from the Colon Cancer Family Registry (CCFR, participating sites: Seattle, Australia, Mayo Clinic, Ontario, and Hawaii) as well as three sites external to the CCFR: University of Pittsburgh Medical Center (UPMC), Mount Sinai Hospital Toronto, and Seattle-Puget Sound (Access) Cancer Registry [21]. The CCFR enrolled participants after colorectal carcinoma diagnosis with prospective follow-up. The UPMC and Mount Sinai cohorts consisted of consecutively resected colorectal carcinoma at these institutions between 2010 and 2015 and 2011 and 2016, respectively. The Seattle-Puget Sound Cancer Registry cohort has been previously described and consists of patients between 20 and 74 years of age diagnosed with CRC between 2016 and 2018. Recurrence was assessed by manual review of medical records and was available for 3,349 stage I–III CRCs with a median follow-up of 58 months. For the prognostic model training and external validation, the stage I–III CCFR CRCs with recurrence data (n = 2,411) formed the internal cohort, and the UPMC, Mount Sinai, and Seattle-Puget CRCs (n = 938) formed the external validation cohort. This study was approved by the Mayo Clinic institutional review board (IRB 806–96 and 18–11309). The data was available to the research team on Jan 14, 2019. No patients were involved in any part of the study, including concept and study design, data collection, analysis and interpretation, drafting of the manuscript, and critical revision.

Table 1 shows the center, disease, and demographic characteristics of train, validation, and test sets that were randomly selected. Fig 1 shows the survival data for both internal and external sites Difference in survival rates between populations was measured using the log-rank test, a metric which quantifies the difference in survival rates between two populations [22]. P-values from the log-rank test are measured by considering Ontario, the largest site, as a reference. We found survival rates from Mayo Clinic and UPMC are not significantly different from the Ontario data. However, the remaining sites, Seattle, Australia, Mt.Sinai, and Access, have significantly different survival rates (based on p-values).

Table 1. Description of the dataset—Shows the characteristics of both internal and external datasets.

	Grouped by Data Split
		Overall	external-test	internal-test	internal-train	internal-val
n	–	3411	946	766	1456	243
center, n (%)	ACCESS	128 (3.8)	128 (13.5)	–	–
	Mt. Sinai	361 (10.6)	361 (38.2)	–	–
	UPMC	457 (13.4)	457 (48.3)	–	–
	Australia	233 (6.8)	–	75 (9.8)	136 (9.3)	22 (9.1)
	Hawaii	37 (1.1)	–	8 (1.0)	25 (1.7)	4 (1.6)
	Mayo	561 (16.4)	–	182 (23.8)	326 (22.4)	53 (21.8)
	Ontario	1118 (32.8)	–	347 (45.3)	657 (45.1)	114 (46.9)
	Seattle	516 (15.1)	–	154 (20.1)	312 (21.4)	50 (20.6)
Stage, n (%)	1	689 (20.2)	209 (22.1)	151 (19.7)	278 (19.1)	51 (21.0)
	2	1364 (40.0)	376 (39.7)	292 (38.1)	601 (41.3)	95 (39.1)
	3	1358 (39.8)	361 (38.2)	323 (42.2)	577 (39.6)	97 (39.9)
Time to event, mean (SD)		42.2 (20.7)	38.3 (22.2)	43.9 (19.9)	43.8 (19.9)	42.5 (20.3)

Open in a new tab

Fig 1 — P-values are computed as a log-rank test using the internal site Ontario as a reference (ref).

2.2 Quantitative image feature extraction

We followed the standard histopathology image acquisition pipeline for image staining and scanning; however, given the geographical distance, the slides were scanned at different locations. From each of the surgically resected colorectal cancer studies, one representative H&E slide was digitized using Leica Aperio GT450 or AT2 at ×40 magnification. The images from the internal cohort were stained with H&E and scanned at Mayo Clinic. The Seattle-Puget cases were stained with H&E at Fred Hutchison Cancer Center and scanned at Mayo Clinic. The UPMC and Mount Sinai cases were stained with H&E and scanned at their respective institutions. After staining, all images were uploaded to the Aiforia Create deep learning cloud-based platform (Aiforia Technologies, Helsinki, Finland), a commercially available platform designed explicitly for histologic images. Each image was manually reviewed by an expert gastrointestinal pathologist (Dr. Pai), and the entire tumor bed was outlined (median 96.5 mm² analyzed per CRC).

Previously developed deep learning quantitative segmentation algorithm, QuantCRC [23], was applied to the tumor bed to segment colorectal carcinoma digitized images into 13 regions and one object (Fig 2). The QuantCRC algorithm uses convolutional neural networks (CNN) to segment the image in a stepwise manner and is trained using 24,157 annotations made on 559 images, which are not used in this study. First, the tumor bed is segmented into carcinoma, TB/PDC, stroma, mucin, necrosis, fat, and smooth muscle. The second layer segments stroma into immature (loose, often myxoid stroma with haphazardly arranged plump fibroblasts and collagen fibers), mature (densely collagenous areas with scattered fibroblasts, often with parallel collagen fibers), and inflammatory (dense clusters of chronic inflammatory cells obscuring stromal cells) subtypes. The third layer segments carcinoma into low-grade, high-grade, and signet ring cell carcinoma (SRCC). The fourth layer identifies TILs. To re-validate QuantCRC, 30 images (15 from each GT450 and AT2 scanners) were selected from the 3,349 CRCs with recurrence data. For layers 1 to 3, the algorithm output was compared with annotations by five gastrointestinal pathologists. The results from the segmentation algorithm compare favorably with annotations by gastrointestinal pathologists for all features. The most variation was seen between immature and mature stroma, where disagreement among the five pathologist raters was noted. This is likely related to the fact that stroma subtyping is a new concept and is not done routinely by pathologists for clinical practice.

Fig 2 — First, the image is segmented into carcinoma (green), stroma (light blue), mucin (dark blue), TB/PDC (red), necrosis (brown), smooth muscle (purple, and fat (yellow). Next, the stroma is segmented into immature (teal), mature (green), and inflammatory (gray). The carcinoma is segmented into low-grade (purple), high-grade (orange), and signet ring cells (light green). Finally, TILs are recognized as objects (blue dots) within the tumor. After this segmentation, 15 features are calculated from each image as shown. Abbreviations: B, tumor bed; ST, stromal region.

For each image, fat and smooth muscle were subtracted from the tissue area to generate the tumor bed area. From the tumor bed area, the following 15 quantitative parameters were measured: %tumor, %stroma, tumor:stroma ratio, %TB/PDC within the tumor, %mucin within the tumor, %necrosis within the tumor bed, %high-grade, %SRCC, TILs per mm2 of tumor, %immature stroma (tumor bed), %inflammatory stroma (tumor bed), % mature stroma (tumor bed), %immature (stromal region), % inflammatory (stromal region), and %mature (stromal region). These 15 quantitative parameters comprise the QuantCRC features used for downstream analysis.

2.3 Causal survival model

Fig 3 presents the proposed causal modeling framework for recurrence-free survival estimation across multiple sites. Our primary hypothesis is that by using the proxy (W) and auxiliary (C) variable in a causal framework, we will be able to adapt to the latent subgroup shift that appears between sites without knowing the cause of bias and able to predict the survival Y given the input X (Fig 3.i). Following [20], if we have two domains (P source and Q target), we frame the learning Q(Y|X) as identification problem where observations are drawn from P(X, C, Y, W) and Q(X). We assume that C and W are observed in the source distribution and thus can be used during learning when U is unobserved and U is a discrete variable. We also assume that conditional dependencies in the data exist iff they exist in the graph (Fig 3.i). Note that we only observe (X, C, Y, W) in the source P and X in the target Q.

We adopted the latent shift causal framework for predicting recurrence-free survival (Y) of colorectal carcinoma patients using the quantitative histopathology features (X). Given the issue of not knowing the unobserved bias variable, we model the treatment site as a proxy (W) and the cancer stage as a concept (C). We assume that input histopathology features (X) are affected by the site due to the bias in acquisition and population variations, and the cancer stage directly affects recurrence-free survival (Y). The learning uses three core components (Fig 3).

2.3.1 Auto encoder training (Module A)

First, in module A, we learn the discrete unobserved latent variable U by training an auto encoder to reconstruct the input, i.e., QuantCRC features. The auto-encoder consists of a single linear layer that produces a smaller ten-dimensional tensor. Followed by a single decoder layer that reproduced the original input. Mean squared error loss is used to guide the auto-encoder’s training. The latent space is constrained to make a discrete representation using the Gumbel-softmax trick where through annealing the temperature value, the output probabilities become discrete one hot representation [24]. Allowing us to ultimately use the latent space to represent the discrete latent variable U. In addition to the reconstruction loss for the features, the encoder component is also trained with two auxiliary classifiers—(i) classify tumor staging (classify concept vector C), (ii) predict hospital of origin (predict the proxy W). The classification task uses categorical cross-entropy losses. KL divergence (KLD) against uniform prior $\tilde{U}$ is added to reduce the potential for prior collapse, a known issue in autoencoders. Eq 1, demonstrates the final loss for training the autoencoder model where X is the input and reconstructed variables are represented with a hat.

\begin{matrix} L_{t o t a l} = β_{X} L_{M S E} (X, \hat{X}) + β_{C} L_{C E} (C, \hat{C}) + β_{W} L_{C E} (W, \hat{W}) + β_{k l} L_{K L D} (\hat{U}, \tilde{U}) \end{matrix}

(1)

2.3.2 Latent estimation (Module B)

To properly estimate the latent shift between source P and target site Q, we would need a reliable estimate of the latent variable probabilities. Fig 3 module B demonstrates the parallel latent estimation branch trained to predict U from the original QuantCRC features directly without referring to the proxy and concept loss. The auto-encoder is frozen and provides the ground truth latent variable labels. The estimation branch is thus trained using the binary cross entropy loss between predicted latent $\hat{U}$ and auto encoder latent U. The outputs of this module are used to modify the final risk prediction seen in Eq 2 by providing the terms P(U = i|X), Q(U = i), P(U = i).

2.4 Recurrence risk prediction (Module C)

In the final stage of the training process (module C), we train a final risk prediction model that leverages the QuantCRC features (X) and learned latent variable U (module A) to predict the risk of recurrence at different time points. The model is trained using the negative log-likelihood loss until convergence. Adaptation to the target domain is done using the QuantCRC features of the external data (domain Q), the latent estimation branch (module B) is used to predict latent variables, and a shift ratio is then estimated. The estimates from the training data (domain P) and external data (domain Q) are compared, and the difference is used to re-scale the model predictions (see Eq 2). The recurrence risk estimation head (module C) is then fine-tuned using the internal data (P) to account for the new latent shift estimations based on an estimate of U.

\begin{matrix} H R_{Q} (t, X) = H_{0} (t) \sum_{i = 0}^{k_{u}} H R_{P} (X, U = i) P (U = i | X) \frac{Q (U = i)}{P (U = i)} \end{matrix}

(2)

During inference, the model uses the predicted latent shift of the new data to re-scale predictions. For our use case, we will treat samples from the external dataset (Q) as a single domain for which we adjust our model predictions. All models were trained using the same parameters: batch size of 256, Adam optimizer with a learning rate of 0.01, and early stopping based on validation loss.

2.5 Model evaluation

Model performance was evaluated for two essential metrics. We first leveraged the concordance index (C-index) to evaluate the predictive power of our mode [25]. The C-index is similar to the area under the receiver operating curve metric with the extension of being aware of time and occurrence of events of interest. The metric functions by comparing concordant and discordant pairs based on the ordering of events and the individual’s risk factor. The second metric we used was to evaluate the AUC over time, where for each time point in the study, 2- 58 months, we evaluated the predictive power of the model’s risk score to predict recurrence. The main distinction is that the AUC is evaluated across each time point. At the same time, the C-index will provide a single value for the model’s predictions while accounting for the occurrence of censoring. Furthermore, to provide a robust statistical comparison, we ran a procedure known as auto-bootstrapping. Where the test set is subsampled randomly over 100 iterations, and performance is calculated on a randomly selected subset. A 95% confidence interval is calculated for our concordance metric to measure the robustness. Each bootstrap run utilized the same random seed for reproducibility and fair comparison, ensuring each model was evaluated on the same subset of samples.

3 Results

Feature correlation was measured using the Pearson correlation coefficient to find a link between center and survival as a potential source of bias. Fig 4 shows the pair-wise Pearson correlation coefficient values between QuantCRC features, stage, center, and recurrence. As we can observe, individual features do not have a high correlation (>0.5) with recurrence (Y); however, stroma and tumor bed-related features are highly correlated among themselves and also moderately correlated with center and stage. Interestingly, a combination of center and stage increase the correlation with the target recurrence Y, which follows our hypothesis that the unobserved variable (U) that affects the outcome (Y) can be estimated from the stage (C) and center (W).

As a baseline, we compared our proposed model performance against the traditional Cox Proportional Hazard (CPH) model trained using the lifelines package [26] and DeepSurv model [27]—a multi-layer deep neural network with hazard loss. Model evaluation was done by measuring the concordance index of the model’s risk score at 60 months, comparing the ranking of at-risk patients adequately. The 95% confidence interval was obtained for all the model performances using the autobootstraping. Overall and site-based performance is presented in Table 2.

Table 2. Performance of the models on the internal and external test sets in-terms of C-index.

95% confidence interval is calculated using auto-bootstraping. Bold front represents optimal performance.

Data	Center	Baseline(Cox Model)	DeepSurv	Causal Survival
Internal	Australia	0.644 (0.601,0.686)	0.641 (0.598,0.684)	0.673 (0.636,0.71)
	Mayo	0.703 (0.651,0.754)	0.734 (0.686,0.782)	0.736 (0.682,0.79)
	Ontario	0.658 (0.630,0.685)	0.609 (0.582,0.637)	0.640 (0.611,0.67)
	Seattle	0.727 (0.648,0.807)	0.775 (0.704,0.847)	0.737 (0.660,0.813)
External	ACCESS	0.643 (0.550,0.736)	0.632 (0.532,0.732)	0.660 (0.571,0.749)
	Mt. Sinai	0.698 (0.666,0.73)	0.661 (0.629,0.693)	0.684 (0.653,0.714)
	UPMC	0.711 (0.689,0.734)	0.675 (0.648,0.703)	0.716 (0.692,0.739)
Overall Internal		0.696 (0.677,0.714)	0.687 (0.665,0.708)	0.702 (0.683,0.72)
Overall External		0.698 (0.675,0.721)	0.651 (0.626,0.676)	0.699 (0.675,0.723)

Open in a new tab

Overall, the CPH baseline (CI internal: 0.696, external:0.698) and Causal Survival (CI internal:0.702, external:0.699) models achieve comparable performance on the overall internal and external test sets while DeepSurv performance is consistently lower with higher variability. However, the performance differences between the sites are significant in the baseline Cox model, e.g., the C-index for Seattle is 0.727, and for Australia is 0.644. While maintaining the overall performance, the Causal survival model reduces the disparities between the site performance, e.g., the C-index for Seattle is 0.737, and for Australia is 0.673. Similar results can be observed for the external dataset, such as that the causal survival model improved the performance of all the sites, including ACCESS.

In Fig 5, we represent comparative performance between the models in terms of the area under the operating characteristics curve (AUROC) at different time intervals. Causal survival model performance is consistently better for all the time points on the internal dataset, except two months, which has the lowest rate of recurrence. A similar consistent performance boost was observed for the external data for predicting short-term as well as long-term recurrence. On the external data, causal survival model has a huge performance boost over DeepSurv (≈ 5% AUC).

3.1 Ablation

To demonstrate the efficacy of our proposed adaptation of the causal domain adaptation framework, the modularized framework allows us to evaluate the effect of key modules on the final predictions. Table 3 highlights the ablation performance for different configurations.

Table 3. Configuration of ablation study.

Removed components of the causal survival model is marked as X. Performance measured as C-index with 95% confidence interval.

Latent Shift Probability Adjustment	Infer Using Latent Variable	Infer using Quant CRC Features	Internal-Test Concordance Index	External Test Concordance Index
X			0.712 (0.694,0.73)	0.691 (0.669,0.713)
	X		0.538 (0.513,0.562)	0.486 (0.462,0.51)
		X	0.702 (0.683,0.721)	0.686 (0.661,0.711)

Open in a new tab

i. Removal of Latent Shift Adjustment (dropped module C): First, we remove the proposed domain adjustment component from module C, i.e. P(U)/Q(U) and evaluate the performance on the internal and external test data. As expected, the removal of latent shift domain adjustment actually improved the performance on the internal data (P) but dropped the performance on the external test (Q). This shows the fact that the proposed latent shift adjustment helps to improve the generalizability of the model on the external data.
ii. Infer only using Latent Variable (dropped module B and C): As the second experiment, we only use the latent variable derived by module A and pass it through hazard layer to compute the survival risk. The performance of the model remains high in the internal test, which demonstrates the latent features from the encoder are able to capture task-relevant correlations. Given the latent estimation head and the adjustment parameter for the domain P are missing, the performance dropped on the external datasets.
iii. Infer only using QuantCRC features (dropped modules A, B, and C): In the final experiment, we only used the QuantCRC features. We observed that the model performance decreased to be in range with the Deepsurv model in the internal test set. Performance on the external set Drops to the lowest amongst all models. Suggesting the predictive power of the latent features adjusted based on proxy and concept is essential for proper risk prediction.

4 Discussion

Bias in AI models for the histopathology domain can arise from several sources—(i) Data bias: If the training data is not representative of the whole population, the model may not generalize well on new data. For example, if the training data only includes samples from one scanner, the model may not perform well on samples from other acquisition devices; (ii) Algorithmic bias: The algorithms used to train the model may introduce bias. For example, if the algorithm is designed to optimize for a specific metric (e.g. AUC), it may ignore other important factors. Such as disparities between the ethnic subgroups; (iii) Human bias: Human bias can be introduced during the data annotation process or when selecting the data used to train the model. It is important to identify and address these sources of bias to ensure that the AI model performs accurately and fairly. There existing supervised techniques in the field of domain adaptation that attempt to mitigate biases that are known. However, the true cause of bias is unknown or related to multiple factors.

In this work, we proposed a causal survival model that can reduce the effect of unknown bias via a causal reasoning framework incorporated within a deep learning paradigm. As our first use case, we evaluated the model for predicting recurrence for colorectal cancer patients across seven geographically distributed sites using the Colon Cancer Family Registry (CCFR) and showed improvement in performance across sites. Our primary contribution is to adapt the unsupervised domain adaptation technique to adjust the deep latent space of the out-of-domain samples and ultimately obtain analogous performance across in-domain and out-of-domain samples.

Interestingly, as shown in the correlation plots (Fig 4), cancer stage and recurrence-free survival across sites have limited to no correlation. Indicating that other factors beyond staging can mediate the risk of recurrence for colorectal cancer patients. Our causal modeling approach suggests that the models are able to learn to measure the hidden variations in data, in other words, ‘unknown bias’, and utilize it to improve prediction generalizability and efficiently handle distribution shifts between internal and external data. In particular, we observe that survival rates from Australia, Seattle, Mt. Sinai, and Access are significantly different (p < 0.1) from the reference internal site Ontario. Additionally, there is a positive correlation between the linear combination of center and stage and recurrence-free survival. The disparity is reflected in the models’ performance, where the baseline models (Cox and DeepSurv) underperformed or overperformed on these sets. Our proposed architecture demonstrates an overall improvement in both internal and external datasets. We observe a significant improvement in the risk prediction on the three external datasets without utilizing fine-tuning or data harmonization techniques. Our ablation study demonstrated that by learning the latent features with auxiliary losses based on potential bias factors, an improvement in model predictions can be obtained. Finally, incorporating the latent shift adjustment further guarantees the stability of model predictions in a new domain.

The proposed framework is an adaption of the unsupervised latent distribution shift method, where we introduce deep latent space features and a domain adjustment factor. We also extend the framework for the prognosis of recurrence-free survival for colorectal cancer patients using quantitative histopathology features. Though detection and interpretation of ‘unknown bias’ is still an unsolved challenge, our proposed solution estimates the bias via the proxy and concept variable and reduces the bias to improve AI model generalizability. This framework could be extended to other applications, such as breast cancer risk prediction and chest x-ray pathology detection, where it is possible to incur both domain and label shifts. The methodology could assist research in moving from a single source of the biasing attributes to tackle the true distribution shift by using the population latent shift.

4.1 Limitation

The study has several limitations. First, the causal survival framework requires the availability of two mediating concepts C and of a proxy variable W at training time. These measures might not be readily available for other studies, or they may not satisfy all the assumptions described in the framework. Another limitation is the quality of the auxiliary variables. In other studies, the variables may not be as easily learnable, producing an unreliable network. Furthermore, the causal assumptions are typically not testable as U is not observed. Second, we validated the model on the quantified histopathology features, but ideally, the paradigm can also be used on the raw image data. Third, we only observed a moderate improvement in performance due to the fact that the slides are scanned and digitized using a standard protocol, reducing variations between the scanned slides within CCFR. Furthermore, manual annotation variations were reduced in our data by having a single expert gastrointestinal pathologist drawing the tumor boundary. We believe that a scenario where multiple sites contribute to clinically scanned slides without a common protocol will result in site-specific variations amplifying biases. In such a case, our proposed method will provide better performance over baseline survival models.

Supporting information

S1 Data

(ZIP)

pone.0303415.s001.zip^{(1MB, zip)}

Data Availability

The data provided by Colon Cancer Family Registry (CCFR) are available by authorized request in dbGaP. Specific details regarding the data access can be found at https://coloncfr.org/for-researchers/db-gap/. Within the scope of this study, this data was used in accordance with the terms and conditions set forth by CCFR. Due to the CCFR data sharing restrictions, we are unable to include the sample data. Researchers interested in using these data should contact CCFR directly for information on how to obtain permission. Interested person need to submit an application or a data request form along with research proposal and need to sign a Data Use Agreement (DUA) that outlines the terms and conditions under which you can use the data. For any inquiries regarding data access or if you need assistance in obtaining third-party data from CCFR, please contact Mayo Clinic study site manager Aatikah Mouti at mouti.aatikah@mayo.edu.

Funding Statement

The author(s) received no specific funding for this work.

References

1. Ashman K, Zhuge H, Shanley E, Fox S, Halat S, Sholl A, et al. Whole slide image data utilization informed by digital diagnosis patterns. Journal of Pathology Informatics. 2022;13:100113. doi: 10.1016/j.jpi.2022.100113 [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Madabhushi A, Lee G. Image analysis and machine learning in digital pathology: Challenges and opportunities. Medical image analysis. 2016;33:170–175. doi: 10.1016/j.media.2016.06.037 [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Bousis D, Verras GI, Bouchagier K, Antzoulas A, Panagiotopoulos I, Katinioti A, et al. The role of deep learning in diagnosing colorectal cancer. Gastroenterology Review/Przegląd Gastroenterologiczny. 2023;18(1). doi: 10.5114/pg.2023.129494 [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Pai RK, Banerjee I, Shivji S, Jain S, Hartman D, Buchanan DD, et al. Quantitative Pathologic Analysis of Digitized Images of Colorectal Carcinoma Improves Prediction of Recurrence-Free Survival. Gastroenterology. 2022;163(6):1531–1546.e8. doi: 10.1053/j.gastro.2022.08.025 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Zhang C, Bengio S, Hardt M, Recht B, Vinyals O. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM. 2021;64(3):107–115. doi: 10.1145/3446776 [DOI] [Google Scholar]
6. Banerjee I, Bhattacharjee K, Burns JL, Trivedi H, Purkayastha S, Seyyed-Kalantari L, et al. “Shortcuts” causing bias in radiology artificial intelligence: causes, evaluation and mitigation. Journal of the American College of Radiology. 2023;. doi: 10.1016/j.jacr.2023.06.025 [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Seyyed-Kalantari L, Zhang H, McDermott MB, Chen IY, Ghassemi M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nature medicine. 2021;27(12):2176–2182. doi: 10.1038/s41591-021-01595-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Howard FM, Dolezal J, Kochanny S, Schulte J, Chen H, Heij L, et al. t. Nature communications. 2021;12(1):4423. doi: 10.1038/s41467-021-24698-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Komura D, Ishikawa S. Machine learning methods for histopathological image analysis. Computational and structural biotechnology journal. 2018;16:34–42. doi: 10.1016/j.csbj.2018.01.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Reinhard E, Adhikhmin M, Gooch B, Shirley P. Color transfer between images. IEEE Computer graphics and applications. 2001;21(5):34–41. doi: 10.1109/38.946629 [DOI] [Google Scholar]
11.Macenko M, Niethammer M, Marron JS, Borland D, Woosley JT, Guan X, et al.; IEEE. A method for normalizing histology slides for quantitative analysis. 2009 IEEE international symposium on biomedical imaging: from nano to macro. 2009; p. 1107–1110.
12. Tellez D, Balkenhol M, Otte-Höller I, van de Loo R, Vogels R, Bult P, et al. Whole-slide mitosis detection in H&E breast histology using PHH3 as a reference to train distilled stain-invariant convolutional networks. IEEE transactions on medical imaging. 2018;37(9):2126–2136. doi: 10.1109/TMI.2018.2820199 [DOI] [PubMed] [Google Scholar]
13. Zech JR, Badgeley MA, Liu M, Costa AB, Titano JJ, Oermann EK. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study. PLoS medicine. 2018;15(11):e1002683. doi: 10.1371/journal.pmed.1002683 [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Rueckel J, Trappmann L, Schachtner B, Wesp P, Hoppe BF, Fink N, et al. Impact of confounding thoracic tubes and pleural dehiscence extent on artificial intelligence pneumothorax detection in chest radiographs. Investigative Radiology. 2020;55(12):792–798. doi: 10.1097/RLI.0000000000000707 [DOI] [PubMed] [Google Scholar]
15. Gichoya JW, Banerjee I, Bhimireddy AR, Burns JL, Celi LA, Chen LC, et al. AI recognition of patient race in medical imaging: a modelling study. The Lancet Digital Health. 2022;4(6):e406–e414. doi: 10.1016/S2589-7500(22)00063-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Seyyed-Kalantari L, Liu G, McDermott M, Chen IY, Ghassemi M. CheXclusion: Fairness gaps in deep chest X-ray classifiers. In: BIOCOMPUTING 2021: proceedings of the Pacific symposium. World Scientific; 2020. p. 232–243. [PubMed]
17. Correa R, Shaan M, Trivedi H, Patel B, Celi LAG, Gichoya JW, et al. A Systematic review of ‘Fair’AI model development for image classification and prediction. Journal of Medical and Biological Engineering. 2022;42(6):816–827. doi: 10.1007/s40846-022-00754-z [DOI] [Google Scholar]
18.Li Z, Evtimov I, Gordo A, Hazirbas C, Hassner T, Ferrer CC, et al. A Whac-A-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One Amplifies Others; 2023. Available from: http://arxiv.org/abs/2212.04825.
19.Koh PW, Nguyen T, Tang YS, Mussmann S, Pierson E, Kim B, et al. Concept Bottleneck Models; 2020. Available from: http://arxiv.org/abs/2007.04612.
20.Alabdulmohsin I, Chiou N, D’Amour A, Gretton A, Koyejo S, Kusner MJ, et al. Adapting to latent subgroup shifts via concepts and proxies. In: International Conference on Artificial Intelligence and Statistics. PMLR; 2023. p. 9637–9661.
21. Newcomb PA, Baron J, Cotterchio M, Gallinger S, Grove J, Haile R, et al. Colon Cancer Family Registry: an international resource for studies of the genetic epidemiology of colon cancer. Cancer Epidemiology Biomarkers & Prevention. 2007;16(11):2331–2343. doi: 10.1158/1055-9965.EPI-07-0648 [DOI] [PubMed] [Google Scholar]
22. Johnson LL, Shih JH. Chapter 23—An Introduction to Survival Analysis. In: Gallin JI, Ognibene FP, editors. Principles and Practice of Clinical Research (Third Edition). third edition ed. Boston: Academic Press; 2012. p. 285–293. Available from: https://www.sciencedirect.com/science/article/pii/B9780123821676000230. [Google Scholar]
23. Pai RK, Hartman D, Schaeffer DF, Rosty C, Shivji S, Kirsch R, et al. Development and initial validation of a deep learning algorithm to quantify histological features in colorectal carcinoma including tumour budding/poorly differentiated clusters. Histopathology. 2021;79(3):391–405. doi: 10.1111/his.14353 [DOI] [PubMed] [Google Scholar]
24.Jang E, Gu S, Poole B. Categorical Reparameterization with Gumbel-Softmax; 2017. Available from: http://arxiv.org/abs/1611.01144.
25. Longato E, Vettoretti M, Camillo BD. A practical perspective on the concordance index for the evaluation and selection of prognostic time-to-event models. Journal of Biomedical Informatics. 2020;108:103496. doi: 10.1016/j.jbi.2020.103496 [DOI] [PubMed] [Google Scholar]
26. Davidson-Pilon C. lifelines: survival analysis in Python. Journal of Open Source Software. 2019;4(40):1317. doi: 10.21105/joss.01317 [DOI] [Google Scholar]
27. Katzman JL, Shaham U, Cloninger A, Bates J, Jiang T, Kluger Y. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC medical research methodology. 2018;18:1–12. doi: 10.1186/s12874-018-0482-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0303415.r001

Decision Letter 0

Guanghui Liu

27 May 2024

PONE-D-24-15861Causal Debiasing for Unknown Bias in Histopathology - A Colon Cancer Use CasePLOS ONE

Dear Dr. Banerjee,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jul 11 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Guanghui Liu

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. For studies involving third-party data, we encourage authors to share any data specific to their analyses that they can legally distribute. PLOS recognizes, however, that authors may be using third-party data they do not have the rights to share. When third-party data cannot be publicly shared, authors must provide all information necessary for interested researchers to apply to gain access to the data. (https://journals.plos.org/plosone/s/data-availability#loc-acceptable-data-access-restrictions)

For any third-party data that the authors cannot legally distribute, they should include the following information in their Data Availability Statement upon submission:

1) A description of the data set and the third-party source

2) If applicable, verification of permission to use the data set

3) Confirmation of whether the authors received any special privileges in accessing the data that other researchers would not have

4) All necessary contact information others would need to apply to gain access to the data

3. Please include your full ethics statement in the ‘Methods’ section of your manuscript file. In your statement, please include the full name of the IRB or ethics committee who approved or waived your study, as well as whether or not you obtained informed written or verbal consent. If consent was waived for your study, please include this information in your statement as well.

4. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The manuscript titled "Causal Debiasing for Unknown Bias in Histopathology - A Colon Cancer Use Case": provides a valuable contribution to the field of digital pathology and AI, addressing a critical issue of bias in AI models. Addressing the points below may enhance the manuscript's clarity, impact, and scientific rigor.

Major Concerns:

1. While the causal modeling approach is a strong point of the paper, the explanation of the model and its components could be improved for better clarity. Specifically, the description of how the causal model integrates with the existing AI frameworks could be elaborated.

Recommendation: Provide a more detailed step-by-step explanation of the causal model implementation, possibly complemented by pseudocode or a more detailed schematic diagram.

Recommendation: Enhance the statistical analysis section by detailing the types of tests performed, including any assumptions made and the justification for the choice of these tests.

Recommendation: Expand the limitations section to include a discussion on how these limitations could affect the study's generalizability and what future research could address these issues.

Recommendation: Include a section that explores the broader implications of your findings for the field of digital pathology and AI, considering both technological and ethical aspects.

Minor Points:

1. Figure and Table Quality: Some figures and tables could be clearer. Specifically, the graphical representations of the model's performance across different sites are somewhat difficult to interpret.

Recommendation: Improve the quality of these graphical elements to enhance clarity. Consider using color contrasts or different chart types if necessary.

2. Editing for Language and Grammar: Minor grammatical errors and typos should be corrected to maintain the manuscript's professionalism.

Recommendation: Perform a thorough proofread to correct these issues before final submission.

Here are some grammatical corrections and suggestions for improvement to enhance the clarity and precision of the writing:

1. Verb Tense Consistency:

- Original: "The study have shown..."

- Suggested Correction: "The study has shown..."

- Explanation: The verb "has" should be used for singular subjects like "study" to maintain subject-verb agreement.

2. Article Usage:

- Original: "We apply a novel method to address issue."

- Suggested Correction: "We apply a novel method to address the issue."

- Explanation: Use of the definite article "the" is necessary before "issue" to specify which issue is being discussed.

3. Prepositional Phrases:

- Original: "This is critical for ensuring that AI applications is robust."

- Suggested Correction: "This is critical for ensuring that AI applications are robust."

- Explanation: The verb "are" correctly agrees with the plural noun "applications."

4. Punctuation and Compound Sentences:

- Original: "The model performs well in testing environments, however, it requires further validation."

- Suggested Correction: "The model performs well in testing environments; however, it requires further validation."

- Explanation: A semicolon is more appropriate before "however" when it's used to connect two independent clauses in a compound sentence.

5. Redundancy and Wordiness:

- Original: "The data was gathered from a variety of different sources."

- Suggested Correction: "The data was gathered from a variety of sources."

- Explanation: The word "different" is redundant with "variety" and can be omitted for conciseness.

6. Consistency in Technical Terms:

- Original: "de-biasing techniques can help in reduction of errors"

- Suggested Correction: "debiasing techniques can help reduce errors"

- Explanation: Maintain consistency in hyphenation across the document ("debiasing" not "de-biasing"), and simplify the phrase "help in reduction of" to "help reduce" for clarity.

Reviewer #2: This study proposed the Causal Survival model leverages causal reasoning to mitigate unknown biases, achieving comparable performance to traditional models and enhancing generalizability in histopathology applications.The topic holds significant importance; however, there are several questions that need to be addressed.

1. Do you perform k-fold cross validation during the model training? Please show the results with k-fold cross validation.

2. Please show the demographic details of cohort in various participating sites in tables.

3. Please add the github link with the code for your model in this paper.

Reviewer #3: Introduction and Background: The introduction provides a solid context for the study, emphasizing the challenges of bias in AI models for histopathology. The references to prior work and the identification of gaps in existing methods are well-articulated.

Methodology: The methodology is thorough and includes detailed descriptions of the datasets, feature extraction processes, and the proposed model. The incorporation of latent shift adjustment and auxiliary losses is innovative and well-explained.

Results: The results are clearly presented, with appropriate statistical measures and visualizations. The comparative performance analysis with existing models is comprehensive and shows the advantages of the proposed model.

Discussion: The discussion is insightful, highlighting the implications of the findings and acknowledging the limitations of the study. The authors provide a balanced view of their contributions and the areas that need further research.

Data Availability: While the data privacy constraints are understandable, providing more detailed guidance on how researchers can access the data would be beneficial. Additionally, sharing synthetic datasets or detailed simulation setups could enhance reproducibility.

Technical Depth: The technical depth of the paper is commendable. However, a more detailed explanation of certain complex concepts, such as the latent shift causal framework, could be beneficial for readers less familiar with these methods.

Figures and Tables: The figures and tables are well-designed and effectively convey the key findings. Ensuring that all figures have descriptive captions would improve clarity.

Overall, this manuscript makes a significant contribution to the field of AI in histopathology by addressing the critical issue of unknown bias. The proposed model is innovative, and the results are promising. With minor revisions to improve accessibility and reproducibility, this paper would be a valuable addition to the literature.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Peng Wang

Reviewer #2: No

Reviewer #3: Yes: Yang Zhang

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Nov 22;19(11):e0303415. doi: 10.1371/journal.pone.0303415.r002

Author response to Decision Letter 0

14 Aug 2024

We are glad that the reviewer found this manuscript to add a valuable contribution to the field of digital pathology.

Major Concerns:

1. While the paper's causal modeling approach is a strong point, the explanation of the model and its components could be improved for better clarity. Specifically, the description of how the causal model integrates with the existing AI frameworks could be elaborated on.

Recommendation: Provide a more detailed step-by-step explanation of the causal model implementation, possibly complemented by pseudocode or a more detailed schematic diagram.

We thank the reviewer for the suggestion to make the description of the causal model easier for readers to understand. We have modified the manuscript in two ways. First, we have modified the introduction to establish how the proposed causal framework fits into existing AI learning methodologies. Second, we also updated the method section to include training details for each module composing our network to make it more understandable.

Instead of pseudocode, we added a GitHub link to the actual model training code used for the manuscript to improve the reproducibility for readers.

2. The manuscript could benefit from a more detailed statistical analysis section. Details about the statistical tests used to validate the model's performance and the rationale behind choosing these tests should be clearly stated. Recommendation: Enhance the statistical analysis section by detailing the types of tests performed, including any assumptions made and the justification for the choice of these tests.

We thank the reviewer for highlighting our limited discussion on statistics used in the manuscript. We added a ‘Model Evaluation’ section to include an explanation of the metrics for model evaluation. Furthermore, we have also included an explanation of the bootstrapping procedure used to derive the confidence intervals shown in the paper. We also included more descriptive sentences justifying using the log-rank test and Pearson correlation when discussing dataset statistics.

3. Discussion of Limitations: The manuscript briefly mentions the limitations; however, a more thorough exploration of these limitations would strengthen the paper. It is particularly important to discuss the potential impacts of these limitations on the study's findings and how they might be addressed in future work. Recommendation: Expand the limitations section to include a discussion on how these limitations could affect the study's generalizability and what future research could address these issues.

We thank the reviewer for discussing the limitations of our study to properly ground our findings. We have included an additional discussion on the fact that our model's success largely depends on being able to reliably learn C and W variables during training. However, these variables are subject to noisy labels or cannot be quickly discovered by a network. We refrain from discussing how to mitigate such a limitation as it falls under a limitation of all existing supervised learning methods, which would distract from main focus of the manuscript.

4. Broader Implications: The paper would benefit from a discussion on the broader implications of this research, particularly in how it might influence future developments in AI for medical imaging, policy-making, and clinical practices.Recommendation: Include a section that explores the broader implications of your findings for the field of digital pathology and AI, considering both technological and ethical aspects.

We thank the reviewer for their suggestion. We have expanded the discussion to include a statement on the implications of our model. In particular, we highlight that by using the latent variable as the target for adaptation, we move onto a more diverse space not limited by features that may not be available in all datasets. We further discuss how this work can be applied to other medical imaging AI applications, which are subject to domain and label shift, such as chest x-ray pathology screening and breast cancer prediction.

5. Figure and Table Quality: Some figures and tables could be clearer. Specifically, the graphical representations of the model's performance across different sites are somewhat difficult to interpret. Recommendation: Improve the quality of these graphical elements to enhance clarity. Consider using color contrasts or different chart types if necessary.

We thank the reviewer for their recommendation. We have improved the quality of our figure by introducing more contrasting colors, larger points, and reference lines for easier interpretation. The added discussion of the methods section will assist readers in interpreting the tables.

6. Editing for Language and Grammar: Minor grammatical errors and typos should be corrected to maintain the manuscript's professionalism. Recommendation: Perform a thorough proofread to correct these issues before final submission.

Here are some grammatical corrections and suggestions for improvement to enhance the clarity and precision of the writing:

1. Verb Tense Consistency:

- Original: "The study have shown..."

- Suggested Correction: "The study has shown..."

- Explanation: The verb "has" should be used for singular subjects like "study" to maintain subject-verb agreement.

2. Article Usage:

- Original: "We apply a novel method to address issue."

- Suggested Correction: "We apply a novel method to address the issue."

- Explanation: Use of the definite article "the" is necessary before "issue" to specify which issue is being discussed.

3. Prepositional Phrases:

- Original: "This is critical for ensuring that AI applications is robust."

- Suggested Correction: "This is critical for ensuring that AI applications are robust."

- Explanation: The verb "are" correctly agrees with the plural noun "applications."

4. Punctuation and Compound Sentences:

- Original: "The model performs well in testing environments, however, it requires further validation."

- Suggested Correction: "The model performs well in testing environments; however, it requires further validation."

- Explanation: A semicolon is more appropriate before "however" when it's used to connect two independent clauses in a compound sentence .

7. 5. Redundancy and Wordiness:

- Original: "The data was gathered from a variety of different sources."

- Suggested Correction: "The data was gathered from a variety of sources."

- Explanation: The word "different" is redundant with "variety" and can be omitted for conciseness.

6. Consistency in Technical Terms:

- Original: "de-biasing techniques can help in reduction of errors"

- Suggested Correction: "Debiasing techniques can help reduce errors"

- Explanation: Maintain consistency in hyphenation across the document ("debiasing" not "de-biasing"), and simplify the phrase "help in reduction of" to "help reduce" for clarity.

We would like to thank the reviewer for thoroughly checking the draft. We have updated the manuscript with the proper corrections.

We thank the reviewer for considering our manuscript to be important for the field. We replied to the questions below and addressed the changes in the main manuscript.

1. Do you perform k-fold cross-validation during the model training? Please show the results with k-fold cross-validation

We thank the reviewer for suggesting using k-fold cross-validation. We did not perform k-fold cross-validation during model training since the main focus of the manuscript is improving the model performance on external test data that is never seen during training. Instead, we provided an auto-bootstrap evaluation of our model on each test set, which allowed us to derive confidence intervals and comprehend the robustness of the model performance on each site.

2. Please show the demographic details of the cohort in various participating sites in tables.

Unfortunately, we don’t have access to the demographic details of the patients per site.

3. Please add the GitHub link with the code for your model in this paper.

We apologize to the reviewer for this oversight. We have updated the manuscript to include the GitHub link of the training code used for the manuscript.

1. Data Availability: While the data privacy constraints are understandable, providing more detailed guidance on how researchers can access the data would be beneficial. Additionally, sharing synthetic datasets or detailed simulation setups could enhance reproducibility.

We thank the reviewer for their interest in data availability. The data is available upon reasonable request from the colon cancer family registry. QuantCRC features can also be available upon data usage agreement.

We have updated the manuscript to include the GitHub link of the training code used for the manuscript.

2. Technical Depth: The technical depth of the paper is commendable. However, a more detailed explanation of certain complex concepts, such as the latent shift causal framework, could be beneficial for readers less familiar with these methods.

We thank the reviewer for commenting on improving the approachability of our manuscript. We have updated our manuscript and added a supplemental section with more in-depth material to further assist readers in understanding the concepts discussed in our work.

Attachment

Submitted filename: PlosOne_Manuscript_comments.docx

pone.0303415.s002.docx^{(11KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0303415.r003

Decision Letter 1

Guanghui Liu

16 Sep 2024

Causal Debiasing for Unknown Bias in Histopathology - A Colon Cancer Use Case

PONE-D-24-15861R1

Dear Dr. Banerjee,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Guanghui Liu

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: The authors have thoroughly addressed all of my concerns. I have no further questions and recommend proceeding with the acceptance process according to the journal's guidelines.

Reviewer #2: I am satified with the improvement the authors made on the reearch paper. I have no further comments on the current version of paper draft.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: peng wang

Reviewer #2: No

**********

PLoS One. doi: 10.1371/journal.pone.0303415.r004

Acceptance letter

Guanghui Liu

8 Oct 2024

PONE-D-24-15861R1

PLOS ONE

Dear Dr. Banerjee,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Guanghui Liu

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Data

(ZIP)

pone.0303415.s001.zip^{(1MB, zip)}

Attachment

Submitted filename: PlosOne_Manuscript_comments.docx

pone.0303415.s002.docx^{(11KB, docx)}

Data Availability Statement

[pone.0303415.ref001] 1. Ashman K, Zhuge H, Shanley E, Fox S, Halat S, Sholl A, et al. Whole slide image data utilization informed by digital diagnosis patterns. Journal of Pathology Informatics. 2022;13:100113. doi: 10.1016/j.jpi.2022.100113 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0303415.ref002] 2. Madabhushi A, Lee G. Image analysis and machine learning in digital pathology: Challenges and opportunities. Medical image analysis. 2016;33:170–175. doi: 10.1016/j.media.2016.06.037 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0303415.ref003] 3. Bousis D, Verras GI, Bouchagier K, Antzoulas A, Panagiotopoulos I, Katinioti A, et al. The role of deep learning in diagnosing colorectal cancer. Gastroenterology Review/Przegląd Gastroenterologiczny. 2023;18(1). doi: 10.5114/pg.2023.129494 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0303415.ref004] 4. Pai RK, Banerjee I, Shivji S, Jain S, Hartman D, Buchanan DD, et al. Quantitative Pathologic Analysis of Digitized Images of Colorectal Carcinoma Improves Prediction of Recurrence-Free Survival. Gastroenterology. 2022;163(6):1531–1546.e8. doi: 10.1053/j.gastro.2022.08.025 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0303415.ref005] 5. Zhang C, Bengio S, Hardt M, Recht B, Vinyals O. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM. 2021;64(3):107–115. doi: 10.1145/3446776 [DOI] [Google Scholar]

[pone.0303415.ref006] 6. Banerjee I, Bhattacharjee K, Burns JL, Trivedi H, Purkayastha S, Seyyed-Kalantari L, et al. “Shortcuts” causing bias in radiology artificial intelligence: causes, evaluation and mitigation. Journal of the American College of Radiology. 2023;. doi: 10.1016/j.jacr.2023.06.025 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0303415.ref007] 7. Seyyed-Kalantari L, Zhang H, McDermott MB, Chen IY, Ghassemi M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nature medicine. 2021;27(12):2176–2182. doi: 10.1038/s41591-021-01595-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0303415.ref008] 8. Howard FM, Dolezal J, Kochanny S, Schulte J, Chen H, Heij L, et al. t. Nature communications. 2021;12(1):4423. doi: 10.1038/s41467-021-24698-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0303415.ref009] 9. Komura D, Ishikawa S. Machine learning methods for histopathological image analysis. Computational and structural biotechnology journal. 2018;16:34–42. doi: 10.1016/j.csbj.2018.01.001 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0303415.ref010] 10. Reinhard E, Adhikhmin M, Gooch B, Shirley P. Color transfer between images. IEEE Computer graphics and applications. 2001;21(5):34–41. doi: 10.1109/38.946629 [DOI] [Google Scholar]

[pone.0303415.ref011] 11.Macenko M, Niethammer M, Marron JS, Borland D, Woosley JT, Guan X, et al.; IEEE. A method for normalizing histology slides for quantitative analysis. 2009 IEEE international symposium on biomedical imaging: from nano to macro. 2009; p. 1107–1110.

[pone.0303415.ref012] 12. Tellez D, Balkenhol M, Otte-Höller I, van de Loo R, Vogels R, Bult P, et al. Whole-slide mitosis detection in H&E breast histology using PHH3 as a reference to train distilled stain-invariant convolutional networks. IEEE transactions on medical imaging. 2018;37(9):2126–2136. doi: 10.1109/TMI.2018.2820199 [DOI] [PubMed] [Google Scholar]

[pone.0303415.ref013] 13. Zech JR, Badgeley MA, Liu M, Costa AB, Titano JJ, Oermann EK. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study. PLoS medicine. 2018;15(11):e1002683. doi: 10.1371/journal.pmed.1002683 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0303415.ref014] 14. Rueckel J, Trappmann L, Schachtner B, Wesp P, Hoppe BF, Fink N, et al. Impact of confounding thoracic tubes and pleural dehiscence extent on artificial intelligence pneumothorax detection in chest radiographs. Investigative Radiology. 2020;55(12):792–798. doi: 10.1097/RLI.0000000000000707 [DOI] [PubMed] [Google Scholar]

[pone.0303415.ref015] 15. Gichoya JW, Banerjee I, Bhimireddy AR, Burns JL, Celi LA, Chen LC, et al. AI recognition of patient race in medical imaging: a modelling study. The Lancet Digital Health. 2022;4(6):e406–e414. doi: 10.1016/S2589-7500(22)00063-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0303415.ref016] 16.Seyyed-Kalantari L, Liu G, McDermott M, Chen IY, Ghassemi M. CheXclusion: Fairness gaps in deep chest X-ray classifiers. In: BIOCOMPUTING 2021: proceedings of the Pacific symposium. World Scientific; 2020. p. 232–243. [PubMed]

[pone.0303415.ref017] 17. Correa R, Shaan M, Trivedi H, Patel B, Celi LAG, Gichoya JW, et al. A Systematic review of ‘Fair’AI model development for image classification and prediction. Journal of Medical and Biological Engineering. 2022;42(6):816–827. doi: 10.1007/s40846-022-00754-z [DOI] [Google Scholar]

[pone.0303415.ref018] 18.Li Z, Evtimov I, Gordo A, Hazirbas C, Hassner T, Ferrer CC, et al. A Whac-A-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One Amplifies Others; 2023. Available from: http://arxiv.org/abs/2212.04825.

[pone.0303415.ref019] 19.Koh PW, Nguyen T, Tang YS, Mussmann S, Pierson E, Kim B, et al. Concept Bottleneck Models; 2020. Available from: http://arxiv.org/abs/2007.04612.

[pone.0303415.ref020] 20.Alabdulmohsin I, Chiou N, D’Amour A, Gretton A, Koyejo S, Kusner MJ, et al. Adapting to latent subgroup shifts via concepts and proxies. In: International Conference on Artificial Intelligence and Statistics. PMLR; 2023. p. 9637–9661.

[pone.0303415.ref021] 21. Newcomb PA, Baron J, Cotterchio M, Gallinger S, Grove J, Haile R, et al. Colon Cancer Family Registry: an international resource for studies of the genetic epidemiology of colon cancer. Cancer Epidemiology Biomarkers & Prevention. 2007;16(11):2331–2343. doi: 10.1158/1055-9965.EPI-07-0648 [DOI] [PubMed] [Google Scholar]

[pone.0303415.ref022] 22. Johnson LL, Shih JH. Chapter 23—An Introduction to Survival Analysis. In: Gallin JI, Ognibene FP, editors. Principles and Practice of Clinical Research (Third Edition). third edition ed. Boston: Academic Press; 2012. p. 285–293. Available from: https://www.sciencedirect.com/science/article/pii/B9780123821676000230. [Google Scholar]

[pone.0303415.ref023] 23. Pai RK, Hartman D, Schaeffer DF, Rosty C, Shivji S, Kirsch R, et al. Development and initial validation of a deep learning algorithm to quantify histological features in colorectal carcinoma including tumour budding/poorly differentiated clusters. Histopathology. 2021;79(3):391–405. doi: 10.1111/his.14353 [DOI] [PubMed] [Google Scholar]

[pone.0303415.ref024] 24.Jang E, Gu S, Poole B. Categorical Reparameterization with Gumbel-Softmax; 2017. Available from: http://arxiv.org/abs/1611.01144.

[pone.0303415.ref025] 25. Longato E, Vettoretti M, Camillo BD. A practical perspective on the concordance index for the evaluation and selection of prognostic time-to-event models. Journal of Biomedical Informatics. 2020;108:103496. doi: 10.1016/j.jbi.2020.103496 [DOI] [PubMed] [Google Scholar]

[pone.0303415.ref026] 26. Davidson-Pilon C. lifelines: survival analysis in Python. Journal of Open Source Software. 2019;4(40):1317. doi: 10.21105/joss.01317 [DOI] [Google Scholar]

[pone.0303415.ref027] 27. Katzman JL, Shaham U, Cloninger A, Bates J, Jiang T, Kluger Y. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC medical research methodology. 2018;18:1–12. doi: 10.1186/s12874-018-0482-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Causal debiasing for unknown bias in histopathology—A colon cancer use case

Ramón L Correa-Medero

Rish Pai

Kingsley Ebare

Daniel D Buchanan

Mark A Jenkins

Amanda I Phipps

Polly A Newcomb

Steven Gallinger

Robert Grant

Loic Le marchand

Imon Banerjee

Roles

Abstract

1 Introduction

2 Methods

2.1 Dataset

Table 1. Description of the dataset—Shows the characteristics of both internal and external datasets.

Fig 1. Recurrence-free survival for the colorectal carcinoma patients—(a) internal and (b) external sites.

2.2 Quantitative image feature extraction

Fig 2. Quantitative feature extraction pipeline on 2 sample images, which segments the image in a stepwise manner.

2.3 Causal survival model

Fig 3. Causal modeling framework: (left) latent shift assumption and causal relations; (right) proposed model diagram demonstrates 3 main components of the model.

2.3.1 Auto encoder training (Module A)

2.3.2 Latent estimation (Module B)

2.4 Recurrence risk prediction (Module C)

2.5 Model evaluation

3 Results

Fig 4. Pearson correlation coefficient between Stage (C), Center (W), QuantCRC features (X) and target recurrence (Y)—(a) represents stage and center as separate variables, (b) represents stage and center as a combined variables, and (c) Correlation with recurrence.

Table 2. Performance of the models on the internal and external test sets in-terms of C-index.

Fig 5. Comparative area under the receiver operating characteristics curve (AUROC) calculated for every time step on the internal and external dataset—(left) internal and (right) external.

3.1 Ablation

Table 3. Configuration of ablation study.

4 Discussion

4.1 Limitation

Supporting information

Data Availability

Funding Statement

References

Decision Letter 0

Guanghui Liu

Roles

Author response to Decision Letter 0

Decision Letter 1

Guanghui Liu

Roles

Acceptance letter

Guanghui Liu

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases