A deep learning–based algorithm for tall cell detection in papillary thyroid carcinoma

Sebastian Stenman; Nina Linder; Mikael Lundin; Caj Haglund; Johanna Arola; Johan Lundin

doi:10.1371/journal.pone.0272696

. 2022 Aug 9;17(8):e0272696. doi: 10.1371/journal.pone.0272696

A deep learning–based algorithm for tall cell detection in papillary thyroid carcinoma

Sebastian Stenman ^1,^2,^3,^*, Nina Linder ^1,⁴, Mikael Lundin ¹, Caj Haglund ^3,^5,^#, Johanna Arola ^2,^#, Johan Lundin ^1,^6,^7,^#

Editor: Alvaro Galli⁸

PMCID: PMC9362950 PMID: 35944056

Abstract

Introduction

According to the World Health Organization, the tall cell variant (TCV) is an aggressive subtype of papillary thyroid carcinoma (PTC) comprising at least 30% epithelial cells two to three times as tall as they are wide. In practice, applying this definition is difficult causing substantial interobserver variability. We aimed to train a deep learning algorithm to detect and quantify the proportion of tall cells (TCs) in PTC.

Methods

We trained the deep learning algorithm using supervised learning, testing it on an independent dataset, and further validating it on an independent set of 90 PTC samples from patients treated at the Hospital District of Helsinki and Uusimaa between 2003 and 2013. We compared the algorithm-based TC percentage to the independent scoring by a human investigator and how those scorings associated with disease outcomes. Additionally, we assessed the TC score in 71 local and distant tumor relapse samples from patients with aggressive disease.

Results

In the test set, the deep learning algorithm detected TCs with a sensitivity of 93.7% and a specificity of 94.5%, whereas the sensitivity fell to 90.9% and specificity to 94.1% for non-TC areas. In the validation set, the deep learning algorithm TC scores correlated with a diminished relapse-free survival using cutoff points of 10% (p = 0.044), 20% (p < 0.01), and 30% (p = 0.036). The visually assessed TC score did not statistically significantly predict survival at any of the analyzed cutoff points. We observed no statistically significant difference in the TC score between primary tumors and relapse tumors determined by the deep learning algorithm or visually.

Conclusions

We present a novel deep learning–based algorithm to detect tall cells, showing that a high deep learning–based TC score represents a statistically significant predictor of less favorable relapse-free survival in PTC.

1. Introduction

Papillary thyroid carcinoma (PTC) has the most favorable outcome among all thyroid carcinomas, especially in young patients [1–3]. However, the tall cell variant (TCV) of PTC correlates with a more aggressive disease and less favorable outcomes [4–6]. TCV is associated with a greater risk of recurrence and further extra-thyroidal extensions [6–8]. The World Health Organization (WHO) defines a TCV as a PTC containing at least 30% epithelial cells that are two to three times as tall as they are wide with an abundant eosinophilic cytoplasm [9]. This threshold percentage, however, remains debated. A TC score as low as 10% was previously proposed as correlating with an adverse outcome [5, 10]. Yet, in other studies higher thresholds for the TC score were reportedly needed to identify an adverse outcome, such as thresholds of 30% [6], 50% [7], and 70% [11]. The debate regarding the TCV threshold causes confusion among pathologists [12, 13].

The visual assessment of the TC composition through conventional microscopy remains a subjective and time-consuming task, leading to high inter- and intra-observer variability demonstrated through the reevaluation of tissue samples [3, 14]. Whole-slide imaging (WSI) and computational methods allow for the quantitative analysis of increasingly complex morphological patterns. Deep learning–based algorithms have been used for a wide range of tasks, from the detection of cell nuclei [15], mitoses [16], and tumor-infiltrating immune cells [17, 18], to more complex spatial pattern recognition within tumors [19, 20] and tumor grades [21]. These methods can help tackle inter- and intra-observer variability and subjectivity when analyzing tissue samples.

In this study, we evaluated the feasibility of a deep learning–based tissue segmentation method to assess TCs in thyroid carcinomas. The process includes the segmentation of tumor tissue as the first step, followed by the segmentation of the epithelium into TC and non-TC areas and, thus, a TC percentage—thus, a TC score, can be calculated. Specifically, we aimed to evaluate if deep learning (DL) methods can be applied to the quantification of the TC composition and how that correlates with TC scores assessed by human observers. By using this novel TCV algorithm, we also studied the association between the TC score and PTC outcomes at various cutoff points in a selected cohort of PTC patients. As a secondary aim of the study, we analyzed 71 PTC relapse samples to study how the TC composition correlates to the morphology in the primary tumors.

2. Methods

2.1. Patient series

2.1.1. Training series

The DL algorithm was trained using 100 whole-slide images (WSIs) of hematoxylin and eosin (H&E) stained tissue samples originating from 100 separate patients with PTC. Among these, 70 WSIs originated from a PTC cohort treated from 1973 to 1996 at the Helsinki University Hospital [2, 11]. To broaden the training dataset and improve the generalizability of the trained algorithm, 30 additional PTC WSIs were added to the training series. These 30 WSIs were downloaded from The Cancer Genome Atlas (TCGA) [22] (Fig 1), a vast publicly available database of information including genomic data and histological WSIs of 33 different cancer types.

Fig 1 — Papillary thyroid carcinoma is abbreviated as PTC.

2.1.2. Validation series

We validated the trained DL algorithm on an independent case–control cohort comprising 90 PTC patients treated at the Hospital District of Helsinki and Uusimaa (HUS) between 01/01/2003 and 12/31/2013. The follow-up time ended in 12/31/2018 and therefore allowed all included patients to have a follow-up of at least 5 years. Patients with an adverse outcome (n = 34) were defined as PTC cases with at least two recurrences (histological confirmation or serum thyroglobulin elevation during follow-up), distant metastases, or patients who died from PTC. These adverse outcome patients were matched with 1 to 2 controls (n = 56) according to age (within 10 years), gender, and tumor stage (T stage). Microcarcinomas (<1 cm in diameter) were excluded from this cohort (Fig 1). Formalin-fixed paraffin-embedded tissue blocks for all patients treated at HUS were retrieved and simultaneously assessed by two researchers (SS and JA) using a multiview microscope. Based on these slides, the most representative tissue block for each patient was selected. New tissue sections of these representative blocks were cut and stained with H&E following standard procedures. The H&E-stained slides were then digitized with a scanner (Pannoramic 250 FLASH 3DHISTECH Ltd., Budapest, Hungary) equipped with a plan-apochromat at objective 20x (NA 0.8), a CMOS camera (Adimec Q-12A-160Fc, Eindhoven, The Netherlands) with a pixel size of 0.2 μm/pixels and a 1.6 adapter. Following digitization of the slides, they were compressed into a wavelet format (Enhanced Compressed Wavelet, ECW, ER Mapper, Intergraph, Atlanta, GA, USA) with a compression ratio of 1:9 and imported to an image management platform (Aiforia Create, Aiforia Technologies Oy, Helsinki, Finland). Patient follow-up time ranged from 2.1 to 15.8 years (median 10.1 years). The median age at diagnosis was 41.0 years (standard deviation [SD] ± 16.2) in the adverse outcome group, 41.5 years (SD ± 14.1) in the control group, and 41 years (SD ± 14.9) for the entire cohort. During follow-up, all 34 patients in the adverse outcome group experienced disease relapse compared with 14 patients in the control group. The median relapse–free survival (RFS) was 0.8 years (SD ± 1.2) in the adverse outcome group and 8.6 years (SD ± 4.5) in the control group. In the adverse outcome group, 2 patients had distant metastases at the time of primary diagnosis, 34 patients had relapses, and 7 patients were diagnosed with distant metastases during follow-up, 2 patients died of PTC, and 6 patients died from other causes (Table 1).

Table 1. Characteristics of papillary thyroid carcinoma (PTC) patient cohort.

The cohort comprised 34 patients with an adverse outcome and 56 age-, sex-, and tumor stage–matched control PTC patients.

Patient characteristics	Adverse outcome (n = 34)	Control (n = 56)
Female	23 (68%)	41 (73%)
Male	11 (32%)	15 (27%)
Nodal metastases	25 (74%)	28 (50%)
Primary distant metastases	2 (6%)	0 (0%)
Relapse	34 (100%)	14 (25%)
Distant metastases during follow-up	7 (21%)	0 (0%)
Died during follow-up	6 (18%)	2 (4%)
Died of PTC	2 (6%)	0 (0%)
Primary RAI	33 (97%)	55 (98%)
Median age at diagnosis (in years)	41.0	41.5
Median follow-up time (in years)	10.4	9.7
Median relapse-free survival (in years)	0.8	8.6
Stage of tumor^*
• T1	5	10
• T2	10	19
• T3	17	25
• T4	2	2
RAI times (mean)	3.4	3.0
Algorithm TC score (median)	32%	23%
Visual TC score group (median)	30–39%	0–9%

Open in a new tab

Figures represent number of patients unless otherwise stated

RAI, radioiodine ablation

*T classification and TNM stage according to the TNM classification, seventh edition of the American Joint Committee on Cancer staging of papillary thyroid cancer.

2.1.3. Relapse series

All available histologically evaluated tissue samples (n = 71) from relapses in patients in the adverse outcome group (n = 34) were collected, visually assessed, and a representative FFPE tissue sample was selected for each of the relapses. New fresh tissue slices were cut from these representative FFPE tissue samples, stained with H&E, and digitized according to the same procedure protocol as previously described.

2.2. Training of the deep learning algorithm

The DL-based model trained to assess the TC score consists of two algorithms run in sequence. The first algorithm was trained to segment the tumor tissue (Fig 2). The second algorithm was trained to segment the tumor tissue into TC and non-TC areas. Both algorithms were trained and tested on manual annotations of the regions of interest within the 100 PTC WSIs originating from 100 separate patients. The manually annotated regions of interest were 2,970 in total and were carried out by one of the researchers (SS). Among the manual annotations, 90% (n = 2,674) were used for training and the remaining 10% (n = 296) were used as a holdout test set for the assessment of the performance of the algorithm (Fig 1). To improve the generalizability of the model, image augmentations by perturbation of the training data were utilized. The augmentations used to train the tumor tissue segmentation algorithm consisted of the rotation (0–360°), variation of scale (±10%), shear distortion (±10%), aspect ratio (±10%), contrast (±10%), white balance (±10%), and luminance (±10%). The predetermined feature size was 500 μm. When training the TC segmentation algorithm, the augmentations consisted of the rotation (0–360°), variation of scale (±5%), contrast (±20%), white balance (±20%), and luminance (±20%) with a predetermined feature size of 40 μm. Training of the model was continued until no decrease in the area error was detected for 500 iterations (training epochs) or until a total of 10 000 iterations was reached. The final model was trained with 9,521 completed iterations.

Fig 2 — a) A segmentation algorithm detects the tumor tissue (blue), which is fed as an input to a b) tall cell (TC) segmentation algorithm trained to detect TC epithelial regions (red) as well as non-TC epithelial regions (green). Finally, the TC score of the total epithelial area was calculated.

2.3. Tall cell analysis

The TC score for the representative WSIs of the primary tumors as well as the relapse tumors was visually assessed by one of the researchers (SS) who was blinded to the clinical characteristics of the patients. WSIs were grouped by the visual TC score into three groups of 0–9%, 10–29%, and ≥30% TCs (Table 2). The TC scores were also grouped into 10 percentage groups: 0–9%, 10–19%, 20–29%, 30–39%, 40–49%, 50–59%, 60–69%, 70–79%, 80–89%, and 90–100% (S1 Table in S1 File). The same WSIs were analyzed separately by the trained DL-based algorithm. The TC scores of the DL algorithm were reported using a continuous scale, and also grouped in 10% incremental percentage groups as well as in three groups of 0–9%, 10–29%, and ≥30% TC in order to compare the visually assessed TC scores. The images included in the study can be viewed via the following URL: https://tinyurl.com/9pbyuuxm.

Table 2. The interrater agreement between the visually assessed tall cell (TC) percentage and the algorithm’s TC area–based percentage in the primary tumors.

An interrater agreement analysis was performed and yielded a weighted kappa value of 0.31 (SD ± 0.068).

		Algorithm TC score assessment			Total
		<10%	10–29%	≥30%	Total
Visual TC score assessment	<10%	6	27	7	40
	10–29%	1	7	8	16
	≥30%	1	7	26	34
	Total	8	41	41	90

Open in a new tab

2.4. Statistical analysis

All statistical analyses were calculated using a general-purpose statistical software package (Stata 16.1 for Mac, Stata Corp., College Station, TX, USA). The performance of the DL algorithm was evaluated by calculating the sensitivity and specificity based on the independent test set. The F1 score of the independent test set was also assessed as a harmonic mean of the sensitivity (recall) and positive predictive value (precision). The statistical distribution of the samples according to their TC score were analyzed using the Mann–Whitney U test. The Fisher’s exact test was used to statistically assess the differences between groups for nominal variables and the Cochran–Armitage tests for trends between ordinal variables. Agreement between the researcher’s and the algorithm’s TC score was tested with weighted kappa statistics with linear weights. The Kaplan–Meier method with the log-rank test and the Cox proportional hazard regression model were calculated for the survival analyses. Relapse-free survival (RFS) was defined as the time between the primary operation until relapse or end of follow-up. Overall survival (OS) was defined as the time between the primary operation and death from any cause. We used tall cell score as the primary exposure in survival analyses. Spearman’s rank correlation coefficient evaluated correlations between variables. We considered p < 0.05 as statistically significant using two-tailed tests.

2.5. Ethical statement

We used retrospective samples that were routinely collected. This study complies with the Declaration of Helsinki and was approved by the Surgical Ethics Committee of the Helsinki University Hospital (DNo. HUS 226/E6/06, extension TMK02 §66 17.04.2013). The National Supervisory Authority of Health and Welfare granted us permission to use the tissue samples without requiring individual informed consent in this retrospective study (Valvira DNo. 10041/06.01.03.01/2012).

3. Results

3.1 The deep learning algorithm

In the test set containing 296 manual annotations, the algorithm detecting tumor tissue reached a positive predictive value (PPV; precision) of 99%, a sensitivity of 99%, and an F1 score of 99%. The subsequent TC segmentation algorithm detected TC regions with a PPV of 95%, sensitivity of 94%, and F1 score of 94%. Non-TC regions were detected with a PPV of 94%, sensitivity of 91%, and F1 score of 92% (Fig 3). In the validation dataset, the DL algorithm detected the median TC percentage area—that is, a TC score of 22.8% (SD ± 13.0%) in the control group and 31.6% (SD ± 11.8%) in the group with an adverse outcome. The algorithm results can be visually assessed via the following link: https://tinyurl.com/bde78hby.

Fig 3 — a) A zoomed-out view of a papillary thyroid carcinoma tissue sample in which the results of the deep learning (DL) algorithm’s first layer are shown (blue). From the registered carcinoma area, the DL algorithm then registers the carcinoma epithelium as either b) tall cell (TC) (red) or c) non-TC area (green). The DL algorithm determines the percentage of the epithelium covered by TCs, that is, the TC score.

3.2. Agreement between visual and algorithm-based tall cell scores

The interrater agreement between the human and algorithm TC scoring when analyzed according to three groups (<10%, 10–29%, and ≥30%) yielded a weighted kappa value of 0.31 (SD ± 0.068), which translates to a fair agreement (Table 2). The interrater agreement was also calculated using all ten TC score groups, yielding a weighted kappa value of 0.36 (SD ± 0.058; S1 Table in S1 File).

3.3. Algorithm-based tall cell score and survival

Overall, a higher algorithm-based TC score correlated with an adverse outcome (p = 0.005). When studying the TC score in 10% increments for the correlation between adverse versus control outcome groups, we observed a significant difference at the 10% and 20% thresholds, where a TC score above the threshold associated with a significantly less favorable outcome (p = 0.022 and p = 0.013, respectively). We observed no significant difference at the 30%, 40%, or 50% thresholds (p = 0.054, p = 0.15, and p = 0.38, respectively). The log-rank survival analysis showed a significant correlation between a reduction in RFS at the TC score thresholds of 10% (p = 0.044), 20% (p < 0.01), and 30% (p = 0.036; Fig 4). We observed no significant association with a shorter OS at any of the TC thresholds. When splitting the samples according to the TC score into three groups of <10%, 10–29%, and ≥30%, we found that a higher TC score significantly associated with a less favorable RFS (log rank = 0.038; Fig 5). In the Cox univariate regression analysis, the TC score correlated with a diminished RFS at thresholds 20% and 30% (HR = 2.46, p < 0.01 and HR = 1.84, p = 0.039, respectively) and for the ≥ 30% TC group in a three-group split using <10% TC as reference (HR = 7.48, p = 0.049). In Cox multivariate regression analysis adjusted for age, RFS was significantly reduced for a 20% threshold (HR = 2.47, p = 0.009), 30% threshold (HR = 1.83, p = 0.041) and for the ≥30% TC group in a three-group split using <10% TC as reference group (HR = 7.45, p = 0.049) (Table 3). A higher age at diagnosis did not significantly correlate with a higher TC score (Spearman’s rho = -0.028, p = 0.80).

Fig 4 — In the figure, tall cell is abbreviated as TC.

Fig 5 — In the figure, tall cell is abbreviated as TC.

Table 3. Cox univariate and multivariate regression analysis for relapse-free survival (RFS) among patients with papillary thyroid carcinoma.

	Univariate analysis			Multivariate analysis
Parameters	Hazard ratio	p value	95% CI	Hazard ratio	p value	95% CI
Age >45 years	0.91	0.74	0.51–1.62
Algorithm TC thresholds^*
≥10%	6.00	0.076	0.83–43.6	5.93	0.079	0.81–43.2
≥20%	2.46	0.009	1.25–4.86	2.38	0.013	1.20–4.70
≥30%	1.84	0.039	1.03–3.27	1.83	0.041	1.02–3.26
Algorithm TC three group split^†
<10%	-	-	-	-	-	-
10–29%	4.80	0.13	0.64–35.7	4.81	0.13	0.64–35.8
≥30%	7.48	0.049	1.01–55.2	7.45	0.049	1.01–55.0
Multivariate adjusted for age^‡				0.96	0.88	0.53–1.71
Visual TC thresholds^*
≥10%	1.23	0.49	0.69–2.19	1.24	0.46	0.69–2.23
≥20%	1.29	0.38	0.73–2.28	1.30	0.37	0.74–2.30
≥30%	1.32	0.34	0.75–2.34	1.32	0.34	0.75–2.35
Visual TC three group split^†
<10%	-	-	-	-	-	-
10–29%	1.02	0.96	0.45–2.34	1.04	0.92	0.45–2.40
≥30%	1.33	0.37	0.71–2.48	1.34	0.36	0.72–2.40
Multivariate adjusted for age^‡				0.90	0.72	0.50–1.61

Open in a new tab

*The two group splits are analyzed for 10%, 20% and 30% TC thresholds and are studied as separate models as illustrated with dashed borders.

^†<10% TC group used as reference group when analyzing the three-group split of <10% TC, 10–29% TC, and ≥30% TC.

^‡Multivariate Cox regression analysis adjusted for age threshold of 45 years was analyzed according to the seventh edition of the American Joint Committee on Cancer staging of papillary thyroid cancer.

Abbreviation: TC, tall cell.

3.4. Visually assessed tall cell score and survival

When evaluated using 10% increments for the visual TC score, the median TC score for the controls was 0–9% and 30–39% for the adverse outcome group. Overall, the visual TC score was significantly higher in the adverse outcome group compared with the controls (p = 0.008). Examining the visual TC score in 10% increments, we observed a significant difference in the distribution between the adverse outcome group and controls when using the 10%, 20%, 30%, 40%, and 50% thresholds (p = 0.030, p = 0.031, p = 0.026, p = 0.030, and p = 0.024, respectively; Fig 4). We observed no significant correlation at the 60%, 70%, or 80% thresholds (p = 0.094, p = 0.19, and p = 0.14, respectively). The log-rank survival analysis revealed no significant correlation between the visually assessed TC score and RFS at any of the 10% increments in the TC score thresholds. Similarly, we observed no significant correlation between the visual TC score and overall survival. When split into three groups of <10%, 10–29%, and ≥30% TCs, we also detected no statistically significant association between the TC score and RFS (log rank = 0.63; Fig 5). In Cox univariate nor multivariate Cox regression analysis, none of the TC thresholds correlated with a reduction in RFS. Nor did the three-group split of <10% TC, 10–29% TC, and ≥30% TC correlate with a reduction in RFS (Table 3).

3.5. Tumor relapse samples

When visually evaluating the 71 slides from tumor relapses, the median TC score group was 10–19%. The median algorithm-based TC score was 27.3% (SD ± 11.45%), which is comparable to the 31.6% reported for primary tumors among the adverse outcome patients (p = 0.36). The agreement metric analysis yielded a weighted kappa value of 0.22 (SD ± 0.07) when patients were divided into TC score groups <10%, 10–29% and ≥30% (Table 4). When analyzing the TC score groups based on 10% increments, the agreement analysis yielded a weighted kappa value of 0.25 (SD ± 0.06; S2 Table in S1 File). The algorithm TC scoring of the relapse samples can be visually assessed via the following link: https://tinyurl.com/bde78hby.

Table 4. Interrater agreement between the visually assessed tall cell (TC) percentage and the algorithm-based TC score in tumor relapse samples.

		Algorithm TC score assessment			Total
		<10%	10–29%	≥30%	Total
Visual TC score assessment	<10%	2	23	6	31
	10–29%	0	7	9	16
	≥30%	0	7	17	24
	Total	2	37	32	71

Open in a new tab

4. Discussion

TCV PTC results in more adverse outcomes compared with the classical variant of PTC and should, therefore, be treated more aggressively [23]. Here, we present a TCV algorithm that quantifies the percentage of tumor epithelial area containing TCs in whole-slide images from H&E-stained PTC samples with a high sensitivity and specificity (https://tinyurl.com/bde78hby). Survival analysis demonstrated that the algorithm-based TC score significantly predicts relapse-free survival (RFS), whereas we detected no statistically significant association between the visually assessed TC score and RFS.

Inter- and intra-observer variability represents a major challenge in TCV classification [3, 14], and an improved reproducibility of TC identification and quantification is needed. Although the WHO defines TC as a cell that is two to three times as tall as it is wide, it is quite difficult for humans to strictly follow this rule when visually evaluating a PTC slide [13]. We hypothesized that using a DL algorithm to evaluate the TC score could meet the demand for a more objective and more consistent means of evaluating the presence and number of TCs in a sample. As debate continues related to the optimal cutoff point for visually assessed TCV [12], automated methods similar to that proposed here could be used to more systematically analyze and establish TC score cutoff points that provide clinically meaningful subgroups according to the proportion of TCs in PTC tissue samples.

The highest TC score given by the algorithm was only 52%, indicating a more conservative scoring compared to reliance on a human investigator. Interestingly, the TCV algorithm also more rarely identified a tumor sample as having a low TC score of <10% (Table 2). Thus, it is possible that the TC score is both over- and underestimated through visual assessment. When studying TC thresholds using 10% increments, no statistically significant association was observed between the visually assessed TC score and recurrence-free survival at any of the thresholds. Using the algorithm-based TC score assessment, we observed a significant association between a higher TC score and a less favorable RFS using thresholds set to 10%, 20%, and 30% (Fig 4). This finding agrees with previous studies, emphasizing the clinical impact of a low percentage of TCs, while percentages as low as 10% should be reported by pathologists [5, 10]. In univariate Cox regression analysis, a significant reduction in RFS could be observed for both a 20% and 30% TC threshold for the algorithm TC scoring. In multivariate analysis we observed a significant reduction with a 20% and 30% threshold. In an age-adjusted multivariate Cox regression analysis of the three-group split of algorithm TC scoring using <10% TC as reference, we observed a significant reduction in RFS for the ≥30% TC group but not for the 10–29% TC (Table 3). These findings show that a 30% threshold indeed should be considered for diagnosing TCV, which is in line with WHO’s recommendations [9]. The findings also suggests that cases with 10–29% TCs (HR = 4.81) could have a worse prognosis compared to <10% TC but the sample size of the current study could not prove this in a multivariate Cox regression analysis (p = 0.13) (Table 3).

Patients in the validation case–control cohort were all diagnosed within the same hospital district. Thus, they were all offered similar initial treatment—that is, surgery in combination with radioiodine ablation therapy. This rather aggressive initial treatment protocol could explain why we found so few cases of PTC with an adverse outcome when collecting data retrospectively. Aggressive disease was classified as a tumor relapsing at least twice. Such cases could potentially have been missed or even misclassified as control cases due to the initial aggressive treatment protocol. This could be considered a limitation of the present study, and, in future studies, we recommend carrying out a multicenter study to limit the impact of treatment protocols adopted by specific hospital districts.

In the present study, we used a commercially available image management and machine learning platform (Aiforia Create, Aiforia Technologies Oy, Helsinki, Finland), The exact architecture of the algorithm is proprietary and thus could not be reported which is a limitation. However, the same platform can be used in future studies to fully reproduce the experiment.

Previously, little research focused on the morphologies of PTC metastases. Therefore, we also assessed the TC score in 71 metastatic samples obtained from 32 patients with relapses, that is, aggressive disease, using both visual evaluation and the algorithm-based TC score. We hypothesized that a higher TC score could be seen in samples taken from metastatic tissue. This, however, seems not to be the case, since we observed no statistically significant difference between the median TC score in the primary tumor versus the relapse tumors (27.3% and 31.6%, respectively, p = 0.36). This result suggests that the TCV morphology is retained in metastatic tissue as well. Thus, TC score quantification could possibly also be completed for metastatic samples.

One strength to this study lies in the carefully matched adverse outcome case–control cohort we used as the validation dataset. Patients were matched by age of diagnosis (within 10 years), sex, and tumor stage. All patients in the validation dataset were treated within the same hospital district and offered a similar initial treatment. However, adhering to such stringent criteria also limited the cohort size to only 90 patients. Since death from disease is a rare outcome in PTC, we defined an adverse outcome as having at least two relapses, distant metastases at primary diagnosis, or death from disease. The eligibility criteria of including patients with two relapses in the dataset representing an adverse PTC could mean that we might have allowed more benign cases to be included as aggressive cases. To only use death from disease as the single criteria for aggressive disease is preferable but leads to a low number of events in single-center series. Another strength lies in the training dataset used for the TC score algorithm. During training, we used a previous PTC series of 70 WSIs. Furthermore, we included 30 WSIs from the TCGA database in order to broaden the spectrum of stain variations in the training set and, thus, improve the generalizability of the TCV algorithm. The colors and contrasts in the selected H&E stained TCGA WSIs visually differed from the corresponding H&E-stained slides in the Helsinki series used to train the algorithm. Furthermore, these WSIs contained morphologies of non-TCV PTC as well as those of TCV. To address the potential variability in the sample properties, we utilized both the color and scale augmentations of the training samples to improve the generalizability. This resulted in a high sensitivity and specificity in the independent test set and also visually accurate segmentation of both TC and non-TC tumor epithelium in the validation set. However, we expect that the performance will decline when testing algorithms on samples from different centers [24]. Thus, in future studies, the performance of the trained TC score algorithm requires validation on external datasets within a multicenter validation study.

In conclusion, we show that the DL-based algorithm was better than the human observer in identifying TCVs. The algorithm could prove useful as a clinical tool for pathologists when evaluating PTC samples and can potentially significantly improve the consistency of TCV case assessment. To our knowledge, no such algorithm has previously been described. The results indicate that a 30% threshold should be used in diagnosing TVC. However, all cases with more than 10% TCs should be included in the pathologist’s reports. Our results also suggest that a higher TC score in PTC assessed using the DL algorithm is associated with less favorable survival. Finally, we show that TC morphology is retained, although the proportion of the TC area does not increase in the tumor tissue from relapsed patients, suggesting that the diagnosis of TCV can rely on metastatic tissue as well.

Supporting information

S1 File

(DOCX)

Click here for additional data file.^{(19.5KB, docx)}

Acknowledgments

We thank the FIMM Digital Microscopy and Molecular Pathology Unit supported by the Helsinki Institute of Life Science and Biocenter Finland for excellent assistance. The results reported here are in part based upon data generated by the TCGA Research Network (https://www.cancer.gov/tcga).

Data Availability

The images used to validate the deep learning based algorithm and the results of the trained algorithm can be accessed by the reader via the following link (https://cloud.aiforia.com/Public/LundinLab/eom7FCyANBZ8boHHo1hXJtsXIjcbW5M6Ujmd2bnUIx80). Clinical data of the cases included in the validation series have been collected from Helsinki and Uudenmaan sairaanhoitopiiri (HUS) and encrypted clinical data can be made available upon request. Data from the Cancer Genome Atlas (www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga) used in this study is readily available and can be accessed by the general public.

Funding Statement

This study has received funding from; Syöpäsäätiö Cancer Foundation Finland funded J.A., S.S., (www.syopasaatio.fi), Finska Läkaresällskapet funded S.S., C.H., J.L. (www.fls.fi), K. Albin Johanssons Foundation funded S.S. (www.foundationweb.net/johansson/), Sigrid Juséliuksen Foundation funded S.S., C.H., J.L. (www.sigridjuselius.fi), Medicinska understödsföreningen Liv och Hälsa funded N.L., C.H., J.A., and J.L. (www.livochhalsa.fi), iCAN Digital Precision Medicine Flagship funded N.L., and J.L. (www.ican.fi), and HiLIFE Helsinki Institute of Life Sciences funded N.L., and J.L., (www2.helsinki.fi/en/helsinki-institute-of-life-science). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript in any way.

References

1.Mazzaferri EL. Long-term outcome of patients with differentiated thyroid carcinoma: effect of therapy. Endocr Pract 2000. Nov-Dec;6(6):469–476. doi: 10.4158/EP.6.6.469 [DOI] [PubMed] [Google Scholar]
2.Siironen P, Louhimo J, Nordling S, Ristimaki A, Maenpaa H, Haapiainen R, et al. Prognostic factors in papillary thyroid cancer: an evaluation of 601 consecutive patients. Tumour Biol 2005. Mar-Apr;26(2):57–64. doi: 10.1159/000085586 [DOI] [PubMed] [Google Scholar]
3.Michels JJ, Jacques M, Henry-Amar M, Bardet S. Prevalence and prognostic significance of tall cell variant of papillary thyroid carcinoma. Hum Pathol 2007. Feb;38(2):212–219. doi: 10.1016/j.humpath.2006.08.001 [DOI] [PubMed] [Google Scholar]
4.DeLellis RA, Lloyd RV, Heitz PU, Eng C. Tumours of the thyroid and parathyroid. World Health Organization Classification of Tumours. Pathology & Genetics Tumours of Endocrine Organs Lyon: IARC Press; 2004. p. 49–66.
5.Dettmer MS, Schmitt A, Steinert H, Capper D, Moch H, Komminoth P, et al. Tall cell papillary thyroid carcinoma: new diagnostic criteria and mutations in BRAF and TERT. Endocr Relat Cancer 2015. Jun;22(3):419–429. doi: 10.1530/ERC-15-0057 [DOI] [PubMed] [Google Scholar]
6.Ganly I, Ibrahimpasic T, Rivera M, Nixon I, Palmer F, Patel SG, et al. Prognostic implications of papillary thyroid carcinoma with tall-cell features. Thyroid 2014. Apr;24(4):662–670. doi: 10.1089/thy.2013.0503 [DOI] [PubMed] [Google Scholar]
7.Ghossein R, Livolsi VA. Papillary thyroid carcinoma tall cell variant. Thyroid 2008. Nov;18(11):1179–1181. doi: 10.1089/thy.2008.0164 [DOI] [PubMed] [Google Scholar]
8.Okuyucu K, Alagoz E, Arslan N, Emer O, Ince S, Deveci S, et al. Clinicopathologic features and prognostic factors of tall cell variant of papillary thyroid carcinoma: comparison with classic variant of papillary thyroid carcinoma. Nucl Med Commun 2015. Oct;36(10):1021–1025. doi: 10.1097/MNM.0000000000000360 [DOI] [PubMed] [Google Scholar]
9.Lloyd RV, Osamura RY, Kloppel G. WHO Classification of Tumours: Pathology and Genetics of Tumours of Endocrine Organs. 4th Edition. Lyon, France: IARC; 2017. [Google Scholar]
10.Beninato T, Scognamiglio T, Kleiman DA, Uccelli A, Vaca D, Fahey TJ 3rd, et al. Ten percent tall cells confer the aggressive features of the tall cell variant of papillary thyroid carcinoma. Surgery 2013. Dec;154(6):1331–6; discussion 1336. doi: 10.1016/j.surg.2013.05.009 [DOI] [PubMed] [Google Scholar]
11.Stenman S, Siironen P, Mustonen H, Lundin J, Haglund C, Arola J. The prognostic significance of tall cells in papillary thyroid carcinoma: A case-control study. Tumour Biol 2018. Jul;40(7):1010428318787720. doi: 10.1177/1010428318787720 [DOI] [PubMed] [Google Scholar]
12.Baloch ZW, LiVolsi VA. Special types of thyroid carcinoma. Histopathology 2018. Jan;72(1):40–52. doi: 10.1111/his.13348 [DOI] [PubMed] [Google Scholar]
13.Hernandez-Prera JC, Machado RA, Asa SL, Baloch Z, Faquin WC, Ghossein R, et al. Pathologic Reporting of Tall-Cell Variant of Papillary Thyroid Cancer: Have We Reached a Consensus? Thyroid 2017. Dec;27(12):1498–1504. doi: 10.1089/thy.2017.0280 [DOI] [PubMed] [Google Scholar]
14.Ghossein RA, Leboeuf R, Patel KN, Rivera M, Katabi N, Carlson DL, et al. Tall cell variant of papillary thyroid carcinoma without extrathyroid extension: biologic behavior and clinical implications. Thyroid 2007. Jul;17(7):655–661. doi: 10.1089/thy.2007.0061 [DOI] [PubMed] [Google Scholar]
15.Xing F, Yang L. Robust Nucleus/Cell Detection and Segmentation in Digital Pathology and Microscopy Images: A Comprehensive Review. IEEE Rev Biomed Eng 2016;9:234–263. doi: 10.1109/RBME.2016.2515127 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Dif N, Elberrichi Z. Deep Learning Methods for Mitosis Detection in Breast Cancer Histopathological Images: A Comprehensive Review. 2020. p. 279–306. [Google Scholar]
17.Stenman S, Bychkov D, Kucukel H, Linder N, Haglund C, Arola J, et al. Antibody Supervised Training of a Deep Learning Based Algorithm for Leukocyte Segmentation in Papillary Thyroid Carcinoma. IEEE J Biomed Health Inform 2021. Feb;25(2):422–428. doi: 10.1109/JBHI.2020.2994970 [DOI] [PubMed] [Google Scholar]
18.Linder N, Taylor JC, Colling R, Pell R, Alveyn E, Joseph J, et al. Deep learning for detecting tumour-infiltrating lymphocytes in testicular germ cell tumours. J Clin Pathol 2019. Feb;72(2):157–164. doi: 10.1136/jclinpath-2018-205328 [DOI] [PubMed] [Google Scholar]
19.Corredor G, Wang X, Zhou Y, Lu C, Fu P, Syrigos K, et al. Spatial Architecture and Arrangement of Tumor-Infiltrating Lymphocytes for Predicting Likelihood of Recurrence in Early-Stage Non-Small Cell Lung Cancer. Clin Cancer Res 2019. Mar 1;25(5):1526–1534. doi: 10.1158/1078-0432.CCR-18-2013 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Bychkov D, Linder N, Turkki R, Nordling S, Kovanen PE, Verrill C, et al. Deep learning based tissue analysis predicts outcome in colorectal cancer. Sci Rep 2018. Feb 21;8(1):3395-018-21758-3. doi: 10.1038/s41598-018-21758-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Couture HD, Williams LA, Geradts J, Nyante SJ, Butler EN, Marron JS, et al. Image analysis with deep learning to predict breast cancer grade, ER status, histologic subtype, and intrinsic subtype. NPJ Breast Cancer 2018. Sep 3;4:30-018-0079-1. eCollection 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Cooper LA, Demicco EG, Saltz JH, Powell RT, Rao A, Lazar AJ. PanCancer insights from The Cancer Genome Atlas: the pathologist’s perspective. J Pathol 2018. Apr;244(5):512–524. doi: 10.1002/path.5028 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Coca-Pelaz A, Shah JP, Hernandez-Prera JC, Ghossein RA, Rodrigo JP, Hartl DM, et al. Papillary Thyroid Cancer-Aggressive Variants and Impact on Management: A Narrative Review. Adv Ther 2020. Jul;37(7):3112–3128. doi: 10.1007/s12325-020-01391-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Wang F, Casalino LP, Khullar D. Deep Learning in Medicine—Promise, Progress, and Challenges INTEMED 2019;179(3):293–294. [DOI] [PubMed] [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0272696.r001

Decision Letter 0

Jason Chia-Hsun Hsieh

24 Aug 2021

PONE-D-21-21384

A deep learning–based algorithm for tall cell detection in papillary thyroid carcinoma

PLOS ONE

Dear Dr. Stenman,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

==============================

ACADEMIC EDITOR: There are still some minor issues requiring to be addressed. Please kindly respond to the reviewers' comments.

==============================

Please submit your revised manuscript by Oct 08 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Jason Chia-Hsun Hsieh, M.D. Ph.D

Academic Editor

PLOS ONE

Journal requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please know it is PLOS ONE policy for corresponding authors to declare, on behalf of all authors, all potential competing interests for the purposes of transparency. PLOS defines a competing interest as anything that interferes with, or could reasonably be perceived as interfering with, the full and objective presentation, peer review, editorial decision-making, or publication of research or non-research articles submitted to one of the journals. Competing interests can be financial or non-financial, professional, or personal. Competing interests can arise in relationship to an organization or another person. Please follow this link to our website for more details on competing interests: http://journals.plos.org/plosone/s/competing-interests

3. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide.

4. Please include a copy of Table 2 which you refer to in your text on page 8.

5. We note you have included a table to which you do not refer in the text of your manuscript. Please ensure that you refer to Table 5 in your text; if accepted, production will need this reference to link the reader to the Table.

6. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

7. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments (if provided):

There are still some minor issues requiring to be addressed. Please kindly respond to the reviewers' comments.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: No

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: General Comments: The authors aimed to train a deep learning algorithm to detect the proportion of tall cell variants in papillary thyroid carcinomas. Comparisons were made to scoring by a human investigator, 71 samples were used for training and testing, with an additional 90 samples used for validation.

Specific Comments:

1. Methods, Patient Series, Training Series: The algorithm is reported to have been trained on 100 whole slide images (70 from HUH; 30 from TCGA), however it is not explicitly stated whether these are from separate individuals. Please indicate if these are from separate subjects and, if not, how many unique individuals provided samples.

2. Methods, Patient Series, Testing Series: The authors include a separate validation set of 90 patients and were careful to include both cases (with adverse outcomes) and controls. This is sound practice. However, the authors mention that 71 tissue samples were obtained only from those in the adverse outcome group. It is not clear whether the validation sample size is 90 or 71, and if the latter, what the point of including the controls was.

3. Methods, Training of Deep Learning Algorithm: The authors mention 2,674 manual annotations were used for training and 296 were used for testing. The large numbers involved help justify the somewhat extreme 90/10 training/testing set ratio (something like 70/30 is more standard), but it is not clear where these large numbers come from, given that the previously reported sample size were 90.

4. Methods, Training of Deep Learning Algorithm: The authors performed their training/testing split once and missed an opportunity to determine the stability of their algorithm and the validity of their results by performing different splits of the data into training/testing and reproducing the model. Please justify why only a single split was made.

5. Methods, Statistical Analysis: The methods appear sound, though they are slightly under-reported. For instance, the log-rank test and proportional hazard regression model are used for survival analysis, but for what purpose? Please specify the outcomes and fixed effects, as well as the goals for these analyses. The reporting for exact tests and kappa statistics in the same paragraph are excellent examples of the detail needed.

6. Methods, Statistical Analysis: There is shockingly little detail provided about the neural network / deep learning approach. Input and output nodes, the number of hidden layers, the number of neurons, are not reported or discussed. Please provide these details – needed for reproducibility – along with justification for those choices. A Figure would also be helpful.

7. Results: In general, the findings are thoroughly and adequately reported. It is not precisely clear what role the validation sample played in generating them. Please state which results are from testing and which are from validation.

8. Results: The authors do not appear to have calculated sensitivity and specificity for the validation set. This seems an important omission. Please report or state how and why the data preclude such calculations.

9. Discussion: The authors state several strengths of their work without explicitly stating any limitations. Surely there must be some!

Reviewer #2: Interesting premise, but I have the following questions about study design:

In the validation set all samples were from patients with at least 5 years follow up. This means that any patients with a cancer with a death or other event that lead to loss of follow up by 5 years were not included, thus the population was a healthier population that the expected population with this cancer. How was this bias addressed in the study?

Patients with an adverse outcome - what was the time frame for this observation?

How was the matching accounted for in the analyses?

How was median relapse free survival assessed? What events were included/ censoring events?

Table 1: include percentages to show if the rates were similar in the two groups for categorical variables.

Definition of overall survival?

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Aug 9;17(8):e0272696. doi: 10.1371/journal.pone.0272696.r002

Author response to Decision Letter 0

10 Nov 2021

Rebuttal letter

Please find the comments of the reviewers and the rebuttal below:

Reviewer #1:

General Comments: The authors aimed to train a deep learning algorithm to detect the proportion of tall cell variants in papillary thyroid carcinomas. Comparisons were made to scoring by a human investigator, 71 samples were used for training and testing, with an additional 90 samples used for validation.

The primary aim of the study was to train a deep learning-based algorithm to detect tall cells in papillary thyroid carcinoma (PTC) and compare the algorithm results to that of a human investigator as well as the outcome of the patients. A total of 100 whole-slide images (WSI) from 100 separate PTC patients were used for training and testing the algorithm; 70 patients and corresponding WSIs from the Hospital District of Helsinki and Uusimaa and the remaining 30 patients and WSIs from the Cancer Genome Atlas Program (TCGA). The validation series consisting of 90 PTC WSIs from 90 separate age-, sex, and T-stage matched PTC patients was used to validate the trained algorithm and how it performed compared to a human investigator.

Furthermore 71 tissue slides from 71 separate relapses from the 34 patients with an aggressive disease (figure 1) in the validation series were analysed. As a secondary aim of the study, these tissue slides representing cases with relapse were also assessed by the algorithm as well as the human investigator to study whether the TC morphology of the primary tumor could also be seen in the relapse morphology.

Comment: 1. Methods, Patient Series, Training Series: The algorithm is reported to have been trained on 100 whole slide images (70 from HUH; 30 from TCGA), however it is not explicitly stated whether these are from separate individuals. Please indicate if these are from separate subjects and, if not, how many unique individuals provided samples.

Rebuttal: The 100 total samples in the training set all originated from separate patients.

All formalin-fixed paraffin embedded (FFPA) tissue samples from all 70 patients originating from the Hospital District of Helsinki and Uusimaa (HUS) were collected and reviewed by two researchers (S.S., J.A.). One representative FFPA tissue block was selected for each of the patients. New tissue slides were cut, stained with hematoxylin and eosin (HE) and further digitally scanned and imported to the image management platform used to train the algorithm (Aiforia, Aiforia Technologies Oy, Helsinki, Finland). In order to improve the generalizability of the trained algorithm, 30 additional tissue slides originating from the Cancer Genome Atlas Program (TCGA). These 30 slides were from cases with separate patients and were chosen based on visual assessment. Please see the CONSORT diagram of the different datasets used in the study (figure 1).

Adjustments made: Manuscript edited, and figure added to clarify that the training samples originated from separate patients (page 3, Methods section, 2.1.1. Training series, row 2 and the newly added figure (figure 1) CONSORT flow diagram on page 5, Methods section, 2.3. Training of the deep learning algorithm, rows 3-5, figure 1, page 4).

Comment: 2. Methods, Patient Series, Testing Series: The authors include a separate validation set of 90 patients and were careful to include both cases (with adverse outcomes) and controls. This is sound practice. However, the authors mention that 71 tissue samples were obtained only from those in the adverse outcome group. It is not clear whether the validation sample size is 90 or 71, and if the latter, what the point of including the controls was.

Rebuttal: The validation set included 90 patients; 34 patients with an adverse outcome and 56 age-, gender-, and T stage matched control cases.

The main objective of the study was to validate the trained algorithm using a case-control approach and to study how the TC score of the algorithm correlated with disease outcome. A secondary objective of the study was to test a hypothesis that the tall cell composition is increased in relapse samples compared to the corresponding primary tumor sample. In order for us to analyse this, we collected all relapse/metastasis FFPA tumor samples (e.g. local tumor recurrence or lymph node metastases) from the 34 patients with an adverse outcome. In the same manner as with the primary tumor samples, all FFPA samples were reviewed by two researchers (S.S., J.A.) and one representative FFPA tissue block was selected for each of the 71 relapses. The FFPA were then cut and the tissue samples were stained with HE digitized and imported to an image management platform. After the algorithm was validated on the 90 primary tumor samples in the validation set, the algorithm then analysed the 71 relapse tumor samples. In the study, we concluded that the tall cell morphology is retained but not increased in relapse samples compared to the primary tumor morphology.

Adjustments made: Text edited to outline the aims of the study more clearly (Page 3, Introduction section, rows 22-24). A new subheader “Relapse series” was inserted to highlight that these WSIs are indeed not included in the validation series (page 5, Methods section, 2.1.3. Relapse series). Also, the “digitization of slides” subheader was deleted and embedded into the “Validation series” text (Page 4, Methods section, 2.1.2 Validation series, rows 7-16). A consort diagram added to illustrate the different datasets more clearly (figure 1).

Comment: 3. Methods, Training of Deep Learning Algorithm: The authors mention 2,674 manual annotations were used for training and 296 were used for testing. The large numbers involved help justify the somewhat extreme 90/10 training/testing set ratio (something like 70/30 is more standard), but it is not clear where these large numbers come from, given that the previously reported sample size were 90.

Rebuttal: In total, the training set consisted of 100 tumor samples; 70 samples originating from Helsinki University Hospital and the remaining 30 from the TCGA database. However, the algorithm was trained on 2,674 manual annotations of regions of interest within these 100 patient samples. The manual annotations were done by one researcher (S.S.).

The reviewer brings up a good point that a training/testing set ratio or 80/20 or even 70/30 is more common and perhaps more standard. However, in a 90/10 split was justified based on the quantity of manual annotations in the training set, were as many as 2,674 and that the test set consisted of 296 manual annotations of regions of interest. A power calculation is reported below to illustrate that for analysis of the sensitivity of the algorithm on the annotation level, this number can be considered sufficient.

Changes made: Added a new figure of a CONSORT flow diagram outlining the datasets used in the study (figure 1, page 4).

Comment: 4. Methods, Training of Deep Learning Algorithm: The authors performed their training/testing split once and missed an opportunity to determine the stability of their algorithm and the validity of their results by performing different splits of the data into training/testing and reproducing the model. Please justify why only a single split was made.

Rebuttal: If we assume the prevalence of TCs to be 30% (estimated based on visual assessment) in the manual annotations used for training and testing, we would need 244 manual annotations in the test set to reach the acceptable +/-5% width of the 95% confidence level of the estimated sensitivity if the expected sensitivity is 95% according to our calculations. Thus, the 296 manual annotations exceed this number and we therefore deemed that one split sufficed. Given the sensitivity level for detection of TC in the test set, we expect a similar sensitivity level i.e. 95% in the validation set. However, the reviewer is correct in that multiple splits could perhaps have determined the stability of the model in more detail.

Comment: 5. Methods, Statistical Analysis: The methods appear sound, though they are slightly under-reported. For instance, the log-rank test and proportional hazard regression model are used for survival analysis, but for what purpose? Please specify the outcomes and fixed effects, as well as the goals for these analyses. The reporting for exact tests and kappa statistics in the same paragraph are excellent examples of the detail needed.

Rebuttal: As death from disease is a rather rare event in PTC, relapse was used as the primary outcome in survival analysis of the data. An event in relapse-free survival was defined as time from diagnosis to relapse (Increase in serum thyroglobulin levels or histological confirmation). Patients who died of other causes or otherwise were censored (e.g. treatment moved to another hospital district).

Changes made: Clarification of parameters used in survival analysis added (page 7, Methods section, 2.4. Statistical analysis, rows 9-12)

Comment: 6. Methods, Statistical Analysis: There is shockingly little detail provided about the neural network / deep learning approach. Input and output nodes, the number of hidden layers, the number of neurons, are not reported or discussed. Please provide these details – needed for reproducibility – along with justification for those choices. A Figure would also be helpful.

Rebuttal: The exact architecture of the artificial neural network is not available and remains proprietary to the provider of the software (Aiforia Technologies Oy, Helsinki, Finland). This could be considered a limitation and has been added as such in the manuscript. However, as the platform is commercially available, it is possible for other researchers to access and use the platform for training of deep learning-based algorithms. Also, the image augmentation details are provided in the manuscript which makes the method fully reproducible.

Comment: 7. Results: In general, the findings are thoroughly and adequately reported. It is not precisely clear what role the validation sample played in generating them. Please state which results are from testing and which are from validation.

Rebuttal: The validation set in the study was used for survival statistics and to compare how the algorithm’s scores compared to that of a human investigator. The validation set was a held-out set and thus not included in training or testing of the algorithm.

Changes made: Added figure 1, page 4, and edited the text according to earlier comments to clarify this.

Comment: 8. Results: The authors do not appear to have calculated sensitivity and specificity for the validation set. This seems an important omission. Please report or state how and why the data preclude such calculations.

Rebuttal: The sensitivity and specificity was calculated based on the manual annotations in the test set. The validation set had no ground truth TC labels, and a sensitivity and specificity could therefore not be calculated in the validation dataset.

Currently, the TC score is assessed visually by the pathologists mostly using traditional microscopes. The assessment is done visually on slide level, as it would be near impossible for a human to assess the TC score on cell level throughout the whole slide. The TC score is then usually reported with 10% increments. The human TC scoring in the study was performed in the same way; the representative slides were evaluated and a TC score was reported for each representative slide. The algorithm in the study consisted of two layers; first, tumor tissue was detected and secondly both TC area and non-TC area were registered within the tumor tissue area. The second layer of the algorithm analyzed the tissue on a much higher magnification resulting in cell level analysis. That is why, in the validation set, no ground truth annotations existed and therefore no specificity nor sensitivity could be calculated.

Comment: 9. Discussion: The authors state several strengths of their work without explicitly stating any limitations. Surely there must be some!

Rebuttal: The validation series of the study were all collected from within the same hospital district. During the time of sample collection, the hospital district offered a rather aggressive initial treatment protocol to adequately treat all aggressive cases but maybe overtreated the more indolent PTCs. This, in turn, could lead to misclassifying of the cases included in the study. As mentioned in the discussion section, we highly recommend multicenter studies in the future to limit the impact of treatment protocols within a specific hospital districts.

Adjustments made: Explicitly mentioning the limitation of the study cohort (Page 12, discussion section, rows 7-8)

Reviewer #2:

Interesting premise, but I have the following questions about study design:

Comment: In the validation set all samples were from patients with at least 5 years follow up. This means that any patients with a cancer with a death or other event that lead to loss of follow up by 5 years were not included, thus the population was a healthier population that the expected population with this cancer. How was this bias addressed in the study?

Rebuttal: The patients in the validation set included patients that were diagnosed with papillary thyroid carcinoma (PTC) between 2003-2013. By design, the follow-up time extended to 2018, which allowed for all patients included in the study to have a 5-year follow-up time. Patients who died of other causes than PTC or for other causes were censored within 5 years from diagnosis date were also included in the study. This is how, even though we allowed for a minimum 5-year follow-up, some patients had a shorter follow-up and the range of follow-up for the study was 2.1 years to 15.8 years (median 10.1 years) which is stated in the manuscript (page 5, Methods, Validation series, rows 6-7).

Adjustments made: Text edited to clarify the follow-up time (page 4, Methods section, 2.1.2. Validation series, rows 2-3)

Comment: Patients with an adverse outcome - what was the time frame for this observation?

Rebuttal: In the present study, an adverse outcome was defined as PTC cases having at least two recurrences (histological confirmation or serum thyroglobulin elevation during follow-up), having distant metastases or patients who died from PTC. Distant metastases at primary diagnosis was classified as an adverse outcome disease. A PTC was also defined as having an adverse outcome if the PTC distally metastasized, recurred more than twice or the patient died of the disease during follow-up.

Comment: How was the matching accounted for in the analyses?

Rebuttal: Both the human investigator and the deep learning-based algorithm were blinded to the clinical outcome and thus also to the matching when giving TC scores to the cases.

Comment: How was median relapse free survival assessed? What events were included/ censoring events?

Rebuttal: Relapse free survival was defined as time from diagnosis to first relapse. A relapse was defined as serum thyroglobulin elevation during follow-up or histological confirmation. Patients for whom follow-up ended before relapse for different reasons (e.g. moving to another hospital district or death from other disease) were censored in the survival analysis.

Adjustments made: Clarification of text on page 7, Methods section, 2.4. Statistical analysis, rows 9-12)

Comment: Table 1: include percentages to show if the rates were similar in the two groups for categorical variables.

Changes made: Percentages added to appropriate variables in table 1.

Comment: Definition of overall survival?

Rebuttal: Overall survival was defined as time from diagnosis to death from any cause.

Adjustments made: Clarification of text on page 7, Methods section, 2.4. Statistical analysis, rows 9-12)

Attachment

Submitted filename: Rebuttal_letter_final.docx

Click here for additional data file.^{(222.7KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0272696.r003

Decision Letter 1

Jason Chia-Hsun Hsieh

17 Jan 2022

PONE-D-21-21384R1A deep learning–based algorithm for tall cell detection in papillary thyroid carcinomaPLOS ONE

Dear Dr. Stenman,

==============================

ACADEMIC EDITOR: We are sorry for the delayed evaluation. Some questions require further clarification.

==============================

Please submit your revised manuscript by Mar 03 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Jason Chia-Hsun Hsieh, M.D. Ph.D

Academic Editor

PLOS ONE

Additional Editor Comments (if provided):

Some questions require further clarification.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: (No Response)

Reviewer #2: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: No

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: While the authors responded to my comment (#6, reproduced below), they did not in fact mention the proprietary nature of the neural network approach in the limitations section.

Comment: 6. Methods, Statistical Analysis: There is shockingly little detail provided

about the neural network / deep learning approach. Input and output nodes, the

number of hidden layers, the number of neurons, are not reported or discussed. Please

provide these details – needed for reproducibility – along with justification for those

choices. A Figure would also be helpful.

Rebuttal: The exact architecture of the artificial neural network is not available and

remains proprietary to the provider of the software (Aiforia Technologies Oy, Helsinki,

Finland). This could be considered a limitation and has been added as such in the

manuscript. However, as the platform is commercially available, it is possible for other

researchers to access and use the platform for training of deep learning-based

algorithms. Also, the image augmentation details are provided in the manuscript which

makes the method fully reproducible.

Reviewer #2: Regarding the comment of:

"How was the matching accounted for in the analyses?" - this requires a statistical analyses response. For example - In the Cox proportional hazards regression, adjustment was made for the matching variables of age, gender, tumour stage...

Table 3 could be updated to include multiple variable analyses adjusted for age group, gender, tumour stage.

Is it possible that some controls would be cases if observed for a longer period of time? How did 14 patients experience relapse, but were considered controls (if followed for longer, could a second relapse have been observed, making them cases)? It seems that the two groups were linked with the relapse free survival time - thus any analysis of the effect of group on relapse free survival would be biased. It would not be possible to define which group a patient would fall into at baseline.

In table 1 median age at diagnosis for cases is 43.4 years, but in text above it says 41.0 years - which is correct? Similar 41.7 and 41.5 are both quoted for the controls - which is right?

Figure 5: the third line in the number at risk - the label should be >=30% rather than 10% to match with the legend.

Include in table 3 the three group split as shown in figure 5, including univariate and multiple variable analyses with this grouping.

The question is - is a three group split better than a two group split - and this is difficult to tell with these analyses.

In the definition of relapse free survival I assume that relapse and death are events, end of follow up is a censoring variable? It is not clear in this sentence.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS One. 2022 Aug 9;17(8):e0272696. doi: 10.1371/journal.pone.0272696.r004

Author response to Decision Letter 1

3 Mar 2022

Rebuttal letter

Reviewer #1:

While the authors responded to my comment (#6, reproduced below), they did not in fact mention the proprietary nature of the neural network approach in the limitations section.

Rebuttal: The architecture of the algorithm is not available in exact detail, so it would indeed not be possible to recreate the model without knowing the algorithm structure. With that said, the image management and machine learning platform used in the study is commercially available for anyone to use. As the parameter settings that can be changed in the platform (e.g., feature size, iterations, image augmentations etc.) have been described in detail in the manuscript, the method is considered fully reproducible.

Changes made: It is now mentioned that the exact architecture of the algorithm is proprietary and that it could be considered a limitation. (Page 12, Discussion section, rows 26-29)

Reviewer #2:

1. "How was the matching accounted for in the analyses?" - this requires a statistical analyses response. For example - In the Cox proportional hazards regression, adjustment was made for the matching variables of age, gender, tumour stage...

Table 3 could be updated to include multiple variable analyses adjusted for age group, gender, tumour stage.

Rebuttal: The matching of the cases was done when collecting the samples for the study. Thus, the entire study, including the statistical analyses, were affected and accounted for in the study. In Cox multivariate regression analysis, the model could not be fitted with all four variables the reviewer suggested due to diminishing power and overfitting. We performed a Cox multivariate regression analysis adjusted for age (at 45 years of age threshold) and updated the manuscript and table 3 accordingly.

Changes made: Included the recent changes in the manuscript and updated table 3 according to the findings. (Page 9, results section, 3.3. Algorithm-based tall cell score and survival subsection, rows 8-11 and 3.4 Visually assessed tall cell score and survival, rows 10-11, page 12, Discussion section, rows 11-18. Table 3, page 10, reworked and corrected)

2. Is it possible that some controls would be cases if observed for a longer period of time? How did 14 patients experience relapse, but were considered controls (if followed for longer, could a second relapse have been observed, making them cases)? It seems that the two groups were linked with the relapse free survival time - thus any analysis of the effect of group on relapse free survival would be biased. It would not be possible to define which group a patient would fall into at baseline.

Rebuttal: Papillary thyroid carcinoma (PTC) often has a great prognosis, even with local recurrence. Thus, we designed the present study such that an adverse outcome was defined as two or more relapses (thyroglobulin elevation or histological confirmation), distant metastasis (at primary diagnosis or during follow-up) or death from PTC. Each case with an adverse outcome was matched with one to two control cases matched by age at diagnosis (within 10 years), gender, and tumor stage. The cases were collected between 2003 and 2013 and follow-up was ended 2018. This allowed all included cases at least a 5-year follow-up. By design, this cohort includes a concentrated number of cases with an aggressive disease and as such does not represent the general population. As tall cell variants are rather rare, the concentration of aggressive cases allowed us to study the prognostic impact of tall cells with a rather small sample size which was the aim of the study design.

In the survival analysis the cohort was analyzed as a whole and not grouped by aggressive or control. For example, analyzing a 30% tall cell threshold, the only grouping was whether the patient had less or more than 30% tall cells. Thus, the aggressive vs. control grouping would not affect the survival analysis of tall cell thresholds.

We hypothesized that patients with a single local recurrence does not have an aggressive disease and that a single recurrence could also be a result of inadequate resection of tumor in the primary surgery. It is possible that some of the patients in the control group could have been diagnosed with additional recurrences if observed for a longer period and longer follow-up periods are thus generally recommended. However, we felt that a 5-year follow-up was adequate for the scope of this study.

The cohort was retrospectively collected and most of the criteria for an aggressive disease indeed occurred during follow-up and would have been unknown at time of diagnosis. The reviewer correctly points out that it would not have been possible to define which group a patient would fall into at the time of diagnosis based on the aggressive disease criteria used in the study. However, our aim was to examine tall cell thresholds and how the tall cell scores correlated with outcome. As we concluded in the study, a 30% tall cell threshold did indeed correlate with a reduced relapse free survival. This could have been observed at the time of diagnosis as we studied the same histological tissue blocks that were used in the initial diagnosis. We argue that cases with 30% TC or higher should be treated more aggressively as they correlate with a more aggressive disease.

In the future, the algorithm presented here could be used by pathologists in the diagnosis of tall cell variants of papillary thyroid carcinoma and further studied using a prospective cohort.

Changes made: Discussion and conclusion sections edited to highlight the findings in the survival analysis (page 12, Discussion section, rows 11-18 and page 13-14, Discussion section, last paragraph of the manuscript)

3. In table 1 median age at diagnosis for cases is 43.4 years, but in text above it says 41.0 years - which is correct? Similar 41.7 and 41.5 are both quoted for the controls - which is right?

Rebuttal: Both numbers are correct since different numbers are reported; In the text, the median age is reported, while in the table the mean age is reported. This understandably cause confusion and we will edit the manuscript to be more consistent.

Changes made: corrected the table so the median age is reported instead of the mean to avoid confusion (page 5, table 1, row 11)

4. Figure 5: the third line in the number at risk - the label should be >=30% rather than 10% to match with the legend.

Rebuttal: The reviewer correctly points out an error in the figure.

Changes made: Figure 5 corrected to match the figure legend.

5. Include in table 3 the three group split as shown in figure 5, including univariate and multiple variable analyses with this grouping.

Changes made: Table 3 changed according to the reviewer’s suggestion. (page 10)

6. The question is - is a three group split better than a two group split - and this is difficult to tell with these analyses.

Rebuttal: The WHO suggests using a 30% TC threshold, thus basically recommending using a two-group split in clinical practice. This is probably preferable in clinical practice as it is very difficult for human observers to adhere to the strict three-group split used in the study (under 10%, 10-29%, and over 30%).

The idea of the three-group split in the statistical analysis was to separate the cases with a very small number of TCs (under 10%) which we assumed would have the best prognosis. This seems to be a valid assumption based on figure 5. However, only 8 cases had less than 10% TCs and only one relapse was observed within this group. Having separated out the cases with very few TCs, we could then study the difference between some TCs (10-29%) versus the cases with a lot of TCs (over 30%). Again, based on figure 5, there seem to be a difference in relapse between these two groups as well.

The reviewer asks a crucial question and to summarize; Based on our results, when using the algorithm, a 30% TC threshold seem to be the correct threshold to use when diagnosing TCV, which is in line with WHOs current recommendations. However, all cases with TCs (≥10%) should be included in the pathologist’s report as these cases also seem to correlate with a reduction in relapse free survival.

Changes made: Conclusion edited to highlight these findings (Page 13-14, Discussion section, last paragraph of the manuscript).

7. In the definition of relapse free survival I assume that relapse and death are events, end of follow up is a censoring variable? It is not clear in this sentence.

Rebuttal: relapse free survival is the time from diagnosis to relapse as described on page 7, Methods section, 2.4 statistical analysis subsection rows 9-10. The event is thus relapse and censoring events are e.g., end of follow-up or death.

Attachment

Submitted filename: Revised2_Rebuttal_letter.docx

Click here for additional data file.^{(21.7KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0272696.r005

Decision Letter 2

Alvaro Galli

8 Jun 2022

PONE-D-21-21384R2A deep learning–based algorithm for tall cell detection in papillary thyroid carcinomaPLOS ONE

Dear Dr. Stenman,

Please submit your revised manuscript by Jul 23 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Alvaro Galli

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #1: All comments have been addressed

Reviewer #2: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: (No Response)

Reviewer #2: Well done for the clarifications.

There is one remaining - point 7.

Typically if the endpoint name is relapse-free survival, it implies the events are relapse and death (the survival element of relapse free survival)

If relapse is the only event, it would typically be termed free-from-relapse. For clarity I would prefer this term to be used if this was the definition used.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

PLoS One. 2022 Aug 9;17(8):e0272696. doi: 10.1371/journal.pone.0272696.r006

Author response to Decision Letter 2

7 Jul 2022

Rebuttal letter

Journal Requirements:

Rebuttal: Two duplicated references (22, 23) were removed. The final reference list includes 24 references (down from 26). No retracted articles could be found among the references.

Reviewer #2:

Well done for the clarifications.

There is one remaining - point 7.

Typically if the endpoint name is relapse-free survival, it implies the events are relapse and death (the survival element of relapse free survival)

If relapse is the only event, it would typically be termed free-from-relapse. For clarity I would prefer this term to be used if this was the definition used.

Rebuttal: In the manuscript, relapse free survival (RFS) was defined as time from primary operation (i.e., initial treatment) until relapse (i.e., event) or end of follow-up (i.e., censor). Relapses were considered events, defined as increase in thyroglobulin level or histologically confirmation during follow-up. Deaths from other causes than PTC were considered censors in RFS. In the patient material, we saw two deaths from PTC.

For the definition we refer to National Cancer Institute’s dictionary that defines RFS as “in cancer, the length of time after primary treatment for a cancer ends that the patient survives without any signs or symptoms of that cancer.” Therefore, we thank the reviewer for the comment but respectfully wish to keep the definition in the manuscript as is.

PLoS One. doi: 10.1371/journal.pone.0272696.r007

Decision Letter 3

Alvaro Galli

26 Jul 2022

A deep learning–based algorithm for tall cell detection in papillary thyroid carcinoma

PONE-D-21-21384R3

Dear Dr. Stenman,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Alvaro Galli

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #1: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: (No Response)

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: (No Response)

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: (No Response)

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: (No Response)

**********

6. Review Comments to the Author

Reviewer #1: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

**********

PLoS One. doi: 10.1371/journal.pone.0272696.r008

Acceptance letter

Alvaro Galli

1 Aug 2022

PONE-D-21-21384R3

A deep learning–based algorithm for tall cell detection in papillary thyroid carcinoma

Dear Dr. Stenman:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Alvaro Galli

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 File

(DOCX)

Click here for additional data file.^{(19.5KB, docx)}

Attachment

Submitted filename: Rebuttal_letter_final.docx

Click here for additional data file.^{(222.7KB, docx)}

Attachment

Submitted filename: Revised2_Rebuttal_letter.docx

Click here for additional data file.^{(21.7KB, docx)}

Data Availability Statement

[pone.0272696.ref001] 1.Mazzaferri EL. Long-term outcome of patients with differentiated thyroid carcinoma: effect of therapy. Endocr Pract 2000. Nov-Dec;6(6):469–476. doi: 10.4158/EP.6.6.469 [DOI] [PubMed] [Google Scholar]

[pone.0272696.ref002] 2.Siironen P, Louhimo J, Nordling S, Ristimaki A, Maenpaa H, Haapiainen R, et al. Prognostic factors in papillary thyroid cancer: an evaluation of 601 consecutive patients. Tumour Biol 2005. Mar-Apr;26(2):57–64. doi: 10.1159/000085586 [DOI] [PubMed] [Google Scholar]

[pone.0272696.ref003] 3.Michels JJ, Jacques M, Henry-Amar M, Bardet S. Prevalence and prognostic significance of tall cell variant of papillary thyroid carcinoma. Hum Pathol 2007. Feb;38(2):212–219. doi: 10.1016/j.humpath.2006.08.001 [DOI] [PubMed] [Google Scholar]

[pone.0272696.ref004] 4.DeLellis RA, Lloyd RV, Heitz PU, Eng C. Tumours of the thyroid and parathyroid. World Health Organization Classification of Tumours. Pathology & Genetics Tumours of Endocrine Organs Lyon: IARC Press; 2004. p. 49–66.

[pone.0272696.ref005] 5.Dettmer MS, Schmitt A, Steinert H, Capper D, Moch H, Komminoth P, et al. Tall cell papillary thyroid carcinoma: new diagnostic criteria and mutations in BRAF and TERT. Endocr Relat Cancer 2015. Jun;22(3):419–429. doi: 10.1530/ERC-15-0057 [DOI] [PubMed] [Google Scholar]

[pone.0272696.ref006] 6.Ganly I, Ibrahimpasic T, Rivera M, Nixon I, Palmer F, Patel SG, et al. Prognostic implications of papillary thyroid carcinoma with tall-cell features. Thyroid 2014. Apr;24(4):662–670. doi: 10.1089/thy.2013.0503 [DOI] [PubMed] [Google Scholar]

[pone.0272696.ref007] 7.Ghossein R, Livolsi VA. Papillary thyroid carcinoma tall cell variant. Thyroid 2008. Nov;18(11):1179–1181. doi: 10.1089/thy.2008.0164 [DOI] [PubMed] [Google Scholar]

[pone.0272696.ref008] 8.Okuyucu K, Alagoz E, Arslan N, Emer O, Ince S, Deveci S, et al. Clinicopathologic features and prognostic factors of tall cell variant of papillary thyroid carcinoma: comparison with classic variant of papillary thyroid carcinoma. Nucl Med Commun 2015. Oct;36(10):1021–1025. doi: 10.1097/MNM.0000000000000360 [DOI] [PubMed] [Google Scholar]

[pone.0272696.ref009] 9.Lloyd RV, Osamura RY, Kloppel G. WHO Classification of Tumours: Pathology and Genetics of Tumours of Endocrine Organs. 4th Edition. Lyon, France: IARC; 2017. [Google Scholar]

[pone.0272696.ref010] 10.Beninato T, Scognamiglio T, Kleiman DA, Uccelli A, Vaca D, Fahey TJ 3rd, et al. Ten percent tall cells confer the aggressive features of the tall cell variant of papillary thyroid carcinoma. Surgery 2013. Dec;154(6):1331–6; discussion 1336. doi: 10.1016/j.surg.2013.05.009 [DOI] [PubMed] [Google Scholar]

[pone.0272696.ref011] 11.Stenman S, Siironen P, Mustonen H, Lundin J, Haglund C, Arola J. The prognostic significance of tall cells in papillary thyroid carcinoma: A case-control study. Tumour Biol 2018. Jul;40(7):1010428318787720. doi: 10.1177/1010428318787720 [DOI] [PubMed] [Google Scholar]

[pone.0272696.ref012] 12.Baloch ZW, LiVolsi VA. Special types of thyroid carcinoma. Histopathology 2018. Jan;72(1):40–52. doi: 10.1111/his.13348 [DOI] [PubMed] [Google Scholar]

[pone.0272696.ref013] 13.Hernandez-Prera JC, Machado RA, Asa SL, Baloch Z, Faquin WC, Ghossein R, et al. Pathologic Reporting of Tall-Cell Variant of Papillary Thyroid Cancer: Have We Reached a Consensus? Thyroid 2017. Dec;27(12):1498–1504. doi: 10.1089/thy.2017.0280 [DOI] [PubMed] [Google Scholar]

[pone.0272696.ref014] 14.Ghossein RA, Leboeuf R, Patel KN, Rivera M, Katabi N, Carlson DL, et al. Tall cell variant of papillary thyroid carcinoma without extrathyroid extension: biologic behavior and clinical implications. Thyroid 2007. Jul;17(7):655–661. doi: 10.1089/thy.2007.0061 [DOI] [PubMed] [Google Scholar]

[pone.0272696.ref015] 15.Xing F, Yang L. Robust Nucleus/Cell Detection and Segmentation in Digital Pathology and Microscopy Images: A Comprehensive Review. IEEE Rev Biomed Eng 2016;9:234–263. doi: 10.1109/RBME.2016.2515127 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0272696.ref016] 16.Dif N, Elberrichi Z. Deep Learning Methods for Mitosis Detection in Breast Cancer Histopathological Images: A Comprehensive Review. 2020. p. 279–306. [Google Scholar]

[pone.0272696.ref017] 17.Stenman S, Bychkov D, Kucukel H, Linder N, Haglund C, Arola J, et al. Antibody Supervised Training of a Deep Learning Based Algorithm for Leukocyte Segmentation in Papillary Thyroid Carcinoma. IEEE J Biomed Health Inform 2021. Feb;25(2):422–428. doi: 10.1109/JBHI.2020.2994970 [DOI] [PubMed] [Google Scholar]

[pone.0272696.ref018] 18.Linder N, Taylor JC, Colling R, Pell R, Alveyn E, Joseph J, et al. Deep learning for detecting tumour-infiltrating lymphocytes in testicular germ cell tumours. J Clin Pathol 2019. Feb;72(2):157–164. doi: 10.1136/jclinpath-2018-205328 [DOI] [PubMed] [Google Scholar]

[pone.0272696.ref019] 19.Corredor G, Wang X, Zhou Y, Lu C, Fu P, Syrigos K, et al. Spatial Architecture and Arrangement of Tumor-Infiltrating Lymphocytes for Predicting Likelihood of Recurrence in Early-Stage Non-Small Cell Lung Cancer. Clin Cancer Res 2019. Mar 1;25(5):1526–1534. doi: 10.1158/1078-0432.CCR-18-2013 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0272696.ref020] 20.Bychkov D, Linder N, Turkki R, Nordling S, Kovanen PE, Verrill C, et al. Deep learning based tissue analysis predicts outcome in colorectal cancer. Sci Rep 2018. Feb 21;8(1):3395-018-21758-3. doi: 10.1038/s41598-018-21758-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0272696.ref021] 21.Couture HD, Williams LA, Geradts J, Nyante SJ, Butler EN, Marron JS, et al. Image analysis with deep learning to predict breast cancer grade, ER status, histologic subtype, and intrinsic subtype. NPJ Breast Cancer 2018. Sep 3;4:30-018-0079-1. eCollection 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0272696.ref022] 22.Cooper LA, Demicco EG, Saltz JH, Powell RT, Rao A, Lazar AJ. PanCancer insights from The Cancer Genome Atlas: the pathologist’s perspective. J Pathol 2018. Apr;244(5):512–524. doi: 10.1002/path.5028 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0272696.ref023] 23.Coca-Pelaz A, Shah JP, Hernandez-Prera JC, Ghossein RA, Rodrigo JP, Hartl DM, et al. Papillary Thyroid Cancer-Aggressive Variants and Impact on Management: A Narrative Review. Adv Ther 2020. Jul;37(7):3112–3128. doi: 10.1007/s12325-020-01391-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0272696.ref024] 24.Wang F, Casalino LP, Khullar D. Deep Learning in Medicine—Promise, Progress, and Challenges INTEMED 2019;179(3):293–294. [DOI] [PubMed] [Google Scholar]

PERMALINK

A deep learning–based algorithm for tall cell detection in papillary thyroid carcinoma

Sebastian Stenman

Nina Linder

Mikael Lundin

Caj Haglund

Johanna Arola

Johan Lundin

Roles

Abstract

Introduction

Methods

Results

Conclusions

1. Introduction

2. Methods

2.1. Patient series

2.1.1. Training series

Fig 1. CONSORT flow diagram of the datasets used in the study.

2.1.2. Validation series

Table 1. Characteristics of papillary thyroid carcinoma (PTC) patient cohort.

2.1.3. Relapse series

2.2. Training of the deep learning algorithm

Fig 2. The convolutional neural network model consisted of two algorithms.

2.3. Tall cell analysis

Table 2. The interrater agreement between the visually assessed tall cell (TC) percentage and the algorithm’s TC area–based percentage in the primary tumors.

2.4. Statistical analysis

2.5. Ethical statement

3. Results

3.1 The deep learning algorithm

Fig 3.

3.2. Agreement between visual and algorithm-based tall cell scores

3.3. Algorithm-based tall cell score and survival

Fig 4. Kaplan–Meier curves for relapse-free survival (RFS) among patients with papillary thyroid cancer according to three tall cell percentage thresholds: 10%, 20%, and 30% based on visual assessment (a–c) and using the algorithmic assessment (d–f).

Fig 5. Kaplan–Meier curves for relapse-free survival (RFS) among patients with papillary thyroid cancer according to tall cell percentage thresholds: <10%, 10–29%, and ≥30% based on a) visual assessment and b) algorithmic assessment.

Table 3. Cox univariate and multivariate regression analysis for relapse-free survival (RFS) among patients with papillary thyroid carcinoma.

3.4. Visually assessed tall cell score and survival

3.5. Tumor relapse samples

Table 4. Interrater agreement between the visually assessed tall cell (TC) percentage and the algorithm-based TC score in tumor relapse samples.

4. Discussion

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

Jason Chia-Hsun Hsieh

Roles

Author response to Decision Letter 0

Decision Letter 1

Jason Chia-Hsun Hsieh

Roles

Author response to Decision Letter 1

Decision Letter 2

Alvaro Galli

Roles

Author response to Decision Letter 2

Decision Letter 3

Alvaro Galli

Roles

Acceptance letter

Alvaro Galli

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases