Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2024 Jul 30;14:17591. doi: 10.1038/s41598-024-67879-w

Training high-performance deep learning classifier for diagnosis in oral cytology using diverse annotations

Shintaro Sukegawa 1,2,, Futa Tanaka 3, Keisuke Nakano 2, Takeshi Hara 3,4, Takanaga Ochiai 5, Katsumitsu Shimada 6, Yuta Inoue 3, Yoshihiro Taki 3, Fumi Nakai 1, Yasuhiro Nakai 1, Takanori Ishihama 1, Ryo Miyazaki 7, Satoshi Murakami 6, Hitoshi Nagatsuka 2, Minoru Miyake 1
PMCID: PMC11289412  PMID: 39080384

Abstract

The uncertainty of true labels in medical images hinders diagnosis owing to the variability across professionals when applying deep learning models. We used deep learning to obtain an optimal convolutional neural network (CNN) by adequately annotating data for oral exfoliative cytology considering labels from multiple oral pathologists. Six whole-slide images were processed using QuPath for segmenting them into tiles. The images were labeled by three oral pathologists, resulting in 14,535 images with the corresponding pathologists’ annotations. Data from three pathologists who provided the same diagnosis were labeled as ground truth (GT) and used for testing. We investigated six models trained using the annotations of (1) pathologist A, (2) pathologist B, (3) pathologist C, (4) GT, (5) majority voting, and (6) a probabilistic model. We divided the test by cross-validation per slide dataset and examined the classification performance of the CNN with a ResNet50 baseline. Statistical evaluation was performed repeatedly and independently using every slide 10 times as test data. For the area under the curve, three cases showed the highest values (0.861, 0.955, and 0.991) for the probabilistic model. Regarding accuracy, two cases showed the highest values (0.988 and 0.967). For the models using the pathologists and GT annotations, many slides showed very low accuracy and large variations across tests. Hence, the classifier trained with probabilistic labels provided the optimal CNN for oral exfoliative cytology considering diagnoses from multiple pathologists. These results may lead to trusted medical artificial intelligence solutions that reflect diverse diagnoses of various professionals.

Keywords: Deep learning, Oral cytology, Classification, Convolutional neural network, Probabilistic labeling

Subject terms: Oral cancer detection, Oral cancer

Introduction

Early detection and treatment are important strategies for oral cancer1. Remarkably, oral exfoliative cytology is extremely easy and minimally invasive because it allows testing by swabbing. It is also useful as a primary screening tool to identify unrecognized diseases in individuals without signs or symptoms as well as for simple differential diagnosis of oral dysplasia and malignant tumors before histological examination2. Cytology is mainly performed by direct smearing on a glass slide. In recent years, liquid-based cytology has become popular. In this method, the collection device is placed into a vial containing a fixative preservation solution immediately after cell collection to prevent cell drying. The cell specimen is prepared using a specialized device to perform a standardized procedure. This improves the diagnostic accuracy of oral cytology by revealing complex cellular patterns3. In recent years, the introduction of whole slide imaging (WSI) has revolutionized pathological diagnosis. WSI captures glass slides as high-quality digital images for easy access and sharing. Additionally, the ability to process images has enabled learning and development using artificial intelligence4.

In clinical and research environments, we previously studied a cytology convolutional neural network (CNN) for diagnosis based on deep learning5. The diagnostic accuracy for the oral cytology patch images using the CNN reached 0.9016, indicating reliable diagnosis using the CNN. However, the annotations were merged into a single result after consultation with multiple pathologists. Rather than making judgments based on the entire histopathological image, a pathologist's diagnosis often differs depending on segmented patch images. This can be observed during oral cytology, which involves a complex mixture of cells. To implement deep learning for diagnosis trusted by pathologists, measures for annotating data considering diverse diagnoses by multiple pathologists should be developed. Annotation diversity is a problem common to image-based medical diagnosis and has the following characteristics that differ from classification for other types of images. Uncertainty about the true label is a problem because different medical professionals often assign different labels to the same image6. A classification model for unevenly labeled data remains largely unexplored. Using probabilistic labels may be a viable alternative to train deep learning classification models for integrating and processing diversely labeled data7.

The method of generating accurate answers presented in this study has been explored under the concept of collective knowledge in the field of cognitive science. Collective knowledge stems from gathering many opinions to find a solution to a problem that cannot be solved by one person. While some of these methods are beginning to be utilized in the medical field, applying them to fields such as image diagnosis presents challenges due to varying opinions among specialists. For example, certain specialists may specifically identify imaging findings that elude general physicians. Therefore, AI trained to generate accurate diagnoses reflecting specialists' unique abilities, based on methods selected by their respective specialties, could foster trust among specialists. While collective knowledge has been reported as a method to reflect such expert opinions, they are limited to reports on lesion detection that are perhaps easily detectable8. Moreover, discussions on this method and its integration with neural network results are scarce9. Therefore, our study proposes a method centered on comparing diagnostic opinions among pathologists.

Materials and methods

Study design

This study aimed to use a deep learning model based on a CNN to analyze the classification performance of diagnosis using oral exfoliative cytology when labels from multiple pathologists are available and to obtain an optimized model. This study was based on fully supervised CNN learning.

Ethics statement

The WSI data for oral exfoliative cytology were acquired at Kagawa University Hospital from April 2015 to October 2022 and used retrospectively. This study was approved by the Ethics Committee of Kagawa University Faculty of Medicine (registration no. 2022-155). The study had a non-interventional retrospective research design and was analytical with anonymized data. Thus, the need for informed consent was waived, and no written or verbal informed consent was required from each patient individually. This study was conducted in accordance with both the ethical guidelines for life science and medical research involving human subjects and the rules established by the ethics committee.

Image data preparation

In this study, six WSI oral scraping cytology specimens were prepared using the liquid-based method. The breakdown of the six slides comprised two slides for tongue cancer, two slides for buccal mucosa cancer, one slide for tongue leukoplakia, and one slide for glossitis. The glass slides were scanned using a Ventana DP 200 slide scanner (Roche, Basel, Switzerland), and WSI scans were created. Each WSI scan was tiled using the Quantitative Pathology (QuPath) software (version 0.4.1).

Considering that WSI is compatible with various magnifications, cell pathologies should be evaluated at the optimal magnification. Therefore, we extracted sections at various magnifications and checked with a pathologist. In QuPath, the magnification can be set using a specific function. We determined a magnification for the cropped image of 1.0 as the optimal setting for diagnosis by an oral pathologist. The image was a square of 768 pixels per side and was cut into segments overlapping by 192 pixels, obtaining images of 256 × 256 pixels in the PNG (Portable Network Graphics) format.

Image annotation and selection

Pathologists

The three oral pathologists who performed the annotations were certified by the Japanese Society of Pathology and had over 15 years of experience.

Image diagnosis

The oral cytology diagnoses from each segmented image were independently labeled. The oral cytology images were classified into five categories based on the Papanicolaou classification. Classes 1 and 2 were assigned as negative labels, while classes 3–5 were assigned as positive labels. Additionally, each pathologist labeled inconclusive images that could not be diagnosed owing to factors such as no cells present, out-of-focus imaging, and strongly overlapping cells. The pathologists were not allowed to consult with each other during annotation.

Classification of annotations

The labels of each patch image diagnosed by a pathologist were classified. The diagnostic results from the three pathologists are shown in Fig. 1. Based on these results, we defined the three types of labels (Fig. 2):

  1. Ground truth (GT) label: Image label assigned when the same diagnosis is provided by the three pathologists.

  2. Majority label: Image label assigned when two pathologists provide the same diagnosis, that is, the label with the highest number of diagnoses is considered correct.

  3. Probabilistic label: Image label that encodes the diagnoses of the three pathologists into a single prediction.

Figure 1.

Figure 1

Distribution of data labels.

Figure 2.

Figure 2

Labeling that reflects annotations from multiple pathologists.

Deep learning model

We selected ResNet50, a representative CNN model, for diagnosis. ResNet is an artificial neural network with high accuracy owing to a residual block that adds an input value to the output value of the convolutional layer to prevent the vanishing gradient problem even when the network is deep10.

The ResNet50 model was finetuned on the ImageNet dataset. Deep learning classification was implemented using PyTorch (version 1.13.1 + cu116) and Python (version 3.8.10) (Fig. 3).

Figure 3.

Figure 3

A diagram of the deep learning model for each learning label.

The following six deep-learning models were established:

  • A.

    Pathologist A model: ResNet50 trained with labels from Pathologist A

  • B.

    Pathologist B model: ResNet50 trained with labels from Pathologist B

  • C.

    Pathologist C model: ResNet50 trained with labels from Pathologist C

  • D.

    GT model: ResNet50 trained with GT labels

  • E.

    Majority model: ResNet50 trained with majority labels

  • F.

    Probabilistic model: ResNet50 trained with probabilistic labels

Dataset and model training

CNN training was generalized using cross-validation in deep learning. Model validation was performed for all cases, with each slide being the test data for leave-one-out cross-validation to avoid case-by-case overfitting and bias (Fig. 4).

Figure 4.

Figure 4

Diagram of leave-one-out cross-validation adopted in this study.

Cross-entropy was used as the loss function, while stochastic gradient descent with momentum was employed for optimization. This method is an evolved variant of traditional stochastic gradient descent that mitigates variability by incorporating moving averages11. Additionally, we utilized the cosine annealing learning rate scheduler to ensure stable deep learning. Further details regarding the deep learning parameters are provided in Supplementary Material S1.

All the models were trained over 300 epochs and had a minibatch size of 32. Training was repeated 10 times for the deep models using different random seeds.

Performance indicators and statistical analysis

We used the GT labels to construct the test set for all the deep learning models and calculated the accuracy, precision, recall, specificity, F1-score, and area under the receiver operating characteristic curve (AUC) as performance indicators. Further details on the calculation of these indicators are available in Supplementary Material S2. The statistical evaluation of the classification performance of each deep learning model was repeated 10 times on the GT test data. The performance indicators were analyzed using JMP Pro Statistical Software Package Version 16.1.0 (SAS Institute Inc., Cary, NC, USA). We set p < 0.05 to indicate statistical significance. The probability of continuous variables was assessed using the Shapiro–Wilk test. The differences in classification performance between the deep learning models were calculated for all performance indicators using the Wilcoxon test.

Visualization of model diagnosis using activation maps

Visualizing the basis of predictions made by deep learning models is crucial for analysis. Gradient-weighted class activation mapping (Grad-CAM) enables visualization of relevant image areas per class based on the feature output from the last convolutional layer of a CNN12. We used a heat map and the gradient of the classification score to identify image areas that contributed to classification.

Results

Comparison of model performance

Supplementary Material S3 shows the performance indicators of AUC and accuracy for each model. Table 1 lists the statistical test results. In terms of AUC, the probabilistic model showed the highest value in cases 3, 5, and 6. In terms of accuracy, cases 5 and 6 exhibited the highest values. In contrast, the pathologist A–C and GT models provided slides with extremely low accuracy and many slides with large variations across trials. The values of all the performance indicators are listed in Supplementary Material S3. The results for the other performance indicators were similar to those of the accuracy and AUC, and the probabilistic model consistently showed high-performance indicators. For cases 5 and 6, the probabilistic model showed a statistically significant high performance for all indicators.

Table 1.

Statistical evaluation results (p-values) comparing probabilistic, majority, and GT models.

AUC Accuracy Precision Recall F1-score
Probabilistic vs majority Probabilistic vs GT Probabilistic vs majority Probabilistic vs GT Probabilistic vs majority Probabilistic vs GT Probabilistic vs majority Probabilistic vs GT Probabilistic vs majority Probabilistic vs GT
Case 2 0.007 0.002 0.082 0.039 0.332 0.131 0.056 0.407 0.049 0.157
Case 3  < 0.001 0.240 0.079 0.397 0.316 0.468 0.009 0.500 0.311 0.452
Case 4  < 0.001  < 0.001 0.039 0.359 0.044 0.404 0.030 0.001 0.001 0.082
Case 5  < 0.001  < 0.001  < 0.001  < 0.001 0.015 0.027  < 0.001  < 0.001  < 0.001  < 0.001
Case 6  < 0.001  < 0.001 0.001  < 0.001  < 0.001  < 0.001 0.031 0.006  < 0.001  < 0.001

p < 0.05 indicates a statistically significant difference.

Visualization of classification using Grad-CAM

Figure 4 shows a heatmap of the regions of interest for classification using each evaluated deep-learning model. For ResNet50, Grad-CAM highlighted the feature regions from the final convolutional layer as a heat map.

Positive labeling focused on atypical cells with a high nuclear ratio in the cytoplasm and an increased amount of chromatin in the cell nucleus as a characteristic region. In the top row of Fig. 4, the models of pathologists A and B focused on the area at the bottom of the image, whereas the model of pathologist C also focused on the area at the top of the image. Therefore, the majority and probabilistic models focused on both the upper and lower regions of the image for classification. In the bottom row of Fig. 5, each model yielded different predictions. The models of pathologists A–C identified different areas as characteristic regions. Therefore, the prediction and GT labels differed in most models. In contrast, the probabilistic model showed classification predictions that matched the GT labels by incorporating each feature region. It is believed that the characteristics of each model caused the differences in the classification results.

Figure 5.

Figure 5

Visualization of regions of interest for classification in each evaluated model.

Discussion

In this study, we evaluated CNNs on datasets with diverse annotations. A deep learning model using probabilistic labels was effective for classification using oral exfoliative cytology annotations provided by multiple oral pathologists. We then investigated the classification performance for diagnosis using each patch image annotated by the pathologists. There was a large variation in AUC from 0.799 to 0.692 for pathologist A, from 0.735 to 0.979 for pathologist B, and from 0.803 to 0.972 for pathologist C. For slides that performed well, the performance was close to that of our previous deep learning classifier using patch images (AUC of 0.93)5. However, some slides provided an extremely low accuracy, and large fluctuations occurred across trials. This was because pathologists generally adhere to a diagnosis (label) and may fail to reach a consensus. Therefore, constructing a highly generalizable model using a deep learning classification system derived from labels provided by a single professional posed challenges. Such labeling may be considered inadequate for clinical applications. Additionally, the significant variation in accuracy among pathologists A–C and GT models suggested that learning based on labeling from a single person is insufficient. This finding should be considered in pathological diagnosis based on deep learning.

The simplest method for learning different annotations is majority voting. We applied this method to images labeled differently by only one of the three pathologists. Most consistent labels showed a relatively stable performance in terms of the AUC, and one case showed a high accuracy. However, the recall often showed poor performance, possibly influenced by image annotations that were difficult to diagnose. The majority model may contribute to improving performance on data that show similar feature regions13. However, this model has various disadvantages, especially if the task requires highly specialized knowledge and the majority of labels are inadequately competent14. Therefore, care must be taken when selecting the majority model.

The probabilistic model had no slides with extremely low performance, and three cases showed the highest AUC among all the models. This model exhibited high robustness without being affected by the test cases. The probabilistic label considered the diagnoses of all pathologists, thus including all the available data. It has been suggested that, unlike hard labels, ambiguity may be advantageous for classifying pathological images. However, a liver echo dataset presented an interesting case in the study by Vega et al.6. The performance achieved by a model trained using hard labels showed higher performance than one based only on probabilistic labels. Compared with the highly reliable case of hard labels, probabilistic labels may converge to incorrect probabilities and increase prediction errors.

In this study, we investigated learning for pathological image classification using deep learning with labels that differed according to the diagnoses of various pathologists. This research attempt has not been reported and holds substantial academic significance. Probabilistic annotations do not rely on the labels of a single pathologist but rather combine the labels of multiple pathologists. Learning involving the differing opinions of multiple pathologists can establish a reliable method for pathological image classification. In addition, our approach can be applied to different annotations in other specialized medical fields, such as radiology and echo imaging diagnosis. In medical imaging, analysis is primarily influenced by the expertise of specialists, leading to different annotation results. In oral cytology and other imaging diagnoses such as CT and MR, diagnosis is often difficult even for specialists, and discrepancies in diagnostic test results among specialists are common in clinical practice. Therefore, we anticipate that our approach to generating accurate answers can extend to AI systems for general imaging diagnosis. Furthermore, we posit that our system can be applied across various medical situations, including psychiatric diagnosis and medical interviews. Even in verbal examinations, different physicians may have different diagnostic results. Additionally, the utilization of multiple pathology diagnoses during training can affect the acceptance of deep learning results from pathologists with different opinions or dependency adjustment cues for the inferred results13. Therefore, further research on these issues is required. In the future, we will consider implementing multiple models, such as in ensemble learning14 and model fusion15, for which the final prediction combines the outcomes of multiple models, as an additional study on CNNs. In addition, as an additional study on CNNs, we will consider implementing multiple models, such as in ensemble learning16 and model fusion17, for which the final prediction combines the outcomes of multiple models to increase the confidence level.

Our study had various limitations. First, it was impossible to statistically evaluate the performance of each model by combining all data owing to slide-by-slide cross-validation. Obtaining test data from independent slides is important for evaluating generalizability. Accordingly, we prioritized evaluating the model generalizability in this study. Second, predictions from three pathologists were considered, but data should be gathered from more pathologists. However, annotating data requires considerable effort. In future work, more pathologists should be enrolled to examine and annotate data. Despite the limitations, we believe that our findings will serve as a foundation for future research.

Conclusion

We found that a classifier using probabilistic labels is effective for training an optimal CNN classifier of oral exfoliative cytology when using annotations provided by multiple pathologists. The resulting model may serve as a trusted medical artificial intelligence solution that reflects diverse diagnoses provided by various pathologists.

Supplementary Information

Acknowledgements

This work was indirectly supported by JST CREST (JPMJCR21D4), Japan.

Author contributions

The study was conceived by S.S. and T.H., who also conducted the experiments. F.T., Y.I., and Y.T. conducted the experiments. K.N., T.O., and S.K. generated the data. All authors analyzed and interpreted the data. S.S. and T.H. wrote the manuscript. F.M., Y.N., T.I., R.M., S.M., H.N. and M.M. revised the manuscript. All authors read and approved the final version of the manuscript.

Data availability

The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-024-67879-w.

References

  • 1.Warnakulasuriya, S. et al. An appraisal of oral cancer and pre-cancer screening programmes in Europe: A systematic review. J. Oral Pathol. Med44, 559–570 (2015). 10.1111/jop.12267 [DOI] [PubMed] [Google Scholar]
  • 2.Sekine, J., Nakatani, E., Hideshima, K., Iwahashi, T. & Sasaki, H. Diagnostic accuracy of oral cancer cytology in a pilot study. Diagn. Pathol.12, (2017). [DOI] [PMC free article] [PubMed]
  • 3.Sukegawa, S. et al. Clinical study on primary screening of oral cancer and precancerous lesions by oral cytology. Diagn. Pathol.15, (2020). [DOI] [PMC free article] [PubMed]
  • 4.Sukegawa, S. et al. Effectiveness of deep learning classifiers in histopathological diagnosis of oral squamous cell carcinoma by pathologists. Sci. Rep.13, (2023). [DOI] [PMC free article] [PubMed]
  • 5.Sukegawa, S. et al. Effective deep learning for oral exfoliative cytology classification. Sci. Rep.12, (2022). [DOI] [PMC free article] [PubMed]
  • 6.Vega, R. et al. Sample efficient learning of image-based diagnostic classifiers using probabilistic labels. Proc. Mach. Learn. Res.130, 739–747 (2021). [Google Scholar]
  • 7.Iglesias, J. E. et al. An algorithm for optimal fusion of atlases with different labeling protocols. Neuroimage106, 451–463 (2015). 10.1016/j.neuroimage.2014.11.031 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Bhardwaj, P., Gupta, P., Guhan, T. & Srinivasan, K. Early diagnosis of retinal blood vessel damage via deep learning-powered collective intelligence models. Comput. Math. Methods Med.2022, 3571364 (2022). 10.1155/2022/3571364 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Popescu, D., El-Khatib, M. & Ichim, L. Skin lesion classification using collective intelligence of multiple neural networks. Sensors22, 4399 (2022). 10.3390/s22124399 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.2016-December, 770–778 (2015).
  • 11.Gitman, I., Lang, H., Zhang, P. & Xiao, L. Understanding the role of momentum in stochastic gradient ethods. Adv. Neural Inf. Process. Syst.32, (2019).
  • 12.Selvaraju, R. R. et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis.128, 336–359 (2016). 10.1007/s11263-019-01228-7 [DOI] [Google Scholar]
  • 13.Hoang Duc, A. K. et al. Using manifold learning for atlas selection in multi-atlas segmentation. PLoS One8, (2013). [DOI] [PMC free article] [PubMed]
  • 14.Li, J., Baba, Y. & Kashima, H. Hyper questions: Unsupervised targeting of a few experts in crowdsourcing. Int. Conf. Inf. Knowl. Manag. Proc.Part F131841, 1069–1078 (2017).
  • 15.Zhang, Y., Liao, Q. V. & Bellamy, R. K. E. Effect of Confidence and Explanation on Accuracy and Trust Calibration in AI-Assisted Decision Making. In FAT* 2020 - Proc. 2020 Conf. Fairness, Accountability, Transpar. 295–305 (2020). 10.1145/3351095.3372852
  • 16.Sukegawa, S. et al. Identification of osteoporosis using ensemble deep learning model with panoramic radiographs and clinical covariates. Sci. Rep.12, (2022). [DOI] [PMC free article] [PubMed]
  • 17.Huang, S. C., Pareek, A., Seyyedi, S., Banerjee, I. & Lungren, M. P. Fusion of medical imaging and electronic health records using deep learning: A systematic review and implementation guidelines. NPJ Digit. Med.3, 1–9 (2020). 10.1038/s41746-020-00341-z [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES