Abstract
We examine the performance of deep learning models on classification of thyroid fine needle aspiration biopsies (FNABs) using microscope images captured in two ways: with a high-resolution scanner, and with a mobile phone camera.
Our training set consists of images from 964 whole slide images (WSIs) captured with a high-resolution scanner. Our test set consists of 100 slides; 20 manually selected regions of interest (ROIs) from each slide were captured in two ways: with a smartphone camera, and with a high-resolution scanner.
Applying a baseline machine learning algorithm trained on scanner ROIs resulted in performance deterioration when applied to the smartphone ROIs (97.8% Area Under the Receiver Operating Characteristic [AUC], Confidence Interval [CI]=[95.4%, 100.0%] for scanner images vs. 89.5% AUC, CI=[82.3%, 96.6%] for mobile images, p=0.019). Preliminary analysis via histogram matching shows that the baseline model is overly sensitive to slight color variations in the images (specifically, to color differences between mobile and scanner images). Adding color augmentation during training reduces this sensitivity and narrows the performance gap between mobile and scanner images (97.6% AUC, CI=[95.0%, 100.0%] for scanner images vs. 96.0% AUC, CI=[91.8%, 100.0%] for mobile images, p=0.309), with both modalities on-par with human pathologist performance (95.6% AUC, CI=[91.6%, 99.5%]) for malignancy prediction (p=0.398 for pathologist vs. scanner and p=0.875 for pathologist vs. mobile). For indeterminate cases (pathologist-assigned Bethesda category of 3, 4, or 5), color augmentations confer some improvement (88.3% AUC, CI=[73.7%, 100.0%] for the baseline model vs. 96.2 % AUC, CI=[90.9%, 100.0%] with color augmentations, p=0.158). In addition, we found that our model’s performance levels off after 15 ROIs, a promising indication that ROI data collection would not be time-consuming for our diagnostic system. Finally, we show that the model has sensible Bethesda category (TBS) predictions (increasing risk malignancy rate with predicted TBS category, with 0% malignancy for predicted TBS 2 and 100% malignancy for TBS 6).
Introduction
There is a severe pathologist shortage in low- and middle-income countries (LMICs): the number of anatomic pathologists (per capita) is 50 times smaller in LMICs than in high-income countries.1 Some authors have proposed using machine learning algorithms in lieu of human pathologists to diagnose diseases from digital whole slide images (WSIs).2–4 However, digitizing slides requires expensive scanners costing up to $250 000,5 a great financial burden for clinical centers in LMICs. Mobile phones have the potential to democratize slide digitization because they are ubiquitous and up to 1000 times less expensive than state-of-the-art slide scanners. Together with machine learning algorithms for cancer diagnosis, WSI capture with a mobile phone could be a vital innovation for LMICs with neither a high-resolution scanner nor an expert pathologist.
In this paper, we propose a semi-automatic system to predict thyroid cancer from fine needle aspiration biopsy (FNAB) slides using mobile phone images. In the workflow we propose, regions of interest (ROIs) on the slide are manually selected and photographed (e.g., by a cytotechnologist) using a smartphone attached to a microscope with an adapter. A classification neural network then predicts the final surgical pathology of each slide (benign or malignant) and the cytopathology diagnosis based on the Bethesda System for the Reporting of Thyroid Cytopathology (TBS).6 The Bethesda System has 6 diagnostic categories (1=non-diagnostic, 2=benign, 3=atypical, 4=neoplastic, 5=suspicious for malignancy, and 6=malignant), each with an associated risk of malignancy (5–10%, 0–3%, 6–18%, 10–40%, 45–60%, and 94–96%, respectively).
Establishing such a workflow raises several questions. First, is the quality of mobile images sufficient for accurate thyroid cancer classification? Second, what is the impact of training our neural network classifier using WSIs, but deploying it on mobile phone images? Finally, since the smartphone camera field-of-view (FOV) is small, is it practical and feasible to capture enough diagnostic material from each slide? How many ROIs are required to make reliable diagnoses?
We address the above challenges to demonstrate the successful application of machine learning for diagnostic cytopathology using smartphone images.
Materials and Methods
Dataset
The training set consists of 964 WSIs of alcohol-fixed Papanicolaou-stained thyroid fine-needle aspiration biopsy (FNAB) slides, each with 1 000 ROIs (each ROI is a 128×128 pixel RGB image). The WSIs consist of all FNABs (with final pathology from surgical follow-up) from our institution’s medical center from 2008 to 2016. The slides were scanned using a Leica AT-2 scanner (Leica Biosystems, Deer Park, IL) at 40x magnification, then the resolution was down-sampled by a factor of 4. As in our previous work,3 The ROIs for the training set were selected using a ROI detection network (based on the VGG-11 neural network architecture7) trained to detect follicular groups.
The test set consists of 100 FNAB slides, each with 20 ROIs captured in two modalities: high-resolution scanner and mobile phone. Example test set ROIs are shown in Figure 1 (right). Every slide is assigned two labels: the (binary) post-surgery malignancy diagnosis (“final pathology”), and the pathologist-assigned ordinal risk assessment (between 2 and 6) according to The Bethesda System (TBS),3 extracted from the Electronic Medical Record (EMR). We limit the number of ROIs to 20 to expedite data collection, and we later show that our model performance saturates at 15 ROIs (regardless of the ROI capture order).
Figure 1:
Proposed procedures for training (left) and evaluation (right). During training, we apply color augmentations to the scanner images, make malignancy and sensible Bethesda System (TBS) category predictions using the neural network classifier, compare the predictions with the malignancy and TBS labels, and finally update the model.
The ROI pairs per test slide were selected as follows: for each slide, we used the ROI detection network to select 1 000 ROIs containing diagnostic regions of follicular cells from the WSI (the detailed procedure is described in our previous work3). The ROI bounding boxes were then overlaid on the WSI using the Aperio ImageScope software (Leica Biosystems, Deer Park, IL). A pathology resident (R.D.) reviewed the ROIs using the software and selected 20 ROIs. Specifically, R.D. was instructed to select regions that are in focus and contain groups of follicular cells. R.D. selected the ROIs only based on the WSI, without access to any other patient information. Moreover, R.D. selected ROIs only based on image quality and clarity of follicular groups, with no regard to the cancer characteristics of the follicular groups. R.D. then located the 20 ROIs in the tissue sample using a conventional microscope and captured them with a Redmi Note 10S mobile phone camera (Xiaomi Inc., Beijing, China) attached to an Olympus BX46 microscope (Olympus Corporation, Tokyo, Japan) via a GoSky microscope lens adapter (GoSky Optics Inc., Atlanta, GA). A picture of the mobile ROI capture setup is shown in Figure 2. This process created a paired dataset of scanner and mobile images to enable a one-to-one comparison. A limitation of this study is that, in favor of comparability, the process to capture the dataset of ROI pairs somewhat differs from the way our system would be used in practice, namely, a cytotechnologist would select 20 ROIs directly using the microscope, without having access to the WSI. Using the detection network to first find 1 000 ROIs and then displaying their bounding boxes on the ImageScope software was done purely to facilitate collection of the paired ROI dataset.
Figure 2:
Mobil regions of interest(ROI) capture setup. For each slide, we captured 20 ROI using a Redmi Note 10S camera attached to an Olympus BX46 via a GoSky microscope lens adapter.
Classification model architecture
To classify malignancy, we use a MobileNetV2 model,8 followed by global average pooling9 and a linear layer. We base our classification model on MobileNetV2 (instead of VGG-11 as in our previous work3) since it is fast, lightweight, and explicitly designed to operate on mobile phones.
We train the model to predict both the binary final pathology (using a cross-entropy loss) and the pathologist-assigned TBS (using an ordinal regression loss). In our previous work,3 we found that this “two-endpoint” approach performs better than training the network using only the final pathology. We note that the slide-level malignancy and TBS labels are assigned to each individual ROI, an assumption that works well in practice if the ROIs contain follicular groups.3
Model training
We train the model for 1 000 epochs (i.e., 1 000 passes over the training dataset) using the AdamW optimizer10 with a learning rate of 0.001 and weight decay parameter of 0.01. We use a batch size of 24 slides, with 12 ROIs per slide (randomly sampled at each epoch from the 1 000 ROIs available per slide).
Our neural network classifier is trained entirely on high-resolution scanner images, since collecting a large enough dataset and performing joint training or transfer learning on mobile images would be prohibitively slow.
Model evaluation
During evaluation, we first make ROI-level malignancy predictions using the classification model (for each of the 20 ROIs available per test slide). Then, we average the ROI-level predictions to get slide-level predictions. Finally, the slide-level predictions are compared with the binary final pathology to compute the Area Under the Receiver Operating Characteristic (AUC).
To improve performance, we obtain 5 different models using 5-fold cross-validation, and we use early stopping on the validation set to select the best model for each fold. During evaluation, we then average the 5 models’ predictions to get an “ensemble” prediction.11
As additional evaluation steps, we examine our model’s performance on “indeterminate cases” (slides with an EMR diagnosis of TBS 3, 4, or 5), as well as observed risk of malignancy within the Bethesda categories predicted by the model.
For all reported AUC values, we also estimate 95% confidence intervals (CIs) based on DeLong’s method.12 Further, for all AUC comparisons, we report two-sided p-values using DeLong’s test and use a significance level of α=0.05.
We examine our model’s performance vs. the number of ROIs used. We average the model’s performance across 1 000 random permutations of the 20 ROIs to account for randomness in the ROI ordering. To assess convergence, we use a two-sided Wilcoxon signed rank test comparing the model’s performance using 1 through 19 ROIs with the model’s performance using 20 ROIs. We determine convergence based on p<0.05/19 (i.e., a significance level of α=0.05 with a Bonferroni correction for N=19 comparisons).
All our models are implemented in Python 3.9.7 with PyTorch 1.11.0. Each model is trained on a single Tesla V100 PCIE graphics processing unit with 16 GB of memory.
Data augmentation
Data augmentation is a standard way to improve training of computer vision algorithms. Data augmentation is a set of image manipulations (e.g., changing the contrast/brightness, rotating) which do not fundamentally change the content of the image (the manipulations are said to be “label-preserving”13). In our problem, the characteristics of malignancy should be preserved after augmentation – slightly changing the brightness/contrast of an image does not change our decision about whether it should be classified as malignant (see Figure 3 for examples of data augmentations). The result of this augmentation process can be thought of in two ways. First, it effectively increases the size of the training set: each image/label pair appears multiple times in the training process, each time with a different image augmentation. Having a larger training set means the classification model is less likely to overfit the training data. Second, augmentations make the algorithm more robust to small variations in image characteristics, which in turn improves performance of the algorithm on mobile images, as we show in this paper.
Figure 3:
Data augmentation examples, shown on the same regions of interest (ROIs) for ease of comparison. Each row is a different type of augmentation, and each column is an augmentation “strength.” Left column shows the original ROI, and the middle & right columns show the augmented ROI (for two different strengths).
We base our data augmentation strategy on “TrivialAugment”14, which recently achieved state-of-the-art performance on several computer vision benchmarks. TrivialAugment works as follows: for each example (i.e., each ROI), a data augmentation operation is randomly sampled from the following list: brightness, saturation, contrast, sharpness, posterize, equalize, solarize, autocontrast, shear, rotation, and translation. Then, an augmentation strength (e.g., brightness level, contrast level, rotation angle) is randomly sampled, and finally, the augmentation is applied to the ROI. We make two changes to the TrivialAugment algorithm. First, we remove shears from the list of augmentations, since image shears are not characteristic-preserving for cytology images. Second, to facilitate analysis, we split the list of augmentations into two distinct types: color augmentations (brightness, saturation, contrast, sharpness, posterize, equalize, solarize, autocontrast – see Figure 3 for examples) and motion augmentations (rotations and translations). This split is useful because we would like to specifically study the effect of color augmentations on model performance since we hypothesize that the main difference between the training and test sets is in their color compositions (see Figure 4). Note that the data augmentation procedure is applied only to the training set of scanner ROIs. During evaluation on the paired mobile/scanner ROIs, we use the images as-is, without any augmentation (Figure 1, right).
Figure 4:
Example regions of interest (ROIs) from mobile test set (left) and scanner test set (center), with their corresponding RGB histograms. “Matched” (right) is identical to the mobile image, but approximately matches the color histogram of the scanner image. Histograms are scaled to be between 0 and 1 for ease of visualization.
Results
We trained a baseline model without augmentation, similar to a model from our previous work which achieves human-level performance for the prediction of malignancy using WSIs of thyroid FNABs.3 While this baseline model provides high AUC (97.8%, CI=[95.4%, 100.0%]) for scanner ROIs, its performance is statistically worse for mobile images (89.5% AUC, CI=[82.3%, 96.6%], p=0.019). Paired mobile and scanner ROIs show the same follicular group(s), but they have different color characteristics. Therefore, we hypothesized that the degradation in performance is likely due to color differences. To show this, we modified the mobile ROIs by matching their color histograms with their corresponding ROIs from the scanner set (Figure 4). Histogram matching improved performance on the mobile test set with 94.8% AUC, CI=[90.7%, 98.8%] vs. 89.5% AUC, CI=[82.3%, 96.6%] for the baseline model (p=0.115) (Figure 5). For reference, Figure 5 also shows the performance of the EMR pathologist (95.6% AUC, CI=[91.6%, 99.5%]) by comparing their assigned TBS category to the ground-truth final pathology. Of course, histogram matching is not viable in practice since we will not have a corresponding scanner image to match histograms. However, the performance improvement using histogram matching alone suggests that the baseline model is overly sensitive to slight color variations and performs better on images whose color distributions are like the training data. This informed our next iteration of the model, in which we used data augmentation during training.
Figure 5:
Naïve model performance on mobile, scanner, and “matched” images. The naïve model is significantly worse on mobile images (89.5% AUC, CI=[82.3%, 96.6%]) compared with scanner images (97.8% AUC, CI= [95.4%, 100.0%], p=0.019). Applying histogram matching (“Matched”) improves performance (94.8% AUC, CI= [90.7%, 98.8%], p=0.115) on mobile images. For reference, we also show the performance of the EMR pathologist’s assigned TBS category (95.6% AUC, CI=[91.6%, 99.5%]). The error bars show the 95% confidence interval obtained from DeLong’s method. (Receiver Operating Characteristic [AUC], Confidence Interval [CI])
Figure 6 summarizes model performance for different types of data augmentation. We found that data augmentation during training is effective in closing the performance gap between the mobile and scanner test sets. Adding color augmentations improved the performance on the mobile test set (96.0% AUC, CI=[91.8%, 100.0%] vs. 89.5% AUC, CI=[82.3%, 96.6%] for the baseline model, p=0.048) and did not significantly decrease performance on the scanner test set (97.6% AUC, CI=[95.0%, 100.0%] vs. 97.8% AUC, CI=[95.4%, 100.0%] for the baseline model, p=0.546). Further, we found that motion augmentations improved performance on the mobile test set (92.7% AUC, CI=[87.7%, 97.7%] vs. 89.5% AUC, CI=[82.3%, 96.6%] for the baseline model, p=0.205), but not to the same level as color augmentations. Finally, we note that our proposed model (i.e., with color augmentations) has an AUC that is statistically indistinguishable from the EMR pathologist’s AUC (96.0%, CI=[91.8%, 100.0%] for the model vs. 95.6% AUC, CI=[91.6%, 99.5%] for the EMR pathologist, p=0.875).
Figure 6:
Effect of different training data augmentations on model performance. The proposed color augmentations (right) yield the most improvement in mobile test set AUC (from 89.5% AUC, CI=[82.3%, 96.6%] to 96.0% AUC, CI=[91.8%, 100.0%], p=0.048). The hatching patterns show the modality used for diagnosis. The error bars show the 95% confidence interval obtained from DeLong’s method. (Receiver Operating Characteristic [AUC], Confidence Interval [CI])
Figure 7 shows how performance of the model evolves as we use an increasing number of ROIs per test slide (starting at 1 ROI, and going up to 20 ROIs). To control for randomness in the ROI ordering, we report the average performance and interquartile range for 1 000 random permutations of the ROI ordering. For our proposed model (trained with color augmentations), we find that the AUC levels off at 15 ROIs for the mobile test set: using the Wilcoxon two-sided signed rank test comparing with model performance at 20 ROIs, we obtain p < 0.05/19 for 1–14 ROIs and p > 0.05/19 for 15–19 ROIs. In the supplementary material, we plot all p-values obtained from this test for 1–19 ROIs.
Figure 7:
Mobile and Scanner AUC vs. number of ROIs at test time for the naïve model (No augmentation) and the proposed model (color augmentations). The error bars are the interquartile range over 1 000 random permutations of ROI orderings. The proposed model converges at 15 ROIs (determined using a two-sided Wilcoxon signed-rank test comparing with performance at 20 ROIs with significance level α=0.05 and a Bonferroni correction for N=19 comparisons). (Receiver Operating Characteristic [AUC], Confidence Interval [CI])
Figure 8 shows the performance of our model in “indeterminate cases” (cases classified as TBS 3,4, or 5 by the EMR pathologist). For the mobile test set, our proposed model (trained with color augmentations) achieves 96.2% AUC (vs. 88.3% AUC for the baseline model, p=0.158). We also note that the performance of our model on this “indeterminate” subset is better for mobile images than for scanner images, but not significantly so (94.5% AUC, CI=[87.1%, 100.0%] for scanner images vs. 96.2% AUC, CI=[90.9%, 100.0%] for mobile images, p=0.564).
Figure 8:
AUC of the naïve model (no augmentation) and the proposed model (color augmentations) on indeterminate cases (cases assigned TBS 3,4,5 by the EMR pathologist). The hatching patterns show the modality used for diagnosis. The error bars show the 95% confidence interval obtained from DeLong’s method. (Receiver Operating Characteristic [AUC], Confidence Interval [CI])
Finally, we measured the risk of malignancy (ROM) in each of the model’s predicted TBS category. The ROMs for TBS 2, 3, 4, 5, and 6 were 0.0, 16.7, 90.0, 100.0, and 100.0%, respectively (vs. 0–3, 6–18, 10–40, 45–60, and 94–96% reported in Cibas & Ali6).
Discussion
In this study, we evaluated whether a machine learning algorithm trained on WSIs of thyroid FNABs could be effectively applied to smartphone images. While a few recent works3,15–17 have also used similar approaches for thyroid cytopathology WSIs (see Kezlarian & Lin18 for a comprehensive review), this paper is (to our knowledge) the first to use machine learning to classify malignancy from mobile images of thyroid FNABs. Our approach is particularly relevant to the practice of cytopathology in low- to middle-income countries (LMICs). On a dataset of 100 slides (with 20 ROIs per slide) selected from the WSI and captured with a smartphone, we examine the performance gap between mobile phone-based and scanner-based imaging for the diagnosis of cancer. Our study shows that mobile phone image capture, coupled with machine learning, can be used to classify thyroid FNABs and aid in the prediction of malignancy. Our approach raises several important questions which we addressed in this work.
First, what is the difference in diagnostic quality of mobile phone images, as compared with WSIs obtained from a high-resolution scanner? Skandarajah, et al.19 analyzed the optical quality (e.g., resolution, image distortion, illumination variation) of mobile phone cameras for cell imaging. They showed that 5-megapixel mobile phone cameras have enough resolution to operate near the diffraction limit in magnification ranges relevant for cell imaging. They noted that the typical convenient automations of mobile phone cameras (e.g., auto-focus, exposure) can hinder accurate color capture. Rivenson et al.20 used deep learning to correct differences in image characteristics between mobile images and scanner images for lung tissue, pap smears, and blood smears reliably (as measured by reconstruction error). However, we believe that diagnostic quality is best assessed by measuring the accuracy of the deep learning algorithm to predict malignancy, which directly quantifies the gap in diagnostic quality between scanner and mobile images. Specifically, we compare our model’s malignancy against final surgical pathology. Our initial assessment showed an 8.3% absolute AUC decrease in performance (p=0.019) between the scanner and mobile images using the baseline algorithm/neural network. By noting that color histogram matching improved performance (Figure 5), we showed that our baseline algorithm is overly sensitive to slight variations in color. However, matching color histograms between scanner and mobile images is not feasible in low-resource settings where scanner images would not be available.
This brings us to the second challenge in implementing our workflow: how can we practically address color and image quality differences between mobile and scanner images? De Haan et al.21 built an end-to-end system for automatic diagnosis of sickle cell disease from mobile phone-based microscope images. They used an “enhancement network” to correct differences between the mobile phone images and those of a benchtop microscope. This network explicitly models the differences between a specific scanner and a specific mobile phone. Such a process, in our case, would require a large set of paired mobile/scanner images to train the “enhancement network.” Our approach simplifies the process: rather than changing the smartphone images with an additional enhancement network, we used data augmentation (only during training) to make our classification model less sensitive to differences in image characteristics. This is also an important consideration from a clinical perspective where direct manipulation of digital pathology images for diagnostic purposes may lead to regulatory issues. In addition, this data augmentation approach does not require the time-consuming collection of paired ROIs for enhancement network training. The approach results in a robust, more platform-agnostic model (i.e., a model which is not tied to specific phone or scanner models), as evidenced by the model’s performance on a test set of mobile images (having been trained only using WSIs).
Applying color augmentations during training improved the algorithm’s performance on the mobile test set to be statistically indistinguishable from its performance on the scanner test set across all TBS diagnoses (97.6% AUC, CI=[95.0%, 100.0%] on scanner test set vs. 96.0%, CI=[91.8%, 100.0%] on mobile test set, p=0.309) and among the indeterminate cases (96.1% AUC, CI=[90.5%, 100.0%] for mobile vs. 94.3%, CI=[86.7%, 100.0%] for scanner, p=0.718). Additionally, color augmentations improved performance on indeterminate (TBS 3,4,5) mobile images (88.3% AUC, CI= [73.7%, 100.0%] for the baseline model vs. 96.2 % AUC, CI= [90.9%, 100.0%] with color augmentations, p=0.158). This is a promising indication that mobile phones could serve as a viable slide digitization mechanism for diagnostic use.
Another practical consideration for this workflow is the amount of time it takes to collect the ROIs from the slide. In the proposed workflow a trained lab technologist would review FNAB slides under the microscope and use a smartphone mounted to the microscope to capture 20 regions of interest. Our machine learning algorithm would then be applied to those ROIs and provide a prediction of malignancy and TBS diagnosis. In our previous work,3 we used hundreds of ROIs from the WSI to make a malignancy prediction for each slide. This is not practical in a real-world setting because capturing hundreds of ROIs with a mobile phone would be time-consuming. We analyzed the performance of our proposed model as a function of the number of ROIs used at test time to determine how many ROIs needed to be manually collected. We show that our model’s performance across all TBS categories converges at 15 ROIs for the mobile test set (Figure 7). In terms of time spent, capturing 20 ROIs with the mobile phone and microscope setup was comparable to manually highlighting 20 ROIs on the Aperio ImageScope platform (between 5 and 10 minutes per slide). By attaching the camera phone to the microscope, the user (R.D.) integrated ROI capture with standard microscopic techniques (i.e., coarse focus, fine focus, switching objectives, and use of a slide stage) familiar from routine pathology practice. In this respect, the camera phone system is more like routine practice than the use of WSIs.
While the primary goal of the model is to predict final pathology, the model’s predicted TBS category also provides valuable insight into its mechanism. First, the observed risk of malignancy (ROM) increases with the predicted TBS category. Moreover, the ROMs (for all categories except TBS 4) are close to observed ROMs in clinical practice.22 We note that the ROMs in our model’s predicted TBS categories are skewed higher for TBS 4 and 5 than our institutional ROMs.23 The TBS 4 ROM of the model is 90.0% vs. the institutional ROM of 35%. This skew is likely because our model was trained to predict final pathology, so it attempts to separate benign and malignant cases distinctly. In doing so, the model assigns most malignant cases into TBS categories 4 or higher. The high ROM for the model’s predicted TBS 4 category is clinically acceptable since surgery is indicated in most cases assigned TBS 4 and above.24 A similar effect is noted for TBS 5, though the difference with institutional ROM is not as stark (100% for the model vs. 79.5% institutional ROM).
The overall performance of our method (96.0% AUC on mobile images), its performance on indeterminate cases (96.2% AUC on mobile images), its data efficiency (AUC convergence at 15 ROIs per slide), and its sensible TBS predictions support our hypothesis that the proposed workflow can be effective in thyroid cancer diagnosis. To our knowledge, this work is the first application of machine learning to thyroid cytopathology with a smartphone. We believe this proof-of-concept is promising for low-resource settings, where neither high-resolution scanners nor expert pathologists are always available. It is a first step towards deploying an automated system needing only a microscope and a smartphone equipped with a machine learning algorithm. In particular, the malignancy prediction algorithm we used, MobileNetV2, was specifically designed to operate on mobile phones given its light memory footprint and fast runtime.
Several limitations of our work and directions for future work should be addressed. First, there is a difference between the way the model was evaluated and its intended deployment. During evaluation, the pathology resident selected 20 ROIs from a pool of 1 000 scanner ROIs identified by a neural network. Then, they found the corresponding ROIs on a microscope and captured them using a mobile phone – this was done only to speed up the creation of the paired test dataset of scanner and mobile ROIs. In practice, we envision that a human cytotechnologist would select the 20 ROIs using a microscope and capture them with a mobile phone.
All the models presented in this study were trained exclusively on scanner images, not mobile images. This is purely due to data collection constraints: we did not have enough mobile samples to construct a reasonably sized training set, given that collecting thousands of mobile ROIs for training would be time-consuming. Although this is a limitation, it shows that our model is robust to differences in image modalities. Given more mobile samples, we suspect that incorporating mobile images into the training set would further improve the algorithm’s performance and robustness.
The approach we propose could also potentially be fully automated. For example, the diagnosis could be done entirely using a smartphone equipped with neural networks for ROI detection and classification, but this would require development of hardware to control ROI capture. Finally, the scale of the evaluation could be larger, including more slides (currently there are 100), more mobile devices (currently we use a Redmi Note 10S phone), and more clinical centers to collect data from (currently all the cases are from a single institution). Also, it is worth noting that our algorithm uses only a single z-stack depth from the WSI (we default to the “middle” depth out of 9 z-stack depths), which may result in some ROIs being blurry and slightly decrease our algorithm’s performance. Some of the results presented are not statistically significant (e.g., effect of color augmentations on model performance among indeterminate cases with p=0.158, effect of histogram matching on naïve model performance with p=0.118), but we conjecture that differences are likely to become significant with a larger sample. Finally, since our dataset consists of FNABs with surgical follow-up, it is likely biased toward clinically concerning nodules requiring resection.
Supplementary Material
Acknowledgments
This work is supported in part by the National Cancer Institute, National Institutes of Health, Grant/Award Number: 1R21CA268428-01. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Conflict of Interest
The authors declare no competing interests.
Ethics Approval/Consent to Participate
After Institutional Review Board approval was obtained, we searched institutional files for all thyroidectomy specimens with a preceding fine-needle aspiration biopsy. We used them to construct the training set and test set used in the paper.
Supplementary information is available on Modern Pathology’s website.
Data Availability Statement
The data that support the findings of this study are available from the Duke Office of Translation and Commercialization (OTC) but restrictions apply to the availability of these data, which were used under a pending license agreement for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Duke OTC.
References
- 1.Wilson ML, Fleming KA, Kuti MA, Looi LM, Lago N, Ru K. Access to pathology and laboratory medicine services: a crucial gap. The Lancet 2018;391(10133):1927–1938. [DOI] [PubMed] [Google Scholar]
- 2.Campanella G, Hanna MG, Geneslaw L, et al. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat Med 2019;25(8):1301–1309. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Dov D, Kovalsky SZ, Assaad S, et al. Weakly supervised instance learning for thyroid malignancy prediction from whole slide cytopathology images. Med Image Anal 2021;67:101814. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Nagpal K, Foote D, Liu Y, et al. Development and validation of a deep learning algorithm for improving Gleason scoring of prostate cancer. NPJ digital medicine 2019;2(1):1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Isse K, Lesniak A, Grama K, Roysam B, Minervini MI, Demetris AJ. Digital transplantation pathology: combining whole slide imaging, multiplex staining and automated image analysis. Am J Transplant 2012;12(1):27–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Cibas ES, Ali SZ. The 2017 Bethesda system for reporting thyroid cytopathology. Thyroid 2017;27(11):1341–1346. [DOI] [PubMed] [Google Scholar]
- 7.Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556 2014. https://arxiv.org/abs/1409.1556 [Google Scholar]
- 8.Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C. Mobilenetv2: Inverted residuals and linear bottlenecks 2018:4510–4520. [Google Scholar]
- 9.Lin M, Chen Q, Yan S. Network in network. arXiv preprint arXiv:13124400 2013; [Google Scholar]
- 10.Loshchilov I, Hutter F. Decoupled weight decay regularization. arXiv preprint arXiv:171105101 2017; [Google Scholar]
- 11.Dietterich TG. Ensemble methods in machine learning Springer; 2000:1–15. [Google Scholar]
- 12.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 1988:837–845. [PubMed] [Google Scholar]
- 13.Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. Journal of big data 2019;6(1):1–48. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Müller SG, Hutter F. Trivialaugment: Tuning-free yet state-of-the-art data augmentation 2021:774–782. [Google Scholar]
- 15.Sanyal P, Mukherjee T, Barui S, Das A, Gangopadhyay P. Artificial intelligence in cytopathology: a neural network to identify papillary carcinoma on thyroid fine-needle aspiration cytology smears. J Pathol Inform 2018;9(1):43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Savala R, Dey P, Gupta N. Artificial neural network model to distinguish follicular adenoma from follicular carcinoma on fine needle aspiration of thyroid. Diagn Cytopathol 2018;46(3):244–249. [DOI] [PubMed] [Google Scholar]
- 17.Guan Q, Wang Y, Ping B, et al. Deep convolutional neural network VGG-16 model for differential diagnosing of papillary thyroid carcinomas in cytological images: a pilot study. J Cancer 2019;10(20):4876. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Kezlarian B, Lin O. Artificial intelligence in thyroid fine needle aspiration biopsies. Acta Cytol. 2021;65(4):324–329. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Skandarajah A, Reber CD, Switz NA, Fletcher DA. Quantitative imaging with a mobile phone microscope. PLoS One 2014;9(5):e96906. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Rivenson Y, Ceylan Koydemir H, Wang H, et al. Deep learning enhanced mobile-phone microscopy. ACS Photonics 2018;5(6):2354–2364. [Google Scholar]
- 21.de Haan K, Ceylan Koydemir H, Rivenson Y, et al. Automated screening of sickle cells using a smartphone-based microscope and deep learning. NPJ Digital Medicine 2020;3(1):1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Faquin WC, Wong LQ, Afrogheh AH, et al. Impact of reclassifying noninvasive follicular variant of papillary thyroid carcinoma on the risk of malignancy in The Bethesda System for Reporting Thyroid Cytopathology. Cancer Cytopathol 2016;124(3):181–187. [DOI] [PubMed] [Google Scholar]
- 23.Elliott Range DD, Dov D, Kovalsky SZ, Henao R, Carin L, Cohen J. Application of a machine learning algorithm to predict malignancy in thyroid cytopathology. Cancer Cytopathol 2020;128(4):287–295. [DOI] [PubMed] [Google Scholar]
- 24.Haugen BR, Alexander EK, Bible KC, et al. 2015 American Thyroid Association management guidelines for adult patients with thyroid nodules and differentiated thyroid cancer: the American Thyroid Association guidelines task force on thyroid nodules and differentiated thyroid cancer. Thyroid 2016;26(1):1–133. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data that support the findings of this study are available from the Duke Office of Translation and Commercialization (OTC) but restrictions apply to the availability of these data, which were used under a pending license agreement for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Duke OTC.