Skip to main content
PLOS One logoLink to PLOS One
. 2024 Dec 30;19(12):e0315862. doi: 10.1371/journal.pone.0315862

Problems of magnetic resonance diagnosis for gastric-type mucin-positive cervical lesions of the uterus and its solutions using artificial intelligence

Ayumi Ohya 1, Tsutomu Miyamoto 2,*, Fumihito Ichinohe 1, Hisanori Kobara 2, Yasunari Fujinaga 1, Tanri Shiozawa 2
Editor: Kazunori Nagasaka3
PMCID: PMC11684648  PMID: 39775578

Abstract

Purpose

To reveal problems of magnetic resonance imaging (MRI) for diagnosing gastric-type mucin-positive (GMPLs) and gastric-type mucin-negative (GMNLs) cervical lesions.

Methods

We selected 172 patients suspected to have lobular endocervical glandular hyperplasia; their pelvic MR images were categorised into the training (n = 132) and validation (n = 40) groups. The images of the validation group were read twice by three pairs of six readers to reveal the accuracy, area under the curve (AUC), and intraclass correlation coefficient (ICC). The readers evaluated three images (sagittal T2-weighted image [T2WI], axial T2WI, and axial T1-weighted image [T1WI]) in every patient. The pre-trained convolutional neural network (pCNN) was used to differentiate between GMPLs and GMNLs and perform four-fold cross-validation using cases in the training group. The accuracy and AUC were obtained using the MR images in the validation group. For each case, three images (sagittal T2WI and axial T2WI/T1WI) were entered into the CNN. Calculations were performed twice independently. ICC (2,1) between first- and second-time CNN was evaluated, and these results were compared with those of readers.

Results

The highest accuracy of readers was 77.50%. The highest ICC (1,1) between a pair of readers was 0.750. All ICC (2,1) values were <0.7, indicating poor agreement; the highest accuracy of CNN was 82.50%. The AUC did not differ significantly between the CNN and readers. The ICC (2,1) of CNN was 0.965.

Conclusions

Variation in the inter-reader or intra-reader accuracy in MRI diagnosis limits differentiation between GMPL and GMNL. CNN is nearly as accurate as readers but improves the reproducibility of diagnosis.

Introduction

Common benign lesions in the uterine cervix include Nabothian cysts, tunnel clusters, lobular endocervical glandular hyperplasia (LEGH), endometriosis, and cervical polyps [1]. LEGH is a benign lesion first proposed by Nucci et al. [2], but it may be a precursor lesion for gastric-type mucinous carcinomas (GAS) [38]. Therefore, frequent and long-term follow-up or surgical treatment is selected once LEGH is diagnosed. In contrast, other benign cystic lesions of the uterine cervix do not require follow-up as frequently as LEGH. Therefore, clinically, LEGH and other benign cystic lesions must be distinguished.

LEGH shows magnetic resonance imaging (MRI) findings called ‘cosmos pattern’, while Nabothian cysts show coarse cysts pattern [9]. However, some Nabothian cysts exhibit MRI findings similar to those of LEGH [10]. The decisive difference from other benign lesions is that LEGH and GAS secrete gastric-type mucin, which has O-linked oligosaccharides with a terminal α1,4-linked N-acetylglucosamine (αGlcNAc) residue [11]. Because gastric-type mucin is a neutral mucin, LEGH and GAS exhibit a ‘two-color pattern’ on Pap smears [12]. Further, αGlcNAc has been detected in cervical mucus by latex agglutination assay using monoclonal antibody HIK1083 [13]. This method has extremely high sensitivity and specificity [13]; however, the number of facilities that can implement this method is limited. Therefore, differentiation by MRI findings is extremely important.

An attempt was recently made to classify cervical lesions into gastric-type mucin-positive lesions (GMPLs) and gastric-type mucin-negative lesions (GMNLs) based on MRI findings [10]. The specificity was 95.5% when the cosmos pattern was observed as a hypointense area compared with the cervical stroma on T1-weighted images (T1WIs) [10]. However, the accuracy of differentiating GMPLs from GMNLs by MRI findings is not clear. In addition, no study of differences in GMPLs diagnostic performance among physicians has been reported.

On the other hand, the field of machine learning has developed remarkably. A convolutional neural network (CNN) is a machine learning algorithm of great interest in the field of diagnostic imaging [14]. It can perform equivalently to or better than humans in some image classification tasks [14]. In particular, transfer learning using pre-trained CNNs (pCNNs) can achieve high classification performance with a relatively small dataset [14]. Although GMPLs, as represented by LEGH, are relatively infrequent, it is not impossible to diagnose them using the pCNN. In addition, since artificial intelligence always makes the same diagnosis once it learns, it can be a promising solution when the reproducibility of physicians’ diagnoses is low. However, transfer learning has randomness in its learning, and the degree of reproducibility of the learning results has not been fully investigated.

Therefore, this study aimed to clarify the accuracy of MRI diagnosis of GMPLs, the differences in diagnostic ability among physicians, and the problems in the current situation, and to explore the possibility of GMPLs diagnosis by artificial intelligence.

Material and methods

Patient population

We reviewed the medical records in our hospital and selected 172 consecutive patients with clinical suspicion of LEGH or GAS (based on ultrasonographic findings, such as multiple cysts of the cervix, and symptoms, such as vaginal watery discharge) who underwent pelvic MRI between January 2000 and October 2020. Patients ranged in age from 26 to 82 years, with an average age of 48.7 years. The process of patient selection, and grouping is summarized in Fig 1. In 171 patients, a cervical Pap smear or latex agglutination assay using monoclonal antibody HIK1083 (HIK test) (Cica HIK gastric-type mucin; Kanto Kagaku, Tokyo, Japan) [13] had been performed to confirm that the cervical mucus included gastric-type mucin. Among the 172 patients, 35 underwent surgery or biopsy and were pathologically diagnosed with benign cystic lesions other than LEGH (BCL), LEGH, LEGH with atypia or adenocarcinoma in situ (aLEGH), and GAS. One patient who had not undergone a Pap smear and latex agglutination assay before surgery was pathologically diagnosed with GAS after surgical resection. Patients with gastric-type mucin confirmed by a cervical Pap smear or latex agglutination assay and those with LEGH or GAS pathologically diagnosed after surgical resection were both included in the GMPL group. Patients without gastric-type mucin or pathologically diagnosed BCLs were included in the GMNL group. There were 76 and 96 patients in the GMPL and GMNL groups, respectively. Thirty-one of the 76 patients in the GMPL group underwent surgical resection or biopsy. Of these, 15 patients exhibited LEGH, 13 exhibited aLEGH, and three exhibited GAS. In contrast, in the GMNL group, surgical resection was performed in only four of 96 patients. In these four patients, surgical resection was performed for a disease other than cervical lesions. Two of these patients were pathologically diagnosed with Nabothian cysts, and the other two were diagnosed with tunnel clusters. The classification of histopathologic lesions is summarized in Table 1.

Fig 1. Patient selection and inclusion criteria and grouping.

Fig 1

Table 1. Definition of lesions.

Lesion definition n Mean age
GMPLs
 LEGH Histopathologically diagnosed LEGH, LEGH with atypia, and LEGH with AIS 28 44 years
 GAS Histopathologically diagnosed GAS 3 42 years
 other Positive gastric type mucin by cervical Pap smear or latex agglutination assay 45 51 years
GMNLs
 Nabothian cyst Histopathologically diagnosed Nabothian cyst 2 49.5 years
 Tunnel cluster Histopathologically diagnosed tunnel cluster 2 51 years
 other Nagative gastric type mucin by cervical Pap smear or latex agglutination assay 92 48 years

GMPLs, gastric-type mucin-positive lesions; GMNLs, gastric-type mucin-negative lesions; LEGH, lobular endocervical glandular hyperplasia; AIS, adenocarcinoma in situ; GAS, gastric-type mucinous adenocarcinoma

This study was approved by the ethics committee of our institution (approval no.: 4423). The ethics committee waived the requirement for informed consent for the use of the patients’ information and MR images because diagnostic use of the samples had been completed before the study, and there was no risk to the involved patients. The patients’ information and MR images were also coded to protect patient anonymity.

Lesions in patients who were determined to have gastric-type mucin secretion by surgery, biopsy, cervical Pap smear, or latex agglutination assay were designated gastric-type mucin-positive lesions (GMPLs). There were 76 GMPLs. Of these, 31 cases were determined by surgery or biopsy (Table 1). The remaining lesions were determined by cervical Pap smear or latex agglutination assay (Table 1). In contrast, lesions in patients who were determined to not have gastric-type mucin secretion by surgery, biopsy, cervical Pap smear, or latex agglutination assay were designated gastric-type mucin-negative lesions (GMNLs). There were 96 GMNLs. Of these, 4 cases were determined by surgery or biopsy (Table 1). The remaining lesions were determined by cervical Pap smear or latex agglutination assay (Table 1). The total number of GMPLs was 172. From the 172 cases, 132 cases were randomly selected for training the convolutional neural network (CNN) without changing the overall ratio of the patients with GMPL and GMNL. The remaining cases were used as data for CNN validation and physician pair reading experiments. Training data are more numerous than validation data because more data are needed to train the CNN.

MR images

The most recent pelvic MR images of each patient stored on the image server of our institution were used for analysis. If a hysterectomy or conisation was performed, MR images immediately before the treatment were selected. Of these MR images, the sagittal T2-weighted images (T2WIs) with or without fat suppression, axial T2WIs with or without fat suppression, and axial T1WIs with or without fat suppression were used for analysis. The types and magnetic field strength of the MRI unit used for imaging and the parameters of each sequence varied because the study was retrospective and MR images were acquired over a long period. All the patients underwent MRI using 1.5 or 3.0 tesla scanners. A total of 134 patients underwent MRI by Siemens scanners (Siemens Healthcare Diagnostics, Erlangen, Germany)– 28 patients by 3T Trio, 27 patients by 3T Prisma, 22 patients by 3T Vida, 2 patients by 3T Skyra, 39 patients by 1.5T Avanto, 11 patients by 1.5T Symphony, three patients by 1.5T Aera, and two patients by 1.5T Essenza. A total of 29 patients underwent MRI by GE scanners (GE HealthCare, Chicago, Illinois, USA)–two patients by 3T DISCOVERY MR 750w, one patient by 3T SIGNA Pioneer, one patient by 1.5T OPTIMA MR360w, 15 patients by 1.5T OPTIMA MR450w, 10 patients by 1.5T SIGNA HDxt, and one patient by 1.5T SIGNA Excite HD. The remaining nine patients were imaged with two types of 3T MRI scanners (Philips Electronics N.V., Amsterdam, Holland; Canon Medical Systems Corp., Tochigi, Japan) and three types of 1.5T MRI scanners (Philips Electronics N.V., Amsterdam, Holland; Canon Medical Systems Corp., Tochigi, Japan). All images had a slice thickness of 2 to 7.5 mm and were captured as two-dimensional images. One image showing the maximum cross-section of the lesion was selected from each of these three sequences per patient, and the three images were used for analysis.

Diagnostic accuracy and reproducibility by the readers

The patients were randomly divided into the training (132 patients) and validation (40 patients) groups without changing the overall ratio of the patients with GMPL and GMNL (Fig 1). Three pairs of readers [six readers; two experienced gynecologic radiologists (with 31 and 13 years of experience), pair A; two young radiologists (with 8 years of experience), pair B; and two gynecologists (with 14 and 10 years of experience), pair C) diagnosed GMPLs or GMNLs on the MR images in the validation group without the patients’ clinical information. At this time, each pair of readers determined the confidence level of the diagnosis as a percentage by consensus. The confidence level of the diagnosis was stated as a percentage based on each physician’s experience. Each reader was presented with a total of three images (sagittal T2WI, axial T2WI, and axial T1WI, which were the maximum cross sections of the lesion) in every patient. The diagnostic accuracy and area under the curve (AUC) in each of the three pairs were determined. This evaluation by the readers was performed twice, at least 1 month apart. Images of the same patients were presented to each pair of readers during the second evaluation, but MR images of the patients were presented in a different order from the first evaluation. Same as the first evaluation, the diagnostic accuracy and AUC in each group were determined. In addition, intraclass correlation coefficients (ICC) (1) or (2) value was calculated for the assessment of reproducibility. ICC value of <0.7 was considered as poor agreement.

Diagnostic accuracy evaluation by pCNNs

Among the existing pCNNs that can be used by anyone, fine-tuning was performed using Xception [15]. Xception is pre-trained by the ImageNet database. Xception was used on MATLAB software (MATLAB 2020a; MathWorks, Natick, MA, USA). We trained Xception using the images of the training group (132 patients) and validated it using the images of the validation group (Fig 1). When images were input into Xception, MR images were processed so that only the lesion and cervical stroma were included in the image range (Fig 2). The image size entered into 299 × 299 pixels for Xception. As the MR images were captured under various conditions, it was impossible to make the display conditions in the image viewer constant. Therefore, the display conditions were adjusted so that the images had visually similar contrast. The fine-tuning hyperparameters were as follows: the optimiser Adam, learning rate of 0.001 for fully connected layers, and learning rate of 0.00005 for other layers. Fine-tuning was performed with a mini-batch size of 32 and an epoch number of 16. The learning rate was multiplied by 0.9 every four epochs. For image augmentation, resizing, rotation, translation, and reflection were performed at random. Using the training data, all three images were input into Xception. The training data were randomly divided into four parts without changing the ratio of GMPL to GMNL, and we performed four-fold cross-validation (Fig 3). The accuracy of each verification by the four-fold cross-validation was calculated and averaged to the tentative diagnostic accuracy, which indicates degree of training. Finally, images of the independent validation data were input into the model trained in each fold, and the true diagnostic accuracy and AUC for the independent validation data were determined by averaging the diagnostic probabilities by the four models (Fig 3). These series of calculations were performed twice independently. The ICC (2) value was calculated to evaluate the reproducibility between the two answers.

Fig 2. Three images input into the pre-trained convolutional neural networks.

Fig 2

A sagittal T2-weighted image was input into the red channel, an axial T2-weighted image was input into the green channel, and a T1-weighted image was input into the blue channel. These converted images were processed to include only the lesion and cervical stroma of the uterus.

Fig 3. Training data were divided into four data sets so that the ratio of gastric mucin-positive and gastric mucin-negative lesions did not change, and four-fold cross-validation was performed.

Fig 3

The accuracy of each verification by the four folds was calculated and averaged to the tentative diagnostic accuracy. Using the independent validation data, the diagnostic probability for each case was then calculated in four folds. Finally, the true diagnostic accuracy was obtained by averaging the four diagnostic probabilities. T, Training; V, Validation.

Statistical analysis

Creation of a receiver operating characteristic (ROC) curve, calculation of the AUC, comparison of AUCs, and calculation of ICC (1) or (2) values were performed using BellCurve for Excel (Social Survey Information Co., Ltd. Tokyo, Japan). The DeLong’s test for two correlated ROC curves was used to compare AUCs. Values of p < 0.05 were considered as statistical significance. If the 95% confidence intervals (95% CI) did not overlap, the difference was considered statistically significant.

Results

Diagnostic ability of the readers

The diagnostic accuracy, AUC, and ICC (1, 1) for the three pairs of readers are summarised in Table 2. The diagnostic accuracy showed the highest value of 0.775 for the first evaluation of pair B. The diagnostic accuracy showed the lowest value of 0.625 for the first evaluation of pair C. When GMPL was positive in the evaluation with the highest diagnostic accuracy (the first evaluation of pair B), the precision, recall, specificity, and F-measure were 0.737, 0.778, 0.773, and 0.757, respectively.

Table 2. Diagnostic accuracy, area under the curve, and kappa value for each pair of readers.

Pairs Test Diagnostic accuracy AUC (95% CI) ICC (1,1) (95% CI)
Pair A (Experienced gynecologic radiologists) First 0.650 0.720 (0.555–0.885) 0.666 (0.452–0.808)
Second 0.725 0.814 (0.680–0.949)
Pair B (Young radiologists) First 0.775 0.777 (0.626–0.927) 0.750 (0.577–0.859)
Second 0.700 0.765 (0.608–0.923)
Pair C (Gynecologists) First 0.625 0.812 (0.671–0.953) 0.680 (0.472–0.816)
Second 0.700 0.759 (0.604–0.914)

AUC, area under the curve; CI, confidence interval; ICC, intraclass correlation coefficients

The AUC of the readers showed the highest value of 0.814 for the second evaluation of pair A. The AUC of the readers showed the lowest value of 0.720 for the first evaluation of pair A.

The highest ICC (1,1) value between the readers in pair C was 0.750 (Table 2). The ICC (2,1) values between the pairs of readers are summarised in Table 3. All ICC (2) values were <0.7, indicating poor agreement.

Table 3. Intraclass correlation coefficient (2, 1) (95% confidence interval) between the physicians’ pairs.

Pair Pair B First Pair B Second Pair C First Pair C Second
Pair A First 0.631 (0.401–0.786) 0.568 (0.321–0.745) 0.453 (0.083–0.694) 0.537 (0.066–0.775)
Pair A Second 0.522 (0.255–0.715) 0.527 (0.267–0.717) 0.402 (0.068–0.646) 0.540 (0.082–0.774)
Pair B First 0.566 (0.102–0.791) 0.564 (0.051–0.799)
Pair B Second 0.389 (-0.068–0.685) 0.430 (-0.099–0.745)

Diagnostic ability of Xception and comparison with the ability of the readers

The diagnostic accuracy, AUC, and ICC (2, 1) obtained by Xception are summarised in Table 4. The true diagnostic accuracy of Xception was almost identical to or higher than the tentative diagnostic accuracy in most of the procedures. The diagnostic accuracy showed the highest value of 0.825 for the second time. This was higher than the highest diagnostic accuracy (0.775) of the readers (Table 2). The AUC showed the highest value of 0.854 for the second time. Comparison of the ROC curve between the AUC values of Xception and those of each pair of readers showed no statistically significant difference (Fig 4). When GMPL was positive in the procedure with the highest diagnostic accuracy, the precision, recall, specificity, and F-measure were 0.867, 0.722, 0.909, and 0.788, respectively. All of these values except recall were higher than those of the readers. The ICC (2,1) value (95% CI) was 0.965 (0.934–0.981). This value was higher than all ICC (2,1) values of the readers. The agreement rate was statistically significantly higher than that of all readers because the 95% CI at this time did not overlap with those of any of the readers.

Table 4. Diagnostic accuracy, area under the curve, and intraclass correlation coefficient (2, 1) value of each convolutional neural network.

Test Tentative diagnostic accuracy True diagnostic accuracy AUC (95% CI) ICC (2,1) value
First 0.588–0.824 0.800 0.841 (0.711–0.971) 0.965
Second 0.677–0.750 0.825 0.854 (0.731–0.976)

AUC, area under the curve; CNN, convolutional neural network; CI, confidence interval; ICC, intraclass correlation coefficient

Fig 4.

Fig 4

(A) Comparison of receiver operating characteristic (ROC) curves at the time of the second independent procedure was entered in Xception and at the time of the first interpretation experiment of pair A with the lowest area under the curve (AUC) among the readers. The AUC at the time of the second independent procedure of Xception was 0.854, which was higher than the AUC at the time of the first interpretation experiment of pair A (0.720). However, there was no statistically significant difference between the two (p = 0.094). (B) Comparison of ROC curves at the time of the first independent procedure of Xception and at the time of the second interpretation experiment of pair A with the highest AUC value. The AUC at the time of the first independent procedure of Xception was 0.841, which was lower than the AUC at the time of the second nterpretation experiment of pair A (0.814). However, there was no statistically significant difference between the two (p = 0.681). Both the first- and second-time AUCs of Xception exceeded those of all readers, but no statistically significant difference in the comparison of ROC curves of all combinations is observed.

Discussion

In 2010, Takatsu et al. [9] proposed the ‘cosmos pattern’ as a characteristic MRI finding of LEGH. In this study, the readers diagnosed LEGH using MR images only, and the highest rate of their positive diagnosis was 77.5%. The precision of the readers was approximately 0.7, indicating that approximately 30% of the GMPLs were judged to be GMNLs. The first reason for judging GMPLs as GMNLs is the presence of GMPLs with atypical imaging findings. In a recent study, the cosmos pattern was found in approximately 60% of GMPLs and was the most frequently observed pattern [10]. In addition, approximately 30% of GMNLs exhibit a cosmos pattern [10]. The second reason is that findings suggestive of LEGH vary from study to study; the definitions of LEGH vary. Takatsu et al. defined the cosmos pattern as ‘a pattern of relatively large cysts arranged in the cervical stroma with small cysts or solid components in the centre of lesion’ and reported a sensitivity of 87.5% [9]. On the other hand, Ohya et al. defined the cosmos pattern as ‘a pattern of small cysts and a solid area in central area with large outer cysts’ and reported a higher specificity than sensitivity for GMPLs [10]. Omori et al. also proposed two types of MRI findings of LEGH: flower and raspberry types [16]. To overcome these problems, a common understanding of the definition of LEGH is required.

There are no reports describing differences in diagnostic performance among physicians in LEGH or GMPLs. This study demonstrated that the concordance rate among different physicians’ pairs for the diagnosis of GMPLs was not satisfactory (Table 3). Diagnostic concordance was particularly poor between radiologists and gynecologists. The low concordance rate of diagnosis between radiologists and gynecologists suggests that they use different diagnosis methods. Although radiologists are experts in diagnostic imaging, gynecologists make a comprehensive diagnosis based on the clinical symptoms and other clinical information. Since clinical information could not be gathered in this experiment, the accuracy of the gynecologists was considered lower than that of the radiologists. Even among pairs of radiologists, concordance rates are generally not high. We attributed it to two reasons. First, the GMPLs (including LEGH) are difficult to differentiate due to disease rarity and lack of experience. Second, the presence of cases with atypical findings may have been a factor that caused the judgements to vary from day to day, even for the same case. In daily practice, whether or not a patient is diagnosed with GMPLs based on MRI findings is highly likely to vary among physicians. This condition poses the risk of being incorrectly diagnosed by MRI diagnosis, resulting in some patients with GMPLs not receiving treatment and others with GMNLs being operated on.

The first possible solution to this problem would be to add diagnosis by cervical mucus, but this method is not covered by insurance in Japan and, therefore, cannot be performed at many facilities. Hence, MRI diagnosis is highly important, but patients in all locations need to have equal access to the diagnosis. A possible way to achieve this is through the use of artificial intelligence.

Our results showed that the accuracy rate of Xception was slightly higher than that of the readers. Additionally, the precision, specificity, and F-measure were all higher. However, there is no statistically significant difference in AUC between Xception and readers. These results indicate that Xception and readers have comparable diagnostic performance. The first reason for these results is that the readers did not study the cases in advance and made judgements based on their little clinical experience, whereas the CNN trained 132 cases in advance. Second, the CNN’s diagnostics are based on an average of four diagnostic probabilities, which may be responsible for its high diagnostic performance. In addition, Xception showed an extremely high ICC (2,1) value from two independent validations and led to almost identical results. These results indicated that GMPLs diagnosis by the pCNN is as accurate as that by physicians, but with higher reproducibility than that by physicians. This could be one way to solve the previous problems, as diagnosing with Xception could potentially provide the same diagnosis anywhere at the same level as a physician, as long as the same training data are used.

The first limitation of this study is that we included non-surgical and surgical cases. The HIK1083-latex agglutination assay (HIK test), used as the final result, has a very high sensitivity and specificity for detecting gastric-type mucin [13]. However, no test is without false positives and negatives. The result of this study may be influenced by false positives and negatives of HIK test. We could not avoid this problem due to the following reasons. First, our study included cases of GMNLs. Since our institution uses the HIK test, GMNLs are essentially never operated on. If a patient with suspected LEGH is negative for the HIK test, the patient is basically followed up. Second, at least more than 100 cases are necessary to analyze data using the pCNN [14]. It was not possible to examine only surgical cases because LEGH is a rare disease. The second limitation is that it was not possible to analyze what MRI features Xception recognised and diagnosed. Physicians are likely to base their decisions on whether the lesions are in the cosmos pattern or not. However, we averaged the diagnostic probabilities of four-fold cross-validation to avoid overfitting, preventing us from analyzing where Xception focused on the MR image. Finally, the readers conducted image interpretation using only three images for each case. In clinical practice, it is not possible to establish a diagnosis using only three images. Therefore, the diagnostic abilities of the readers might have been underestimated in this study.

In conclusion, our study revealed MRI diagnostic heterogeneity among the readers in distinguishing between GMPLs and GMNLs. Homogeneity in GMPLs and GMNLs diagnosis is important for patients. Xception for learning and diagnosis is as powerful as a physician for distinguishing between GMPLs and GMNLs. If the same training data are used, a highly reproducible diagnosis can be established regardless of facilities. Current problems in the MRI diagnosis of GMPLs and GMNLs may be solved by pCNNs.

Acknowledgments

We thank Mai Komatsu, Takanori Aonuma, Keisuke Todoroki, Manaka Shinagawa, and Takeuchi Hodaka for conducting image interpretation experiments. We would like to thank Editage (www.editage.com) for English language editing.

Data Availability

The data that support the findings of this study are available in Shinshu University Institutional Repository at [http://hdl.handle.net/10091/0002000343].

Funding Statement

T.M., A.O., H.K., and T.S. have received funding from the Japan Society for the Promotion of Science (JSPS) KAKENHI, Grant Number 22K09593 (https://kaken.nii.ac.jp/ja/grant/KAKENHI-PROJECT-22K09593/). JSPS did not play any role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript?

References

  • 1.Kurman RJ, Ellenson LH, Ronnett BM. Blaustein’s pathology of the female Ge nital Tract. 6th ed. Berlin: Springer Science+Business Media; 2011. [Google Scholar]
  • 2.Nucci MR, Clement PB, Young RH. Lobular endocervical glandular hyperplasia, not otherwise specified: a clinicopathologic analysis of thirteen cases of a distinctive pseudoneoplastic lesion and comparison with fourteen cases of adenoma malignum. Am J Surg Pathol. 1999;23: 886–891. doi: 10.1097/00000478-199908000-00005 [DOI] [PubMed] [Google Scholar]
  • 3.Nara M, Hashi A, Murata SI, Kondo T, Yuminamochi T, Nakazawa K, et al. Lobular endocervical glandular hyperplasia as a presumed precursor of cervical adenocarcinoma independent of human papillomavirus infection. Gynecol Oncol. 2007;106: 289–298. doi: 10.1016/j.ygyno.2007.03.044 [DOI] [PubMed] [Google Scholar]
  • 4.Takatsu A, Miyamoto T, Fuseya C, Suzuki A, Kashima H, Horiuchi A, et al. Clonality analysis suggests that STK11 gene mutations are involved in progression of lobular endocervical glandular hyperplasia (LEGH) to minimal deviation adenocarcinoma (MDA). Virchows Arch. 2013;462: 645–651. doi: 10.1007/s00428-013-1417-1 [DOI] [PubMed] [Google Scholar]
  • 5.Takeuchi K, Tsujino T, Sugimoto M, Yoshida S, Kitazawa S. Endocervical adenocarcinoma associated with lobular endocervical glandular hyperplasia showing rapid reaccumulation of hydrometra. Int J Gynecol Cancer. 2008;18: 1285–1288. doi: 10.1111/j.1525-1438.2007.01174.x [DOI] [PubMed] [Google Scholar]
  • 6.Tsuboyama T, Yamamoto K, Nakai G, Yamada T, Fujiwara S, Terai Y, et al. A case of gastric-type adenocarcinoma of the uterine cervix associated with lobular endocervical glandular hyperplasia: radiologic-pathologic correlation. Abdom Imaging. 2015;40: 459–465. doi: 10.1007/s00261-014-0323-6 [DOI] [PubMed] [Google Scholar]
  • 7.Nishio S, Tsuda H, Fujiyoshi N, Ota SI, Ushijima K, Sasajima Y, et al. Clinicopathological significance of cervical adenocarcinoma associated with lobular endocervical glandular hyperplasia. Pathol Res Pract. 2009;205: 331–337. doi: 10.1016/j.prp.2008.12.002 [DOI] [PubMed] [Google Scholar]
  • 8.Ohya A, Asaka S, Fujinaga Y, Kadoya M. Uterine cervical adenocarcinoma associated with lobular endocervical glandular hyperplasia: radiologic-pathologic correlation. J Obstet Gynaecol Res. 2018;44: 312–322. doi: 10.1111/jog.13528 [DOI] [PubMed] [Google Scholar]
  • 9.Takatsu A, Shiozawa T, Miyamoto T, Kurosawa K, Kashima H, Yamada T, et al. Preoperative differential diagnosis of minimal deviation adenocarcinoma and lobular endocervical glandular hyperplasia of the uterine cervix: a multicenter study of clinicopathology and magnetic resonance imaging findings. Int J Gynecol Cancer. 2011;21: 1287–1296. doi: 10.1097/IGC.0b013e31821f746c [DOI] [PubMed] [Google Scholar]
  • 10.Ohya A, Kobara H, Miyamoto T, Komatsu M, Shiozawa T, Fujinaga Y. Usefulness of the “cosmos pattern” for differentiating between cervical gastric-type mucin-positive lesions and other benign cervical cystic lesions in magnetic resonance images. J Obstet Gynaecol Res. 2021;47: 745–756. doi: 10.1111/jog.14602 [DOI] [PubMed] [Google Scholar]
  • 11.Yamanoi K, Ishii K, Tsukamoto M, Asaka S, Nakayama J. Gastric gland mucin-specific O-glycan expression decreases as tumor cells progress from lobular endocervical gland hyperplasia to cervical mucinous carcinoma, gastric type. Virchows Arch. 2018;473: 305–311. doi: 10.1007/s00428-018-2381-6 [DOI] [PubMed] [Google Scholar]
  • 12.Ishii K, Katsuyama T, Ota H, Watanabe T, Matsuyama I, Tsuchiya S, et al. Cytologic and cytochemical features of adenoma malignum of the uterine cervix. Cancer Cytopathol. 1999;87: 245–253. doi: [DOI] [PubMed] [Google Scholar]
  • 13.Ishii K, Kumagai T, Tozuka M, Ota H, Katsuyama T, Kurihara M, et al. A new diagnostic method for adenoma malignum and related lesions: latex agglutination test with a new monoclonal antibody, HIK1083. Clin Chim Acta. 2001;312: 231–233. doi: 10.1016/s0009-8981(01)00611-8 [DOI] [PubMed] [Google Scholar]
  • 14.Yasaka K, Akai H, Kunimatsu A, Kiryu S, Abe O. Deep learning with convolutional neural network in radiology. Jpn J Radiol. 2018;36: 257–272. doi: 10.1007/s11604-018-0726-3 [DOI] [PubMed] [Google Scholar]
  • 15.Chollet F. Xception: deep learning with depthwise separable convolutions. Honolulu, HI, USA. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 2017. Piscataway, New Jersey: IEEE Publications; 2017. pp. 1800–1807. doi: 10.1109/CVPR.2017.195. [DOI]
  • 16.Omori M, Kondo T, Tagaya H, Watanabe Y, Fukasawa H, Kawai M, et al. Utility of imaging modalities for predicting carcinogenesis in lobular endocervical glandular hyperplasia. PLOS ONE. 2019;14: e0221088. doi: 10.1371/journal.pone.0221088 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Kazunori Nagasaka

4 Jul 2024

PONE-D-24-02686Problems of magnetic resonance diagnosis for gastric-type mucin-positive cervical lesions of the uterus and its solutions using artificial intelligencePLOS ONE

Dear Dr. Miyamoto,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Aug 18 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Kazunori Nagasaka

Academic Editor

PLOS ONE

Journal requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please provide a complete Data Availability Statement in the submission form, ensuring you include all necessary access information or a reason for why you are unable to make your data freely accessible. If your research concerns only data provided within your submission, please write "All data are in the manuscript and/or supporting information files" as your Data Availability Statement.

Additional Editor Comments:

Dear Authors,

Thank you so much for submitting your manuscript to Plos One.

It is very intriguing study.

Please consider the reviewer's comments and revise the manuscript accordingly.

If you have any inquires, please do not hesitate to contact us.

We look forward to receiving your revised manuscript.

Sincerely,

Plos One

Kazunori Nagasaka

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Comments to the Authors

This study used MRI images to differentiate gastric mucin-positive (GMPL) from gastric mucin-negative (GMNL) to validate diagnostic performance and concordance rates by three sets of physicians and a CNN.

The methodology is valid, but there are significant parts that would benefit from substantial revision.

Overall

・In academic or formal writing, especially in scientific papers, "physician" is often preferred over "doctor" when referring to medical doctors.

・The values for metrics such as AUC and ICC are inconsistent; it would be better to standardize them to three decimal places.

・It is confusing because 'validation' and 'test' are used interchangeably; it would be better to use 'test' consistently.

・Independent evaluation of six readers and additional external testing is recommended.

Abstract

・The purpose and conclusion are not aligned. The solution is not clear.

・(axial T2-weighted image [T2WI], axial T1-weighted image [T1WI], and sagittal T2WI)→Please use the same order of notation as in the text.

・I think it is common to express diagnostic ability using AUC.

・Since the same model and the same test data are used, it is natural that the CNN has high reproducibility.

Introduction

・P.5 Line 56, Some reports have described adenocarcinomas in association with LEGH [5–8], does it mean adenocarcinoma other than GAS?

・P.6 Line 1-71, It is redundant and should be concise.

・P.7 Line 87-90, In addition, since artificial intelligence always makes the same diagnosis once it learns, it can be a promising solution when the reproducibility of doctors’ diagnoses is low.→As for CNN, doesn't it make sense to evaluate reproducibility on different test sets?

M&M

MR images

・As Figure 1, patient selection, diagnostic rationale, and assignment to training and testing should be presented in a clear manner.

・P.8 Line 104, ...other than LEGH, LEGH, LEGH with→Isn't there a need for a LEGH in the middle?

・P.8 Line 110, First appearance of BCL is to be spelled out.

・Lesion’s classification should also be tabulated.

・P.10 Line 131, minimum imaging conditions should be described, such as slice thickness, whether 2D or 3D, and the cross-section of imaging (e.g., orthogonal to the cervix).

Diagnostic accuracy and reproducibility by the readers

・Why did you choose to use consensus by pairs? It would be more objective to do each independently and give data for the six people separately.

・What is the definition of diagnostic confidence?

・Is it correct in understanding that they evaluated the same test data with trimming that you put into the CNN?

・Did they evaluate the images without checking for image features that distinguish GMPL from GMNL?

Diagnostic accuracy evaluation by pCNNs

・P.11 Line 139, please list the years of experience of the readers.

・P.11 Line 155, Xception requires a citation

・P.12, Line167, if you are using 3 combined images, I don't think rotation makes sense

Fig1.

Figure 1 is not well explained, but did you use the 3 images as a combined image and each sequence was always trained and tested in the same position and in different colors?

Fig 2.

①②③④ means layers?

I'm not sure, so why don't you change it so we can see which is the same training set, e.g. T①, and leave the current ① out?

Statistical analysis

・P.14 Line 202, Since kappa value comes out of nowhere, it should be explained in the statistics section if you want to bring it out.

Results

・Please describe the characteristics of the patient group (age, etc.) in the first section.

・First, second, ... in Tables 1, 2 and 3 should be unified.

Fig 3.

(A) Comparison of receiver operating characteristic (ROC) curves at the time of the second independent procedure was entered in Xception and at the time of the first interpretation experiment of pair A with the lowest area under the curve (AUC) among the readers.→What made this one so big out of so many?

The former AUC was 0.8535, which was higher than the latter AUC (0.7197).→Please specify what former and latter refer to.

Comparison of ROC curves at the time of the first independent procedure of Xception and at the time of the second interpretation experiment of pair A with the highest AUC value. →What made this one so big out of so many?

The former AUC was 0.8409, which was lower than the latter AUC (0.8144). →Please specify what former and latter refer to.

Discussion

・P.22, Line 8, How are the diagnostic methods different?

・Please explain why one AUC for an experienced radiologist is lower than for a non-experienced radiologist.

・The value of ICC among experienced radiologists is of interest and should have been evaluated independently for each of the six readers.

P.25, Line 7-8. If the same training data are used, a highly reproducible diagnosis can be established regardless of facilities. →Although a variety of models are included in the training, reproducibility with models not included in the training has not been tested, so this mention cannot be made.

P.25, Line 8-9. Current problems in the MR diagnosis of GMPLs and GMNLs may be solved by pCNNs.→Since no significant difference was found, this reference is also questionable.

Reviewer #2: The focus of your research is excellent. As noted, LEGH and other benign cystic lesions must be distinguished.

Since this is 20 years of data, there are considerable differences in reading due to the imaging accuracy of MRI equipment and other factors.

In addition, it would be helpful to consider at what point the reading differed for cases in which there was a discrepancy in diagnosis in both the Human and CNN groups, to improve the agreement rate.

Although it is unclear how widespread the use of convolutional neural networks will become in the future, the significant difference in the positive diagnosis rate compared to human reading showed that it is a useful tool.

CNN is still not very popular in Japan, but if it becomes popular, it will be applied to preoperative diagnosis of gynecological tumors as well as LEGH, and this is a very interesting paper.

Materials and Methods:

{clinical suspicion of LEGH or GAS (based on ultrasonographic findings and symptoms) }

Please describe in detail what the ultrasound findings and symptoms are specifically.

MR images:

{The types and magnetic field strength of the MR unit used for imaging and the parameters of each sequence varied because the study was retrospective and MR images were acquired over a long period. }

Since the observation period is as long as 20 years and the MRI equipment must have changed many times during that time, I think that the bias by equipment is quite large.

At the very least, please consider how many times the MRI equipment has changed, as well as the details of that equipment and the statistics by equipment.

Diagnostic accuracy and reproducibility by the readers:

{divided into the training (132 patients) and validation (40 patients) test groups}

Please describe the reasons for any differences in the number of patients between groups.

Discussion:

{The second limitation is that it was not possible to analyse what MR features Xception recognised and diagnosed.}

In both Human and CNN groups, please describe the basis for reading LEGH from the MRI images.

Conclusion:

{In conclusion, our study revealed MR diagnostic heterogeneity among the readers in distinguishing between GMPLs and GMNLs.}

I think it is impossible to discuss the concordance rate without considering at what point the differences in reading were observed in the cases with discrepancies in diagnosis in both the Human and CNN groups.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: Comments to the Author.docx

pone.0315862.s001.docx (17.9KB, docx)
PLoS One. 2024 Dec 30;19(12):e0315862. doi: 10.1371/journal.pone.0315862.r002

Author response to Decision Letter 0


22 Sep 2024

Response to the reviewers’ comments

We deeply appreciate the reviewers’ educational suggestions. According to the suggestions, we have revised our manuscript as follows. The revised sentences are displayed as red letters in the revised manuscript.

Reviewer #1: Comments to the Authors

This study used MRI images to differentiate gastric mucin-positive (GMPL) from gastric mucin-negative (GMNL) to validate diagnostic performance and concordance rates by three sets of physicians and a CNN.

The methodology is valid, but there are significant parts that would benefit from substantial revision.

Overall

・In academic or formal writing, especially in scientific papers, "physician" is often preferred over "doctor" when referring to medical doctors.

Response: Thank you for pointing this out. We have revised the term “doctor(s)” to “physician(s)” as per your suggestion. (P6L73, P7L83, L86, Table 3 L250, P24L306, L307, P26L337, L338, L340, P27L362 in the revised manuscript)

“vary from doctor to doctor” was revised to “vary among physicians” (P25L319 in the revised manuscript)

・The values for metrics such as AUC and ICC are inconsistent; it would be better to standardize them to three decimal places.

Response: We completely agree with your comment. We have unified the AUC and ICC values to three decimal places. (P4L42, P17L239, L240, L242, Table 2, Table 3, P20L259, L264, Table 4, P21L277, P22L279, L282, L284 in the revised manuscript)

・It is confusing because 'validation' and 'test' are used interchangeably; it would be better to use 'test' consistently.

Response: To avoid confusion among readers, the term “validation” is used in both the text and figure legend.

� “validation test” to “validation” (P12L166, P13L171 in the revised manuscript)

� “test” to “validation” (P15L203, L205, P16L218 in the revised manuscript)

・Independent evaluation of six readers and additional external testing is recommended.

Response: As per your suggestion, we conducted additional experiments. The results demonstrated a diagnostic accuracy of up to 80% and an AUC of up to 0.859. Diagnostic accuracy of the readers was slightly lower than that of CNN, and AUC of the readers was slightly higher than that of CNN; however, the 95% confidence intervals overlapped and were not significantly different. The following are the results of additional experiments.

Reader Diagnostic accuracy AUC (95% CI)

A 0.75 0.859 (0.746–0.972)

B 0.7 0.797 (0.658–0.936)

C 0.675 0.705 (0.540–0.869)

D 0.8 0.816 (0.678–0.953)

E 0.775 0.793 (0.646–0.940)

F 0.7 0.729 (0.566–0.891)

Overall, both the diagnostic accuracy and AUC slightly improved when the reading experiment was conducted individually compared to when it was performed in pairs. However, the diagnostic accuracy of the individual is slightly lower than that of the CNN, and the individual AUC is not statistically significantly different from the CNN AUC. The same is the case for the pair.

Abstract

・The purpose and conclusion are not aligned. The solution is not clear.

Response: Based on your valuable comment, we have aligned our objectives and conclusions. The first sentence of the conclusion has been changed as follows: Variation in the inter-reader or intra-reader accuracy in MRI diagnosis limits differentiation between GMPL and GMNL. (P4L43-44 in the revised manuscript)

・(axial T2-weighted image [T2WI], axial T1-weighted image [T1WI], and sagittal T2WI)→Please use the same order of notation as in the text.

Response: As per your comment, we have ensured the dame order of notation as in the text. (P3L31-32 in the revised manuscript)

・I think it is common to express diagnostic ability using AUC.

Response: We completely agree with your comment. Therefore, the first sentence of the conclusion has been changed as follows: Variation in the inter-reader or intra-reader accuracy in MRI diagnosis limits differentiation between GMPL and GMNL. (P4L43-44 in the revised manuscript)

・Since the same model and the same test data are used, it is natural that the CNN has high reproducibility.

Response: In CNN, the same model and the same test data are not necessarily reproducible. This is because even if the CNN uses the same model, different training methods may lead to different test results. Many papers do not confirm the reproducibility of the results by CNN. Thus, we have demonstrated that CNN is highly reproducible, even if it trains on the same data in different ways.

Introduction

・P.5 Line 56, Some reports have described adenocarcinomas in association with LEGH [5–8], does it mean adenocarcinoma other than GAS?

Response: We greatly appreciate your point of view. We have deleted that sentence. (P5L53 in the revised manuscript)

・P.6 Line 1-71, It is redundant and should be concise.

Response: Thank you for your suggestion.

“Various benign lesions with cystic components occur in the uterine cervix. Common benign lesions include Nabothian cysts, tunnel clusters, lobular endocervical glandular hyperplasia (LEGH), endometriosis, and cervical polyps [1]. LEGH is a hyperplastic benign lesion first proposed by Nucci et al. [2], but it may be a precursor lesion for gastric-type mucinous carcinomas (GAS) with a very poor prognosis [3,4]. Some reports have described adenocarcinomas in association with LEGH [5–8]. The most recent study demonstrated that malignant change of LEGH occurs at a frequency of 1.4% [9], but frequent and long-term follow-up or surgical treatment is selected once LEGH is diagnosed. In contrast, other benign cystic lesions of the uterine cervix do not require follow-up as frequently as LEGH. Therefore, clinically, LEGH and other benign cystic lesions must be distinguished.

LEGH has been reported to exhibit magnetic resonance imaging (MRI) findings called ‘cosmos pattern’, while Nabothian cysts show coarse cysts pattern [10]. However, some Nabothian cysts exhibit MR findings similar to those of LEGH [11]. The decisive difference from other benign lesions is that LEGH and GAS exhibit a distinctive pyloric gland metaplasia and secrete gastric-type mucin, which has O-linked oligosaccharides with a terminal α1,4-linked N-acetylglucosamine (αGlcNAc) residue [12]. Because gastric-type mucin is a neutral mucin, LEGH and GAS exhibit a ‘two-colour pattern’ on Pap smears [13]. Further, αGlcNAc expression exhibits positive immunohistochemical findings for monoclonal antibody HIK1083. αGlcNAc has been detected in cervical mucus by latex agglutination assay using monoclonal antibody HIK1083 [14].” was revised to

“Common benign lesions in the uterine cervix include Nabothian cysts, tunnel clusters, lobular endocervical glandular hyperplasia (LEGH), endometriosis, and cervical polyps [1]. LEGH is a benign lesion first proposed by Nucci et al. [2], but it may be a precursor lesion for gastric-type mucinous carcinomas (GAS) [3–8]. Therefore, frequent and long-term follow-up or surgical treatment is selected once LEGH is diagnosed. In contrast, other benign cystic lesions of the uterine cervix do not require follow-up as frequently as LEGH. Therefore, clinically, LEGH and other benign cystic lesions must be distinguished.

LEGH shows magnetic resonance imaging (MRI) findings called ‘cosmos pattern’, while Nabothian cysts show coarse cysts pattern [9]. However, some Nabothian cysts exhibit MR findings similar to those of LEGH [10]. The decisive difference from other benign lesions is that LEGH and GAS secrete gastric-type mucin, which has O-linked oligosaccharides with a terminal α1,4-linked N-acetylglucosamine (αGlcNAc) residue [11]. Because gastric-type mucin is a neutral mucin, LEGH and GAS exhibit a ‘two-color pattern’ on Pap smears [12]. Further, αGlcNAc has been detected in cervical mucus by latex agglutination assay using monoclonal antibody HIK1083 [13].” (P5L50-P6L65 in the revised manuscript)

・P.7 Line 87-90, In addition, since artificial intelligence always makes the same diagnosis once it learns, it can be a promising solution when the reproducibility of doctors’ diagnoses is low.→As for CNN, doesn't it make sense to evaluate reproducibility on different test sets?

Response: Thank you for your comments. Physicians used the same validation data to determine reproducibility. The CNN would have to be evaluated under the same conditions to be comparable. In addition, because GMPLs are relatively rare diseases, it has been impossible to provide separate validation datasets.

M&M

MR images

・As Figure 1, patient selection, diagnostic rationale, and assignment to training and testing should be presented in a clear manner.

Response: We completely agree with your comment. We have added a new Figure 1 and its caption to make patient selection, diagnostic rationale, and assignment to training and testing easier to understand. Therefore, Figure numbers were rearranged. “The process of patient selection, and grouping is summarized in Fig 1.” was added. (P7L95-96 in the revised manuscript)

・P.8 Line 104, ...other than LEGH, LEGH, LEGH with→Isn't there a need for a LEGH in the middle?

Response: Yes, there is. For clarity, “(BCL)” was added after “other than LEGH”. (P8L101 in the revised manuscript)

・P.8 Line 110, First appearance of BCL is to be spelled out.

Response: As noted above, “BCL” is already exists at P8L101.

・Lesion’s classification should also be tabulated.

Response: As per your suggestion, we have summarized the pathological features of the lesion in new Table 1. Therefore, we rearranged Table numbers.

“The classification of histopathologic lesions is summarized in Table 1.” was added at P9L113-114 in the revised manuscript.

・P.10 Line 131, minimum imaging conditions should be described, such as slice thickness, whether 2D or 3D, and the cross-section of imaging (e.g., orthogonal to the cervix).

Response: As you have accurately pointed out, we have added basic information about the image as follows: All images have a slice thickness of 2 to 7.5 mm and are captured as two-dimensional images. (P12L161-162 in the revised manuscript)

Diagnostic accuracy and reproducibility by the readers

・Why did you choose to use consensus by pairs? It would be more objective to do each independently and give data for the six people separately.

Response: The CNN performs validation using four differently trained folds. Therefore, we considered that humans should also be examined by more than one group to be considered as equals, so we examined them in pairs. Additional independent studies were conducted for each of the six individuals.

・What is the definition of diagnostic confidence?

Response: The confidence level of the diagnosis was stated as a percentage based on each physician's experience. We have appended this statement as follows: The confidence level of the diagnosis was stated as a percentage based on each physician’s experience. (P13L173-174 in the revised manuscript)

・Is it correct in understanding that they evaluated the same test data with trimming that you put into the CNN?

Response: Yes, it is.

・Did they evaluate the images without checking for image features that distinguish GMPL from GMNL?

Response: Yes, they did. They did not have information of useful image features that distinguished GMPL from GMLN when they evaluated the images.

Diagnostic accuracy evaluation by pCNNs

・P.11 Line 139, please list the years of experience of the readers.

Response: As per your comment, we have added the years of experience of each reader. (P12L169-P13L170 in the revised manuscript)

・P.11 Line 155, Xception requires a citation

Response: As per your comment, we have added the paper of Xception to the references (ref# 15).

・P.12, Line167, if you are using 3 combined images, I don't think rotation makes sense

Response: When the three images are superimposed, the angle of rotation is limited to 10° at random. Rotation is not unnatural because the cervix is tilted differently in each person. This method is common and used previous reports (La Grace Saint-Esteven A, et al. Comput Biol Med. 2022 142: 105215).

Fig1.

Figure 1 is not well explained, but did you use the 3 images as a combined image and each sequence was always trained and tested in the same position and in different colors?

Response: CNN recognizes three images by inputting them into three color channels. We input a T2-weighted sagittal section in the red channel, a T2-weighted transverse section in the green channel, and a T1-weighted transverse section in the blue channel. We ensured that the images were always entered in the same way during training and during validation.

Fig 2.

①②③④ means layers?

I'm not sure, so why don't you change it so we can see which is the same training set, e.g. T①, and leave the current ① out?

Response: ①②③④ are the training data randomly divided into four parts; of the four parts, three colored data are used for training and the uncolored data are used for provisional validation. CNN training uses three of these four divisions of data to train four different types of training.

Statistical analysis

・P.14 Line 202, Since kappa value comes out of nowhere, it should be explained in the statistics section if you want to bring it out.

Response: Thank you for pointing this out. This is an error. We have corrected “kappa value” to “ICC (1) or (2)”. (P16L225 in the revised manuscript)

Results

・Please describe the characteristics of the patient group (age, etc.) in the first section.

Response: We are unable to describe the characteristics of the patients in the Results section because they are not part of the survey. The age of the patients is described in the “Patient population” section as follows: Patients ranged in age from 26 to 82 years, with an average age of 48.7 years. (P7L94-95 in the revised manuscript)

・First, second, ... in Tables 1, 2 and 3 should be unified.

Response: As per your comments, we have changed Table 3 to unify them.

Fig 3.

� Comparison of receiver operating characteristic (ROC) curves at the time of the second independent procedure was entered in Xception and at the time of the first interpretation experiment of pair A with the lowest area under the curve (AUC) among the readers.→What made this one so big out of so many?

Response: We selected and enlarged this graph because we compared the highest AUC in Xception to the lowest AUC for the physician pair.

� The former AUC was 0.8535, which was higher than the latter AUC (0.7197).→Please specify what former and latter refer to.

Response: Thank you for pointing this out. It was difficult to understand, so we have corrected the relevant part as follows: The AUC at the time of second independent procedure of Xception was 0.854, which was higher than the AUC at the time of the first interpretation experiment of pair A (0.720) (P21L277-P22L279 in the revised manuscript)

� Comparison of ROC curves at the time of the first independent procedure of Xception and at the time of the second interpretation experiment of pair A with the highest AUC value. →What made this one so big out of so many?

Response: We selected and enlarged this graph because we compared the lowest AUC in Xception to the highest AUC for the physician pair.

� The former AUC was 0.8409, which was lower than the latter AUC (0.8144). →Please specify what former and latter refer to.

Response: Thank you for pointing this out. To make the sentence easier to understand, we have corrected the relevant part as follows: The AUC at the time of the first independent procedure of Xception was 0.841, which was lower than the AUC at the time of the second interpretation experiment of pair A (0.814). (P22L282-284 in the revised manuscript)

Discussion

・P.22, Line 8, How are the diagnostic methods different?

Response: We have added the following sentences. “Although radiologists are experts in diagnostic imaging, gynecologists make a comprehensive diagnosis based on the clinical symptoms and other clinical information. Since clinical information could not be gathered in this experiment, the accuracy of the gynecologists was considered lower than that of the radiologists.” (P24L310-314 in the revised manuscript)

・Please explain why one AUC for an experienced radiologist is l

Attachment

Submitted filename: point_by_point_response ver.3-2.docx

pone.0315862.s002.docx (38.5KB, docx)

Decision Letter 1

Kazunori Nagasaka

4 Oct 2024

PONE-D-24-02686R1Problems of magnetic resonance diagnosis for gastric-type mucin-positive cervical lesions of the uterus and its solutions using artificial intelligencePLOS ONE

Dear Dr. Miyamoto,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Nov 18 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Kazunori Nagasaka

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments:

Dear Authors,

Thank you so much for your submission to Plos One.

As the reviewers pointed out in their comments, please revise the manuscript accordingly.

Their primary focus would be to increase your impact on your claim in the manuscript.

We look forward to your revised manuscript soon.

Sincerly,

Kazunori Nagasaka

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Thank you for the revision according to the comments. I understand that there is no significant difference in the results whether in pairs or individually.

Please consider the following minor revisions.

Abstract

P.3, line 36, (sagittal T2WI and axial T2WI/T1WI) may be better.

Overall: Since "magnetic resonance imaging (MRI) findings" is mentioned in Introduction p.5, line 58, I think it is better to use "MRI findings", "MRI diagnosis", and "MR unit" instead of "MR findings", "MR diagnosis", and "MRI unit" thereafter. However, "MR images" can remain as is.

Table 1 should include other BCLs with histologically unconfirmed diagnosis, as well as information on the number of patients and their ages. The title should be changed accordingly.

MR images

P.12, L.168, For the CNN, it is assumed that one image per patient was used, so it would be better to add "per patient" as in: "One image showing the maximum cross-section of the lesion was selected from each of these three sequences per patient, and the three images were used for analysis."

Diagnostic accuracy and reproducibility by the readers

P.12, L168, GMNL. (Fig 1)→Move the period to the end.

Diagnostic accuracy evaluation by pCNNs

P.14, L189, the validation group. (Fig 1)→Move the period to the end.

P.14, L200, I think "(Fig 2)" can be omitted from "Using the training data, all three images were input into Xception."

Fig2. "Input" is used for converting Fig into color, and it may cause confusion with "Input" for CNN in the main text. Therefore, I think it would be better to change the text in the figure to something like "Conversion."

Fig3. The "test" in the figure and "the validation data" in the figure description may cause confusion with cross-validation, so I think it would be better to unify them as "independent validation".

The first letters in the abbreviation explanations of Fig 3 should be consistently either capitalized or in lowercase.

Diagnostic accuracy evaluation by pCNNs

P.15, L203-205

“Finally, images of the validation data were input into the model trained in each fold, and the true diagnostic accuracy and AUC for the validation data were determined by averaging the diagnostic probabilities by the four models (Fig 3).”

I think using "independent validation" would make the above sentences clearer.

Table 3. The notation in the "Second" column of Pair B is incorrect.

In the Figure legends of Fig 4, please standardize the decimal places for the p-values to three digits, same as the others. “However, there was no statistically significant difference between the two (p =

0.0939).” “However, there was no statistically significant difference between the two (p =

0.6813).”

Table 4. It would be better to include the meaning of "Tentative diagnostic accuracy" in the main text.

Please change the lower limit value of the Tentative diagnostic accuracy for "second" to three decimal places.

Discussion

P.25. L.327-331. Since "readers" appears consecutively, I think the second "readers" can be omitted.

Reviewer #2: Since CNN is almost as accurate as the reader, it has not shown usefulness in reading at this time. It has little IMPACT as a new FINDINGS. In addition, CNN may have the potential to improve diagnostic reproducibility, but in actual clinical practice, when a treatment plan is decided based on MRI readings, “diagnostic reproducibility” is not as useful as “diagnostic accuracy. I don't feel it is useful.

What would you estimate the number of LEGH clinical experiences of the readers to be for a CNN that has trained 132 LEGH cases? For example, I think it is much more experienced than the group of young radiologists. What do you consider the difference?

As Reviewer 1 also pointed out, I don't understand the significance of examining the pairs. I think a comparison between individuals would be better; you have also disclosed the results between individuals in Response, and this should be noted and discussed.

The group of readers did not study the LEGH cases in advance and made decisions based on their limited clinical experience, which may be a disadvantage compared to the CNN, which is supposed to be studied in advance.

Isn't it a leap to conclude that a CNN's finding of a significant difference in diagnostic reproducibility can provide the same level of diagnosis everywhere as a reading physician, when no significant difference in diagnostic accuracy has been found?

Without analysis of what MRI features CNNs recognized and diagnosed, I don't think this will lead to the widespread use and improvement of CNNs in the future.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Dec 30;19(12):e0315862. doi: 10.1371/journal.pone.0315862.r004

Author response to Decision Letter 1


18 Nov 2024

Response to the reviewers’ comments

We deeply appreciate the reviewers’ educational suggestions. According to the suggestions, we have revised our manuscript as follows. The revised sentences are displayed as red letters in the revised manuscript.

Reviewer #1:

Thank you for the revision according to the comments. I understand that there is no significant difference in the results whether in pairs or individually.

Please consider the following minor revisions.

Abstract

P.3, line 36, (sagittal T2WI and axial T2WI/T1WI) may be better.

Response: Thank you for pointing this out. We have corrected it, as you suggested.

Overall: Since "magnetic resonance imaging (MRI) findings" is mentioned in Introduction p.5, line 58, I think it is better to use "MRI findings", "MRI diagnosis", and "MR unit" instead of "MR findings", "MR diagnosis", and "MRI unit" thereafter. However, "MR images" can remain as is.

Response: Thank you for pointing this out. We have corrected it, as you suggested.

Table 1 should include other BCLs with histologically unconfirmed diagnosis, as well as information on the number of patients and their ages. The title should be changed accordingly.

Response: Thank you for your feedback. We have revised Table 1 and its title, as you suggested.

MR images

P.12, L.168, For the CNN, it is assumed that one image per patient was used, so it would be better to add "per patient" as in: "One image showing the maximum cross-section of the lesion was selected from each of these three sequences per patient, and the three images were used for analysis."

Response: Thank you for pointing this out. We have added “per patient”, as you suggested.

Diagnostic accuracy and reproducibility by the readers

P.12, L168, GMNL. (Fig 1)→Move the period to the end.

Response: We apologize for our careless mistake. We have corrected it, as you suggested.

Diagnostic accuracy evaluation by pCNNs

P.14, L189, the validation group. (Fig 1)→Move the period to the end.

Response: We apologize for our careless mistake. We have corrected it, as you suggested.

P.14, L200, I think "(Fig 2)" can be omitted from "Using the training data, all three images were input into Xception."

Response: Thank you for pointing this out. We have corrected it, as you suggested.

Fig2. "Input" is used for converting Fig into color, and it may cause confusion with "Input" for CNN in the main text. Therefore, I think it would be better to change the text in the figure to something like "Conversion."

Response: Thank you for your suggestion. As you said, Fig 2 shows that a black and white images are inputted and consequently converted into color images. We have revised Fig 2, as you suggested. In addition, we have revised the Figure Legends of Fig 2 as follows.

“A sagittal T2-weighted image was input into the red channel, an axial T2-weighted image was input into the green channel, and a T1-weighted image was input into the blue channel. These converted images were processed to include only the lesion and cervical stroma of the uterus.” (P16 L214-217 in the revised manuscript)

Fig3. The "test" in the figure and "the validation data" in the figure description may cause confusion with cross-validation, so I think it would be better to unify them as "independent validation".

The first letters in the abbreviation explanations of Fig 3 should be consistently either capitalized or in lowercase.

Response: According to your suggestion, we have revised Fig 3, and its Figure legends.

Diagnostic accuracy evaluation by pCNNs

P.15, L203-205

“Finally, images of the validation data were input into the model trained in each fold, and the true diagnostic accuracy and AUC for the validation data were determined by averaging the diagnostic probabilities by the four models (Fig 3).”

I think using "independent validation" would make the above sentences clearer.

Response: Thank you for your feedback. Your point is certainly valid. According to your suggestion, we have added “independent” at P15L207 and L208 in the revised manuscript.

Table 3. The notation in the "Second" column of Pair B is incorrect.

Response: We apologize for our careless mistake. We have corrected it.

In the Figure legends of Fig 4, please standardize the decimal places for the p-values to three digits, same as the others. “However, there was no statistically significant difference between the two (p =0.0939).” “However, there was no statistically significant difference between the two (p =0.6813).”

Response: According to your suggestion, we have revised 0.0939 to 0.094, and 0.6813 to 0.681. (L285 and L290 in the revised manuscript)

Table 4. It would be better to include the meaning of "Tentative diagnostic accuracy" in the main text.

Response: Thanks for your suggestion. We have corrected the sentence on P14L204-206 in the revised manuscript as follows: The accuracy of each verification by the four-fold cross-validation was calculated and averaged to the tentative diagnostic accuracy, which indicates degree of training.

Please change the lower limit value of the Tentative diagnostic accuracy for "second" to three decimal places.

Response: Thanks for your point of view. We have corrected the relevant area.

Discussion

P.25. L.327-331. Since "readers" appears consecutively, I think the second "readers" can be omitted.

Response: According to your suggestion, “Additionally, the precision, specificity, and F-measure were all higher than those of the readers” was revised to “Additionally, the precision, specificity, and F-measure were all higher.” (P26L333 in the revised manuscript)

Reviewer #2:

Since CNN is almost as accurate as the reader, it has not shown usefulness in reading at this time. It has little IMPACT as a new FINDINGS. In addition, CNN may have the potential to improve diagnostic reproducibility, but in actual clinical practice, when a treatment plan is decided based on MRI readings, “diagnostic reproducibility” is not as useful as “diagnostic accuracy. I don't feel it is useful.

Response: We understand your point of view. However, this study aims to explore the issues in the diagnosis of LEGH, demonstrating that current human MRI diagnoses of LEGH are inconsistent and that the interpretation results can vary significantly depending on the reader. Diagnosis by the CNN was only used as a comparison to verify the variability of human diagnosis, and it is not the purpose of this study to verify the superiority of CNN. We agree that the diagnostic accuracy is important. However, we believe that the reproducibility is equally important, as low reproducibility includes the possibility that accuracy may be reduced if a different doctor diagnoses the patient.

What would you estimate the number of LEGH clinical experiences of the readers to be for a CNN that has trained 132 LEGH cases? For example, I think it is much more experienced than the group of young radiologists. What do you consider the difference?

Response: There is a misunderstanding in your interpretation. A CNN cannot provide an adequate diagnosis unless it is trained on around 100 cases. On the other hand, the 'young radiologist' in this study, though less experienced with LEGH, is a specialist in diagnostic imaging. The fact that the pairs of young radiologists achieved the highest diagnostic accuracy cannot be ignored. Since there was no significant difference in accuracy between the CNN and the radiologists, it can be concluded that even though the young radiologists did not have prior training, they were not inferior to the CNN. 

As Reviewer 1 also pointed out, I don't understand the significance of examining the pairs. I think a comparison between individuals would be better; you have also disclosed the results between individuals in Response, and this should be noted and discussed.

Response: In response to Reviewer 1’s comments, we disclosed the individual physicians’ reading results, but there was no difference compared to when the readings were done in pairs. Therefore, we did not include it in the manuscript. Additionally, the CNN provides final results by averaging the outcomes of the four folds. Since CNN is using an average of 4 folds to answer the question, we thought the physicians should also be in the group.

The group of readers did not study the LEGH cases in advance and made decisions based on their limited clinical experience, which may be a disadvantage compared to the CNN, which is supposed to be studied in advance.

Response: We agree with your comment. If the readers had been trained beforehand, they might have had a better accuracy. However, as mentioned above, diagnosis by the CNN was only used as a comparison to verify the variability of human diagnosis, and it is not the purpose of this study to verify the superiority of CNN.

Isn't it a leap to conclude that a CNN's finding of a significant difference in diagnostic reproducibility can provide the same level of diagnosis everywhere as a reading physician, when no significant difference in diagnostic accuracy has been found?

Response: What we would like to say is that when a certain finding was found, humans may vary in their diagnosis, but CNN do not, making them useful as a criterion for diagnosis. It goes without saying that a high accuracy and reproducibility are ideal.

Without analysis of what MRI features CNNs recognized and diagnosed, I don't think this will lead to the widespread use and improvement of CNNs in the future.

Response: Thank you for your important comment. We agree your comment. We also believe it is important to clarify what features of MRI images the CNN captures and understands for its development. However, the specific details of how the CNN analyzes images are not clearly established, so it is even unclear if it operates in the same way as human interpretation. Additionally, in this study, the CNN uses the average results of the four folds as the final outcome, and due to the low agreement rate among physicians across the three groups, the analysis and comparative evaluation are very complex and challenging. We think that exploring how the CNN recognizes MRI images needs to be addressed through a different approach and should be a focus for future studies.

Attachment

Submitted filename: point by point response R2.3-3.docx

pone.0315862.s003.docx (24.5KB, docx)

Decision Letter 2

Kazunori Nagasaka

3 Dec 2024

Problems of magnetic resonance diagnosis for gastric-type mucin-positive cervical lesions of the uterus and its solutions using artificial intelligence

PONE-D-24-02686R2

Dear Dr. Miyamoto,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Kazunori Nagasaka

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Dear Authors,

Thank you for submitting your manuscript to Plos One.

Our reviewers suggested the manuscript is ready to accept for publication.

Congulatulatin on your works and we look forward to receiving your future studies!!

Sincerely,

Kazunori Nagasaka

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: (No Response)

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: (No Response)

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: (No Response)

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: (No Response)

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: (No Response)

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: (No Response)

Reviewer #2: Thank you for the revision according to the comments. I have understood that you used CNN as comparators to validate the variability of human diagnoses, and it is not the purpose of this study to verify the superiority of CNN.

I now understand why you considered pairs instead of individuals.

I have accepted your manuscript for publication.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

Acceptance letter

Kazunori Nagasaka

16 Dec 2024

PONE-D-24-02686R2

PLOS ONE

Dear Dr. Miyamoto,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Professor Kazunori Nagasaka

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: Comments to the Author.docx

    pone.0315862.s001.docx (17.9KB, docx)
    Attachment

    Submitted filename: point_by_point_response ver.3-2.docx

    pone.0315862.s002.docx (38.5KB, docx)
    Attachment

    Submitted filename: point by point response R2.3-3.docx

    pone.0315862.s003.docx (24.5KB, docx)

    Data Availability Statement

    The data that support the findings of this study are available in Shinshu University Institutional Repository at [http://hdl.handle.net/10091/0002000343].


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES