Visual Turing test is not sufficient to evaluate the performance of medical generative models

Shoichiro Yamamoto; Akinori Higaki

doi:10.1186/s41747-023-00347-8

letter

. 2023 Jul 10;7:31. doi: 10.1186/s41747-023-00347-8

Visual Turing test is not sufficient to evaluate the performance of medical generative models

Shoichiro Yamamoto ¹, Akinori Higaki ^1,^✉

PMCID: PMC10329967 PMID: 37423911

To the Editor,

We read with great interest the article by Wang et al. [1], reporting that generative adversarial networks (GANs) could generate synthetic ground glass opacities (GGOs) in computed tomography. While we appreciate their ambitious research to advance clinical radiology, we feel that the performance evaluation of the GANs is insufficient for their aim.

In their study, the authors stated that the model performance was evaluated by both subjective and objective approaches, namely the visual Turing test (VTT) and the distribution of radiomic features. We agree that VTT is a suitable approach to assess the realism of synthesized medical images [2], but a low VTT score does not guarantee the diversity of the generated data; it tells us they just look real. As the authors admitted as a limitation in the “Discussion” section, about 40% of the distributions of the radiomic features (e.g., NGTDM coarseness) were significantly different between generated and original images. Therefore, we suspect that their generative model may only be able to produce biased images due to the so-called mode collapse phenomenon [3]. If this were the case, it would diminish the usefulness of the data augmentation for classification tasks.

It is true that there is no single universal metric to assess the model performance and the quality of generated data; therefore, we need to combine several indicators, such as inception score, Fréchet inception distance, and geometry score [4, 5]. In addition to these, the image quality can be also evaluated quantitatively by NIQE, PIQE, and BRISQUE scores, as Oyelade and colleagues have demonstrated for mammography images [6]. As a practical matter, the images presented in the article are so small in size and resolution that the readers cannot fully appreciate what kind of images the GAN model has produced.

In summary, we believe that the authors need to provide more example images of the generated GGO and evaluate their GAN in several other ways to ensure the quality of data synthesis.

Authors’ contributions

AH conceptualized and drafted the manuscript. SY reviewed and revised the manuscript. The authors read and approved the final manuscript.

Funding

The authors declare that they received no external funding concerning this article.

Availability of data and materials

Not applicable.

Declarations

Ethics approval and consent to participate

This article is based on previously conducted studies and does not contain any studies with human participants or animals performed by the authors.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Wang Z, Zhang Z, Feng Y, et al. Generation of synthetic ground glass nodules using generative adversarial networks (GANs) Eur Radiol Exp. 2022;6:59. doi: 10.1186/s41747-022-00311-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Higaki A, Kawada Y, Hiasa G, Yamada T, Okayama H. Using a visual Turing test to evaluate the realism of generative adversarial network (GAN)-based synthesized myocardial perfusion images. Cureus. 2022;14:e30646. doi: 10.7759/cureus.30646. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Bau D, Zhu J-Y, Wulff J et al (2019) Seeing what a GAN cannot generate. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE; 2019. p4501–4510. 10.1109/ICCV.2019.00460
4.Shmelkov K, Schmid C, Alahari K. How good is my GAN? Improving and optimizing operations: things that actually work - Plant Operators’ Forum. 2018;2004:218–234. doi: 10.1007/978-3-030-01216-8. [DOI] [Google Scholar]
5.Borji A. Pros and cons of GAN evaluation measures. Comput Vis Image Underst. 2019;179:41–65. doi: 10.1016/j.cviu.2018.10.009. [DOI] [Google Scholar]
6.Oyelade ON, Ezugwu AE, Almutairi MS, Saha AK, Abualigah L, Chiroma H. A generative adversarial network for synthetization of regions of interest based on digital mammograms. Sci Rep. 2022;12:1–30. doi: 10.1038/s41598-022-09929-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Not applicable.

[CR1] 1.Wang Z, Zhang Z, Feng Y, et al. Generation of synthetic ground glass nodules using generative adversarial networks (GANs) Eur Radiol Exp. 2022;6:59. doi: 10.1186/s41747-022-00311-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Higaki A, Kawada Y, Hiasa G, Yamada T, Okayama H. Using a visual Turing test to evaluate the realism of generative adversarial network (GAN)-based synthesized myocardial perfusion images. Cureus. 2022;14:e30646. doi: 10.7759/cureus.30646. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Bau D, Zhu J-Y, Wulff J et al (2019) Seeing what a GAN cannot generate. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE; 2019. p4501–4510. 10.1109/ICCV.2019.00460

[CR4] 4.Shmelkov K, Schmid C, Alahari K. How good is my GAN? Improving and optimizing operations: things that actually work - Plant Operators’ Forum. 2018;2004:218–234. doi: 10.1007/978-3-030-01216-8. [DOI] [Google Scholar]

[CR5] 5.Borji A. Pros and cons of GAN evaluation measures. Comput Vis Image Underst. 2019;179:41–65. doi: 10.1016/j.cviu.2018.10.009. [DOI] [Google Scholar]

[CR6] 6.Oyelade ON, Ezugwu AE, Almutairi MS, Saha AK, Abualigah L, Chiroma H. A generative adversarial network for synthetization of regions of interest based on digital mammograms. Sci Rep. 2022;12:1–30. doi: 10.1038/s41598-022-09929-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Visual Turing test is not sufficient to evaluate the performance of medical generative models

Shoichiro Yamamoto

Akinori Higaki

Authors’ contributions

Funding

Availability of data and materials

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Visual Turing test is not sufficient to evaluate the performance of medical generative models

Shoichiro Yamamoto

Akinori Higaki

Authors’ contributions

Funding

Availability of data and materials

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases