Vision transformer with masked autoencoders for referable diabetic retinopathy classification based on large-size retina image

Yaoming Yang; Zhili Cai; Shuxia Qiu; Peng Xu

doi:10.1371/journal.pone.0299265

. 2024 Mar 6;19(3):e0299265. doi: 10.1371/journal.pone.0299265

Vision transformer with masked autoencoders for referable diabetic retinopathy classification based on large-size retina image

Yaoming Yang ¹, Zhili Cai ¹, Shuxia Qiu ^1,², Peng Xu ^1,^2,^*

Editor: Yawen Lu³

PMCID: PMC10917269 PMID: 38446810

Abstract

Computer-aided diagnosis systems based on deep learning algorithms have shown potential applications in rapid diagnosis of diabetic retinopathy (DR). Due to the superior performance of Transformer over convolutional neural networks (CNN) on natural images, we attempted to develop a new model to classify referable DR based on a limited number of large-size retinal images by using Transformer. Vision Transformer (ViT) with Masked Autoencoders (MAE) was applied in this study to improve the classification performance of referable DR. We collected over 100,000 publicly fundus retinal images larger than 224×224, and then pre-trained ViT on these retinal images using MAE. The pre-trained ViT was applied to classify referable DR, the performance was also compared with that of ViT pre-trained using ImageNet. The improvement in model classification performance by pre-training with over 100,000 retinal images using MAE is superior to that pre-trained with ImageNet. The accuracy, area under curve (AUC), highest sensitivity and highest specificity of the present model are 93.42%, 0.9853, 0.973 and 0.9539, respectively. This study shows that MAE can provide more flexibility to the input image and substantially reduce the number of images required. Meanwhile, the pretraining dataset scale in this study is much smaller than ImageNet, and the pre-trained weights from ImageNet are not required also.

Introduction

According to the WHO report, there are more than 400 million people with diabetes in the world [1]. The number of people living with diabetes is projected to reach 552 million by 2024 [2]. DR caused by diabetes is one of the leading causes of blindness, which is probably avoidable [3]. However, the screening of DR involves many features and it is time-consuming for clinicians [4]. In addition, the diagnosis results will be largely affected by the doctor’s personal work experience, professional standards, psychological state, and other factors. Thus, computer-aided techniques have been proposed to improve the accuracy and enhance the efficiency of DR diagnosis [5].

Since the theory of deep learning was proposed in 2006 [6], it has been developed with the enhancement of computational power [7]. Due to the excellent ability of feature extraction and classification, deep neural networks have been introduced in the medical field [8–11] and have been successfully applied in DR diagnosis [12–14]. Compared with the state-of-the-art CNN architecture, ViT shows better performance on classification tasks of computer vision (CV) [15–17] and therefore has been applied in various downstream CV tasks [18].

Recently, Kumar et al. [19] tested the classification performance of several major CNNs and Transformers as well as MLPs on APTOS dataset. They found that Transformers perform better than CNNs and MLPs overall. However, Dosovitskiy et al. [17] pointed out that the sufficient training on a large number of images should be carried out in ViT due to the lack of inductive bias. Touvron et al. [20] introduced a new token-based distillation strategy based on ViT and named the model as DeiT (Data efficient image Transformers) to reduce the data requirements of ViT. Matsoukas et al. [21] tested the classification performance of DeiT-S on APTOS and found that it is only close to that by ResNet50 when the pre-training data is limited.

He et al. [22] proposed MAE to overcome the issue that the training of Transformer requires a lot of data, they achieved 87.8% accuracy on ImageNet using MAE and ViT-Huge. The cost of acquisition and annotation of medical images is much higher than that of natural images. And the size of medical images is generally larger than 224×224, which is much higher than that of natural images. Although the self-attention mechanism in ViT can handle sequences of any length, the pre-trained position embeddings cannot be directly applied on images larger than 224×224. And Shamshad et al. [18] argued that the pre-trained weights from ImageNet are not optimal for medical images. Srinivasan et al. [23] further pre-trained their model using the EyePACS dataset in a self-supervised manner on top of the ImageNet weights. They observed that pre-training with retinal images could further enhance the model’s classification performance and mitigate overfitting. This finding suggests that, to some extent, ImageNet pre-trained weights may not be the optimal choice for diabetic retinopathy classification.

The rest of this paper is organized as follows. The second part briefly describes the related work. The third part introduces the datasets and methods used in this study. The fourth part illustrates the experiment details and results. The fifth part contains the conclusion and prospect.

Related work

The Transformer was first proposed in 2017 and then successfully applied on natural language processing (NLP) [24]. Then, Radford et al. [25] proposed GPT based on Transformer in 2018, and Devlin et al. [26] presented BERT and got state-of-the-art results in 11 NLP tasks in 2019. With the success of Transformer in NLP, ViT was proposed and applied in CV by Dosovitskiy et al. in 2020 [17].

Transformer

Yang et al. [27] used the sliding window method to extract patches in order to avoid the key area of the lesion being divided into different patches. They employed CNN to reduce the dimension of the patches, and inputted the reduced patches to ViT. They also selected the patches with the largest weight as the effective area by accumulating the attention weight. They utilized the OIA-ODIR dataset with image resolution of 224×224 as input and achieved an accuracy of 84.1%. Jha et al. [28] attempted to classify multiple diseases including DR on OCT B-scan data. They employed images with a resolution of 256×256 as input, the accuracy of ViT and VGG-16 are 88% and 83%, respectively. They also proposed a hybrid structure of ViT and SVM. The hybrid structure also utilizes images with a resolution of 256×256 as input, achieving an accuracy increase to 94%. In addition, the combination of ViT with CNN was also proposed and applied [29, 30]. Sadeghzadeh et al. [29] fused EfficientNet-B0 with Transformer and achieved state-of-the-art classification results using 224×224 images as input on EyePACS, APTOS, DDR, Messidor-1, and Messidor-2 datasets. Ma et al. [30] proposed a fusion network based on Transformer and CNN for the grading of DR, where DR grading was treated as a joint ordinal regression and multi-classification problem. They utilized images with a slightly higher resolution of 384×384 and demonstrated superior performance on the DeepDR and IDRiD datasets. Adak et al. [31] integrated four Transformer models, ViT, BEiT (Bidirectional Encoder representation for image Transformer), CaiT (Class-Attention in Image Transformers) and DeiT, for DR detection. Based on the APTOS dataset with image resolution of 256×256, they indicated that the integration of multiple Transformer models further enhances the model’s classification capabilities for DR.

MAE

Recently, Masked Image Modeling (MIM) in the self-supervised learning field has been further developed [32–34]. Based on ImageNet dataset, He et al. [22] applied MAE in ViT-Huge to achieve the accuracy of 87.8%. Encouraged by this result, researchers began to explore the application of MAE in medical images. Zhou et al. [35] took ViT-Base as a backbone and used MAE in the pretraining phase, where ChestX-ray14, BTCV and BRATS datasets were tested. The image resolutions of the ChestX-ray14, BTCV, and BraTS dataset are 224×224, 96×96×96, and 128×128×128, respectively. They reported that the AUC was increased by 9.4% for classification tasks, and the average DSC can be improved from 77.4% to 78.9% for the tumor segmentation task. Cai et al. [36] generated a multi-modal and multi-dimensional dataset with 95,978 samples, and named it as mmOptht-v1. They also designed a general architecture (UnionEye), which shows good performance on both 2D (resolution of 224×224) and 3D (resolution of 112×224×112) image-related tasks. In addition, they found that the model with a masking ratio of 50% was better than that with a masking ratio of 75% in terms of AUC, Recall, Kappa, and other metrics.

Our contributions

In order to improve the performance of Transformers on large-size medical images, over 100,000 available large-size fundus retinal images were used to pre-train ViT by MAE in the present work, it was then fine-tuned and tested on APTOS dataset with labels. The structure of proposed VMLRI is shown in Fig 1. It is divided into two parts: pre-training part and fine-tuning part. In the pre-training phase, inputted retinal images are uniformly splitted into non-overlapping image blocks (patches). Subsequently, a portion of these image blocks is masked to create mask tokens based on a predetermined masking rate (e.g., 75% or 50%). The remaining visible image blocks are then encoded by the ViT. Upon completion of encoding, mask tokens are introduced into the encoded results, and together they are fed into a simple decoder to attempt the reconstruction of the original image. After pre-training, the decoder component is discarded, the ViT can be utilized for the classification of referable DR. To the best of our knowledge, this is the first study that employs MAE to pre-train large-size fundus retinal images for referable DR (rDR) classification.

The objective of this study is to reduce the number of images required for pre-training ViT in the DR domain through MAE and to achieve superior classification performance for DR compared to ViT pre-trained with over one million natural images. Thus, the proposed model pre-trained on different sizes of retinal images is also compared with the ViT that pre-trained on ImageNet. Since the inter-class gap of fundus retinal images is smaller than that of natural images, a masking rate of 50% is also tested and compared with that of 75% masking ratio.

Materials and methods

Datasets

The APTOS [37] public dataset contains 5,590 color fundus retinal images, but only 3,662 images have corresponding DR grading labels. The kaggle-EyePACS public dataset [38] contains 88,702 color fundus retinal images, of which 35,126 images have corresponding DR grading labels. DR was graded in this dataset in the same way as that of APTOS. Messidor-2 [39] is an extension of the original Messidor dataset. It contains 1,748 color fundus retinal images, but the fundus images do not appear in pairs and the official DR grade for each image is not provided. OIA-DDR [40] is a subset of the OIA series datasets, which contains 13,673 fundus images with labels of DR classification and segmentation. All the data in kaggle-EyePACS and Messidor-2 as well as OIA-DDR, and the data without labels in APTOS were used as pretraining data. And the data with corresponding labels in APTOS was taken as fine-tuning data and testing data. The total number of images used for pretraining is 106,051, and that for fine-tuning and testing is 3,662.

Meanwhile, in order to study the performance gap of MAE on datasets with different scales, three datasets were constructed. Dataset1 contains 17,349 fundus retinal images, it is consisted of unlabeled data in APTOS, Messidor-2 and OIA-DDR. Dataset2 contains 106,051 fundus retinal images, it is composed by Dataset1 and EyePACS dataset. Dataset3 contains 3,662 fundus retinal images, it is formed with labeled data in APTOS. The data information for these three datasets is shown in Table 1.

Table 1. The dataset used in this study.

Dataset Name	Compose	Number of images
DataSet1	APTOS without labels, Messidor-2, OIA-DDR	17,349
DataSet2	APTOS without labels, Messidor-2, OIA-DDR, EyePACS	106,051
DataSet3	APTOS with labels	3,662

Open in a new tab

Vision transformer and masked autoencoders

A central part of the Transformer architecture is the self-attention mechanism, which was first proposed in CV in 2014 [41] and then was applied in NLP [42]. The self-attention also known as scaled dot-product attention, can be expressed by:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{K}}}) V

(1)

where Q, K and V represents query, key and value, respectively. These parameters are all obtained by different mapping of the input. Initially, Transformer consists of encoder and decoder, which is composed by several Transformer blocks. While, the Transformer block is mainly composed of residual connection, multi-head attention mechanism, fully connected layer and layer normalization. The input of ViT were replaced by 16×16×3 patches which were obtained via partitioning the original 224×224×3 images.

The structure of ViT is shown in Fig 2, where ViT-Large and ViT-Base are employed. The input image is initially uniformly divided into non-overlapping patches, which are then mapped to vectors of a predetermined dimension through a straightforward linear transformation. These vectors are subsequently added to learnable position vectors of the same dimension. The resulting vectors are fed into Transformer blocks for encoding, and the final classification results for the image are obtained through a simple MLP. The details of ViT-Large and ViT-Base models are listed in Table 2. However, it is lack of the position embedding in the pre-trained weights when the patches increase with the enhancement of input image size. Thus, the Bicubic interpolation was used here to obtain the missing position embedding in the pre-trained weights.

Table 2. The details of ViT-Large and ViT-Base.

Model	Layer	Hidden size D	MLP size
ViT-Base	12	768	3072
ViT-Large	24	1024	4096

Open in a new tab

MAE is one of simple and scalable self-supervised learning methods. It was used to pre-train ViT-Large and ViT-Base models. The input images were divided into nonoverlapping patches with the same size in the pre-train stage. Then, a high proportion of image patches were randomly masked, and the remaining unmasked image patches were encoded by the encoder. A simple decoder was used to restore the original images based on the output of the encoder and the mask token. When the pretraining process was complete, the decoder part was discarded and the rest of the network can be used for a specific task via transfer learning.

Experiment and results

The detail of experiments

In order to make a fair and objective comparison of the experimental results as much as possible, the hyperparameters were fixed for all experiments. The main hyperparameters are summarized in Table 3. Dataset3 was used to finetune and test the pre-trained model. The test goal is the classification performance of the model for rDR. Four evaluation metrics, accuracy, AUC, sensitivity and specificity were recorded in each experiment to evaluate the final performance of different pretrained models. At first, 10% of the images in Dataset3 were randomly selected as the test set. And then the rest of the data was randomly splitted into 80% and 20% parts, which were respectively taken as the training set and the validation set. All the results of this study were obtained by a single test with the model on the test set. After the cropping operation, all images were resized to the same size as the pretraining images. random horizontal flip and random rotation in the range of (-180 degrees, +180 degrees) were performed in the image enhancement. The optimizer was stochastic gradient descent with momentum. The initial learning rate was 1.5625×10⁻³ and the momentum was 0.9. The learning rate was multiplied by 0.8 for epochs 10, 25 and 50. The WeightedRandomSampler in Pytorch was used for class balancing. The binary cross-entropy loss function was used and the maximum number of epochs was 150.

Table 3. Main hyperparameters configured in the experimental setup of this study.

	Pre-training (224×224 and 320×320)	Pre-training (448×448)	Finetune
Optimizer	Adam	AdamW	SGD
Optimizer momentum	β₁ = 0.9, β₂ = 0.999	β₁ = 0.9, β₂ = 0.95	0.9
Learning rate	10⁻⁴	10⁻⁴	1.5625×10⁻³
Learning rate schedule	-	-	milestones
Augmentation	random horizontal flip and random rotation	random horizontal flip and random rotation	random horizontal flip and random rotation

Open in a new tab

In order to compare with the results of pretraining using fundus images, the randomly initialized weights and ImageNet pre-trained weights were loaded in ViT on Dataset3 with the same hyperparameters. The Bicubic interpolation was also carried out to obtain the lacking position embeddings when the ImageNet pre-trained weights were loaded and then fine-tuned on images larger than 224×224. In the pretraining stage, the Adam optimizer was used for the 224×224 and 320×320 images and the learning rate was fixed to be 10⁻⁴. The AdamW optimizer (β₁ = 0.9, β₂ = 0.95) was used to pretrain on 448×448 images, and the learning rate is fixed to be 10⁻⁴. The image processing operations were the same as on Dataset3.

For ViT-Large, a maximum number of pretraining epochs of 1000 was used on Dataset1 and Dataset2, and the pre-trained model was saved every 100 epochs. The masking ratios of 75% and 50% were used on Dataset1 and Dataset2, where the image sizes were all adjusted to 224×224. When the images in Dataset1 and Dataset2 were resized to 320×320, the masking ratios was set to 75%. The pre-training information of ViT-Large is summarized in Table 4.

Table 4. The pretraining information from ViT-Large.

Dataset	Image size	Epoch	Masking Ratio
DataSet1	224×224	100, 200, 300, 400, 500, 600, 700, 800, 900, 1000	0.5, 0.75
DataSet1	320×320	100, 200, 300, 400, 500, 600, 700, 800, 900, 1000	0.75
DataSet2	224×224	100, 200, 300, 400, 500, 600, 700, 800, 900, 1000	0.5, 0.75
DataSet2	320×320	100, 200, 300, 400, 500, 600, 700, 800, 900, 1000	0.75

Open in a new tab

For ViT-Base, a masking ratio of 75% was used on Dataset1 and Dataset2. The maximum number 800 pretraining epochs was used on 224×224 images, while 1000 maximum pretraining epochs were used on images larger than 224×224 (320×320 and 448×448). The pre-trained model was saved every 100 epochs. The pre-training information in ViT-Base is summarized in Table 5.

Table 5. The pre-training information in ViT-Base.

Dataset	Image size	Epoch	Masking Ratio
DataSet1	224×224	100, 200, 300, 400, 500, 600, 700, 800	0.75
	320×320	100, 200, 300, 400, 500, 600, 700, 800, 900, 1000	0.75
	448×448	100, 200, 300, 400, 500, 600, 700, 800, 900, 1000	0.75
DataSet2	224×224	100, 200, 300, 400, 500, 600, 700, 800	0.75
	320×320	100, 200, 300, 400, 500, 600, 700, 800, 900, 1000	0.75
	448×448	100, 200, 300, 400, 500, 600, 700, 800, 900, 1000	0.75

Open in a new tab

Results and analysis

The results of the ViT-Large experiment with masking ratio of 0.75 are shown in Table 6. It is evident that the model pre-trained on ImageNet shows better results than that with randomly initialized weights. The accuracy of the present model based on 320×320 images of Dataset2 exceeds that pre-trained on ImageNet with Bicubic interpolation. And the ViT-Large pre-trained on 320×320 images of Dataset2 outperforms that pre-trained on ImageNet in AUC and sensitivity. Based on 17,000 images of Dataset1, the accuracy can be increased from 89.59% to 90.41%, AUC can be enhanced from 0.9624 to 0.9719, and the sensitivity can be enhanced from 0.9324 to 0.9527 when 224×224 images were replaced by 320×320 images for pretraining. Similar results can be found in Dataset2. The accuracy, AUC, sensitivity and specificity can be enhanced up to 92.60%, 0.9803, 0.973 and 0.9263, respectively. In this study, 0.9803 and 0.973 are the highest values achieved by ViT-Large in terms of AUC and sensitivity, respectively. However, the increment of the image size from 224×224 to 320×320 cannot significantly improve the performance of the model with randomly initialized weights. This is also true for models pre-trained on ImageNet. It may be ascribed to the larger error in interpolation.

Table 6. The results of ViT-Large with masking ratio of 0.75.

Pre-training Dataset	Image size	Masking Ratio	Accuracy	AUC	Sensitivity	Specificity
-	224×224	-	84.38%	0.9038	0.8581	0.8341
-	320×320	-	82.74%	0.9126	0.8581	0.8065
ImageNet1k	224×224	0.75	93.15%	0.9801	0.8919	0.9585
ImageNet1k	320×320	0.75	91.23%	0.9745	0.8581	0.9493
DataSet1	224×224	0.75	89.59%	0.9624	0.9324	0.9124
DataSet1	320×320	0.75	90.41%	0.9719	0.9527	0.9078
DataSet2	224×224	0.75	90.68%	0.9738	0.9121	0.9171
DataSet2	320×320	0.75	92.60%	0.9803	0.9730	0.9263

Open in a new tab

Moreover, pre-training model with different types of images (natural and retinal images) leads to classification variations of the model. As shown in Table 6, ViT-Large pre-trained on natural images exhibits lower sensitivity compared with that on retinal images. And the sensitivity of ViT-Large pre-trained on natural images is much lower than the specificity. While, ViT-Large pre-trained on retinal images yields the opposite results. That is, the sensitivity of ViT-Large pre-trained on retinal images is higher than the specificity.

The influences of masking ratio on the performance of ViT-Large are shown in Table 7. MAE with 50% masking ratio slightly improves the performance of the model compared to that with 75% masking ratio. This conclusion is similar to that by Cai et al. [36]. However, it should be pointed out that the enhancement by increasing image size (224×224 to 320×320) is larger than that by reducing the masking ratio.

Table 7. The results of ViT-Large with different masking ratio.

Pre-training Dataset	Image size	Masking Ratio	Accuracy	AUC	Sensitivity	Specificity
DataSet1	224×224	0.75	89.59%	0.9624	0.9324	0.9124
DataSet1	224×224	0.5	90.96%	0.9705	0.9257	0.9217
DataSet2	224×224	0.75	90.68%	0.9738	0.9121	0.9171
DataSet2	224×224	0.5	91.78%	0.9768	0.9324	0.9217

Open in a new tab

The results of the ViT-Base experiments are shown in Table 8. The performance of different models on images larger than 224×224 (320×320 and 448×448) were examined and compared with each other. It can be found that the performance of ViT-Base is similar to that of ViT-Large.

Table 8. The results of ViT-Base with masking ratio of 0.75.

Pre-training Dataset	Image size	Masking Ratio	Accuracy	AUC	Sensitivity	Specificity
-	224×224	-	84.38%	0.9086	0.9122	0.7972
	320×320	-	85.21%	0.9145	0.8784	0.8341
	448×448	-	83.29%	0.9202	0.9122	0.7788
ImageNet1k	224×224	0.75	90.96%	0.978	0.9257	0.8986
	320×320	0.75	92.05%	0.9806	0.9257	0.9171
	448×448	0.75	90.41%	0.9718	0.9460	0.8756
DataSet1	224×224	0.75	89.59%	0.9669	0.9054	0.8940
	320×320	0.75	91.51%	0.9720	0.9324	0.9355
	448×448	0.75	92.05%	0.9761	0.9595	0.9355
DataSet2	224×224	0.75	92.05%	0.9805	0.9460	0.9217
	320×320	0.75	93.42%	0.9825	0.9662	0.9539
	448×448	0.75	93.15%	0.9853	0.9662	0.9447

Open in a new tab

Compared with the model using 224×224 images of Dataset1 for pretraining, the accuracy with 320×320 and 448×448 images can be increased by 1.92% and 2.46%, respectively. And the model accuracy based on 17,000 images with the size of 448×448 is 92.05%, which exceeds all the results of ViT-Base using ImageNet pre-trained weights in this study. Using more than 100,000 images with the size of 320×320 in Dataset2 for pretraining, ViT-Base achieves the highest accuracy of 93.42% in this study. However, further expanding the image size to 448×448 does not significantly improve the accuracy, the AUC can be increased to 0.9853 and exceeded the results obtained by other models in this study. The sensitivity of ViT-Base pre-trained on natural and retinal images is higher than the specificity, which is different from that of ViT-Large.

It can be found that ViT-Base using the ImageNet pre-trained weights could improve the model performance when the input image size was increased from 224×224 to 320×320. This is because that large-size images provide more details and show a major impact on model performance. However, when the input image size was increased from 320×320. to 448×448, both model accuracy and AUC decrease. It may be attributed to that the weight interpolation show more significant effect on model performance than that of image size. It should be pointed out that the model with pretraining could lead to a significant improvement in model performance. However too many pretraining epochs lead to overfitting, which is consistent with the finds by El-Noub et al. [33] and Zhou et al. [35].

Comparison to state-of-the-art methods

We compared the classification performance of VMLRI for rDR on the APTOS dataset with other state-of-the-art models, including CNN and Bayesian Neural Networks (BNN). The comparative results are presented in Table 9. Zhang et al. [43] proposed a multi-model domain adaptation (MMDA) method, and they trained it on source domains including DDR, IDRiD, Messidor, and Messidor-2 datasets and tested it on the target domain APTOS. Their method achieved high sensitivity but had an accuracy of 90.6%, which is lower than that of VMLRI (93.42%). Zhang et al. [44] introduced the Source-Free Transfer Learning (SFTL) method with a similar concept of domain adaptation. They considered the EyePACS as the source domain and trained the model on this dataset. After training, the model was tested on the APTOS. Compared to MMDA, SFTL exhibited higher accuracy but slightly lower sensitivity. In this study, VMLRI outperforms SFTL in terms of accuracy, sensitivity, and specificity.

Table 9. Performance comparison with state-of-the-art diabetic retinopathy classification methods on the APTOS dataset.

Method	Accuracy	AUC	Sensitivity	Specificity
MMDA [43]	90.6%	-	0.985	-
SFTL [44]	91.2%	-	0.951	0.858
VGG16 [45]	92.91%	-	0.94	-
InceptionV3 [45]	93.59%	-	0.93	-
AmInceptionV3 [45]	94.46%	-	0.9	-
BNN [46]	-	0.961	-	-
VMLRI	93.42%	0.9825	0.9662	0.9539

Open in a new tab

Vives-Boix et al. [45] tested the classification performance of VGG16 and InceptionV3 on the APTOS dataset. In terms of sensitivity, VMLRI surpasses VGG16 and InceptionV3. While, in accuracy, VMLRI outperforms VGG16 but is slightly inferior to InceptionV3. Inspired by metaplasticity in the field of neuroscience, they proposed AmInceptionV3 by improving InceptionV3, which achieved an accuracy of 94.46%. However, the sensitivity of AmInceptionV3 was reduced to be 0.9. Apart from CNN, Jaskari et al. [46] attempted to classify rDR using BNN, obtaining an AUC of 0.961 on the APTOS dataset. This result is lower than VMLRI’s AUC (0.9825), and it is noteworthy that they used retinal images with a higher resolution of 512×512. In conclusion, the proposed VMLRI shows high sensitivity and specificity and a competitive AUC of 0.9825, and demonstrates balanced performance compared with state-of-the-art models on.

Discussion

If DR can be detected and treated at an early stage, further damage to vision caused by DR can be effectively prevented. Thus, a novel VMLRI has been proposed for the classification of rDR. It has been found that the classification performance of the present model pre-trained with MAE on more than 100,000 large-size (320×320 and 448×448) fundus retinal images at 75% masking ratio is better than that with the weights from ImageNet. Although the pretraining with a masking ratio of 50% provides a slight improvement in model performance, it consumes more computing resources. Furthermore, ViT-Large and ViT-Base indicate similar results in accuracy and AUC, and they show advantages in sensitivity and specificity, respectively. Therefore, the ViT-Base pre-trained with a masking ratio of 75% is recommended, and the masking ratio and network architecture can be adjusted based on the results and situation.

This study shows that MAE can provide more flexibility to the input image and substantially reduce the number of images required for pretraining of downstream tasks. However, the required computing resources increases rapidly as the size of input image increases. For the self-attention mechanism in the Transformer, the computational complexity is quadratic with respect to the input sequence length, and thus the training time during the pre-training phase will be significantly extended if the image resolution increases. However, this issue can be mitigated by setting a higher masking rate to reduce the sequence length. It should be also noted that the proposed method exhibits a degree of overfitting, which was also reported by El-Noub et al. [33] and Zhou et al. [35].

The present VMLRI model is beneficial to incorporate the domain knowledge of downstream tasks and modify the network architecture, instead of limiting the network structure in order to use pre-training weights. In order to further enhance the pre-trained model’s classification performance, the Transformer network structure will be modified to incorporate relevant prior knowledge from DR domain in the future. In addition, introducing DR-related regression or detection tasks on top of the MAE image reconstruction task can be also taken into account.

Acknowledgments

We thank the reviewers for their insightful comments that helped improve our manuscript’s overall quality.

Data Availability

All data are available from: EyePACS:https://www.kaggle.com/competitions/diabetic-retinopathy-detection/data APTOS:https://www.kaggle.com/competitions/aptos2019-blindness-detection/data Messidor2:https://www.adcis.net/en/third-party/messidor2/ OIA:https://github.com/nkicsl/OIA The Code is available at: https://github.com/CNMaxYang/VMLRI.

Funding Statement

This work was supported by the National Natural Science Foundation of China through grant number 52376079. The URL for NSFC is https://www.nsfc.gov.cn/. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Management-Screening DaTM. Global report on diabetes. Geneva, Switzerland: World Health Organization, 2016. [Google Scholar]
2.Guariguata L, Whiting DR, Hambleton I, Beagley J, Linnenkamp U, Shaw JE. Global estimates of diabetes prevalence for 2013 and projections for 2035. Diabetes Research and Clinical Practice. 2014;103(2):137–49. doi: 10.1016/j.diabres.2013.11.002 [DOI] [PubMed] [Google Scholar]
3.Kocur I, Resnikoff S. Visual impairment and blindness in Europe and their prevention. British Journal of Ophthalmology. 2002;86(7):716–22. doi: 10.1136/bjo.86.7.716 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Group ETDRSR. Grading diabetic retinopathy from stereoscopic color fundus photographs—an extension of the modified Airlie House classification: ETDRS report number 10. Ophthalmology. 1991;98(5):786–806. [PubMed] [Google Scholar]
5.Fujita H, Uchiyama Y, Nakagawa T, Fukuoka D, Hatanaka Y, Hara T, et al. Computer-aided diagnosis: The emerging of three CAD systems induced by Japanese health care needs. Computer Methods and Programs in Biomedicine. 2008;92(3):238–48. doi: 10.1016/j.cmpb.2008.04.003 [DOI] [PubMed] [Google Scholar]
6.Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural networks. Science. 2006;313(5786):504–7. doi: 10.1126/science.1127647 [DOI] [PubMed] [Google Scholar]
7.Gu J, Wang Z, Kuen J, Ma L, Shahroudy A, Shuai B, et al. Recent advances in convolutional neural networks. Pattern Recognition. 2018;77:354–77. [Google Scholar]
8.Vasilakos AV, Tang Y, Yao Y. Neural networks for computer-aided diagnosis in medicine: a review. Neurocomputing. 2016;216:700–8. [Google Scholar]
9.Shamshirband S, Fathi M, Dehzangi A, Chronopoulos AT, Alinejad-Rokny H. A review on deep learning approaches in healthcare systems: Taxonomies, challenges, and open issues. Journal of Biomedical Informatics. 2021;113:103627. doi: 10.1016/j.jbi.2020.103627 [DOI] [PubMed] [Google Scholar]
10.Asiri N, Hussain M, Al Adel F, Alzaidi N. Deep learning based computer-aided diagnosis systems for diabetic retinopathy: A survey. Artificial Intelligence in Medicine. 2019;99:101701. doi: 10.1016/j.artmed.2019.07.009 [DOI] [PubMed] [Google Scholar]
11.Trokielewicz M, Czajka A, Maciejewicz P. Post-mortem iris recognition with deep-learning-based image segmentation. Image and Vision Computing. 2020;94:103866. [Google Scholar]
12.Nielsen KB, Lautrup ML, Andersen JK, Savarimuthu TR, Grauslund J. Deep learning–based algorithms in screening of diabetic retinopathy: A systematic review of diagnostic performance. Ophthalmology Retina. 2019;3(4):294–304. doi: 10.1016/j.oret.2018.10.014 [DOI] [PubMed] [Google Scholar]
13.Sarki R, Ahmed K, Wang H, Zhang Y. Automatic detection of diabetic eye disease through deep learning using fundus images: a survey. IEEE Access. 2020;8:151133–49. [Google Scholar]
14.Voets M, Møllersen K, Bongo LA. Reproduction study using public data of: Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. PloS one. 2019;14(6):e0217541. doi: 10.1371/journal.pone.0217541 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Parmar N, Vaswani A, Uszkoreit J, Kaiser L, Shazeer N, Ku A, et al. Image Transformer. In: Jennifer D, Andreas K, editors. Proceedings of the 35th International Conference on Machine Learning; Proceedings of Machine Learning Research: PMLR; 2018. p. 4055–64. [Google Scholar]
16.Ho J, Kalchbrenner N, Weissenborn D, Salimans T. Axial attention in multidimensional transformers. arXiv preprint arXiv:191212180. 2019. [Google Scholar]
17.Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:201011929. 2020. [Google Scholar]
18.Shamshad F, Khan S, Zamir SW, Khan MH, Hayat M, Khan FS, et al. Transformers in medical imaging: A survey. arXiv preprint arXiv:220109873. 2022. [DOI] [PubMed] [Google Scholar]
19.Kumar NS, Karthikeyan BR, editors. Diabetic Retinopathy Detection using CNN, Transformer and MLP based Architectures. 2021 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS); 2021: IEEE. p. 1–2. [Google Scholar]
20.Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H, editors. Td>raining data-efficient image transformers & distillation through attention. International Conference on Machine Learning; 2021: PMLR. p. 10347–57. [Google Scholar]
21.Matsoukas C, Haslum JF, Söderberg M, Smith K. Is it time to replace cnns with transformers for medical images? arXiv preprint arXiv:210809038. 2021. [Google Scholar]
22.He K, Chen X, Xie S, Li Y, Dollár P, Girshick R, editors. Masked autoencoders are scalable vision learners. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022. p. 16000–9.
23.Srinivasan V, Strodthoff N, Ma J, Binder A, Müller K-R, Samek W. To pretrain or not? A systematic analysis of the benefits of pretraining in diabetic retinopathy. Plos one. 2022;17(10):e0274291. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Advances in Neural Information Processing Systems. 2017;30. [Google Scholar]
25.Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. 2018. [Google Scholar]
26.Devlin J, Chang M-W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805. 2018. [Google Scholar]
27.Yang H, Chen J, Xu M<, editors. Fundus disease image classification based on improved transformer. 2021 International Conference on Neuromorphic Computing (ICNC); 2021: IEEE. p. 207–14.
28.Jha S, Luhach V, Poddar R, editors. Retinal Malady Classification Using AI: A novel ViT-SVM combination architecture. 2022 6th International Conference on Computing Methodologies and Communication (ICCMC); 2022: IEEE. p. 1659–64. [Google Scholar]
29.Sadeghzadeh A, Junayed MS, Aydin T, Islam MB, editors. Hybrid CNN+ Transformer for Diabetic Retinopathy Recognition and Grading. 2023 Innovations in Intelligent Systems and Applications Conference (ASYU); 2023: IEEE. p. 1–6. [Google Scholar]
30.Ma L, Xu Q, Hong H, Shi Y, Zhu Y, Wang L. Joint ordinal regression and multiclass classification for diabetic retinopathy grading with transformers and CNNs fusion network. Applied Intelligence. 2023:1–14. [Google Scholar]
31.Adak C, Karkera T, Chattopadhyay S, Saqib M. Detecting Severity of Diabetic Retinopathy from Fundus Images using Ensembled Transformers. arXiv preprint arXiv:230100973. 2023. [Google Scholar]
32.Bao H, Dong L, Piao S, Wei F. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:210608254. 2021. [Google Scholar]
33.El-Nouby A, Izacard G, Touvron H, Laptev I, Jegou H, Grave E. Are large-scale datasets necessary for self-supervised pre-training? arXiv preprint arXiv:211210740. 2021. [Google Scholar]
34.Xie Z, Zhang Z, Cao Y, Lin Y, Bao J, Yao Z, et al., editors. Simmim: A simple framework for masked image modeling. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022. p. 9653–63.
35.Zhou L, Liu H, Bae J, He J, Samaras D, Prasanna P. Self pre-training with masked autoencoders for medical image analysis. arXiv preprint arXiv:220305573. 2022. [Google Scholar]
36.Cai Z, Lin L, He H, Tang X, editors. Uni4Eye: Unified 2D and 3D Self-supervised Pre-training via Masked Image Modeling Transformer for Ophthalmic Image Classification. Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VIII; 2022: Springer. p. 88–98. [Google Scholar]
37.APTOS 2019 Blindness Detection [Internet]. Kaggle. 2019. Available from: https://kaggle.com/competitions/aptos2019-blindness-detection. [Google Scholar]
38.Cuadros J, Bresnick G. EyePACS: an adaptable telemedicine system for diabetic retinopathy screening. Journal of Diabetes Science and Technology. 2009;3(3):509–16. doi: 10.1177/193229680900300315 [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Decencière E, Zhang X, Cazuguel G, Lay B, Cochener B, Trone C, et al. Feedback on a publicly distributed image database: the Messidor database. Image Analysis & Stereology. 2014;33(3):231–4. [Google Scholar]
40.Li T, Gao Y, Wang K, Guo S, Liu H, Kang H. Diagnostic assessment of deep learning algorithms for diabetic retinopathy screening. Information Sciences. 2019;501:511–22. [Google Scholar]
41.Mnih V, Heess N, Graves A. Recurrent models of visual attention. Advances in Neural Information Processing Systems. 2014;27. [Google Scholar]
42.Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:14090473. 2014. [Google Scholar]
43.Zhang G, Sun B, Zhang Z, Pan J, Yang W, Liu Y. Multi-model domain adaptation for diabetic retinopathy classification. Frontiers in Physiology. 2022;13:918929. doi: 10.3389/fphys.2022.918929 [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Zhang C, Lei T, Chen P. Diabetic retinopathy grading by a source-free transfer learning approach. Biomedical Signal Processing and Control. 2022;73:103423. [Google Scholar]
45.Vives-Boix V, Ruiz-Fernández D. Diabetic retinopathy detection through convolutional neural networks with synaptic metaplasticity. Computer Methods and Programs in Biomedicine. 2021;206:106094. doi: 10.1016/j.cmpb.2021.106094 [DOI] [PubMed] [Google Scholar]
46.Jaskari J, Sahlsten J, Damoulas T, Knoblauch J, Särkkä S, Kärkkäinen L, et al. Uncertainty-aware deep learning methods for robust diabetic retinopathy classification. IEEE Access. 2022;10:76669–81. [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0299265.r001

Decision Letter 0

Yawen Lu

2 Nov 2023

PONE-D-23-29847Vision transformer with masked autoencoders for referable diabetic retinopathy classification based on large-size retina imagePLOS ONE

Dear Dr. Xu,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Dec 17 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Yawen Lu, Ph.D

Academic Editor

PLOS ONE

Journal requirements:

1. When submitting your revision, we need you to address these additional requirements.

Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf.

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match.

When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section.

4. We note that Figure 1 in your submission contain copyrighted images. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:

a. You may seek permission from the original copyright holder of Figure 1 to publish the content specifically under the CC BY 4.0 license.

We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text:

“I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.”

Please upload the completed Content Permission Form or other proof of granted permissions as an ""Other"" file with your submission.

In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].”

b. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only.

Additional Editor Comments:

Comments to the Author:

In conclusion, after careful review and consideration of the paper titled "Vision transformer with masked autoencoders for referable diabetic retinopathy classification based on large-size retina image", the initial decision is major revision. This decision is based on the consensus of the reviewers and the following concerns and contributions of the paper:

From R1, the authors should compare the method with other state-of-the-art CNN models, state the novelty and contribution of the proposed work, the limitations of the method, and other details in implementation and result analysis.

From R2, the authors should make the architecture clear. R2 also mentions some issues in the writing and layout in tables, contributions and figures.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: No

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The author implemented the article entitled "Vision transformer with masked autoencoders for referable diabetic retinopathy classification based on large-size retina image". There are many queries need to be addressed.

1. Did the authors compared the Vision transformer performance with any other state of art CNN?

2. Since already many researchers implemented Vision Transformer in Diabetic retinopathy database. Then what is the novelty of the proposed work?

3. Literature review given in the related work is not sufficient . Need to add some more recent papers.

4. Training and testing data spilt?

5. Whether 5 fold or 10 fold validation carried out?

6. Information about the hyperparameters used.

7. Have you performed error analysis?

8. Methods need to be elaborated.

9. Discussion is incomplete. The author should discuss the results of their proposed work with existing related literature works.

10. Include the limitations and future scope of the work.

11. Result analysis need to be improved in Result section.

Reviewer #2: 1. while talking about datasets : start with the APTOS dataset as this was the one mentioned rigorously within the introduction, then move forward to the other datasets.

2. In the introduction: explain what DeiT-S and other techniques stand for as you added the details for CNN and ViT.

3. Confusion between the closing paragraphs within the introduction and the related work. why dont you move all this part to be combined with the related work. Divide the related work to work conducted on standard size retinal images and larger retina images in order to present the drawbacks mentioned within the introduction based on solid facts .

4. Figures needs more labeling between arrows in order to better understand the sequence.

5. Move the contributions to the conclusions, replace those with objectives instead to enhance the readability.

6. For the results tables, highlight the best performing techniques through all datasets and all evaluation criterea.

7.The architecture is very unclear, every component needs to be explained separately with the motivation behind it and what it adds to the overall technique.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Mar 6;19(3):e0299265. doi: 10.1371/journal.pone.0299265.r002

Author response to Decision Letter 0

27 Nov 2023

We have carefully revised our manuscript according to the journal requirements and comments by editor and reviewers.We have highlighted the revised text in blue type in our revised manuscript. And, we have provided response to each comment, uploaded separated “Response to Reviews”.

Attachment

Submitted filename: Response to Reviews.docx

pone.0299265.s001.docx^{(38.6KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0299265.r003

Decision Letter 1

Yawen Lu

29 Jan 2024

PONE-D-23-29847R1Vision transformer with masked autoencoders for referable diabetic retinopathy classification based on large-size retina imagePLOS ONE

Dear Dr. Xu,

Please submit your revised manuscript by Mar 14 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Yawen Lu

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments:

Please address the minor comments on the diagram, logic flow and literature reviews. After the modification, the manuscript should be ready to publish.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: No

**********

6. Review Comments to the Author

Reviewer #1: The authors have addressed the reviewer comments. Hence the article can be accepted in its current form.

Reviewer #2: Most of the comments have been covered (greatest thanks for your effort), however I will reiterate two comments that will make the paper up to standard:

1- Diagrams to be clearer (especially ViT diagram), The logic flow is very unclear without reading the text

2- Dividing the literature review into sub sections based on the size of the dataset worked on

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

PLoS One. 2024 Mar 6;19(3):e0299265. doi: 10.1371/journal.pone.0299265.r004

Author response to Decision Letter 1

1 Feb 2024

We appreciate the comments from the reviewers. We have revised Figures 1 and 2 by adding annotations to make them clear. Additionally, we have revised the related work section, and added subsections to improve readability.

Attachment

Submitted filename: Response to Reviewers.docx

pone.0299265.s002.docx^{(23.4KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0299265.r005

Decision Letter 2

Yawen Lu

8 Feb 2024

Vision transformer with masked autoencoders for referable diabetic retinopathy classification based on large-size retina image

PONE-D-23-29847R2

Dear Dr. Xu,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Yawen Lu, Ph.D

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Dear authors:

Regarding your submission:

PONE-D-23-29847R2

Vision transformer with masked autoencoders for referable diabetic retinopathy classification based on large-size retina image

We have received feedbacks from the previous reviewers and are announcing that your work has been Accepted for publication in PLOS ONE.

Please follow the following steps and provide a camera-ready version of your manuscript. Congratulation again!

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #2: Comments have been addressed, no further comments are required to be detailed.

Language was fixed, organizational structure and diagrams were revised

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

**********

PLoS One. doi: 10.1371/journal.pone.0299265.r006

Acceptance letter

Yawen Lu

26 Feb 2024

PONE-D-23-29847R2

PLOS ONE

Dear Dr. Xu,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Yawen Lu

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Attachment

Submitted filename: Response to Reviews.docx

pone.0299265.s001.docx^{(38.6KB, docx)}

Attachment

Submitted filename: Response to Reviewers.docx

pone.0299265.s002.docx^{(23.4KB, docx)}

Data Availability Statement

[pone.0299265.ref001] 1.Management-Screening DaTM. Global report on diabetes. Geneva, Switzerland: World Health Organization, 2016. [Google Scholar]

[pone.0299265.ref002] 2.Guariguata L, Whiting DR, Hambleton I, Beagley J, Linnenkamp U, Shaw JE. Global estimates of diabetes prevalence for 2013 and projections for 2035. Diabetes Research and Clinical Practice. 2014;103(2):137–49. doi: 10.1016/j.diabres.2013.11.002 [DOI] [PubMed] [Google Scholar]

[pone.0299265.ref003] 3.Kocur I, Resnikoff S. Visual impairment and blindness in Europe and their prevention. British Journal of Ophthalmology. 2002;86(7):716–22. doi: 10.1136/bjo.86.7.716 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0299265.ref004] 4.Group ETDRSR. Grading diabetic retinopathy from stereoscopic color fundus photographs—an extension of the modified Airlie House classification: ETDRS report number 10. Ophthalmology. 1991;98(5):786–806. [PubMed] [Google Scholar]

[pone.0299265.ref005] 5.Fujita H, Uchiyama Y, Nakagawa T, Fukuoka D, Hatanaka Y, Hara T, et al. Computer-aided diagnosis: The emerging of three CAD systems induced by Japanese health care needs. Computer Methods and Programs in Biomedicine. 2008;92(3):238–48. doi: 10.1016/j.cmpb.2008.04.003 [DOI] [PubMed] [Google Scholar]

[pone.0299265.ref006] 6.Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural networks. Science. 2006;313(5786):504–7. doi: 10.1126/science.1127647 [DOI] [PubMed] [Google Scholar]

[pone.0299265.ref007] 7.Gu J, Wang Z, Kuen J, Ma L, Shahroudy A, Shuai B, et al. Recent advances in convolutional neural networks. Pattern Recognition. 2018;77:354–77. [Google Scholar]

[pone.0299265.ref008] 8.Vasilakos AV, Tang Y, Yao Y. Neural networks for computer-aided diagnosis in medicine: a review. Neurocomputing. 2016;216:700–8. [Google Scholar]

[pone.0299265.ref009] 9.Shamshirband S, Fathi M, Dehzangi A, Chronopoulos AT, Alinejad-Rokny H. A review on deep learning approaches in healthcare systems: Taxonomies, challenges, and open issues. Journal of Biomedical Informatics. 2021;113:103627. doi: 10.1016/j.jbi.2020.103627 [DOI] [PubMed] [Google Scholar]

[pone.0299265.ref010] 10.Asiri N, Hussain M, Al Adel F, Alzaidi N. Deep learning based computer-aided diagnosis systems for diabetic retinopathy: A survey. Artificial Intelligence in Medicine. 2019;99:101701. doi: 10.1016/j.artmed.2019.07.009 [DOI] [PubMed] [Google Scholar]

[pone.0299265.ref011] 11.Trokielewicz M, Czajka A, Maciejewicz P. Post-mortem iris recognition with deep-learning-based image segmentation. Image and Vision Computing. 2020;94:103866. [Google Scholar]

[pone.0299265.ref012] 12.Nielsen KB, Lautrup ML, Andersen JK, Savarimuthu TR, Grauslund J. Deep learning–based algorithms in screening of diabetic retinopathy: A systematic review of diagnostic performance. Ophthalmology Retina. 2019;3(4):294–304. doi: 10.1016/j.oret.2018.10.014 [DOI] [PubMed] [Google Scholar]

[pone.0299265.ref013] 13.Sarki R, Ahmed K, Wang H, Zhang Y. Automatic detection of diabetic eye disease through deep learning using fundus images: a survey. IEEE Access. 2020;8:151133–49. [Google Scholar]

[pone.0299265.ref014] 14.Voets M, Møllersen K, Bongo LA. Reproduction study using public data of: Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. PloS one. 2019;14(6):e0217541. doi: 10.1371/journal.pone.0217541 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0299265.ref015] 15.Parmar N, Vaswani A, Uszkoreit J, Kaiser L, Shazeer N, Ku A, et al. Image Transformer. In: Jennifer D, Andreas K, editors. Proceedings of the 35th International Conference on Machine Learning; Proceedings of Machine Learning Research: PMLR; 2018. p. 4055–64. [Google Scholar]

[pone.0299265.ref016] 16.Ho J, Kalchbrenner N, Weissenborn D, Salimans T. Axial attention in multidimensional transformers. arXiv preprint arXiv:191212180. 2019. [Google Scholar]

[pone.0299265.ref017] 17.Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:201011929. 2020. [Google Scholar]

[pone.0299265.ref018] 18.Shamshad F, Khan S, Zamir SW, Khan MH, Hayat M, Khan FS, et al. Transformers in medical imaging: A survey. arXiv preprint arXiv:220109873. 2022. [DOI] [PubMed] [Google Scholar]

[pone.0299265.ref019] 19.Kumar NS, Karthikeyan BR, editors. Diabetic Retinopathy Detection using CNN, Transformer and MLP based Architectures. 2021 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS); 2021: IEEE. p. 1–2. [Google Scholar]

[pone.0299265.ref020] 20.Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H, editors. Td>raining data-efficient image transformers & distillation through attention. International Conference on Machine Learning; 2021: PMLR. p. 10347–57. [Google Scholar]

[pone.0299265.ref021] 21.Matsoukas C, Haslum JF, Söderberg M, Smith K. Is it time to replace cnns with transformers for medical images? arXiv preprint arXiv:210809038. 2021. [Google Scholar]

[pone.0299265.ref022] 22.He K, Chen X, Xie S, Li Y, Dollár P, Girshick R, editors. Masked autoencoders are scalable vision learners. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022. p. 16000–9.

[pone.0299265.ref023] 23.Srinivasan V, Strodthoff N, Ma J, Binder A, Müller K-R, Samek W. To pretrain or not? A systematic analysis of the benefits of pretraining in diabetic retinopathy. Plos one. 2022;17(10):e0274291. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0299265.ref024] 24.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Advances in Neural Information Processing Systems. 2017;30. [Google Scholar]

[pone.0299265.ref025] 25.Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. 2018. [Google Scholar]

[pone.0299265.ref026] 26.Devlin J, Chang M-W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805. 2018. [Google Scholar]

[pone.0299265.ref027] 27.Yang H, Chen J, Xu M<, editors. Fundus disease image classification based on improved transformer. 2021 International Conference on Neuromorphic Computing (ICNC); 2021: IEEE. p. 207–14.

[pone.0299265.ref028] 28.Jha S, Luhach V, Poddar R, editors. Retinal Malady Classification Using AI: A novel ViT-SVM combination architecture. 2022 6th International Conference on Computing Methodologies and Communication (ICCMC); 2022: IEEE. p. 1659–64. [Google Scholar]

[pone.0299265.ref029] 29.Sadeghzadeh A, Junayed MS, Aydin T, Islam MB, editors. Hybrid CNN+ Transformer for Diabetic Retinopathy Recognition and Grading. 2023 Innovations in Intelligent Systems and Applications Conference (ASYU); 2023: IEEE. p. 1–6. [Google Scholar]

[pone.0299265.ref030] 30.Ma L, Xu Q, Hong H, Shi Y, Zhu Y, Wang L. Joint ordinal regression and multiclass classification for diabetic retinopathy grading with transformers and CNNs fusion network. Applied Intelligence. 2023:1–14. [Google Scholar]

[pone.0299265.ref031] 31.Adak C, Karkera T, Chattopadhyay S, Saqib M. Detecting Severity of Diabetic Retinopathy from Fundus Images using Ensembled Transformers. arXiv preprint arXiv:230100973. 2023. [Google Scholar]

[pone.0299265.ref032] 32.Bao H, Dong L, Piao S, Wei F. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:210608254. 2021. [Google Scholar]

[pone.0299265.ref033] 33.El-Nouby A, Izacard G, Touvron H, Laptev I, Jegou H, Grave E. Are large-scale datasets necessary for self-supervised pre-training? arXiv preprint arXiv:211210740. 2021. [Google Scholar]

[pone.0299265.ref034] 34.Xie Z, Zhang Z, Cao Y, Lin Y, Bao J, Yao Z, et al., editors. Simmim: A simple framework for masked image modeling. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022. p. 9653–63.

[pone.0299265.ref035] 35.Zhou L, Liu H, Bae J, He J, Samaras D, Prasanna P. Self pre-training with masked autoencoders for medical image analysis. arXiv preprint arXiv:220305573. 2022. [Google Scholar]

[pone.0299265.ref036] 36.Cai Z, Lin L, He H, Tang X, editors. Uni4Eye: Unified 2D and 3D Self-supervised Pre-training via Masked Image Modeling Transformer for Ophthalmic Image Classification. Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VIII; 2022: Springer. p. 88–98. [Google Scholar]

[pone.0299265.ref037] 37.APTOS 2019 Blindness Detection [Internet]. Kaggle. 2019. Available from: https://kaggle.com/competitions/aptos2019-blindness-detection. [Google Scholar]

[pone.0299265.ref038] 38.Cuadros J, Bresnick G. EyePACS: an adaptable telemedicine system for diabetic retinopathy screening. Journal of Diabetes Science and Technology. 2009;3(3):509–16. doi: 10.1177/193229680900300315 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0299265.ref039] 39.Decencière E, Zhang X, Cazuguel G, Lay B, Cochener B, Trone C, et al. Feedback on a publicly distributed image database: the Messidor database. Image Analysis & Stereology. 2014;33(3):231–4. [Google Scholar]

[pone.0299265.ref040] 40.Li T, Gao Y, Wang K, Guo S, Liu H, Kang H. Diagnostic assessment of deep learning algorithms for diabetic retinopathy screening. Information Sciences. 2019;501:511–22. [Google Scholar]

[pone.0299265.ref041] 41.Mnih V, Heess N, Graves A. Recurrent models of visual attention. Advances in Neural Information Processing Systems. 2014;27. [Google Scholar]

[pone.0299265.ref042] 42.Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:14090473. 2014. [Google Scholar]

[pone.0299265.ref043] 43.Zhang G, Sun B, Zhang Z, Pan J, Yang W, Liu Y. Multi-model domain adaptation for diabetic retinopathy classification. Frontiers in Physiology. 2022;13:918929. doi: 10.3389/fphys.2022.918929 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0299265.ref044] 44.Zhang C, Lei T, Chen P. Diabetic retinopathy grading by a source-free transfer learning approach. Biomedical Signal Processing and Control. 2022;73:103423. [Google Scholar]

[pone.0299265.ref045] 45.Vives-Boix V, Ruiz-Fernández D. Diabetic retinopathy detection through convolutional neural networks with synaptic metaplasticity. Computer Methods and Programs in Biomedicine. 2021;206:106094. doi: 10.1016/j.cmpb.2021.106094 [DOI] [PubMed] [Google Scholar]

[pone.0299265.ref046] 46.Jaskari J, Sahlsten J, Damoulas T, Knoblauch J, Särkkä S, Kärkkäinen L, et al. Uncertainty-aware deep learning methods for robust diabetic retinopathy classification. IEEE Access. 2022;10:76669–81. [Google Scholar]

PERMALINK

Vision transformer with masked autoencoders for referable diabetic retinopathy classification based on large-size retina image

Yaoming Yang

Zhili Cai

Shuxia Qiu

Peng Xu

Roles

Abstract

Introduction

Related work

Transformer

MAE

Our contributions

Fig 1. The architecture of ViT with MAE based on large-size retina image (VMLRI).

Materials and methods

Datasets

Table 1. The dataset used in this study.

Vision transformer and masked autoencoders

Fig 2. The architecture of ViT.

Table 2. The details of ViT-Large and ViT-Base.

Experiment and results

The detail of experiments

Table 3. Main hyperparameters configured in the experimental setup of this study.

Table 4. The pretraining information from ViT-Large.

Table 5. The pre-training information in ViT-Base.

Results and analysis

Table 6. The results of ViT-Large with masking ratio of 0.75.

Table 7. The results of ViT-Large with different masking ratio.

Table 8. The results of ViT-Base with masking ratio of 0.75.

Comparison to state-of-the-art methods

Table 9. Performance comparison with state-of-the-art diabetic retinopathy classification methods on the APTOS dataset.

Discussion

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

Yawen Lu

Roles

Author response to Decision Letter 0

Decision Letter 1

Yawen Lu

Roles

Author response to Decision Letter 1

Decision Letter 2

Yawen Lu

Roles

Acceptance letter

Yawen Lu

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases