Skip to main content
Journal of Imaging Informatics in Medicine logoLink to Journal of Imaging Informatics in Medicine
. 2025 Apr 29;39(1):655–668. doi: 10.1007/s10278-025-01521-7

Multimodal Masked Autoencoder Based on Adaptive Masking for Vitiligo Stage Classification

Fan Xiang 1, Zhiming Li 1, Shuying Jiang 1, Chunying Li 2, Shuli Li 2, Tianwen Gao 2, Kaiqiao He 2, Jianru Chen 2, Junpeng Zhang 1, Junran Zhang 1,
PMCID: PMC12920979  PMID: 40301294

Abstract

Vitiligo, a prevalent skin condition characterized by depigmentation, presents challenges in staging due to its inherent complexity. Multimodal skin images can provide complementary information, and in this study, the integration of clinical images of vitiligo and those obtained under Wood’s lamp is conducive to the classification of vitiligo stages. However, difficulties in annotating multimodal data and the scarcity of multimodal data limit the performance of deep learning models in related classification tasks. To address these issues, a Multimodal Masked Autoencoder (Multi-MAE) based on adaptive masking is proposed in annotating multimodal data and the problem of multimodal data scarcity, and enhances the model’s ability to extract characteristics from multimodal data. Specifically, an image reconstruction task is constructed to diminish reliance on annotated multimodal data, and a pre-training strategy is employed to alleviate the scarcity of multimodal data. Experimental results demonstrate that the proposed model achieves a vitiligo stage classification accuracy of 95.48% on a dataset of unlabeled dermatological images, an improvement of 5.16%, 4.51%, 3.87%, 2.58%, 4.51%, 4.51%, 3.87%, and 2.58% over that of MobileNet, DenseNet, VGG, ResNet-50, BEIT, MaskFeat, SimMIM, and MAE, respectively. These results verify the effectiveness of the proposed Multi-MAE model in assessing the stable and active vitiligo stages, making it a suitable clinical aid for evaluating the severity of vitiligo lesions.

Keywords: Multimodal, Masked Autoencoder, Adaptive Masking, Vitiligo Stage

Introduction

Vitiligo is a prevalent skin depigmentation disorder estimated to affect 2% of the global population [1, 2]. The primary pathological mechanism involves the dysfunction and selective loss of melanocytes, leading to progressive skin, hair, and mucous membrane depigmentation with characteristic white patches [3, 4]. Vitiligo impacts patients’ physical health, psychological well-being, and social functioning [57]. In addition, the staging of vitiligo is characterized by remarkable variability and unpredictability. In our study, the staging of vitiligo refers to the active stage and the stable stage. The stable stage here refers to a period (usually 3 to 6 months) during which the area of the vitiligo does not expand, no new vitiligo patches appear, and the boundaries of the vitiligo are clear [8]. The active stage, on the other hand, is manifested as a gradual increase in the area of the vitiligo, the emergence of new vitiligo patches, or blurred boundaries of the vitiligo [9]. The characteristic of the difficulty in distinguishing between the stable stage and the active stage makes disease management and clinical treatment strategies more complicated [10]. Therefore, accurate staging of vitiligo is of great significance for formulating personalized treatment plans, especially when considering treatment modalities such as surgical intervention during the active stage.

Multimodal image-based classification of vitiligo stages is common in the auxiliary evaluation of dermatological conditions because different modalities provide complementary information [1113]. In this study, clinical images of vitiligo and images obtained under Wood's lamp were used as multimodal images for the stage classification of vitiligo. The distinct features shown by each imaging technique were cross-integrated to improve the accuracy of vitiligo stage classification. Clinical images offer direct observation of skin lesions, including the size, shape, margin, and pigment change characteristic of vitiligo [14]. Wood’s lamp images highlight areas of depigmentation with distinct fluorescence, contrasting with the surrounding normal skin, which aids in identifying and delineating affected regions [15, 16]. Recently, deep learning techniques leveraging multiple modalities have enhanced the ability to classify the stages of vitiligo by capturing the nuanced characteristics of skin lesions [1719]. In summary, increasingly more dermatological studies have employed multimodal data to address clinical challenges of increasing complexity.

Despite the substantial advancements offered by current multimodal technologies in dermatological image research, there are still deficiencies. First, annotating dermatological multimodal data is challenging and expensive [20, 21]. Second, there is typically a scarcity of multimodal dermatological images due to the prevalence of clinical images contrasted with the relative scarcity of Wood’s lamp images. In recent years, Masked Autoencoders (MAE) [22] have proven effective in addressing annotation difficulties and limited data scales in single-modal contexts [23]. However, whether these methods can be generalized to multimodal domains remains uncertain. Generally, the random masking strategy may cause the model to evenly disperse its attention across the image rather than focus on key semantic regions [23]. The critical features when classifying vitiligo stages are the depigmented patches [24], whereas non-relevant information for vitiligo stage classification, such as the normal skin, hair, shadows, and non-lesion moles or freckles, is typically included in the background. Additionally, conventional multimodal fusion methods tend to ignore information from the other modality or fail to leverage the complementary information fully [25]. In brief, the primary challenges are the complexities of annotating multimodal data and the issue of data scarcity.

To address the aforementioned challenges, this study designs a Multimodal MAE (Multi-MAE) model based on adaptive masking for accurate vitiligo stage classification. Specifically, the Multi-MAE framework performs image reconstruction tasks to reduce reliance on annotated multimodal data and adopts a pre-training–fine-tuning strategy to alleviate the issue posed by data scarcity. Additionally, to counter the problem of dispersed attention caused by random masking, we design an adaptive masking module that selectively masks important regions and thus improves the accuracy of vitiligo stage classification. Furthermore, an efficient multimodal fusion module is devised to integrate data from diverse dermatological modalities. In summary, the designed Multi-MAE model addresses the challenges of multimodal data annotation and scarcity, offering a reliable approach to vitiligo stage evaluation.

Methodology

The model’s framework integrates an adaptive masking module, encoder, cross-attention fusion, and decoder to process vitiligo images efficiently. It also leverages a two-stage training approach, starting with feature learning on unlabeled data followed by fine-tuning on labeled samples for accurate vitiligo stage classification (active and stable stages).

Multi-MAE: Framework and Two-Stage Training

The proposed Multi-MAE system consists of two main stages. The first stage is the pre-training phase, where the model learns through an image reconstruction task on a large-scale unlabeled multimodal medical image dataset. The second stage is the fine-tuning phase, where the model utilizes the pre-trained parameters from the first stage to further train on small labeled samples, optimizing the encoder for accurate vitiligo stage classification. By employing this pre-training strategy, the model effectively mitigates the scarcity of annotated data by initially performing reconstruction tasks on a vast array of unlabeled images and subsequently conducts fine-tuning for vitiligo stage classification in downstream tasks.

The overall model framework consists of four key components: the adaptive masking module, the encoder, the cross-attention–based multimodal fusion module, and the decoder (Fig. 1). First, the system receives a clinical vitiligo image and a corresponding Wood’s lamp image. By employing the adaptive masking module, the model selectively masks regions with depigmented patches or significant texture changes related to vitiligo in the images. This selective masking enhances the model’s learning of information related to the classification of vitiligo stages, such as the characteristics of lesion areas, while suppressing irrelevant or redundant background information, thereby reducing noise interference. Subsequently, the encoder extracts features from both modalities and feeds these features into the cross-attention–based multimodal fusion module to integrate complementary information between modalities. Finally, the decoder reconstructs the original image pair based on the fused features. The pre-trained encoder and the cross-attention–based multimodal fusion module can be applied to downstream tasks, such as vitiligo stage classification. This phase, particularly the initial image reconstruction task, alleviates the difficulties associated with annotating multimodal datasets by leveraging a large pool of unlabeled images during pre-training. The subsequent fine-tuning of a modest collection of labeled data refines the model’s classification capabilities, enhancing its applicability and precision.

Fig. 1.

Fig. 1

Training process of the two-stage Multi-MAE model. The first stage is the pre-training stage, where the model learns through the image reconstruction task in a large-scale unlabeled multimodal medical image dataset. The second stage is the fine-tuning stage, where the model uses the pre-trained parameters obtained from the first stage for further training on a small labeled sample dataset

Adaptive Masking Module

Conventional random masking methods [22] do not fully leverage the semantic information crucial for visual representation learning. Additionally, due to the reliance on large backbone networks, these methods often require significant time and sample resources during the pre-training phase. The adaptive masking module is proposed to address these issues (Fig. 2). The core of this module lies in utilizing self-attention mechanisms to automatically extract image features related to the classification task, such as the shape, color, and texture characteristics of vitiligo lesions during training without the need for supervised learning. Self-attention, from the Transformer architecture, is widely used in deep learning. It calculates query, key, and value relationships, normalizes with Softmax for attention weights, and performs a weighted sum to capture important information. In this model's adaptive masking module, it guides masking, helping focus on and mask areas crucial for vitiligo stage classification, like depigmented patches. With this knowledge of relevant image features, the module can selectively mask regions that are important for vitiligo stage classification, such as areas with distinct depigmentation or changes in skin texture, which helps in effective feature representation learning. Moreover, an efficient redundant patch dropout strategy is introduced to further enhance the efficiency of the learning process.

Fig. 2.

Fig. 2

Schematic diagram of the adaptive masking module. The grey pixel blocks represent the masked pixel areas, while the black pixel blocks indicate the discarded patches

The top of Fig. 2 depicts the simplified process of the common Masked Image Modeling (MIM) [26] method, while the bottom provides an overview of the adaptive masking module. In the adaptive masking module, gray pixel blocks represent masked pixel regions, while black pixel blocks represent discarded patches not fed into the model, thus saving computational resources. In contrast to conventional MIM methods, this adaptive module performs masking and dropout operations using normalized attention maps, which consist of the following steps:

Semantic Information Extraction

In the semantic information extraction phase, the image is embedded into a token sequence z RN+1×D, where N is the number of features. These tokens are subsequently fed into Transformer blocks for processing. Within each Transformer block, the Multi-head Self-Attention (MSA) layer is responsible for mapping and partitioning z into Nh subspaces. Each subspace consists of the corresponding query qi, key ki, and value vi, where i = 1, 2, …,Nh, and qi, ki, vi RN+1×D. The output of the MSA layer is obtained by applying the Softmax function:

Ai=softmaxqikiDh, 1

where Dh=D/Nh, Ai is the (N + 1) × (N + 1) attention matrix. Then, averaging over Nh heads for the first row (excluding the first element), we obtain

aw=1Nhi=1Nhai1. 2

In this process, ai1RN represents the first row excluding the first element, while Ai is the corresponding matrix. The masked weight aw is subsequently reshaped into the shape of H/P×W/P and mapped back to the original image size through interpolation.

Masking Strategy

Conventional MIM methods generally employ random masking, providing equal probabilities for each image region to be masked, which results in the model’s attention being evenly distributed across the entire image. As shown in Fig. 2, the conventional random masking approach pays too much attention to the background, which affects the effective learning of feature representations. On the other hand, due to the large scales of the backbone network and data, pre-training usually requires a significant amount of time.

To overcome these challenges, a special masking strategy is proposed in this study to guide the masking and discarding processes. Specifically, random masking is used in the early pre-training stages to collect global semantic information in the image. After a few training cycles, the attention map is obtained and normalized to the [0, 1] range, where each element represents the weight of the corresponding pixel in the image. When the image is cropped and enhanced before being fed to the model, the corresponding area in the attention map is also cropped accordingly. The cropped attention map and image are both marked as patches. To retain sufficient semantic information required for MIM, the patch indices are sampled N times:

M=SPA, 3

where S is a function for weighted sampling, and PA is the sum of aw in the patches. Each element in PARN represents the sample weight of the corresponding mark in the marked image z, excluding the xcls mark. MRN is the resulting vector of the sampling, whose elements represent the indices of the patches. The model selectively masks areas of higher importance based on the weights in the attention map, allowing the model to focus more on key image features.

To further improve learning efficiency, a discarding strategy is also introduced, throwing away image blocks that do not participate in subsequent modeling and thus reducing the consumption of computational resources. The discarded areas are obtained by sampling in the attention map, ensuring that the model focuses on the most important information for the current task. Since the location of objects can be roughly estimated from the attention map, some parts of the background are redundant for training.

Combining the masking and discarding strategies, we generate the final input that includes the masked and visible parts. This step is represented by

zI=CM,r,t, 4

where C is the attention-driven masking and discarding strategy; t is the marked discard ratio; r is the masking ratio; and zI represents the image blocks that include some masked and visible parts, which will be used for model training. This means that although the model discards some image blocks, it retains sufficient information for learning. These retained parts include the most important features for the current task and the information that helps the model understand the image context.

By combining the masking, discarding, and retaining strategies, the adaptive masking module can effectively utilize the semantic information in the image while reducing unnecessary computational burden, thereby improving the model’s learning efficiency and performance. Besides, by focusing on key image features, this module allows the model to learn more advanced and richer feature representations, which indicates better learning ability in different downstream tasks and enhanced generalization ability.

Cross-Attention–Based Multimodal Fusion Module

In vitiligo stage classification, a single imaging modality, whether it is a clinical image or a Wood’s lamp image, often fails to provide comprehensive information, necessitating the exploration of novel methods for multimodal image fusion. Clinical vitiligo images provide intuitive visual information about the lesion areas, while Wood’s lamp images display the fluorescent characteristics of the lesion under specific lighting conditions. The complementary nature of these two modalities offers the potential for precise vitiligo stage classification. Therefore, combining these two types of image data to extract rich pathological information is crucial for improving the accuracy of vitiligo stage classification.

In Fig. 3, during the pre-training phase, feature extraction is performed for each modality. Clinical images and Wood's lamp images undergo their respective feature extraction processes and are transformed into corresponding feature representations. Subsequently, these features are input into the cross-attention fusion module for image fusion. As shown in Fig. 4, the proposed multimodal fusion module based on the cross-attention mechanism independently processes clinical images and Wood’s lamp images to extract their respective feature representations. Afterward, through cross-attention layers, the model learns the interrelationships between the two modalities, with features from one modality as queries and features from the other as keys and values. This process allows the model to identify and enhance features relevant to vitiligo stage classification while suppressing irrelevant background information, thereby achieving more accurate identification of lesion areas.

Fig. 3.

Fig. 3

Multi-MAE’s Multimodal Processing during the Pre-training Phase. During the pre-training phase, clinical and Wood’s lamp images undergo separate feature extraction. Then, the features are fused via the cross-attention module for subsequent tasks

Fig. 4.

Fig. 4

Cross-attention–based multimodal fusion module, with inputs being clinical images and Wood’s lamp images. The features of one modality act as the query, and the features of the other modality serve simultaneously as the key and value

Assume the input consists of image features from two modalities, denoted as FcRN×Dc and FwRN×Dw, respectively. Here, Dc and Dw are the feature dimensions of clinical images and Wood’s lamp images, respectively. To merge these two sets of features, we compute their cross-attention through the following steps:

First, generate the query (Q), key (K), and value (V) matrices for each modality. For the first modality (e.g., clinical image features Fc), compute Q, K, and V as follows:

Qx=FcWqx,Kx=FcWkx,Vx=FcWvx, 5

where Wqx,Wkx,andWvx are the learnable weight matrices corresponding to the first modality.

For the second modality (e.g., Wood’s lamp image features Fw), perform similar transformations:

Qy=FwWqy,Ky=FwWky,Vy=FwWvy, 6

where Wqy,Wky,andWvy are the learnable weight matrices corresponding to the second modality.

Next, compute the dot product between queries and keys and normalize it using the Softmax function to obtain the attention weights:

AxF=softmaxQxKyTdVy,AyF=softmaxQyKxTdVx, 7

where d is the scaling factor used to stabilize the computation of attention weights. Effective information fusion between modalities can be achieved through this multi-head cross-attention mechanism, enabling deep feature representation learning at the feature level and providing strong support for precise vitiligo stage classification. Finally, we use these attention weights to merge values (V) to obtain the final result:

Ffusion=AxVy+AyVx. 8

Through this fusion strategy, the model improves the vitiligo stage classification accuracy and enhances the model’s generalization capability under different lesion features. The proposed cross-attention fusion module offers a new perspective for vitiligo stage classification, especially when dealing with multimodal data, demonstrating its potential to improve the accuracy of vitiligo stage classification. Future work will focus on optimizing fusion strategies and validating the effectiveness of the proposed method in a broader range of dermatological vitiligo stage classifications.

Encoder and Decoder

An advanced autoencoder architecture is employed to handle multimodal image data. Unlike conventional MAEs, the encoder module of the proposed model is capable of simultaneously processing input data from different imaging modalities. Its encoder is based on Vision Transformer (ViT) and is crucially extended to accommodate the needs of multimodal inputs. Image feature extraction involves identifying and extracting relevant info from input images. Here, the encoder divides clinical and Wood's lamp images into small patches. These patches are transformed into a sequence of tokens for model processing. Each token represents a small image part, and through processing these tokens, the encoder captures lesion features like texture, color distribution, and shape. Specifically, data from each modality is transformed via independent patch projection layers into a sequence of tokens and fed into a unified Transformer encoder. This design enables the model to capture the intrinsic connections between different modalities and combine their complementary information, providing richer feature representations for the decoder.

The decoder is a crucial component in the multimodal small-sample MAE framework, which aims to reconstruct complete image representations from partially visible information, thereby handling multimodal data such as clinical images and Wood’s lamp images. Each decoder is designed for specific tasks to ensure image reconstruction on multimodal inputs during the pre-training phase. These input data are obtained through random sampling and provided to the decoder in the form of visible tokens, prompting the model to learn how to infer complete image content from limited information. The decoder design structure emphasizes logical rigor and computational efficiency. It first maps the feature output by the encoder to a dimension space suitable for decoding through a linear projection layer. Subsequently, the decoder processes the mapped features using a series of Transformer modules, which can capture complex relationships between input tokens and generate outputs for reconstructing masked regions.

Experiment

Experimental Setup

The entire model framework was implemented using PyTorch, with a hardware configuration consisting of an NVIDIA GeForce GTX 1080 Ti GPU. The software setup included an Ubuntu 16.04 operating system, CUDA 9.0, PyTorch 1.8, and Python 3.9. The batch size of the model was set to 8 in training. The Adam optimizer was employed, with an initial learning rate of 0.001 and a minimum learning rate of 0.00001. The first pre-training stage consisted of 200 epochs, while the second fine-tuning stage for vitiligo stage classification (active and stable stages) comprised 100 epochs. The ratio of the training set to the validation set was 8:2.

Dataset and Preprocessing

Dataset

This study employs two datasets, as described below.

  1. Dataset 1 consists of clinical images of skin diseases collected by the dermatology department of Xijing Hospital from April 2021 to August 2021. Three departments are involved: the outpatient department, the phototherapy center, and the surgical center. This dataset contains 11,326 images from 2838 patients meticulously annotated by physicians. It encompasses three main categories of skin diseases: typical vitiligo, atypical vitiligo, and other pigmentary skin diseases.

  2. Dataset 2 is also sourced from the dermatology department of Xijing Hospital, with 838 images collected in 2023 from 292 patients with skin diseases. Each pair contains two different imaging modalities: clinical vitiligo images and images under Wood’s lamp. According to professional annotations by physicians, these images have been accurately categorized into two main types of vitiligo: active-stage and stable-stage.

In this study, Dataset 1 is utilized for the image reconstruction task during the pre-training phase. Although Dataset 1 only contains clinical images as a single modality, we apply digital image processing techniques to generate Wood’s lamp images. Subsequently, the original clinical images and the generated Wood’s lamp images are jointly used as the multimodal inputs for the pre-training stage. For Dataset 2, with each object group having both clinical images and Wood’s lamp images, we divide it into training and validation sets in an 8:2 ratio. These two-modality images in the training and validation sets are used as multimodal inputs for the fine-tuning stage.

Preprocessing

Under Wood’s lamp illumination, the vitiligo-affected skin appears pure white and contrasts sharply with the surrounding healthy skin, as shown in Fig. 5, providing a more accurate basis for vitiligo stage classification. However, in underdeveloped areas with insufficient medical resources, the lack of Wood’s lamp equipment makes vitiligo stage classification more challenging. To address this issue, advanced digital image processing techniques such as color space transformation and hue value adjustment are employed, as presented in Fig. 6. In the research of vitiligo stage classification, we need both clinical images and Wood's lamp images as model inputs. In our study, Dataset 1 only contains clinical images. Thus, we utilize digital image processing techniques to generate Wood's lamp images from the clinical images in Dataset 1. Subsequently, the clinically collected real images and the Wood's lamp images generated through digital image processing are jointly used as the multimodal inputs for the pre-training stage.

Fig. 5.

Fig. 5

Examples of clinical images and wood’s images

Fig. 6.

Fig. 6

Image processing workflow

Advanced digital image processing techniques, such as color space transformation and hue value adjustment, are crucial in our vitiligo-related study. Color space transformation can change the color representation of vitiligo-related images. Since vitiligo is characterized by depigmentation, different color models can highlight the affected areas in distinct ways. For example, certain color spaces can better differentiate the hypopigmented regions of vitiligo from the normal skin, making it easier to analyze the extent and characteristics of the lesions. Hue value adjustment, on the other hand, significantly enhances the color contrast between lesioned and normal areas in vitiligo images. Vitiligo lesions usually present a different hue from the surrounding healthy skin. Adjusting the hue values can enhance the clarity of lesion boundaries and extents, including shape, size, and depigmentation degree, which are vital for vitiligo stage classification.

First, by converting the RGB images into HSV images, it becomes convenient to perform targeted processing on the three components of the HSV color space. Differences in hue between the lesion and non-lesion areas are identified by analyzing the images under Wood’s lamp. Initially, the hue ranges of the lesion area (H_lesion) and non-lesion area (H_nonlesion) in Wood’s lamp modality are calculated. Using these hue ranges, the hue values in the image are manually adjusted. Let the original hue be Horiginal, the adjusted hue be Hadjusted, and the adjustment range be Hmin,Hmax. The adjustment formula can be expressed as:

Hadjusted=clampHoriginal+δ,Hmin,Hmax 9

Here, δ is the manually adjusted hue difference, which can be positive or negative, depending on the direction of the hue difference between the lesion and non-lesion areas under the modality of Wood’s lamp images. The clamp() function serves to restrict a value within upper and lower limits. Finally, the processed HSV images are converted back to RGB images for subsequent analysis and application.

Preliminary Experiments

Validation of Image Processing Methods

Converting clinical images into Wood’s lamp images can be achieved through two common methods: digital image processing and Generative Adversarial Networks (GANs) [27]. The digital image processing method has been described earlier. The GAN method employs a pix2pix [28] model and is trained using paired data. Initially, pre-training is performed using the paired clinical and Wood’s lamp images from the second data batch. Then, the trained generator is utilized to produce and validate the images from the first and second data batches. Visual observation suggests that the effects of digital image processing and GANs are comparable in some cases, as shown in the left column of Fig. 7. However, in certain instances, as shown on the right side of Fig. 7, images generated by GANs exhibit distortions, indicating GAN’s instability on the dataset and substantial noise. Additionally, GANs require more memory resources.

Fig. 7.

Fig. 7

Data comparison chart

To quantitatively compare the effectiveness of the two methods, various image similarity metrics are calculated for the two examples in Fig. 7, including cosine similarity, hash functions (mean hash, difference hash, perceptual hash), and Structural Similarity Index Measure (SSIM). The results are presented in Tables 1 and 2, with superior performance highlighted in bold. For cosine similarity, the digital image processing method performs slightly better with scores of 0.99 and 0.996 for Examples 1 and 2, respectively, while the GAN method scores 0.97 and 0.982, respectively. In terms of hash functions, the digital image processing method scores 0.88, 0.66, and 0.89 for the mean, difference, and perceptual hash in Example 1, and 0.86, 0.70, and 1.0 in Example 2. In comparison, the GAN method scores 0.84, 0.59, and 0.78 in Example 1 and 0.75, 0.59, and 0.98 in Example 2. The SSIM evaluation results show that digital image processing achieves scores of 0.91 and 0.93 in Examples 1 and 2, respectively, which are higher than the scores of 0.75 and 0.37 by GAN.

Table 1.

Calculated image similarity metrics for example 1

Method Cosine Hash SSIM
GAN 0.970 0.84/0.59/0.78 0.75
Digital Image Processing 0.990 0.88/0.66/0.89 0.91
Table 2.

Calculated image similarity metrics for example 2

Method Cosine Hash SSIM
GAN 0.982 0.75/0.59/0.98 0.37
Digital Image Processing 0.996 0.86/0.70/1.00 0.93

To explore the differences in image similarity metrics between the two methods, we conducted an independent-sample t-test. The t-statistic and p-value are critical parameters for evaluation. When the p-value is less than 0.05, it indicates a significant difference in structural similarity, with the digital image processing method generating images that better match real Wood’s lamp images. Conversely, a p-value greater than 0.05 suggests no significant difference.

In this comparative study examining the effects of digital image processing and GANs in generating Wood's lamp images, the independent-sample t-test for Table 1, with a t-value of 3.19 and a p-value of 0.0333, indicates a statistically significant difference in image similarity metrics between the digital image processing group and the GAN group (α = 0.05). This demonstrates that the images generated by the digital image processing method are closer to real Wood’s lamp images in terms of overall features and structure compared to those generated by the GAN method.

For the independent-sample t-test of Table 2, with a t-value of 1.60 and a p-value of 0.186, no statistically significant difference was observed between the two groups of images at the significance level α = 0.05. However, it is critical to note that the limited sample size (only 2 images per group) may compromise the reliability of the statistical analysis, potentially leading to false-negative results even if true differences exist.

To further validate these findings, an extended analysis was conducted using 100 samples. Specifically, the mean similarity values of 100 GAN-generated images compared to the original images across five evaluation metrics (Cosine, aHash, dHash, pHash, and SSIM) were [0.972, 0.85, 0.58, 0.76, 0.78], whereas the mean values for images generated by the image processing method were [0.992, 0.89, 0.63, 0.87, 0.89]. Independent samples t-test results for the 100-sample dataset revealed a statistically significant difference between the two methods under large-sample conditions (t = 3.2, p = 0.0238). Collectively, the results from multiple sample analyses indicate a statistically significant difference in structural similarity between the two methods.

Overall, the results of the t-test further support the conclusion from a statistical perspective that the digital image processing method has more advantages in generating similarity to real Wood’s lamp images, meaning that the images generated by this method are closer to real Wood’s lamp images in terms of overall features and structure, and can provide a more reliable data foundation for subsequent vitiligo staging and classification tasks.

Comparative Experiments on Pre-Training with Different Datasets

A series of experiments are designed and conducted to evaluate the differences in the effects of the model when loading different pre-trained parameters. The experiments consider whether to load pre-trained parameters and the selection of different pre-trained datasets (e.g., ImageNet natural image dataset, original skin disease data, cropped skin disease data). According to the parameter loading comparison experiment results shown in Table 3, the model without pre-trained parameters achieves an accuracy of 87.74%. When the pre-trained parameters based on the ImageNet-1 K dataset are introduced, the accuracy increases to 88.39%, highlighting the positive effect of large-scale and diverse pre-trained datasets on improving model generalization. Using the original skin disease data for pre-training, the model achieves an accuracy of 91.61%, demonstrating the advantage of pre-trained datasets tailored to specific tasks. Notably, pre-training with cropped skin disease data improves the accuracy to 95.48%, emphasizing the importance of data processing strategies in optimizing model performance.

Table 3.

Comparison of loading parameter experiments

Pre-trained parameters Pre-trained loss Accuracy
No Pre-trained Parameters - 87.74%
ImageNet-1 K - 88.39%
Original Skin Disease Data 0.098 91.61%
Cropped Skin Disease Data 0.005 95.48%

An integrated analysis of these experimental results concludes that the pre-training strategy significantly affects the model’s performance in multimodal small-sample vitiligo stage classification tasks. Pre-training datasets tailored to specific tasks can enhance model accuracy and generalization. Therefore, attention should be paid to selecting suitable pre-training strategies and data processing methods to achieve optimal model performance in practical applications.

Results and Discussion

This study employs a series of techniques to enhance the accuracy of vitiligo stage classification (active and stable stages), especially under the challenge of multimodal small-sample datasets. This section conducts comparative experiments to discuss the techniques’ efficacies. First, the effectiveness of the adaptive mask module in extracting vitiligo lesion features is validated. Then, multimodal experiments are carried out to assess the specific contributions of different data sources to model performance and to confirm the advantages of the interactive fusion module in integrating multimodal data. Finally, the proposed model is comprehensively evaluated and compared with mainstream networks.

Validation of the Adaptive Mask Module

This section compares the performance of the random and attention masks during the image pre-training stage to verify whether the adaptive mask module can effectively utilize semantic information in images and improve the model’s vitiligo stage classification performance. According to the experiment results shown in Table 4, the model with the random mask achieves a 92.90% accuracy, a 92.7% precision, a 92.3% recall, and a 92.5% F1-score. In contrast, the model with the attention mask achieves a 95.48% accuracy, a 95.6% precision, a 94.5% recall, and a 95.0% F1-score. These results indicate that the adaptive mask module is more effective in improving the model’s vitiligo stage classification performance. The adaptive mask module automatically identifies and focuses on key image areas using attention mechanisms while ignoring irrelevant or redundant information. This approach allows the model to concentrate more on learning important image features, thereby enhancing its understanding of image semantics, as reflected by the improved performance metrics.

Table 4.

Validation of the effectiveness of the adaptive masking module

Masking Method Loss Accuracy Precision Recall F1-Score
Random Masking 0.025 92.90% 92.7% 92.3% 92.5%
Adaptive Masking 0.005 95.48% 95.6% 94.5% 95.0%

In conclusion, the adaptive mask module has a clear advantage in vitiligo stage classification tasks, especially in improving precision, recall, and F1-score. This strategy provides strong support for enhancing model performance through more refined feature selection and utilization. Its ability to effectively utilize semantic information in images translates to tangible benefits in the accuracy of vitiligo stage classification, clinical decision-making, and, ultimately, patient care. Therefore, the adaptive mask module should be prioritized in practical applications, especially in scenarios requiring efficient use of semantic image information.

Experimental Analysis of Each Modality

The experimental results in Table 5 indicate that when using Modality 1 alone, the model achieves an accuracy of 92.90%, a precision of 92.5%, a recall of 92.0%, and an F1-score of 92.2%. These results provide basic performance benchmarks. When using Modality 2 alone, the model’s performance is improved, with the accuracy increasing to 94.19%, precision to 93.8%, recall to 93.5%, and F1-score to 93.6%, indicating that Modality 2 provides more useful information compared to Modality 1 and improves the classification performance. Furthermore, when using Modality 1 and 2 simultaneously and adopting a simple concatenate fusion strategy, the model's performance improves compared to Modality 1 alone, with accuracy rising from 92.90% to 93.55%, etc. However, it still falls short compared to using Modality 2 alone, indicating that simple concatenation is insufficient to fully capitalize on the complementary aspects of the two modalities.

Table 5.

Results of ablation study of different modalities

Modality 1 Modality 2 Fusion Module Accuracy Precision Recall F1-score
92.90% 92.5% 92.0% 92.2%
94.19% 93.8% 93.5% 93.6%
Concatenation Fusion 93.55% 93.4% 93.1% 93.2%
Cross-attention Fusion 95.48% 95.6% 94.5% 95.0%

When the interactive fusion module is introduced, the model achieves an accuracy of 95.48%, a precision of 95.6%, a recall of 94.5%, and an F1-score of 95.0%. These results indicate that compared with using Modality 1 or Modality 2 alone, the performance has been improved significantly. It emphasizes the effectiveness of the interactive fusion module in effectively integrating the information of different modalities and extracting rich features.

In summary, adopting an interactive fusion module is crucial for optimizing model performance in vitiligo stage classification tasks. Fusion can maximize the utilization of data with different modalities, thereby improving vitiligo stage classification accuracy. Therefore, appropriate fusion techniques should be considered to optimize model performance in practical applications.

Comparative Experiments with Mainstream Networks

This study proposes the multimodal small-sample vitiligo stage classification method based on an MAE to fully utilize the complementarity of medical image modalities and mitigate data scarcity issues. Multiple backbone network models, including MobileNet, DenseNet, VGG, and ResNet-50, along with mainstream Masked Image Modeling (MIM) methods, such as BEIT, MaskFeat, SimMIM, and MAE, were utilized in the comparative experiments. In the comparative experiments of this study, to ensure the fairness and accuracy of the experiments, we simultaneously applied these models to both clinical images and Wood’s lamp images. For each model, we separately input the clinical and Wood’s lamp image data, and then used the corresponding model architectures to extract features from these two-modality images.

MaskFeat enhances the model’s capture of structural information by predicting the Histogram of Oriented Gradients (HOG), possibly sacrificing some detailed image information. SimMIM’s decoder uses a linear layer to regress pixel values, which may simplify the model structure but may also limit its ability to capture complex features. By calculating the performance metrics of accuracy, precision, recall, and F1-score in vitiligo stage classification tasks, these models were thoroughly compared with the proposed model (OURS).

After applying these models to the multimodal data, we obtained the performance results in vitiligo stage classification tasks. The following is an in-depth analysis of these results. In the research of the vitiligo stage classification task, early models such as MobileNet, DenseNet, VGG, and ResNet-50 demonstrate their unique performance characteristics. The experimental results in Table 6 indicate that the MobileNet model achieves an accuracy of 90.32%, a precision of 87.60%, a recall of 89.70%, and an F1-score of 92.30%. However, its relatively shallow network depth and simple convolutional operations limit its ability to capture complex lesion features in vitiligo images. The DenseNet model achieves an accuracy of 90.97%, a precision of 91.00%, a recall of 90.40%, and an F1-score of 93.20%. Nevertheless, the dense connections may introduce redundant information, making the model less efficient in differentiating between relevant and irrelevant features for vitiligo stage classification. The VGG model achieves an accuracy of 91.61%, a precision of 89.90%, a recall of 92.50%, and an F1-score of 93.70%, has a large number of convolutional layers and parameters in its architecture. This can lead to overfitting, especially when dealing with small-scale datasets like those in our study. The ResNet-50 model achieves an accuracy of 92.90%, a precision of 91.60%, a recall of 92.70%, and an F1-score of 92.40%. Although it has a better performance in terms of accuracy and recall compared to some of the other early models, the residual blocks in ResNet-50 may not be optimized for the specific features of vitiligo images. Overall, these early models, while showing some effectiveness in vitiligo stage classification, face challenges in fully capturing the complex and diverse features of vitiligo lesions. Their relatively lower performance in certain metrics implies that there is room for improvement in terms of lesion feature extraction and classification accuracy, especially when dealing with subtle differences in lesion appearance. This may be attributed to the limitations of their network architectures, which might not be optimized specifically for the unique characteristics of vitiligo images.

Table 6.

Comparison with mainstream networks

Research Period Network Accuracy Precision Recall F1-score
Early models MobileNet 90.32% 87.60% 89.70% 92.30%
DenseNet 90.97% 91.00% 90.40% 93.20%
VGG 91.61% 89.90% 92.50% 93.70%
ResNet-50 92.90% 91.62% 92.75% 92.43%
Recent models BEIT [26] 90.97% 91.28% 92.97% 91.84%
MaskFeat [29] 90.97% 92.00% 93.88% 92.93%
SimMIM[30] 91.61% 93.94% 93.00% 93.47%
MAE [22] 92.90% 93.07% 95.92% 94.47%
OURS 95.48% 96.00% 96.97% 96.48%

The recent mainstream Masked Image Modeling (MIM) methods, such as BEIT, MaskFeat, SimMIM, and MAE, also exhibit different performance. BEIT achieves 90.97%, 91.28%, 92.97%, and 91.84% in accuracy, precision, recall, and F1-score. Nevertheless, its performance in vitiligo stage classification is limited by its pre-training on general-purpose image datasets. MaskFeat achieves 90.97%, 92.00%, 93.88%, and 92.93% in these metrics. However, its emphasis on structural information might lead to the neglect of other important features. SimMIM achieves 91.61%, 93.94%, 93.00%, and 93.47%. Nevertheless, this simplified decoder structure may not be able to fully capture the complex spatial and semantic relationships in vitiligo images. MAE achieves 92.90%, 93.07%, 95.92%, and 94.47%. Yet, the masking strategy in MAE might not be perfectly tailored to vitiligo images, potentially causing the model to misinterpret some masked regions related to lesion features. Although these MIM-based methods show promise in improving performance, they still have their own drawbacks.

Although these methods demonstrate their advantages in vitiligo stage classification tasks to varying degrees, the proposed model achieves the optimum performance in all evaluation metrics, with an accuracy of 95.48%, a precision of 96.00%, a recall of 96.97%, and an F1-score of 96.48%, demonstrating practicality and value for extension. Future research will further explore the optimization space of the model and consider applying the research findings to a wider range of image processing fields. The experimental results indicate that the proposed method achieves optimal accuracy, precision, recall, and F1-score.

Limitations and Future Directions

Limitations

This study has several limitations. Regarding data, despite the use of two datasets, limitations exist in terms of small data volume and restricted data modalities: Dataset 2 contains only 838 image pairs, which may not comprehensively cover all vitiligo lesion features and clinical scenarios. Additionally, the model relies exclusively on clinical and Wood’s lamp images, excluding other potentially valuable modal data such as dermoscopic and pathological images. This dual limitation restricts the model’s capacity for comprehensive lesion characterization and generalization across diverse clinical presentations.

Moreover, as the data are solely sourced from Xijing Hospital, the sample has a limited geographical origin and patient group. This situation is prone to introduce sample bias, potentially undermining the model's performance when applied to other regions or diverse patient populations.

Regarding the model, although the adaptive masking module and cross-attention fusion module enhance its performance, the model’s architecture is complex. It demands extensive training time and substantial computational resources, posing challenges for rapid deployment and operation on clinical devices with limited resources.

Future Directions

For the future research, the primary emphasis should be placed on dataset expansion. This encompasses not merely augmenting the count of Wood’s lamp images but also amassing data from multiple hospitals spread across different geographical regions. By involving a more extensive spectrum of patient groups, characterized by diverse ethnic backgrounds, ages, and genders, the potential for sample bias can be substantially reduced. Such an expansion will play a pivotal role in enhancing the model’s generalization capabilities, ensuring its stable and accurate application within a wide variety of clinical contexts.

Concurrently, efforts must be directed towards model optimization and multimodal data integration. There is an imperative need to design more efficient model architectures. Simplifying the model's structure can significantly cut down on training time and the consumption of computational resources, thereby enabling its rapid deployment on commonly available clinical devices. Moreover, delving into the integration of additional types of modal data, such as dermoscopic and pathological images, into the model is of utmost importance. Through refining the data fusion strategy, the model will be able to comprehensively glean lesion information, leading to a further enhancement in the accuracy and reliability of vitiligo stage classification.

Conclusions

This study proposes the innovative Multi-MAE framework to enhance the accuracy of medical image classification for vitiligo staging (active and stable stages). The approach generates pseudo labels based on intrinsic data characteristics to reduce reliance on multimodal annotated data and employs a pre-training strategy to mitigate the scarcity of multimodal data. Secondly, an adaptive masking module is designed to optimize the identification of key lesion areas in images. Furthermore, a cross-attention–based multimodal fusion module is proposed to enhance the model’s utilization of fused features from clinical images and Wood’s lamp images. Experiments on vitiligo stage classification tasks validate the effectiveness and superiority of the proposed Multi-MAE framework over mainstream networks. This paper hopes to provide an effective new strategy for the automatic vitiligo stage classification from medical images and open up new perspectives for future research.

Author Contribution

Methodology, F.X. Z.L., S.J., and J.Z.; Data curation: C.L., T.G. and J.C.; Writing,F.X.; Project administration, S.L., J.Z. and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China Key Project under Grant 12126606, in part by the R&D project of Pazhou Lab (Huangpu) under Grant 2023 K0605, in part by the Sichuan Provincial Science and Technology Planning Project (No. 23DYF2913), in part by the National Natural Science Foundation of China (Grant No. U2333209), in part by the Chengdu Science and Technology Bureau (Grant No. 2024-YF05-01588-SN), in part by the Deyang Science and Technology Bureau (Grant No. 2021 JBJZ007), and in part by the Sichuan University-Zigong School-Land Cooperation Program (Grant No. 2023 CDZG-8). The APC was funded by Junran Zhang.

Data Availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Declarations

Ethics Approval

This study was approved by the Institutional Review Board/Ethics Committee of Xijing Hospital, Fourth Military Medical University (KY20202019-C-1). It adhered to the tenets of the Declaration of Helsinki and complied with the Chinese Center for Disease Control and Prevention policy on reportable infectious diseases and the Chinese Health and Quarantine Law.

Consent to Participate

All participants involved in this study have provided their informed consent to participate. They were fully informed about the nature, purpose, procedures, potential risks, and benefits of the study. Participation was voluntary, and participants were assured that they had the right to withdraw from the study at any time without any penalty or loss of benefits to which they were otherwise entitled.

Consent for Publication

All participants have also provided their consent to publish the results of this study. They were informed that the data collected would be anonymized and aggregated to protect their privacy and confidentiality. The publication of this manuscript does not contain any individual person’s data in any form that could identify them.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Ebrahim A, Ibrahiem S, El-fallah A, et al. Vitiligo: A comprehensive review[J]. Benha Journal of Applied Sciences 8(2): 93-99, 2023. [Google Scholar]
  • 2.Diotallevi F, Gioacchini H, De Simoni E, et al. Vitiligo, from pathogenesis to therapeutic advances: State of the art[J]. International Journal of Molecular Sciences 24(5): 4910, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Mohamed M A E, Elgarhy L H, Elsaka A M, et al. Vitiligo: Highlights on Pathogenesis, Clinical Presentation and Treatment[J]. Journal of Advances in Medicine and Medical Research 35(19): 165-187, 2023. [Google Scholar]
  • 4.Elsherif R, Mahmoud W A, Mohamed R R. Melanocytes and keratinocytes morphological changes in vitiligo patients. A histological, immunohistochemical and ultrastructural analysis[J]. Ultrastructural Pathology 46(2): 217-235, 2022. [DOI] [PubMed] [Google Scholar]
  • 5.Mohmoud Z M, Elsayed S S, Ahmed F M. Psychosocial status and quality of life among vitiligo patients[J]. Benha Journal of Applied Sciences 8(4): 67-79, 2023. [Google Scholar]
  • 6.AL-smadi K, Imran M, Leite-Silva V R, et al. Vitiligo: A Review of Aetiology, Pathogenesis, Treatment, and Psychosocial Impact[J]. Cosmetics 10(3): 84, 2023. [Google Scholar]
  • 7.Rao T R, Sharmeen H, Tabassum A, et al. Understanding Vitiligo: Causes, Diagnosis, Promising Advances in Treatment and Management and the Impact on Mental Health[J]. Journal of Drug Delivery and Therapeutics 14(4): 130-137, 2024. [Google Scholar]
  • 8.Grochocka M, Wełniak A, Białczyk A, et al. Management of stable vitiligo—a review of the surgical approach[J]. Journal of Clinical Medicine 12(5): 1984, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Speeckaert R, Caelenberg E V, Belpaire A, et al. Vitiligo: from pathogenesis to treatment[J]. Journal of Clinical Medicine 13(17): 5225, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Lheureux S, Braunstein M, Oza A M. Epithelial ovarian cancer: evolution of management in the era of precision medicine[J]. CA: a cancer journal for clinicians 69(4): 280–304, 2019. [DOI] [PubMed]
  • 11.Azam M A, Khan K B, Salahuddin S, et al. A review on multimodal medical image fusion: Compendious analysis of medical modalities, multimodal databases, fusion techniques and quality metrics[J]. Computers in Biology and Medicine 144: 105253, 2022. [DOI] [PubMed] [Google Scholar]
  • 12.Rahate A, Walambe R, Ramanna S, et al. Multimodal co-learning: Challenges, applications with datasets, recent advances and future directions[J]. Information Fusion 81: 203-239, 2022. [Google Scholar]
  • 13.Abdi P, Anthony M R, Farkouh C, et al. Non-invasive skin measurement methods and diagnostics for vitiligo: A systematic review[J]. Frontiers in Medicine 10: 1200963, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Delbaere L, Speeckaert R, Herbelet S, et al. Biomarkers and clinical indicators of disease activity in vitiligo[J]. Dermatological Reviews 3(5): 289-307, 2022. [Google Scholar]
  • 15.Anbar T S, Atwa M A, Abdel-Aziz R T, et al. Subjective versus objective recognition of facial vitiligo lesions: Detection of subclinical lesions by Wood’s light[J]. Journal of the Egyptian Women’s Dermatologic Society 19(1): 7-13, 2022. [Google Scholar]
  • 16.Dyer J M, Foy V M. Revealing the unseen: A review of Wood’s lamp in dermatology[J]. The Journal of Clinical and Aesthetic Dermatology 15(6): 25, 2022. [PMC free article] [PubMed] [Google Scholar]
  • 17.Ou C, Zhou S, Yang R, et al. A deep learning based multimodal fusion model for skin lesion diagnosis using smartphone collected clinical images and metadata[J]. Frontiers in Surgery 9: 1029991, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Chen Q, Li M, Chen C, et al. MDFNet: Application of multimodal fusion method based on skin image and clinical data to skin cancer classification[J]. Journal of Cancer Research and Clinical Oncology 149(7): 3287-3299, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Wang Y, Feng Y, Zhang L, et al. Adversarial multimodal fusion with attention mechanism for skin lesion classification using clinical and dermoscopic images[J]. Medical Image Analysis 81: 102535, 2022. [DOI] [PubMed] [Google Scholar]
  • 20.Jiao R, Zhang Y, Ding L, et al. Learning with limited annotations: a survey on deep semi-supervised learning for medical image segmentation[J]. Computers in Biology and Medicine 169: 107840, 2024. [DOI] [PubMed] [Google Scholar]
  • 21.Upadhyay A K, Bhandari A K. Advances in Deep Learning Models for Resolving Medical Image Segmentation Data Scarcity Problem: A Topical Review[J]. Archives of Computational Methods in Engineering 31(3): 1701-1719, 2024. [Google Scholar]
  • 22.He K, Chen X, Xie S, et al. Masked autoencoders are scalable vision learners[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 16000–16009 , 2022.
  • 23.Xiao J, Bai Y, Yuille A, et al. Delving into masked autoencoders for multi-label thorax disease classification[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision 3588–3600, 2023.
  • 24.Bergqvist C, Ezzedine K. Vitiligo: A focus on pathogenesis and its therapeutic implications[J]. The Journal of Dermatology 48(3): 252-270, 2021. [DOI] [PubMed] [Google Scholar]
  • 25.Zhang Q, Wei Y, Han Z, et al. Multimodal Fusion on Low-quality Data: A Comprehensive Survey[J]. arXiv preprint arXiv 2404: 18947, 2024.
  • 26.Bao H, Dong L, Piao S, et al. Beit: Bert pre-training of image transformers[J]. arXiv preprint arXiv 2106.08254, 2021.
  • 27.Goodfellow I J, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets[J]. Advances in neural information processing systems 27: 2672-2680, 2014. [Google Scholar]
  • 28.Isola P, Zhu J Y, Zhou T, et al. Image-to-image translation with conditional adversarial networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition 1125–1134, 2017.
  • 29.Wei C, Fan H, Xie S, et al. Masked feature prediction for self-supervised visual pre-training[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14668–14678, 2022.
  • 30.Xie Z, Zhang Z, Cao Y, et al. SimMIM: A simple framework for masked image modeling[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9653–9663, 2022.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.


Articles from Journal of Imaging Informatics in Medicine are provided here courtesy of Springer

RESOURCES