Visual defect obfuscation based self-supervised anomaly detection

YeongHyeon Park; Sungho Kang; Myung Jin Kim; Yeonho Lee; Hyeong Seok Kim; Juneho Yi

doi:10.1038/s41598-024-69698-5

. 2024 Aug 14;14:18872. doi: 10.1038/s41598-024-69698-5

Visual defect obfuscation based self-supervised anomaly detection

YeongHyeon Park ^1,², Sungho Kang ¹, Myung Jin Kim ², Yeonho Lee ¹, Hyeong Seok Kim ², Juneho Yi ^1,^✉

PMCID: PMC11325017 PMID: 39143358

Abstract

Due to scarcity of anomaly situations in the early manufacturing stage, an unsupervised anomaly detection (UAD) approach is widely adopted which only uses normal samples for training. This approach is based on the assumption that the trained UAD model will accurately reconstruct normal patterns but struggles with unseen anomalies. To enhance the UAD performance, reconstruction-by-inpainting based methods have recently been investigated, especially on the masking strategy of suspected defective regions. However, there are still issues to overcome: (1) time-consuming inference due to multiple masking, (2) output inconsistency by random masking, and (3) inaccurate reconstruction of normal patterns for large masked areas. Motivated by this, this study proposes a novel reconstruction-by-inpainting method, dubbed Excision And Recovery (EAR), that features single deterministic masking based on the ImageNet pre-trained DINO-ViT and visual obfuscation for hint-providing. Experimental results on the MVTec AD dataset show that deterministic masking by pre-trained attention effectively cuts out suspected defective regions and resolves the aforementioned issues 1 and 2. Also, hint-providing by mosaicing proves to enhance the performance than emptying those regions by binary masking, thereby overcomes issue 3. The proposed approach achieves a high performance without any change of the model structure. Promising results are shown through laboratory tests with public industrial datasets. To suggest EAR be possibly adopted in various industries as a practically deployable solution, future steps include evaluating its applicability in relevant manufacturing environments.

Subject terms: Computational science, Computer science, Electrical and electronic engineering

In the manufacturing industry, ensuring product quality is of paramount importance, which can be automated by machine vision systems^1,2. Machine vision systems for defective product detection can be implemented with machine learning or deep learning-based models. However, a significant challenge arises when confronted with the scarcity of anomaly situations, leading to an imbalanced dataset during the early stages of manufacturing. In such cases, training of an anomaly detection (AD) model under full supervision becomes practically unfeasible.

Recognizing this predicament, the manufacturing industry has increasingly turned to an unsupervised anomaly detection (UAD) approach. The data imbalance problem is eased simply by UAD because it only exploits prevalent normal samples for the training stage and does not require any defective samples. The rationale behind this approach hinges on the idea that a well-trained UAD model excels in the accurate reconstruction of normal patterns but falters when trying to reconstruct unseen anomalous patterns. This is referred to as contained generalization ability³.

Recent years have witnessed a large amount of research efforts aimed at enhancing the UAD performance by exploring novel neural network (NN) structures and innovative training strategies. Those can be divided into two main categories: (1) employing an additional module to existing NNs such as generative adversarial networks (GAN)^7–10 or memory module^11–14 and (2) changing of the training strategy to online knowledge distillation^14–16 or utilization of synthetic data^17–20. Those methods successfully improve the performance by refining widely adopted mainstream NNs such as U-Net²¹. However, amid the pursuit of ever-more sophisticated techniques to achieve better performance for specific benchmark datasets, their solutions have common limitation of increase of computational expense by employing large-scale deep NNs.

To avoid the above situations, the reconstruction-by-inpainting approach^6,22–28 have been investigated to improve the UAD performance without increasing the scale of the NN structure to use. This approach fundamentally prevents accurate reconstruction of unseen anomalous patterns by making them not visible through masking. However, there still remain the following problems to address: 1) inference latency due to multiple masking or progressive inpainting strategy^6,25–29, 2) output inconsistency by random masking^6,22–24, and 3) inaccurate reconstruction of normal patterns due to a large mask ratio^30,31.

To solve these issues, this study introduces a novel approach to enhance the UAD performance based on single deterministic masking. The proposed method, dubbed Excision And Recovery (EAR), features attention-based visual defect obfuscation. That is, suspected defective regions are obfuscated by mosaicing as shown in Fig. 1. EAR leverages the ImageNet⁴ pre-trained DINO-ViT⁵ that is known to have the ability to emphasize class-specific spatial features. This property is exploited to highlight saliency regions within a given image and excise suspected anomalous regions for inpainting. This deterministic single masking strategy allows fast processing and secures the output reliability. Also, the problem of inaccurate reconstruction of the normal pattern due to large masked region is eased by the mosaic hint which is provided in the masked regions. For this, the proper mosaic scale is estimated for the defective region by leveraging the ratio of principal curvatures of Hessian matrix, which was also used in scale-invariant feature transformation (SIFT)³² to compute the degree of edge response. Thereby, EAR achieves the UAD performance enhancement while it does not change the NN structure at all. The details of the above design components will be described in the “Methods” section.

An overview of EAR. EAR takes the reconstruction-by-inpainting approach and is characterized by single deterministic masking and visual obfuscation of masked regions for hint-providing. The orange box shows the excision process by exploiting the ImageNet⁴ pre-trained DINO-ViT⁵ to mask suspected defective regions. To promote the reconstruction of the region into a normal form, visually obfuscated information by mosaicing is provided as a hint. At this time, mosaic scale, m, is estimated from the saliency region of the given product image to provide a proper hint. Mosaic obfuscation is performed by average pooling with $m \times m$ pixels of image and upscaling it into the original scale. The blue box shows the recovery process that reconstructs the corrupted region in $I^{'}$ into $\hat{I}$ . Abnormality is decided based on a maximum value of $D (I, \hat{I})$ as shown in the gray box. Note that $D (I, \hat{I})$ is a distance map calculated by multi-scale gradient magnitude similarity (MSGMS)⁶.

Experimental results with the public industrial visual inspection dataset, MVTec AD³³, demonstrate that EAR further enhances the UAD performance compared to the same or similar scale of NNs. Visual reconstruction results in Figures 4 and 5 indicate that EAR has desirable contained generalization ability³ for the UAD task. That is, suspected defective regions that are visually obfuscated are reconstructed accurately when the input pattern is in the seen normal category, and that the reconstruction is inaccurate when the input includes unseen anomalous patterns.

Visual comparison of the results when disabling each design component of EAR: visual obfuscation by mosaicing and saliency masking. The EAR variants to confirm the aforementioned effect are the following: 1) EAR is a full model that activates all the components. 2) ${EAR}_{w / o o b f}$ does not provide visual obfuscation-based hints on masked regions. 3) ${EAR}_{w / o a t t n}$ disables the masking component which exploits the ImageNet⁴ pre-trained DINO-ViT⁵. A full model of EAR accurately reconstructs normal regions within a defective sample, marked in the yellow box. In contrast, anomalous regions, marked in the red box, are transformed into a normal form and yield a large reconstruction error. ${EAR}_{w / o o b f}$ and ${EAR}_{w / o a t t n}$ cases show inaccurate inpainting compared to EAR. Best viewed in color.

Visual comparison when the input corruption methods, including masking and mosaicing, vary. To visualize the results of RIAD⁶, we have implemented and tested it for each subtask. The RIAD⁶ results show just one masking case among multiple disjoint masks and a cumulated error map for multiple inferences. They show large edge errors overall. For the red spot pattern at the pill, ${EAR}_{w / o o b f}$ shows a mistake in inpainting, and ${EAR}_{w / o a t t n}$ produces scattered errors all over the region. EAR shows the accurate reconstruction of normal patterns by saliency masking and hint-providing by visual obfuscation. Best viewed in color.

Overall, the contributions of this study are summarized as follows:

The proposed pre-trained spatial attention-based single deterministic masking method has advanced the state-of-the-art methods in the reconstruction-by-inpainting approach for UAD, securing both higher throughput and output reliability.
The proposed hint-providing strategy by visual obfuscation on masked regions further enhances the UAD performance with the proposed mosaic scale estimation method.

The remainder of this paper is organized as follows. The “Related Works” section reviews existing literatures and recent advancements related to this study, highlighting the gaps and limitations that the proposed approach aims to address. In the “Methods” section, the proposed methods, including saliency mask generation and the visual obfuscation-based hint-providing method, are presented in detail. The “Experiments” section presents the experimental settings and results. The “Discussion” section provides an in-depth analysis of the results, interpreting the findings in the context of the research questions. The main findings and contributions of the paper are summarized in the “Conclusion” section.

Related works

In the manufacturing industry, product quality assurance is automated with machine vision systems^1,2. For this, non-AI methods by analyzing datasets and building statistical models³⁴, or AI-based online AD methods^35,36 can be considered. Furthermore, an UAD method can be adopted by considering the scarcity of abnormal situations in an early manufacturing stage³. Among them, the reconstruction-by-inpainting approach⁶ is covered more specifically, which effectively improves the UAD without changing the NN structure. Here, this section briefly reviews related works on UAD techniques and reconstruction-by-inpainting techniques.

Simple but powerful UAD models

There have been efforts to enhance the UAD performance based on widely known NNs, such as auto-encoder (AE) or U-Net²¹, without changing much of their structure. Among AE variants are, MS-CAM³⁷ presents a multi-scale channel attention module with an adversarial learning strategy. GANomaly⁷ adopts feature distance loss to perform better normal pattern reconstruction. SCADN³⁸ performs multi-scale striped masking before feeding input to their NN. In cases of U-Net²¹ variants, DAAD¹² includes block-wise memory module, and RIAD⁶ proposes the reconstruction-by-inpainting strategy with multiple square patched disjoint masks. These approaches maximize the UAD performance while keeping the scale of the NN relatively small.

An U-Net²¹ structure is also employed in this work and at the same time, a practically deployable solution is pursued that allows NNs to operate properly in industrial environments as a way for edge computing.

Reconstruction-by-inpainting methods

UAD based on reconstruction-by-inpainting is an effective self-supervision technique for representation learning to prevent an UAD model from accurately reconstructing unseen anomalous patterns^6,22–28. Specifically, methods such as random masking^6,22–24, multiple disjoint masking^6,25, and progressive inpainting from the initial masks^25–29 have been developed.

The common limitation of multiple masking and progressive inpainting is inference latency due to the multiple inferences. In addition, the random masking strategy causes the problem of output inconsistency when applied to the reconstruction-by-inpainting approach. Thus, to develop a practically deployable solution for ensuring real-time defect detection and output reliability, the following should be considered: 1) deterministic mask generation strategy, 2) minimizing the number of masks, and 3) immediate inpainting strategy rather than a progressive inpainting strategy.

To meet the above requirements, this study exploits a pre-trained attention model for deterministic single masking. The deterministic single masking strategy allows real-time processing and, at the same time, secures the output reliability.

Hint-providing strategies for masked regions

Researches report that attention-based saliency masking^39,40 or non-saliency masking^41,42 is more effective and helpful for representation learning. Their intention is to eliminate unnecessary input information for their objective, representation learning or object recognition.

However, since those masking methods will empty all the information in the suspected anomalous regions, accurate reconstruction of normal patterns becomes hard, especially when the masked region is large. To ease this situation, an additional strategy that randomly leaves a few patches within masked saliency areas as hint information for reconstruction can be considered^39,40. This strategy serves to provide initial information for inpainting the masked regions and accurate reconstruction. However, the randomness of their patch hint-providing causes the output inconsistency problem.

This study presents a visual obfuscation-based hint-providing scheme to promote the accurate reconstruction of normal patterns.

Methods

Overview

Due to the class imbalance problem stemming from the scarcity of abnormal situations, this study adopts a self-supervised learning strategy to conduct target representation learning of normal samples. An overall schematic diagram of EAR is shown in Fig. 1. The excision stage is composed of two steps. First, a deterministic single saliency mask, S, is generated from attention map, A, by exploiting the ImageNet⁴ pre-trained DINO-ViT⁵. The resulting saliency mask, S, indicates suspected anomalous regions. Then, a mosaic hint is provided on the masked regions for reconstruction. To provide a proper hint by obfuscation, mosaic scale, m, is estimated from the part of the given product image that correspond to the saliency region. The result of recovery, $\hat{I}$ , will be obtained by feeding $I^{'}$ into the U-Net²¹. The magnitude of the reconstruction error, especially the maximum value of multi-scale gradient magnitude similarity (MSGMS)⁶, between I and $\hat{I}$ is used to determine whether the product is defective or not.

Saliency mask generation

This study aims to develop a real-time and reliable solution by avoiding inference latency and output inconsistency. For this, a deterministic saliency masking strategy is proposed by exploiting a pre-trained self-attention model. Specifically, the ImageNet⁴ pre-trained DINO-ViT⁵ is used in this study which is trained with a self-distillation strategy. First, an input image I is fed into the DINO-ViT⁵ and get an attention map, A, by averaging [CLS] tokens, multi-heads of the last layer. Then, the attention map, A, is binarized by thresholding the upper quartile value $μ + 0.674 σ$ (Q3)⁴³ of pixel-wise attention scores to generate a binary saliency mask, S. Referring to the probable error⁴⁴, upper Q3 values is regarded as suspected anomalous regions. This allows the mask size to be sufficiently large enough to cover suspected anomalous regions while keeping it reasonably small.

S is used to cut out the suspected anomalous regions in normal samples in training, and the UAD model will be optimized to inpaint the empty region. After training, the recovered masked region by the UAD model will be accurately matched when the I does not include any unseen anomalous patterns. However, if the masked region covers anomalous patterns, the UAD model will struggle to recover its original defective form. Therefore, defective products can be effectively detected due to relatively large reconstruction errors.

Obfuscation-based hint for reconstruction

Saliency masking empties the defective information in the suspected defective regions to help transform masked unseen anomalous regions into normal forms. However, not leaving any clues in the masked region could cause inaccurate reconstruction of normal patterns, degrading the UAD performance.

This study proposes a hint-providing strategy with visual obfuscation on masked saliency regions for accurate reconstruction of normal patterns. For visual obfuscation, it adopts mosaicing of proper scale depending on the defect scale. For mosaicing, each single representative value within each square patch of $m \times m$ pixels is created by average pooling. Thus, the mosaic scale is represented by m. Determining a proper mosaic scale is described in the section “Determining mosaic scale”. The average pooled image is upscaled into the original scale with the nearest interpolation and combined with a saliency mask to provide the masked regions with the proper mosaic hint as shown in Fig. 1. When the mosaic method described above is denoted by M, the hint-providing method is expressed as (1). The processed image $I^{'}$ will be fed into the UAD model for reconstruction.

\begin{matrix} \begin{matrix} I^{'} = M (I) ⊙ S + I ⊙ \bar{S} \end{matrix} \end{matrix}

This mosaicing with the proper mosaic scale makes anomalous regions visually obfuscated to an extent that helps efficient reconstruction of normal patterns, and contains accurate reconstruction of anomalous patterns.

Determining mosaic scale

Depending on the mosaic scale, there is a difference in the information details of provided hints for masked regions. Since the mosaic scale to use is a factor that determines the reconstruction quality, it directly affects the UAD performance. As the optimal scale of the mosaic for each product is not known in advance, estimating the proper mosaic scale is necessary to give the best possible hint.

To construct a mosaic scale estimation model, the optimal mosaic scale $m^{*}$ is obtained for each product in the MVTec AD dataset³³ through grid search.

\begin{matrix} \begin{matrix} r = T r {(H)}^{2} / D e t (H), H = \nabla^{2} I \end{matrix} \end{matrix}

For summarizing the pixel-wise edge response, the saliency region in I is only leveraged and average the top 10% of them. Let us denote this by $r_{10}$ . $r_{10}$ could be successfully related to $m^{*}$ and their linear relation is shown in Fig. 2. They show a strong correlation. Products with detailed features or rough surfaces give high values of $r_{10}$ while products with relatively smooth surfaces show low values.

Linear regression model between $r_{10}$ and $m^{*}$ . $m^{*}$ found by grid search is denoted by blue and red circles for $r_{10}$ , and their correlation coefficients are -0.939 and -0.497 for 10 object subsets and 5 texture subsets, respectively. The linear function f to estimate $\hat{m}$ is shown by the green line. $\hat{m}$ is determined by quantizing $f (r_{10})$ to the nearest power of 2.

The linear function is optimized of the mosaic scale estimation model for each object and texture subset. An estimated mosaic scale, $\hat{m}$ , is determined by quantizing $f (r_{10})$ to the nearest power of 2. For experiments, $\hat{m}$ will be used for EAR training, and the results using $m^{*}$ will also be presented to verify the effectiveness of the proposed mosaic scale estimation method.

Training objectives

A prior study⁶ has shown a satisfactory result to detect various-sized defects by employing MSGMS as in (3). Their training objective also includes $L_{2}$ (pixel-wise distance) and structural similarity index measure (SSIM)⁴⁷ which are widely used for training of a reconstruction model. EAR also inherit the above for training and anomaly scoring. In MSGMS, multiple scales N is set to 3.

L_{msgms} (I, \hat{I}) = \sum_{n = 1}^{N} (1 - \frac{2 g (I^{n}) g ({\hat{I}}^{n}) + c}{g {(I^{n})}^{2} + g {({\hat{I}}^{n})}^{2} + c})

L_{comb} = λ_{2} L_{2} + λ_{ssim} L_{ssim} + λ_{msgms} L_{msgms}

Three loss terms $L_{2}$ , $L_{ssim}$ , and $L_{msgms}$ are combined by applying weights $λ$ as (4). Then, a loss transformation method LAMP³ is applied on (4). LAMP³ is known to enhance the UAD performance by only loss amplification of the training process. In addition, it can be applied to any UAD training process because it does not depend on NN structures or preprocessing methods. The final loss function for training EAR is (5).

\begin{matrix} \begin{matrix} L_{comb}^{LAMP} (I, \hat{I}) = - log (1 - L_{comb} (I, \hat{I})) \end{matrix} \end{matrix}

Experiments

Experimental setup

To evaluate the performance of EAR, the public industrial visual inspection dataset, MVTec AD³³, is used for the experiments. MVTec AD³³ provides a total of 15 subtasks with 10 objects and 5 textures. Each training set for these tasks only provides normal, anomaly-free samples. The test set includes both normal and defective samples.

Implementation details The proposed model simply inherits a well-known U-Net²¹-like structure as a reconstruction model for experiments. Specifically, an U-Net is constructed as in RIAD⁶. The reconstruction model is structured with five convolutional blocks for the encoder and decoder respectively, and the i-th layer in the encoder is concatenated with $(5$ $- i)$ -th layer in the decoder. For the encoder, ‘convolution $\to$ batch normalization $\to$ leaky ReLU activation’ is repeated three times, and ‘upsampling $\to$ convolution $\to$ batch normalization $\to$ leaky ReLU activation’ is repeated three times for the decoder. Note that, the stride is set to 2 in the third layer of each encoder block for spatial downscaling. Also, upsampling with scaling factor 2 and the nearest interpolation is applied in the first layer of each decoder block.

To activate EAR, a pre-trained attention model is required. Many variants of pre-trained ViT are publicly available. EAR adopts one of the state-of-the-art models, specifically ViT-S/8. It is provided in the official GitHub repository, dino (https://github.com/facebookresearch/dino), published by Caron et al.⁵. This ViT is known to have the ability to emphasize class-specific spatial features that lead to unsupervised object segmentation. This property is leveraged to emphasize saliency regions within a given image and cut them out for inpainting.

Mosaic scale estimation This work includes an optimal mosaic scale estimation process. To construct a mosaic scale estimation model, the ground truth of optimal mosaic scale $m^{*}$ for each product is needed. Those are initially found in the grid search manner. The results of finding the initial ground truth are given in Fig. 3, and the linear regression model for mosaic scale estimation is shown in Fig. 2. A summary of mosaic scale estimation for each product is also given in Table 1. As can be seen in Table 1, $m^{*}$ and $\hat{m}$ match in most objects. However, there are mismatches for the textures. Accordingly, it can be observed that the AD performance for them using $\hat{m}$ is in general clearly lower than using $m^{*}$ . The experiments will be conducted with $\hat{m}$ to train an UAD model for each subtask. The UAD results from $m^{*}$ will also be shown for comparison.

Grid search results of finding the optimal mosaic scale $m^{*}$ for hint-providing. Overall, it appears that a larger mosaic scale is advantageous in obtaining a higher AUROC. However, for products with detailed patterns such as capsule, pill, and screw, a moderately small mosaic scale is recommended. The correlation between the visual characteristic of the product and the optimal mosaic scale is shown in Fig. 2.

Table 1.

Summary of optimal mosaic scale $m^{*}$ and estimated mosaic scale $\hat{m}$ .

Objects			Textures
Product	$m^{*}$	$\hat{m}$	Product	$m^{*}$	$\hat{m}$
Bottle	32	32	Carpet	64	16
Cable	32	16	Grid	32	64
Capsule	8	8	Leather	64	32
Hazelnut	64	64	Tile	2	8
Metal nut	32	16	Wood	8	2
Pill	4	4
Screw	8	4
Toothbrush	32	32
Transistor	64	64
Zipper	2	4

Open in a new tab

$m^{*}$ is found by grid search, and $\hat{m}$ is estimated mosaic scale determined by quantizing $f (r_{10})$ to the nearest power of 2. The linear regression function f for each object and texture is shown in Fig. 2.

Training conditions Hyperparameter tuning is performed in all UAD experiments for fair comparison of each model in the best performance condition. The tuned hyperparameters are: 1) kernel size 2) learning rate 3) scheduling method of learning rate. As learning rate scheduling methods, fixed learning rates, learning rate warm-up⁴⁵, and SGDR⁴⁶ are used. The values used as hyperparameters are summarized in Table 2.

Table 2.

Summary of the tuned hyperparameters and their values.

Hyperparameter	Values
Kernel size (k)	3 and 5
Learning rate ( $η$ )	1e−3, 1e−4, and 1e−5
Learning rate scheduling	Fixed, warm-up⁴⁵ and SGDR⁴⁶

Open in a new tab

The optimal combination of hyperparameters is explored in a grid search manner.

Evaluation metric To evaluate the performance of UAD experiments, the area under the receiver operating characteristic curve (AUROC)⁴⁸ is used. The AUROC is measured based on the anomaly scores for each normal and defective sample within the test set. For anomaly scoring, this study adopts the maximum value of MSGMS between the input I and reconstruction-by-inpainting result $\hat{I}$ of the UAD model which is capable of detecting various sizes of defects⁶. When the MSGMSs of the UAD model for the unseen anomalous patterns are relatively larger compared to the normal pattern cases, AUROC will be close to 1.

Visual comparison of reconstruction

Visual comparisons of reconstruction results are presented to see whether the reconstruction of normal patterns is accurate when the proposed method EAR is applied. First, as can be seen in Fig. 4, it can be checked on the effect of disabling either of the important design components: saliency masking and visual obfuscation by mosaicing. ${EAR}_{w / o o b f}$ disables visual obfuscation for hint-providing. That is, ${EAR}_{w / o o b f}$ empties saliency region without any hint. ${EAR}_{w / o a t t n}$ does not utilize the ImageNet⁴ pre-trained DINO-ViT⁵ to cut out suspected anomalous regions. Thus, ${EAR}_{w / o a t t n}$ reconstructs the whole image that is obfuscated. The results for each model are shown for the best hyperparameter conditions. ${EAR}_{w / o o b f}$ shows an inpainting mistake on the background region marked with a red box, and ${EAR}_{w / o a t t n}$ struggles reconstructing the normal region within the defective sample. The full model, EAR, accurately reconstructs normal regions within a defective sample, marked in the yellow box. In addition, red-boxed anomalous regions are successfully transformed into a normal form without inpainting mistakes. Those reconstruction results confirm that both saliency masking and mosaic obfuscation for hint-providing play an essential role in achieving contained generalization ability³ by complementing each other.

In Fig. 5, visual comparisons of the EAR variants with RIAD⁶ are presented. RIAD⁶ features reconstruction by inpainting with multiple disjoint random masks and provides a cumulated error map. The reconstruction results of RIAD⁶ are just one case from multiple masking. The overall edge error, MSGMS, is large due to a lot of random patch masks being located with the edge of the object in RIAD⁶ case. ${EAR}_{w / o o b f}$ case shows the inaccurate reconstruction of the normal region by a large binary mask. Especially in a defective pill case, binary masking causes confusion as to whether the empty space should be filled with a red dot pattern or a white color. ${EAR}_{w / o a t t n}$ shows scattered minute errors all over the region are produced because of the spatial discontinuity of the input image due to the mosaic. In contrast, EAR accurately reconstructs normal patterns by leveraging the hint-providing strategy from mosaic obfuscation; specifically, the logo and digit printings of the capsule. It successfully transforms the scratched white digit printing ‘500’ into a normal form.

Anomaly detection performance

EAR is trained with anomaly-free samples. Then, the AUROC is measured for each subtask with an MSGMS-based anomaly scoring method. The measured performance is summarized in Table 3. As this study proposes a strategy to maximize the performance without changing the NN structure, the performance is compared with recent studies that use NNs of the same or similar scale.

Table 3.

Summary of the AUROC for the MVTec AD dataset³³.

Model	MS-CAM³⁷	GANomaly⁷	SCADN³⁸	MemAE¹¹	U-Net²¹	DAAD¹²	RIAD⁶	EAR (proposed)
Backbone	AE	AE	AE	AE	U-Net	U-Net	U-Net	U-Net
Additional Module	Att	Dis	Dis	Mem	–	Dis &	–	–
Bottle	0.940	0.892	0.957	0.930	0.863	0.976	0.999	0.997 (0.997)
Cable	0.880	0.732	0.856	0.785	0.636	0.844	0.819	0.853 (0.871)
Capsule	0.850	0.708	0.765	0.735	0.673	0.767	0.884	0.870 (0.870)
Carpet	0.910	0.842	0.504	0.386	0.774	0.866	0.842	0.850 (0.899)
Grid	0.940	0.743	0.983	0.805	0.857	0.957	0.996	0.952 (0.959)
Hazelnut	0.950	0.794	0.833	0.769	0.996	0.921	0.833	0.997 (0.997)
Leather	0.950	0.792	0.659	0.423	0.870	0.862	1.000	1.000 (1.000)
Metal nut	0.690	0.745	0.624	0.654	0.676	0.758	0.885	0.856 (0.876)
Pill	0.890	0.757	0.814	0.717	0.781	0.900	0.838	0.922 (0.922)
Screw	1.000	0.699	0.831	0.257	1.000	0.987	0.845	0.779 (0.886)
Tile	0.800	0.785	0.792	0.718	0.964	0.882	0.987	0.918 (0.965)
Toothbrush	1.000	0.700	0.891	0.967	0.811	0.992	1.000	1.000 (1.000)
Transistor	0.880	0.746	0.863	0.791	0.674	0.876	0.909	0.947 (0.947)
Wood	0.940	0.653	0.968	0.954	0.958	0.982	0.930	0.946 (0.985)
Zipper	0.910	0.834	0.846	0.710	0.750	0.859	0.981	0.949 (0.955)
Average	0.902	0.761	0.812	0.707	0.819	0.895	0.917	0.922 (0.942)

Open in a new tab

NNs are structured with simple well-known reconstruction backbones AE and U-Net²¹. For EAR, AUROCs are shown for two cases of $\hat{m}$ and $m^{*}$ , in $\hat{m}$ ( $m^{*}$ ) form. Abbreviations of attention module, discriminator, and memory module are ‘Att’, ‘Dis’, and ‘Mem’ respectively.

EAR achieves the best performance in hazelnut, pill, and transistor cases compared to other models. The common characteristic of defective samples in these subtasks is surface damage which can be recovered into normal form by EAR. In cases of capsules, screws, and zippers that show sophisticated features, AUROC is relatively low compared to the highest performance. This is because the detailed pattern alignment of screw thread or the zipper teeth by reconstruction may be slightly missed due to saliency masking and visual obfuscation in the suspected anomalous regions. In case of a screw, the mask created by DINO-ViT⁵ tends to cover the entire object. That is, it makes the screw into an entirely obfuscated object. Since the screw object has multiple dense threads, EAR struggles to reconstruct an anomaly-free screw due to the difficulty of identifying the starting point of the threads. This affects the UAD performance on normal cases. In case of the wood texture, a slight UAD performance degradation is observed due to poor mosaic scale estimation. The texture of the wood is composed of mostly small irregular texture segments between some large irregular segments in contrast to the other four texture objects, which leads to a large $r_{1} 0$ value that falls within the outlier area than the other texture objects. The simplest linear regression model we opted in this study for computational efficiency does not perfectly estimate $\hat{m}$ for the outlier in this case, and the UAD performance is slightly degraded.

In conclusion, EAR achieves AUROCs of 0.922 and 0.942 by utilizing $\hat{m}$ and $m^{*}$ respectively. The performance 0.922 with $\hat{m}$ indicates that EAR with mosaic scale estimation enhances the UAD performance compared to prior state-of-the-art models. Although the proposed method has some challenging cases in providing hints for objects with irregular patterns (wood) or very dense patterns (screw), an overall performance gain is beneficial. We will continue to investigate into obfuscation improvement such as coordinate adaptive obfuscation together with exploration of the tradeoff of mosaic scale estimation accuracy vs. computational efficiency.

Training and inference speed

The measured processing time is summarized in Table 4. RIAD⁶ takes the longest time for both training and inference due to the multiple masking strategy. For reference, the number of masks used in RIAD⁶ is set to 12 as suggested in the original paper. In Table 4, ${EAR}_{w / o a t t n}$ shows the fastest speed because it does not involve generating saliency maps through a pre-trained attention model, ImageNet⁴ pre-trained DINO-ViT⁵. ${EAR}_{w / o o b f}$ and EAR are somewhat slower than ${EAR}_{w / o a t t n}$ because they generate saliency maps via a pre-trained attention model. They show 2.35 $\times$ and 1.86 $\times$ faster inference than RIAD⁶, respectively. EAR shows an inference speed fast enough for real-time processing with the highest UAD performance.

Table 4.

Processing time for each training and inference.

Model	Training (sec)	Inference (msec)
RIAD⁶	35,478	366
${EAR}_{w / o o b f}$	3084	156
${EAR}_{w / o a t t n}$	3078	37
EAR	3109	197

Open in a new tab

In inference, ${EAR}_{w / o a t t n}$ is the fastest model, also EAR shows sufficiently fast inference speed of is 1.86 $\times$ faster than RIAD⁶.

Ablation study

An ablation study has been conducted to see how deterministic saliency masking and obfuscation by mosaicing for hint-providing affect the UAD performance. In addition, this experiment also checks the effect of applying the knowledge distillation (KD) method, part of SQUID¹⁴, which uses two of the same NNs as teacher and student, respectively, during the training stage.

The results of the ablation study are summarized in Table 5. The 2nd column shows the cases of ${EAR}_{w / o o b f}$ . It appears to be difficult to achieve a high performance because binary masking empties all the information in the suspected defective regions, causing inaccurate reconstruction on both normal and anomalous patterns. On the other hand, ${EAR}_{w / o a t t n}$ , shown in the 3rd column, confirms that the obfuscation by mosaicing achieves better UAD performance compared to RIAD⁶ and ${EAR}_{w / o o b f}$ because of the relatively accurate reconstruction of normal patterns, reducing false positives. Also refer to the visual comparisons of the above cases in Fig. 5.

Table 5.

Summary of the ablation study.

Model	RIAD⁶	Ablations			EAR (proposed)
Masking	✓(multi)	✓		✓	✓
Hint			✓	✓	✓
KD¹⁴				✓
Bottle	0.999	0.995	1.000	0.994 (0.995)	0.997 (0.997)
Cable	0.819	0.795	0.888	0.851 (0.855)	0.853 (0.871)
Capsule	0.884	0.784	0.918	0.869 (0.869)	0.870 (0.870)
Carpet	0.842	0.848	0.718	0.846 (0.880)	0.850 (0.899)
Grid	0.996	0.969	0.963	0.976 (0.976)	0.952 (0.959)
Hazelnut	0.833	0.986	0.996	0.992 (0.996)	0.997 (0.997)
Leather	1.000	1.000	1.000	1.000 (1.000)	1.000 (1.000)
Metal nut	0.885	0.832	0.841	0.868 (0.868)	0.856 (0.876)
Pill	0.838	0.738	0.867	0.870 (0.873)	0.922 (0.922)
Screw	0.845	0.800	0.825	0.776 (0.854)	0.779 (0.886)
Tile	0.987	0.928	0.939	0.956 (0.956)	0.918 (0.965)
Toothbrush	1.000	0.994	1.000	1.000 (1.000)	1.000 (1.000)
Transistor	0.909	0.891	0.943	0.895 (0.933)	0.947 (0.947)
Wood	0.930	0.904	0.945	0.986 (0.995)	0.946 (0.985)
Zipper	0.981	0.900	0.963	0.951 (0.961)	0.949 (0.955)
Average	0.917	0.891	0.920	0.922 (0.934)	0.922 (0.942)

Open in a new tab

Except for RIAD⁶, all the other masking cases employ a pre-trained attention-based deterministic single saliency masking. The rightmost two columns report the performance of visual defect obfuscation methods when using $\hat{m}$ and $m^{*}$ , in $\hat{m}$ ( $m^{*}$ ) form.

This experiment confirms that the UAD performance is further improved when mosaic is used for hint-providing as shown in the last column, which is the case of EAR. Note that, the full model of EAR exploits both hint-providing and pre-trained attention-based saliency masking.

When the KD strategy from SQUID¹⁴ is additionally applied (the 5th column in Table 5), there is almost no change in the performance. Referring to the expensive training cost of KD due to the use of two identical NNs (for each teacher and student), there is no advantage of additionally employing KD for EAR.

Experiment on novel products

Additional experiment is conducted to check whether the linear regression function f obtained from the MVTec AD dataset³³ works properly on another dataset with EAR. This experiment have used KolektorSDD2⁴⁹ which contains surface images from electrical commutators. The models used for comparison are f-AnoGAN⁸, AE-SSIM⁵⁰, and RIAD⁶. f-AnoGAN⁸ is trained by both reconstruction and encoding errors. AE-SSIM⁵⁰ is a case where $L_{ssim}$ is used instead of $L_{2}$ in AE training. RIAD⁶ features the reconstruction-by-inpainting strategy with multiple square-patched disjoint masks. This experiment additionally compares with DRAEM¹⁷ and SGSF¹⁸. These models employ a larger NN constructed by one AE and U-Net for DRAEM and two U-Nets for SGSF. They are similar in that they adopt a self-supervised learning strategy for defect segmentation with synthetic anomalous samples.

The results are summarized in Table 6. The measured $r_{10}$ of KolektorSDD2⁴⁹ is 2.18. Then, $\hat{m}$ for KolektorSDD2⁴⁹ is determined as 64 by quantizing $f (r_{10})$ of texture subset to the nearest power of 2. The AUROC of EAR is higher than the other three UAD models, f-AnoGAN⁸, AE-SSIM⁵⁰, and RIAD⁶ which use NNs of the same or similar scale. EAR also achieves a higher AUROC than DRAEM¹⁷ and SGSF¹⁸. For other mosaic scales, the changes in AUROC are observed as shown in Table 7. The best AUROC was achieved when $m = 32$ while $m = 64$ set by the proposed mosaic scale estimation shows almost equivalent performance. This suggests that proposed mosaic scale estimation method is legitimate.

Table 6.

The UAD performance for KolektorSDD2⁴⁹.

Model	F-AnoGAN⁸	AE-SSIM⁵⁰	RIAD⁶	DRAEM¹⁷		SGSF¹⁸	EAR (proposed)
Backbone	AE		U-Net	AE+U-Net		U-Net $\times$ 2	U-Net
Add-module	Dis.	–	–	Synt.	SSPCAB⁵¹	Synt.	-
AUROC	0.550	0.789	0.703	0.811	0.834	0.863	0.941

Open in a new tab

For EAR, AUROC is measured with $\hat{m}$ . Abbreviations of discriminator and synthetic data utilization for training are ‘Dis.’ and ‘Synt.’.

Table 7.

The UAD performance for various mosaic scales.

m	64 ( $\hat{m}$ )	32 ( $m^{*}$ )	16	8	4	2
AUROC	0.941	0.945	0.921	0.861	0.777	0.729

Open in a new tab

When the estimated optimal mosaic scale $\hat{m} = 64$ is applied, it shows almost equivalent performance compared to when the best mosaic scale $m^{*} = 32$ is applied. The best mosaic scale is found by the grid search.

The reconstruction results of two models, RIAD⁶ and EAR, are shown in Fig. 6. Due to the spatial discontinuity of each square-patched disjoint mask, RIAD⁶ shows large edge errors overall. EAR shows more accurate normal pattern reconstruction while there are spotted errors due to misalignment of small dot patterns in the visually obfuscated regions. The similar results could be seen as shown in cases of MVTec AD³³.

Visual comparison of RIAD⁶ and EAR for the KolektorSDD2 dataset⁴⁹. This study has reproduced RIAD⁶ and trained it for the visualization. The results of RIAD⁶ show only one masking case among multiple disjoint masks. RIAD⁶ causes cumulated error due to multiple inferences. This could be observed the same in the MVTec AD³³ experiments. EAR shows a relatively accurate reconstruction of normal patterns than RIAD⁶ by exploiting saliency masking and hint-providing by visual obfuscation with mosaic scale estimation. Best viewed in color.

In conclusion, these experimental results suggest that EAR effectively estimates the optimal mosaic scale for a novel product and achieves high UAD performance with computational efficiency. Future research efforts will be made to achieve better reconstruction of detailed information in normal patterns such as the red dot pattern seen earlier in the pill cases of MVTec AD³³ or the rough pattern on the surface in KolektorSDD2⁴⁹.

Discussion

This work is to find a way to achieve a real-time and reliable solution that avoids inference latency and output inconsistency. With the recent interest in recycling pre-trained attention mechanisms for edge computing³⁶ in various manufacturing industries, EAR that features an attention-based deterministic single masking strategy with pre-trained DINO-ViT⁵ is presented. EAR greatly shortens the training and inference time compared to recent state-of-the-art RIAD⁶ and can be employed in a plug-and-play manner. The advancement of masking strategy from random multiple masking⁶ to single deterministic masking deserves attention in the relevant research community. Also, the proposed optimal visual obfuscation by mosaic scale prediction helps achieve the desired contained generalization ability.

However, despite the advantages of EAR’s performance mentioned earlier, future steps should deal with the issue of mask coverage that ImageNet⁴ pre-trained DINO-VIT⁵ might fail to completely cover the suspected defective regions for all cases. To address this issue, there is a need to explore the relationship between industrial datasets and various pre-trained attention models trained on different datasets^5,36,52,53. Another option for pre-trained attention model is WinCLIP⁵⁴ which is an attention model to find anomalous regions by prompt guides designed for zero-shot anomaly detection. Through this investigation and subsequent analysis, further research will be conducted to mitigate the possible missing mask issue when leveraging pre-trained attention for suspected defective masking strategy.

Conclusion

This study proposes a novel self-supervised learning strategy, EAR, to enhance the UAD-purposed reconstruction-by-inpainting model. EAR has effectively exploited the ImageNet⁴ pre-trained DINO-ViT⁵ to generate a deterministic single saliency mask to cut out suspected anomalous regions. EAR also provides the best possible hint for reconstruction by visual obfuscation with the proper mosaic scale estimation. EAR not only serves the reliability of resulting the output via deterministic masking and hint-providing strategy but also achieves fast inference via single masking. Moreover, the UAD performance is enhanced because hint-providing strategy promotes the accurate reconstruction of normal patterns and effective translation of anomalous patterns into a normal form.

In laboratory tests, EAR shows promising results. The proposed method is distinguished from others by enhancing the UAD performance with computational efficiency demonstrating strong performance in laboratory tests. In future steps, further evaluation in practical manufacturing environments will be conducted by considering the necessity of confirming its practical deployability.

Acknowledgements

We are grateful to SK Planet Co., Ltd., for providing the equipment for the experiment. We also thank all the members of the Computer Vision Lab at Sungkyunkwan University.

Author contributions

Y.P. and J.Y. designed the experiment, analyzed the data, and wrote the manuscript. Y.P. performed the experiments. S.K., M.J.K., and Y.L. reviewed the experimental results and manuscript. H.S.K. conceptualized the study and provided resources for the experiments. J.Y. supervised the entire study and provided critical feedback. All authors have reviewed and provided feedback on the draft manuscript.

Data availability

The MVTec AD dataset is available from the MVTec (https://www.mvtec.com/company/research/datasets/mvtec-ad). Also, the KolektorSDD2 dataset is available from the Visual Cognitive Systems Laboratory (https://www.vicos.si/resources/kolektorsdd2/). Interested researchers can freely access these datasets.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Weimer, D., Scholz-Reiter, B. & Shpitalni, M. Design of deep convolutional neural network architectures for automated feature extraction in industrial inspection. CIRP Ann.65, 417–420 (2016). 10.1016/j.cirp.2016.04.072 [DOI] [Google Scholar]
2.Agnisarman, S., Lopes, S., Chalil Madathil, K., Piratla, K. & Gramopadhye, A. A survey of automation-enabled human-in-the-loop systems for infrastructure visual inspection. Autom. Constr.97, 52–76 (2019). 10.1016/j.autcon.2018.10.019 [DOI] [Google Scholar]
3.Park, Y. et al. Neural network training strategy to enhance anomaly detection performance: A perspective on reconstruction loss amplification. Proc. ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5165–5169 (2024).
4.Deng, J. et al. Imagenet: A large-scale hierarchical image database. Proc. 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255 (2009).
5.Caron, M. et al. Emerging properties in self-supervised vision transformers. Proc. of the IEEE/CVF International Conference on Computer Vision (ICCV), 9650–9660 (2021).
6.Zavrtanik, V., Kristan, M. & Skočaj, D. Reconstruction by inpainting for visual anomaly detection. Pattern Recogn.112, 107706 (2021). 10.1016/j.patcog.2020.107706 [DOI] [Google Scholar]
7.Akcay, S., Atapour-Abarghouei, A. & Breckon, T. P. Ganomaly: Semi-supervised anomaly detection via adversarial training. Proc. Asian Conference on Computer Vision, 622–637 (2019).
8.Schlegl, T., Seeböck, P., Waldstein, S. M., Langs, G. & Schmidt-Erfurth, U. f-AnoGAN: Fast unsupervised anomaly detection with generative adversarial networks. Med. Image Anal.54, 30–44 (2019). 10.1016/j.media.2019.01.010 [DOI] [PubMed] [Google Scholar]
9.Park, Y., Park, W. S. & Kim, Y. B. Anomaly detection in particulate matter sensor using hypothesis pruning generative adversarial network. ETRI J.43, 511–523 (2021). 10.4218/etrij.2020-0052 [DOI] [Google Scholar]
10.Tang, Y., Tang, Y., Zhu, Y., Xiao, J. & Summers, R. M. A disentangled generative model for disease decomposition in chest x-rays via normal image synthesis. Med. Image Anal.67, 101839 (2021). 10.1016/j.media.2020.101839 [DOI] [PubMed] [Google Scholar]
11.Gong, D. et al. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. Proc. of the IEEE/CVF International Conference on Computer Vision, 1705–1714 (2019).
12.Hou, J. et al. Divide-and-assemble: Learning block-wise memory for unsupervised anomaly detection. Proc. of the IEEE/CVF International Conference on Computer Vision, 8791–8800 (2021).
13.Kim, D., Park, C., Cho, S. & Lee, S. Fapm: Fast adaptive patch memory for real-time industrial anomaly detection. Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 1–5 (2023).
14.Xiang, T. et al. Squid: Deep feature in-painting for unsupervised anomaly detection. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 23890–23901 (2023).
15.Deng, H. & Li, X. Anomaly detection via reverse distillation from one-class embedding. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9737–9746 (2022).
16.Song, K., Xie, J., Zhang, S. & Luo, Z. Multi-mode online knowledge distillation for self-supervised visual representation learning. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11848–11857 (2023).
17.Zavrtanik, V., Kristan, M. & Skočaj, D. Draem - a discriminatively trained reconstruction embedding for surface anomaly detection. Proc. of the IEEE/CVF International Conference on Computer Vision (ICCV), 8330–8339 (2021).
18.Xing, P., Sun, Y. & Li, Z. Self-supervised guided segmentation framework for unsupervised anomaly detection. arXiv preprint (2022) arXiv:2209.12440.
19.Zavrtanik, V., Kristan, M. & Skočaj, D. Dsr - a dual subspace re-projection network for surface anomaly detection. Proc. Computer Vision - ECCV2022, 539–554 (2022).
20.Guo, Y., Jiang, M., Huang, Q., Cheng, Y. & Gong, J. Mldfr: A multilevel features restoration method based on damaged images for anomaly detection and localization. IEEE Trans. Ind. Inf.20, 2477–2486 (2023). 10.1109/TII.2023.3292904 [DOI] [Google Scholar]
21.Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015 (eds Navab, N. et al.) 234–241 (2015).
22.de Nardin, A., Mishra, P., Piciarelli, C. & Foresti, G. L. Bringing attention to image anomaly detection. Proc. Image Analysis and Processing. ICIAP 2022 Workshops, 115–126 (2022).
23.Jiang, J. et al. Masked SWIN transformer UNET for industrial anomaly detection. IEEE Trans. Ind. Inf.19, 2200–2209 (2023). 10.1109/TII.2022.3199228 [DOI] [Google Scholar]
24.Lang, D. M., Schwartz, E., Bercea, C. I., Giryes, R. & Schnabel, J. A. 3d masked autoencoders with application to anomaly detection in non-contrast enhanced breast MRI. arXiv preprint (2023) arXiv:2303.05861.
25.Huang, C., Xu, Q., Wang, Y., Wang, Y. & Zhang, Y. Self-supervised masking for unsupervised anomaly detection and localization. IEEE Trans. Multimed.25, 4426–4438 (2022). 10.1109/TMM.2022.3175611 [DOI] [Google Scholar]
26.Nakanishi, H., Suzuki, M. & Matsuo, Y. Fixing the train-test objective discrepancy: Iterative image inpainting for unsupervised anomaly detection. J. Inf. Process.30, 495–504 (2022). [Google Scholar]
27.Pirnay, J. & Chai, K. Inpainting transformer for anomaly detection. Proc. Image Analysis and Processing - ICIAP2022, 394–406 (2022).
28.Bercea, C. I., Neumayr, M., Rueckert, D. & Schnabel, J. A. Mask, stitch, and re-sample: Enhancing robustness and generalizability in anomaly detection through automatic diffusion models. Proc. ICML 3rd Workshop on Interpretable Machine Learning in Healthcare (IMLH) (2023).
29.Li, Z. et al. Superpixel masking and inpainting for self-supervised anomaly detection. Proc. 31st British Machine Vision Conference (BMVC) (2020).
30.He, K. et al. Masked autoencoders are scalable vision learners. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16000–16009 (2022).
31.Li, T. et al. Mage: Masked generative encoder to unify representation learning and image synthesis. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2142–2152 (2023).
32.Lowe, D. G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis.60, 91–110 (2004). 10.1023/B:VISI.0000029664.99615.94 [DOI] [Google Scholar]
33.Bergmann, P., Fauser, M., Sattlegger, D. & Steger, C. Mvtec ad – a comprehensive real-world dataset for unsupervised anomaly detection. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019).
34.Beruvides, G., Quiza, R., del Toro, R., Castaño, F. & Haber, R. E. Correlation of the holes quality with the force signals in a microdrilling process of a sintered tungsten-copper alloy. Int. J. Precis. Eng. Manuf.15, 1801–1808 (2014). 10.1007/s12541-014-0532-5 [DOI] [Google Scholar]
35.Beruvides, G., Quiza, R., Rivas, M., Castaño, F. & Haber, R. E. Online detection of run out in microdrilling of tungsten and titanium alloys. Int. J. Adv. Manuf. Technol.74, 1567–1575 (2014). 10.1007/s00170-014-6091-1 [DOI] [Google Scholar]
36.Park, Y., Kim, M. J., Gim, U. & Yi, J. Boost-up efficiency of defective solar panel detection with pre-trained attention recycling. IEEE Trans. Ind. Appl.59, 3110–3120 (2023). 10.1109/TIA.2023.3255227 [DOI] [Google Scholar]
37.Li, X., Zheng, Y., Chen, B. & Zheng, E. Dual attention-based industrial surface defect detection with consistency loss. Sensors22, 5141 (2022). 10.3390/s22145141 [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Yan, X., Zhang, H., Xu, X., Hu, X. & Heng, P.-A. Learning semantic context from normal samples for unsupervised anomaly detection. Proc. of the AAAI Conference on Artificial Intelligence35, 3110–3118 (2021).
39.Kakogeorgiou, I. et al. What to hide from your students: Attention-guided masked image modeling. Proc. Computer Vision - ECCV2022, 300–318 (2022).
40.Liu, Z., Gui, J. & Luo, H. Good helper is around you: Attention-driven masked image modeling. Proc. of the AAAI Conference on Artificial Intelligence37, 1799–1807 (2023).
41.Bozorgtabar, B. & Mahapatra, D. Attention-conditioned augmentations for self-supervised anomaly detection and localization. Proc. AAAI Conference on Artificial Intelligence37, 14720–14728 (2023).
42.Sim, M., Lee, J. & Choi, H.-J. Attention masking for improved near out-of-distribution image detection. Proc. 2023 IEEE International Conference on Big Data and Smart Computing (BigComp), 195–202 (2023).
43.Tukey, J. W. et al.Exploratory Data Analysis Vol. 2 (Reading, 1977). [Google Scholar]
44.Dodge, Y. The Oxford Dictionary of Statistical Terms (Oxford University Press, 2003). [Google Scholar]
45.Goyal, P. et al. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint (2017) arXiv:1706.02677.
46.Loshchilov, I. & Hutter, F. SGDR: Stochastic gradient descent with warm restarts. Proc. International Conference on Learning Representations (2017).
47.Wang, Z., Bovik, A., Sheikh, H. & Simoncelli, E. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process.13, 600–612 (2004). 10.1109/TIP.2003.819861 [DOI] [PubMed] [Google Scholar]
48.Fawcett, T. An introduction to roc analysis. Pattern Recogn. Lett.27, 861–874 (2006). 10.1016/j.patrec.2005.10.010 [DOI] [Google Scholar]
49.Božič, J., Tabernik, D. & Skočaj, D. Mixed supervision for surface-defect detection: From weakly to fully supervised learning. Comput. Ind.129, 103459 (2021). 10.1016/j.compind.2021.103459 [DOI] [Google Scholar]
50.Bergmann, P., Löwe, S., Fauser, M., Sattlegger, D. & Steger, C. Improving unsupervised defect segmentation by applying structural similarity to autoencoders. Proc. of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (2019).
51.Ristea, N.-C. et al. Self-supervised predictive convolutional attentive block for anomaly detection. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 13576–13586 (2022).
52.Lee, H., Park, Y. & Yi, J. Enhancing defective solar panel detection with attention-guided statistical features using pre-trained neural networks. Proc. 2024 IEEE International Conference on Big Data and Smart Computing (BigComp), 219–225 (2024).
53.Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. Proc. International Conference on Learning Representations (2021).
54.Jeong, J. et al. Winclip: Zero-/few-shot anomaly classification and segmentation. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 19606–19616 (2023).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[CR1] 1.Weimer, D., Scholz-Reiter, B. & Shpitalni, M. Design of deep convolutional neural network architectures for automated feature extraction in industrial inspection. CIRP Ann.65, 417–420 (2016). 10.1016/j.cirp.2016.04.072 [DOI] [Google Scholar]

[CR2] 2.Agnisarman, S., Lopes, S., Chalil Madathil, K., Piratla, K. & Gramopadhye, A. A survey of automation-enabled human-in-the-loop systems for infrastructure visual inspection. Autom. Constr.97, 52–76 (2019). 10.1016/j.autcon.2018.10.019 [DOI] [Google Scholar]

[CR3] 3.Park, Y. et al. Neural network training strategy to enhance anomaly detection performance: A perspective on reconstruction loss amplification. Proc. ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5165–5169 (2024).

[CR4] 4.Deng, J. et al. Imagenet: A large-scale hierarchical image database. Proc. 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255 (2009).

[CR5] 5.Caron, M. et al. Emerging properties in self-supervised vision transformers. Proc. of the IEEE/CVF International Conference on Computer Vision (ICCV), 9650–9660 (2021).

[CR6] 6.Zavrtanik, V., Kristan, M. & Skočaj, D. Reconstruction by inpainting for visual anomaly detection. Pattern Recogn.112, 107706 (2021). 10.1016/j.patcog.2020.107706 [DOI] [Google Scholar]

[CR7] 7.Akcay, S., Atapour-Abarghouei, A. & Breckon, T. P. Ganomaly: Semi-supervised anomaly detection via adversarial training. Proc. Asian Conference on Computer Vision, 622–637 (2019).

[CR8] 8.Schlegl, T., Seeböck, P., Waldstein, S. M., Langs, G. & Schmidt-Erfurth, U. f-AnoGAN: Fast unsupervised anomaly detection with generative adversarial networks. Med. Image Anal.54, 30–44 (2019). 10.1016/j.media.2019.01.010 [DOI] [PubMed] [Google Scholar]

[CR9] 9.Park, Y., Park, W. S. & Kim, Y. B. Anomaly detection in particulate matter sensor using hypothesis pruning generative adversarial network. ETRI J.43, 511–523 (2021). 10.4218/etrij.2020-0052 [DOI] [Google Scholar]

[CR10] 10.Tang, Y., Tang, Y., Zhu, Y., Xiao, J. & Summers, R. M. A disentangled generative model for disease decomposition in chest x-rays via normal image synthesis. Med. Image Anal.67, 101839 (2021). 10.1016/j.media.2020.101839 [DOI] [PubMed] [Google Scholar]

[CR11] 11.Gong, D. et al. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. Proc. of the IEEE/CVF International Conference on Computer Vision, 1705–1714 (2019).

[CR12] 12.Hou, J. et al. Divide-and-assemble: Learning block-wise memory for unsupervised anomaly detection. Proc. of the IEEE/CVF International Conference on Computer Vision, 8791–8800 (2021).

[CR13] 13.Kim, D., Park, C., Cho, S. & Lee, S. Fapm: Fast adaptive patch memory for real-time industrial anomaly detection. Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 1–5 (2023).

[CR14] 14.Xiang, T. et al. Squid: Deep feature in-painting for unsupervised anomaly detection. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 23890–23901 (2023).

[CR15] 15.Deng, H. & Li, X. Anomaly detection via reverse distillation from one-class embedding. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9737–9746 (2022).

[CR16] 16.Song, K., Xie, J., Zhang, S. & Luo, Z. Multi-mode online knowledge distillation for self-supervised visual representation learning. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11848–11857 (2023).

[CR17] 17.Zavrtanik, V., Kristan, M. & Skočaj, D. Draem - a discriminatively trained reconstruction embedding for surface anomaly detection. Proc. of the IEEE/CVF International Conference on Computer Vision (ICCV), 8330–8339 (2021).

[CR18] 18.Xing, P., Sun, Y. & Li, Z. Self-supervised guided segmentation framework for unsupervised anomaly detection. arXiv preprint (2022) arXiv:2209.12440.

[CR19] 19.Zavrtanik, V., Kristan, M. & Skočaj, D. Dsr - a dual subspace re-projection network for surface anomaly detection. Proc. Computer Vision - ECCV2022, 539–554 (2022).

[CR20] 20.Guo, Y., Jiang, M., Huang, Q., Cheng, Y. & Gong, J. Mldfr: A multilevel features restoration method based on damaged images for anomaly detection and localization. IEEE Trans. Ind. Inf.20, 2477–2486 (2023). 10.1109/TII.2023.3292904 [DOI] [Google Scholar]

[CR21] 21.Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015 (eds Navab, N. et al.) 234–241 (2015).

[CR22] 22.de Nardin, A., Mishra, P., Piciarelli, C. & Foresti, G. L. Bringing attention to image anomaly detection. Proc. Image Analysis and Processing. ICIAP 2022 Workshops, 115–126 (2022).

[CR23] 23.Jiang, J. et al. Masked SWIN transformer UNET for industrial anomaly detection. IEEE Trans. Ind. Inf.19, 2200–2209 (2023). 10.1109/TII.2022.3199228 [DOI] [Google Scholar]

[CR24] 24.Lang, D. M., Schwartz, E., Bercea, C. I., Giryes, R. & Schnabel, J. A. 3d masked autoencoders with application to anomaly detection in non-contrast enhanced breast MRI. arXiv preprint (2023) arXiv:2303.05861.

[CR25] 25.Huang, C., Xu, Q., Wang, Y., Wang, Y. & Zhang, Y. Self-supervised masking for unsupervised anomaly detection and localization. IEEE Trans. Multimed.25, 4426–4438 (2022). 10.1109/TMM.2022.3175611 [DOI] [Google Scholar]

[CR26] 26.Nakanishi, H., Suzuki, M. & Matsuo, Y. Fixing the train-test objective discrepancy: Iterative image inpainting for unsupervised anomaly detection. J. Inf. Process.30, 495–504 (2022). [Google Scholar]

[CR27] 27.Pirnay, J. & Chai, K. Inpainting transformer for anomaly detection. Proc. Image Analysis and Processing - ICIAP2022, 394–406 (2022).

[CR28] 28.Bercea, C. I., Neumayr, M., Rueckert, D. & Schnabel, J. A. Mask, stitch, and re-sample: Enhancing robustness and generalizability in anomaly detection through automatic diffusion models. Proc. ICML 3rd Workshop on Interpretable Machine Learning in Healthcare (IMLH) (2023).

[CR29] 29.Li, Z. et al. Superpixel masking and inpainting for self-supervised anomaly detection. Proc. 31st British Machine Vision Conference (BMVC) (2020).

[CR30] 30.He, K. et al. Masked autoencoders are scalable vision learners. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16000–16009 (2022).

[CR31] 31.Li, T. et al. Mage: Masked generative encoder to unify representation learning and image synthesis. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2142–2152 (2023).

[CR32] 32.Lowe, D. G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis.60, 91–110 (2004). 10.1023/B:VISI.0000029664.99615.94 [DOI] [Google Scholar]

[CR33] 33.Bergmann, P., Fauser, M., Sattlegger, D. & Steger, C. Mvtec ad – a comprehensive real-world dataset for unsupervised anomaly detection. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019).

[CR34] 34.Beruvides, G., Quiza, R., del Toro, R., Castaño, F. & Haber, R. E. Correlation of the holes quality with the force signals in a microdrilling process of a sintered tungsten-copper alloy. Int. J. Precis. Eng. Manuf.15, 1801–1808 (2014). 10.1007/s12541-014-0532-5 [DOI] [Google Scholar]

[CR35] 35.Beruvides, G., Quiza, R., Rivas, M., Castaño, F. & Haber, R. E. Online detection of run out in microdrilling of tungsten and titanium alloys. Int. J. Adv. Manuf. Technol.74, 1567–1575 (2014). 10.1007/s00170-014-6091-1 [DOI] [Google Scholar]

[CR36] 36.Park, Y., Kim, M. J., Gim, U. & Yi, J. Boost-up efficiency of defective solar panel detection with pre-trained attention recycling. IEEE Trans. Ind. Appl.59, 3110–3120 (2023). 10.1109/TIA.2023.3255227 [DOI] [Google Scholar]

[CR37] 37.Li, X., Zheng, Y., Chen, B. & Zheng, E. Dual attention-based industrial surface defect detection with consistency loss. Sensors22, 5141 (2022). 10.3390/s22145141 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR38] 38.Yan, X., Zhang, H., Xu, X., Hu, X. & Heng, P.-A. Learning semantic context from normal samples for unsupervised anomaly detection. Proc. of the AAAI Conference on Artificial Intelligence35, 3110–3118 (2021).

[CR39] 39.Kakogeorgiou, I. et al. What to hide from your students: Attention-guided masked image modeling. Proc. Computer Vision - ECCV2022, 300–318 (2022).

[CR40] 40.Liu, Z., Gui, J. & Luo, H. Good helper is around you: Attention-driven masked image modeling. Proc. of the AAAI Conference on Artificial Intelligence37, 1799–1807 (2023).

[CR41] 41.Bozorgtabar, B. & Mahapatra, D. Attention-conditioned augmentations for self-supervised anomaly detection and localization. Proc. AAAI Conference on Artificial Intelligence37, 14720–14728 (2023).

[CR42] 42.Sim, M., Lee, J. & Choi, H.-J. Attention masking for improved near out-of-distribution image detection. Proc. 2023 IEEE International Conference on Big Data and Smart Computing (BigComp), 195–202 (2023).

[CR43] 43.Tukey, J. W. et al.Exploratory Data Analysis Vol. 2 (Reading, 1977). [Google Scholar]

[CR44] 44.Dodge, Y. The Oxford Dictionary of Statistical Terms (Oxford University Press, 2003). [Google Scholar]

[CR45] 45.Goyal, P. et al. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint (2017) arXiv:1706.02677.

[CR46] 46.Loshchilov, I. & Hutter, F. SGDR: Stochastic gradient descent with warm restarts. Proc. International Conference on Learning Representations (2017).

[CR47] 47.Wang, Z., Bovik, A., Sheikh, H. & Simoncelli, E. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process.13, 600–612 (2004). 10.1109/TIP.2003.819861 [DOI] [PubMed] [Google Scholar]

[CR48] 48.Fawcett, T. An introduction to roc analysis. Pattern Recogn. Lett.27, 861–874 (2006). 10.1016/j.patrec.2005.10.010 [DOI] [Google Scholar]

[CR49] 49.Božič, J., Tabernik, D. & Skočaj, D. Mixed supervision for surface-defect detection: From weakly to fully supervised learning. Comput. Ind.129, 103459 (2021). 10.1016/j.compind.2021.103459 [DOI] [Google Scholar]

[CR50] 50.Bergmann, P., Löwe, S., Fauser, M., Sattlegger, D. & Steger, C. Improving unsupervised defect segmentation by applying structural similarity to autoencoders. Proc. of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (2019).

[CR51] 51.Ristea, N.-C. et al. Self-supervised predictive convolutional attentive block for anomaly detection. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 13576–13586 (2022).

[CR52] 52.Lee, H., Park, Y. & Yi, J. Enhancing defective solar panel detection with attention-guided statistical features using pre-trained neural networks. Proc. 2024 IEEE International Conference on Big Data and Smart Computing (BigComp), 219–225 (2024).

[CR53] 53.Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. Proc. International Conference on Learning Representations (2021).

[CR54] 54.Jeong, J. et al. Winclip: Zero-/few-shot anomaly classification and segmentation. Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 19606–19616 (2023).

PERMALINK

Visual defect obfuscation based self-supervised anomaly detection

YeongHyeon Park

Sungho Kang

Myung Jin Kim

Yeonho Lee

Hyeong Seok Kim

Juneho Yi

Abstract

Figure 1.

Figure 4.

Figure 5.

Related works

Simple but powerful UAD models

Reconstruction-by-inpainting methods

Hint-providing strategies for masked regions

Methods

Overview

Saliency mask generation

Obfuscation-based hint for reconstruction

Determining mosaic scale

Figure 2.

Training objectives

Experiments

Experimental setup

Figure 3.

Table 1.

Table 2.

Visual comparison of reconstruction

Anomaly detection performance

Table 3.

Training and inference speed

Table 4.

Ablation study

Table 5.

Experiment on novel products

Table 6.

Table 7.

Figure 6.

Discussion

Conclusion

Acknowledgements

Author contributions

Data availability

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases