DB-EAC and LSTR: DBnet based seal text detection and Lightweight Seal Text Recognition

Baohua Huang; Aokun Bai; Yuqiong Wu; Chanjuan Yang; Han Sun

doi:10.1371/journal.pone.0301862

. 2024 May 16;19(5):e0301862. doi: 10.1371/journal.pone.0301862

DB-EAC and LSTR: DBnet based seal text detection and Lightweight Seal Text Recognition

Baohua Huang ^1,^*, Aokun Bai ¹, Yuqiong Wu ², Chanjuan Yang ¹, Han Sun ¹

Editor: Nouman Ali³

PMCID: PMC11098430 PMID: 38753628

Abstract

Recognition of the key text of the Chinese seal can speed up the approval of documents, and improve the office efficiency of enterprises or government administrative departments. Due to image blurring and occlusion, the accuracy of Chinese seal recognition is low. In addition, the real dataset is very limited. In order to solve these problems, we improve the differentiable binarization detection algorithm (DBnet) to construct a model DB-ECA for text region detection, and propose a model named LSTR (Lightweight Seal Text Recognition) for text recognition. The efficient channel attention module is added to the differentiable binarization network to solve the feature pyramid conflict, and the convolutional layer network structure is improved to delay downsampling for reducing semantic feature loss. LSTR uses a lightweight CNN more suitable for small-sample generalization, and dynamically fuses positional and visual information through a self-attention-based inference layer to predict the label distribution of feature sequences in parallel. The inference layer not only solves the weak discriminative power of CNN in the shallow layer, but also facilitates CTC (Connectionist Temporal Classification) to accurately align the feature region with the target character. Experiments on the homemade dataset in this paper, DB-ECA compared with the other five commonly used detection models, the precision, recall, F-measure are the best effect of 90.29, 85.17, 87.65, respectively. LSTR compared with the other five kinds of recognition models in the last three years, to achieve the highest effect of accuracy 91.29%, and has the advantages of a small number of parameters and fast inference. The experimental results fully prove the innovation and effectiveness of our model.

1. Introduction

Seal is used to print on the document to indicate identification or signed stationery, because of its production is simple and distinctive signs, widely used in government, enterprises and other organizations issued documents. Although the production methods and styles are not uniform in different countries, most of them have legal effect and occupy an important position in the administrative office. Accurate extraction of key information of seals can efficiently organize and classify documents, save office time, having great application value. With the application of deep learning in the field of recognition, character recognition is also ushering in rapid development [1, 2]. At present, the detection and recognition of regular printing fonts and horizontal documents have already achieved high accuracy, but text recognition for complex and irregular scenes is still very challenging.

Unlike conventional text recognition, seal text scene recognition has the following problems: 1) the text is curved and arranged, the outer ring style is not uniform, the text on the background and the seal text tend to form an occlusion interference, coupled with the inclusion of complex patterns, all of which increase the difficulty of seal recognition. 2) Due to the sensitivity of seal data, the lack of public datasets, there are very few resources of real scene images that can be used for model training, which leads to poor model detection and recognition effect and low application value.

The detection and recognition of seals generally includes preprocessing, text area detection and text content recognition steps. Preprocessing includes filtering and denoising, gray scale processing, edge extraction and other methods to improve the effect of text detection and recognition. Although Chinese seal scene recognition has made great progress in the past few years, most of the research work focuses on image preprocessing, and little attention has been paid to the optimization of text detection and recognition models, and the algorithmic models and datasets are based on scenes with low noise. Due to the existence of blurring and deformation of some seals in practical applications, the above models have low accuracy and weak generalization. The complex operation in preprocessing even increases the workload and difficulty of landing the model in the real scene. Some seal detection and recognition models have been proposed to use off-the-shelf end-to-end ABINet [3], PseNet [4] as the detector, outputting both the position of the text and the recognized content. Despite the high efficiency, the accuracy lacks competitiveness compared to two-stage models.

The current research on Chinese seal scene recognition focuses on the preprocessing operation of color comparison, unfolding and arranging the circular text, i.e., using the seal’s unique red attribute to highlight the R channel weights, and brightening and reducing the noise of the text part. Ma et al [5] denoised the seal images through preprocessing and used the RGB model to eliminate the brightness of the light-colored parts, and repaired the parts of it that were crippled due to contamination, and highlighted the red color of the seal, the key part of the seal, for the first time, verified the role of preprocessing in seal recognition; Yao et al. [6] tested the HSI color model is better than RGB, and extracted the SIFT features of the seal to be tested, and searched for the matching seal in the seal library to get the coarse matching results, and then used the RANSAC algorithm to remove the incorrectly matched points in the coarse matching, and adjusted the size and angle of the matching seal to improve the accuracy of the recognition; Cai et al. [7] normalize of color seals, extract the color of the clay to simplify the image and calculate the magnitude spectrum of the FFT changes in the image, and then construct the feature matrix corresponding to the structure of the fitness ring, to improve the detection rate of dealing with the background complexity and multi-noise images; Zhang et al. [8] use diffuse filling algorithms to enhance the features of the seal image, and then according to the differences between the pixel gray values, find the same region to achieve image segmentation to ensure the seal detection accuracy. Xiao et al. [9] use polar coordinate expansion to unify the direction of the seal text, and use Bessel curves to fit the up and down text region to improve the accuracy of seal region detection. In addition, by combining cross-stage feature fusion and attention mechanisms [10, 11], designing lightweight CNN can also improve accuracy for small target detection and recognition.

Seal detection and recognition belongs to the field of natural scene text recognition, which is generally divided into two parts: text region detection and text content recognition. In text region detection, DRRG [12] proposes novel unified relational inference network graphs for detecting arbitrary shapes, which first generates a text proposal model via a convolutional neural network (CNN), and then bridges the deep relational inference network using a graphical convolutional network (GCN) to divide each text instance into a series of rectangular components. Evaluation with the text proposal model allows the network to be trained end-to-end. FCENet [13] models text instances in the Fourier domain and uses Fourier contour embedding to generate more accurate detection region boxes for arbitrarily shaped text contours. TextMountain [14] divides the center and border of the text into the top and bottom of the mountain, with each pixel detected similar to a path to the top. The model makes full use of the overabundance of relationships in text information to help text instances better find the text center. DBNet [15] focuses on improving segmentation results by incorporating the process of differentiable binarization into the training period, and the optimized segmentation network can adaptively set the binarization. The optimized segmentation network can adaptively set the binarization threshold, which not only simplifies the post-processing process, but also improves the accuracy and inference speed of text region detection. TCM [16] uses the CLIP model directly for text detection without a pre-training process. It improves the ability of existing methods to train with fewer samples and significantly improves the performance of baseline methods.

Methodologically, scene text recognition can be viewed as a cross-modal mapping from images to character sequences. Usually the recognition algorithm consists of two modules, a visual module for feature extraction and a sequence module for text transcription, e.g., the CRNN [17] model uses CNN to extract visual features, and then feeds into a recurrent layer BiLSTM to extract sequence features, and finally obtains prediction results by CTC loss modeling, which can handle sequences of arbitrary length. As well as GTC [18] adds an attention guiding mechanism to CTC to better learn character alignment and feature representation, and achieves accurate prediction for both regular and irregular scene text while maintaining fast inference speed. The advantages of this type of algorithm are high accuracy and simple model, which are still preferred by some commercial recognizers. However, the contextual semantic relevance is weak, and the performance is poor for fuzzy, curved, and irregular text, such as deformation, occlusion, and other situations that can limit its effectiveness.

Encoder-decoder based approaches became popular with the introduction of Transformer into the vision domain by the VIT model [19]. NRTR [20] proposes non-recursive end-to-end text recognizer that relies solely on the self-attention mechanism for extracting image features and modeling sequences, dispensing with recursion and convolution, and can be trained with more parallelization and lower complexity. Since scene images vary greatly in text and background, a modal transformation block is further designed to efficiently convert 2D input images to 1D sequences and combined with an encoder to extract more discriminative features. SRN [21] proposes mining semantic information to aid text recognition, introducing a global semantic reasoning module to take into account global semantic contextual information, which is more robust compared to the unidirectional serial semantic transfer approach and more Efficient. Global semantic context propagation is captured through multiple parallel paths, which combines visual and semantic information more effectively. MGP-STR [22] builds a conceptually simple but powerful visual STR model that is constructed based on VIT and includes both purely visual models and language enhancement methods. And further, a multi-granularity prediction strategy is proposed to improve the model performance by implicitly injecting the information of linguistic modalities into the model. SVTR [23] uses a single visual model to dispense with sequence modeling, firstly decomposing the image text into small chunks called character components, and designing hybrid global and local chunks to perceive inter and intra-character coarse-grained features. Enabling characters to be recognized by simple linear prediction is competitive in terms of inference speed and accuracy. DeepSolo [24] introduces a text matching criterion to provide more accurate supervised signals, allowing a single decoder with Explicit Points Solo to simultaneously perform text detection and recognition for more efficient training. The encoder-decoder based model will have better accuracy because of the consideration of contextual information, but slower inference due to character-by-character transcription. And the model essentially relies on stacked self-attention to learn character-associated features for recognition, lacks the inductive bias of convolution, and requires more data-intensive learning information than the LSTR model. However, due to the confidentiality of seals, there is a lack of a large number of public datasets, so it is difficult to train an effective recognition model for seal text scene recognition.

To solve the above problems, this paper uses a two-stage model for seal text detection and recognition. The use of segmentation-based microscopically binarizable text detection algorithm is naturally suitable for curved text, and competitive results can be achieved without complex preprocessing. And we improve the convolutional recurrent neural network for text recognition of candidate boxes, introduce an inference layer (Inference block) to improve the feature fusion of contextual information, and pay more attention to the correlation between characters. In the real dataset, the high training efficiency and accuracy are still maintained for the cases of font blurring, distortion, and occlusion. First, the seal image is fed into the backbone network Resnet [25] to extract visual features, and the BottleNeck layer in it is improved to postpone downsampling to reduce the loss of image feature information. Second, the feature sequence is fed into the feature pyramid network and added into the Efficient channel attention module [26] (Efficient channel attention), which uses 1*1 convolution cross-channel interaction to further improve the diversity of the feature sequence. The biplot and threshold map are inferred by fusing multi-scale visual features through the feature cascade to determine the detection frame location. Again, the detection frame region is fed into the convolutional layer, and the visual features are extracted and fed into the self-attention-based inference layer. The parallel computation with multi-head attention makes the prediction of sequence labels faster and the learning ability of character granularity features stronger, which can effectively improve the accuracy of the text recognition model. Finally, using CTC transcription, the probability sum of all possible paths for each character is calculated, and a ’_’ is inserted between repeated characters in the text labels, to determine whether consecutive identical characters are merged. In this way, the label distributions output from the inference layer are sequentially aligned to obtain the final recognition results.

The main contributions of this paper are as follows:

Using a model of text detection + text recognition is easier to optimize module by module than an end-to-end approach. Improve the convolutional layer of the differentiable binarization detection algorithm by delaying downsampling to obtain richer semantic features.
Introduce an efficient channel attention module into the differentiable binarization detection algorithm, which allows the model to interact across channels, enhances the ability to detect multiscale targets, and speeds up the model convergence rate.
A self-attention based inference layer is proposed to construct LSTR to improve the ability to learn multi-granularity character features and reduce the waste of training resources to enhance the model generalization on noisier datasets.

2. Text detection and recognition model

2.1 Overall structure

Our proposed method as a whole can be divided into two parts, (1) using segmentation-based suitable for curved text of the differentiable binarization algorithm to construct the detection model DB-ECA, and improve it, which can reduce the workload of preprocessing and streamline the model structure (2) Propose a lightweight seal recognition model LSTR, which can achieve high recognition accuracy for fuzzy, deformed and other irregular Chinese seals without collecting a large number of real seal datasets by using CNN’s inductive bias and self-attention. The overall structure of this paper is shown in Fig 1.

2.2 DB-ECA

Our DB-ECA network structure is shown in Fig 2. First, image visual features are extracted by ResNet and fed to the main stem of the feature pyramid, which is sequentially downsampled in multiples of two to generate feature maps of different sizes. Each layer is then upsampled by a factor of two and fused with the graph of the previous level to merge the deep semantic features and shallow image features, thus capturing multi-scale visual features. Second, the fused multi-scale feature maps are sampled to a quarter of the size of the original image and feature cascading is performed to obtain F. Finally, the probability map P and the threshold map T are predicted using F. The approximate binary mapping B is then computed from P and F. Supervised training is performed on P, T, and B during the training period, and P and B use the same supervised signal (label), and only P or B is needed in the inference phase to obtain the text box.

The backbone of the DB detection algorithm is ResNet, and the original BottleNeck module in the ResNet network uses 1*1 convolution to adjust the number of channels and accelerate convergence via batch norm to improve generalization. Then ReLU solves gradient vanishing and promotes feature information transfer. However, BottleNeck uses 1*1 size convolution kernel and step size s is 2 for downsampling, which results in partial semantic loss. Due to the weight sharing of CNN, it will continue to forwardpropagate the residual feature sequence, which is not conducive to the extraction of richer visual features and leads to the degradation of DBnet performance. Therefore, the downsampling is postponed to the second step of 3*3 convolutional kernel, replacing the 3*3 convolutional kernel step size from 1 to 2, and using the average pooling layer instead of the 1*1 convolutional kernel in shortcut for downsampling. Since the width of the convolution kernel is larger than the step size, it is able to fuse all the information on the feature map during the moving process, and more pixel points can be extracted for backpropagation. The improved BottleNeck structure is shown in Fig 3(A).

Fig 3 — (A) BottleNeck structure diagram (B) ECA structure diagram.

After extracting visual features from the input image, we use 1*1 size convolutional kernel for downsampling, output a feature map scaled by a multiple of 2, and then perform up-sampling feature fusion, and finally unify it into a quarter size of the original image for feature cascading. But suppose a pixel is assigned as a positive value in the downsampling layer, and is regarded as the background in the upsampling cascade operation, it will cause the conflict between different levels of features. And occupies the main part of the feature pyramid (FPN), which interferes with the gradient computation during the training process. To improve the performance of the FPN, an efficient channel attention module, ECA, is introduced to fuse multiscale features before feature cascading, learn the weight coefficients of different channels, and enhance the ability of the feature pyramid to predict multiscale targets. The structure of the ECA network is shown in Fig 3(B).

The ECA model proposes a local cross-channel interaction strategy without dimensionality reduction, which effectively avoids the effect of dimensionality reduction on the learning effect of channel attention, strengthens the correlation of different channels, and automatically learns multi-scale features. First, a feature map with dimension H*W*C is input and spatial feature compression is applied to the feature map. Second, global average pooling GAP is used in the spatial dimension to avoid dimensionality reduction to obtain the aggregated features of the 1*1*C channel map. Again, for the compressed feature map, 1*1 convolution is used to learn each channel and its k neighboring features for local cross-channel interaction. Here, when doing the convolution operation, since the convolution kernel size affects the receptive field, in order to extract different ranges of features for different input feature maps, ECA uses dynamic convolution kernels to avoid manually adjusting k through cross-validation to determine how many neighboring feature matrices are involved in the attentional prediction of a channel. Finally, the aggregated feature map 1*1*C is multiplied with the original input feature map H*W*C to output the feature map with channel attention.

The dynamic convolutional kernel size is changed adaptively by a function, the convolutional kernel adaptive function is defined in Eq (1):

k = ψ (C) = {| \frac{\log_{2} (C)}{γ} + \frac{b}{γ} |}_{o d d}

(1)

Where k denotes the kernel size, i.e., the coverage of local cross-channel interactions, and how many "neighbors" are involved in the attentional prediction of a channel. c denotes the number of channels, and odd denotes that k can only be taken as an odd number, and γ and b are set to 2 and 1, which are used to change the ratio of the number of channels, C, to the size of the convolution kernel and the sum of the kernels. The DB feature pyramid structure with efficient channel attention is shown in Fig 4.

The ECA module is added before feature cascading (CONCAT) is performed. The range of local cross-channel interaction is determined by the kernel size k and efficiently implemented by 1D convolution. So that the model effectively extracts the key character features in the image and reduces the waste of training resources. The experimental results demonstrate that the ECA module captures the cross-channel attention interaction in a lightweight manner, which effectively improves the accuracy and robustness of the detection model.

In computing the approximate biplot by P and F, the standard binarization cannot be backpropagated, much less used directly for training, because it is not differentiable. DBnet proposes to use a differentiable approximate binarization function. The calculation is shown as

\hat{B_{l, j}} = \frac{1}{1 + e^{- k (P_{i, j} - T_{i, j})}}

(2)

$\hat{B}$ represents the approximate biplot, and T represents the adaptive threshold map learned from the network. k is the expansion factor, which is empirically set to 50 for the purpose of increasing the gradient and speeding up the convergence. i, j denote the coordinates of the pixel points. The loss function for positive and negative samples is defined as:

l_{+} = - l o g \frac{1}{1 + e^{- k x}}

(3)

l_{-} = - l o g (1 - \frac{1}{1 + e^{- k x}})

(4)

Find the partial derivative for the input x:

\frac{\partial l_{+}}{\partial x} = - k f (x) e^{- k x}

(5)

\frac{\partial l_{-}}{\partial x} = k f (x)

(6)

k is the gradient gain factor, through the parameter k can achieve the effect of increasing the gradient magnitude and speeding up the convergence, more favorable optimization, and the final segmentation results will be more superior. The total Loss function is defined as shown in Eq (7)

L = L_{s} + α \times L_{b} + β \times L_{t}

(7)

L_s is the Loss of the probability map P and L_b is the Loss of the binary graph B using the binary cross entropy (BCE) operation. L_t represents the threshold graph los, the α and β are taken as 1 and 10, respectively. The formula for calculating cross-entropy is given in equation:

L_{s} = L_{b} = \sum_{i \in S_{i}} y_{i} \log x_{i} + (1 - y_{i}) \log (1 - x_{i})

(8)

where, S_i is the filtered dataset where positive and ne-gative samples are sampled in a ratio of 1:3.The formula used for L_t is shown in Eq (9):

L_{t} = \sum_{i \in R_{d}} | y_{i}^{*} - x_{i}^{*} |

(9)

R_d refers to all pixels in the region G obtained by expanding the labeling frame by the D offset, and $y_{i}^{*}$ denotesthe labeling of the computed threshold map.

2.3 Lightweight Seal Text Recognition model

Lightweight Seal Text Recognition (LSTR) model consists of three parts. Firstly, image visual features are extracted using CNN convolutional layer. Secondly, sequence modeling prediction is performed by inference block. Finally, the recognition results are translated by CTC to transform the feature sequence frames output from the inference layer into labeled sequences. After experiments, it is proved that LSTR has a smaller number of parameters and faster inference than VIT-based recognition model, and also has advantages in accuracy. The LSTR flow is shown in Fig 4.

VIT as backbone is larger compared to CNN with many parameters model, which will lead to more memory occupation and waste of training resources. And Vit is difficult to obtain the underlying information on small sample datasets, and the number of stacked layers is limited. Therefore, ResNet is used instead as a convolutional layer for visual feature extraction of stamp images.

When the input inference layer image sequence is very long, it will bring a lot of computational time and burden. CRNN solves the problem of RNN gradient explosion or disappearance by introducing LSTM, but the running memory is small, it will limit the cross-sample batch processing, and it will still be tricky to deal with 10N or longer sequences. Transformer-based methods often lack positional encoding, making it difficult to accurately align feature regions with the target object, and are computationally expensive.

In this paper, inspired by the application of Bifomer [27] in vision, a two-layer routing attention mechanism is utilized to filter out most of the irrelevant K-V pairs in the coarse-grained region, followed by applying token-to-token attention to focus on a small portion of the relevant tokens, which provides good performance and computational efficiency since it does not distract the attention of irrelevant tokens. We propose to use stacked multi-head self-attention to construct an inference layer. The number of operations to compute the association between two long-distance text positions does not increase with the distance length, and inter-character visual and positional features are processed in parallel. And the self- Attention [28] mechanism continuously learns intra- and inter-character multi-granularity features to produce a more explanatory model with stronger generalization for irregular text. The inference layer network structure is shown in Fig 5.

Firstly, the visual features are fed into the Multi-Head Attention Multi-Head Attention module to find the association between characters, different weights will be assigned to different strokes and feature extraction will be performed on different scales, which can perceive multi-granularity character features. Secondly, the Dropout function is used to randomly deactivate some neurons to prevent overfitting, and the Layer norm normalization ensures the stability of the data feature distribution and accelerates the convergence speed of the model to solve the gradient vanishing problem. Again, the nonlinear transformation of the input sequence is performed by MLP block to enhance the expression ability of the model, so as to better capture the relationship between the character vectors and improve the performance of the model. Finally, the output results of MLP block are normalized to improve the prediction accuracy of CTC sequence modelling.

The essence of multi-head attention is to unite the learned information from different single-head attention head parts, which can more effectively obtain the correlation between distant characters and filter the character features. The single-head attention formula is shown in Eq (10):

A t t e n t i o n (Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(10)

Q represents the query matrix what is about to be queried, K represents the key matrix i.e. the information being queried and V represents the value matrix i.e. the queried content. First, use the transposed dot product of Q and K to get the degree of correlation of the two matrices, i.e., the degree of similarity of the two feature sequences. Second, divide by $\sqrt{d_{k}}$ (square root of K dimensions) to keep the gradient stable during training and normalize with Softmax to get a weight matrix. Again, weighting with the content V to make the value matrix more focused on the stroke characteristics of the character, and finally get the feature vector of attention.

Single-head attention has difficulty learning global feature weights and can only establish a Q and K dependency. To capture various ranges of dependencies within the sequence, different stroke features are learned from different subspaces. Multi-head Attention computation is required, where Attention computation is repeated through multiple parallel heads, with different heads processing different information. The multi-head Attention formulas are shown in Eqs (11) and (12):

{head}_{i} = A t t e n t i o n (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}) W

(11)

M u l H e a d (Q, K, V) = C a n c a t ({head}_{1}, \dots, {head}_{h}) W^{O}

(12)

W_i denotes the corresponding weight matrix of Q, K, and V. Attention denotes the attention computation, and head_i represents the computation result of the ith head of attention. W^o denotes the result splicing matrix resulting from the computation of the h heads of attention. In the calculation process, the attention calculation results head_i of each head are first obtained, and the results are spliced and dot-multiplied with the corresponding weight matrix W^o to obtain the result of multi-head attention.

Finally, the predictions made by the Inference block for each feature vector are converted into a sequence of labels through the CTC loss function. A blank mechanism is introduced to insert a "-" between repeated characters in a text label to solve the problem of whether consecutive identical characters are merged or not, and the CTC loss function is defined as shown in Eq (13):

P (l ∣ x) = \sum_{π \in B^{- 1} (l)} P (π ∣ x)

(13)

where B^-1(l) represents the set of all possible paths that can be merged successfully, and π is one of those paths. The probability of each path is the product of the individual time steps and the corresponding character scores. This way of adding up the probabilities of all paths for the matching transformation makes CTC solve the problem of indeterminate-length sequence alignment without the need for an accurate cut of the original character sequence. A maximum likelihood estimation operation on the loss function allows back propagation of the previous neural network, updating the optimizer parameters used to find the most probable character corresponding to the pixel region.

3. Experiment

3.1 Dataset

The dataset used in this paper includes Chinese seal location detection and text recognition dataset, the text location detection dataset is obtained by the authors of this paper through filming and data enhancement, and the text recognition dataset is constructed by cutting the detected text area box in Python language. The dataset contains 2351 seals, including 676 real seals and 1675 electronic seals (S1 Appendix). Adding analog simulation effects such as rotation and transparency in the production of electronic seals to enhance the training effect and generalization ability of the model for seals with different text orientations. Each sample was uniformly scaled to 420×420 pixels and saved in JPG format. The testing process was divided into training and testing groups with a ratio of 8:2.

3.2 Evaluation indicators

ICDAR [29] categorizes text localization into three challenges, challenge 1, 2, and 4, according to the source of the dataset, and each challenge evaluates different methods of detection. In this paper, the data comes from real scene collection and computer generation, and the methods of challenge 1 and 2 are applied, and the accuracy P, recall R, and F measure are used as the evaluation criteria. The definitions are shown in Eqs (14), (15) and (16):

P = \frac{T P}{T P + F P}

(14)

R = \frac{T P}{T P + F N}

(15)

F = \frac{2 \times P \times R}{P + R}

(16)

TP represents the case where the positive sample is predicted to be true, FP represents the case where the negative sample is predicted to be true, and FN represents the casewhere the true sample is predicted to be false.

3.3 Experimental platform

The experiments were done under Windows OS based on Python 3.7, CUDA11.6 Cudnn 8.4.0, CPU i7-9700, GPU Tesla V100 32GB.

3.4 Ablation experiments

In the training of the text detection model DBNet-ECA, the iteration number epoch is set to 50, the batch data volume batch-size is 8, the sub-process num_workers is 8, and the initial learning rate is set to 0.001 to reach the minimum value of loss faster. The preprocessing stage applies basic data enhancement techniques such as plus or minus ten degree rotation, cropping, partial flipping and color change, and the processed image is resized to 420*420 to improve the training efficiency. balance_loss is set to true, so that the DBLoss is balanced by default for positive and negative samples in the ratio of 1:3. The threshold of thresh binarization is 0.3, which helps to reduce the situation of misjudging the background region as text. box_thresh text box threshold is set to 0.7, which makes the generation of bounding box more stable and improves the performance of subsequent text recognition. The optimizer is chosen to be Adam. The results for test Precision, recall, and F- Measure are shown in Table 1.

Table 1. Detection model comparison.

Improving ResNet	Model	Precision%	Recall%	F-Measure
	ResNet34-DB	81.43	74.52	77.82
√	ResNet34-DB	82.28	77.43	79.78
	ResNet34-DB-ECA	84.17	79.45	81.74
√	ResNet34-DB-ECA	85.25	81.79	83.48
	ResNet50-DB-ECA	89.42	82.75	85.95
√	ResNet50-DB-ECA	90.29	85.17	87.65

Open in a new tab

Table 1 compares the precision, recall, and reconciliation averages of ResNet34-DB with ResNet34-DB-ECA and ResNet50-DB-ECA for the three models with or without ResNet improvement, where ResNet50-DB-ECA is the optimal test result, but the number of parameters is also relatively large. The precision rate, recall rate, and reconciliation mean are 90.29%, 85.17%, and 87.65%, respectively. Improving ResNet improves the precision of the three models by 0.85%, 1.08%, and 0.87%, respectively. Without improving ResNet, adding the ECA module makes ResNet34-DB-ECA 2.74%, 4.93%, and 3.92% higher than ResNet34-DB precision, recall, and reconciliation mean, respectively, and the model size is moderate, and the number of detected frames per second differs from that of ResNet34-DB only by 2.

Since recognizing characters requires a large number of real datasets more, we pre-train LSTR on the public dataset CTW (Chinese Text in the wild) for 100 epochs by transfer learning, and then train it on the homemade dataset. The parameter is set to 200epoch, batch data size of 64, and maximum predicted character length of 40. the recognized character type is set to ch (Chinese) to attenuate numeric or alphabetic interference. The inference layer allows image features to be processed using a cross-attention mechanism, with the number of hidden units being 1024, and the input features are divided into 8 attention heads to capture different positional dependencies. And the dropout deactivation rate is set to 0.1 due to the small real scene dataset. Shuffle is defaulted to true, which ensures that the training images are returned in a different order for each epoch, reducing the overfitting of the model to the training data. fc_decay in CTC is 0.004, allowing the model to learn more detailed features. The Adam optimizer with a weight decay coefficient of 0.05 is used, which requires less memory compared to other optimizers, can adaptively adjust the learning rate of each parameter, and is more suitable for models with large gradient noise in this paper. Data enhancement operations such as rotation, perspective distortion, motion blur, etc. are randomly performed during training.

As can be seen from Table 2. The ResNet34 results are generally better than ResNet18, but the deeper network structure model is also larger. Using ResNet34 as a backbone, the accuracy of the test set is significantly improved by 5.41% and 1.68% compared to VIT and Swin-T. VIT and Swin-T inductive bias is weaker than CNN on small samples, and is poor at reasoning about pixels such as fuzzy irregularities that are not encountered in the model. More data is needed to learn these assumptions automatically, so the accuracy is slightly lower than ResNet34, and the proposed LSTR recognition model combined with the self-attention mechanism has better results for small sample Chinese seal recognition.

Table 2. Comparison results of different backbone networks.

Model	Train accuracy%	Test accuracy%
ResNet18+BiLSTM	77.54	75.63
ResNet18+inference block	81.88	80.12
VIT+inference block	85.08	84.21
Swin-T+inference block	88.71	87.94
ResNet34+inference block	91.29	89.32

Open in a new tab

Table 3 shows the comparison of the number of parameters and inference speed between Swin-T, VIT-S, ResNet34 and Inference block. The Speed(ms) is the inference time averaged over 100 Chinese seal image text. In order to better highlight the advantages of our model, we use cpu to test inference time (ms). The residual network used in ResNet consists of two 1*1 and one 3*3 convolutions, whereas the token and the hidden size of VIT remain unchanged during the computation process, and the computational complexity is proportional to the token’s square, so ResNet has fewer parameters and faster inference speed. Inference block contains only two fully connected layers, the model in this paper not only has the advantage in accuracy, but also has less number of parameters and faster inference speed, which proves the applicability of this method in this study.

Table 3. Backbone network performance comparison.

Model	Params(M)	Speed(ms)
VIT-S+inference	55.35	25.13
Swin-T+inference	58.21	17.27
ResNet34+inference	24.5	13.36

Open in a new tab

3.5 Comparison experiments

To further verify the superiority of the detection model DB-ECA and recognition model LSTR in this paper among the existing methods, the comparison experiments of different recognition models are conducted on the homemade real scene dataset in this paper, as shown in Fig 6 and Table 4.

Fig 6 — The number in the center of the bubble indicates Precision, and the area of the bubble is proportional to F-Measure.

Table 4. Comparison of different recognition models.

Model	Accuracy%	Speed(ms)	Params(M)
NRTR	83.25	54.0	31.7
SRN	85.43	24.74	54.7
SVTR-S	87.51	8.51	10.3
ABINet	86.62	15.47	36.7
MGP-T	87.59	20.52	26.4
LSTR	89.32	13.36	24.5

Open in a new tab

In Fig 6, we can see that TextBoxes++ [30] requires a large number of datasets for model pre-training due to the complex network structure, and the low-level feature expression is weak, CTPN [31] joins the LSTM in the training phase which easily leads to gradient explosion, and it cannot process multi-directional text, so they are not suitable for small-scale target detection such as the seal in this paper, and their precision are respectively lower than those of DB-ECA by 6.71% and 7.63%.Although DRRG can realize the detection of arbitrary shaped text, the detection results are overly dependent on the individual word detection frames suggested by the text component. PRPN [32] proposes a two-dimensional asymptotic kernel to satisfy the requirements of the curved text detection task, but there are difficulties in dealing with text embeddings like seals and other rare training data, and thus the Precision is lower than that of DB-ECA by 4.96%, respectively, 3.48%. The experimental results demonstrate that DB-ECA achieves the highest Precision and F-Measure while maintaining a high FPS.

The transformer-based models NRTR, SRN have larger complexity and parameter counts and require a large number of datasets for gradient updating and optimization, especially for some specific data, so they are not suitable for small-sample studies of real stamp scenes, and their accuracies are lower than that of LSTR by 6.07% and 3.89%, respectively. SVTR unites the extraction of visual and sequence features into a single module, and it is more accurate at the speed is more advantageous, but the accuracy is slightly worse than LSTR 1.81% for stamp images with high noise such as blurring and bending etc. MGP improves the encoder part but lacks a ’localizer’ like CTC, and the accuracy is lower than LSTR 1.73%. This proves that LSTR, which adapts to a small sample dataset by CNN and combines CTC to align feature regions with the target object, has advantages in both accuracy and inference speed.

3.6 Parameter analysis

In this section, the impact of improving ResNet and adding efficient channel attention on DBnet-based detection models is analyzed experimentally. As shown in Table 1, using average pooling instead of 1*1 convolutional kernel and postponing downsampling can extract richer character semantic features, which all improve the performance of the detection model. The introduction of efficient channel attention and cross-channel interaction before feature cascading significantly improves the accuracy of the detection model and effectively solves the problem of mismatch of weights of the same pixel point on feature maps of different scales. Table 5 calculates the detection frame rate per second of the detection model, which shows that ResNet50 has the largest number of parameters and slower inference speed, while the addition of ECA has less impact on the model speed and the number of parameters. Table 2 compares the recognition models by ablation experiments, due to the introduction of residual network and Batch Normalization in ResNet, the convergence is faster and the number of parameters is smaller, and the interference block outperforms BiLSTM in models with deeper networks. LSTR Compared to the transformer-based recognition model [20, 21], the detection accuracy and inference speed are improved, verifying that the combination of CNN and inference layer can capture long-range dependencies, improve accuracy by making predictions on irregular seal text such as occlusion and blurring based on contextual semantic features, eliminate loops and recursion with parallel computing, and have an advantage in inference speed. The recognition model training accuracy is slightly lower than that of the training set in the test set, indicating that the LSTR recognition model does not suffer from overfitting phenomenon during the migration learning process after the Adam optimizer weight attenuation. It is demonstrated that using the inductive bias of CNN can help the inference layer to speed up the convergence speed and have good global modeling ability on small samples.

Table 5. Model performance comparison.

Model	Params (M)	FPS
ResNet34-DB	20.1	34
ResNet34-DB-ECA	21.9	32
ResNet50-DB-ECA	25.6	21

Open in a new tab

4. Conclusion

In this paper, DB-ECA seal text location detection model and Lightweight Seal Text Recognition model are proposed. The ECA channel attention is added to solve the DBNet feature pyramid conflict problem, and the inference block is constructed instead of the loop layer. The inference layer utilizes the self-attention mechanism, which can perform global character feature modeling, fusion of contextual semantic features, and parallel prediction of visual features to improve recognition efficiency and accuracy. Experimental evaluation on a self-made real Chinese seal dataset fully verifies that the DB-ECA and LSTR models outperform other models in terms of performance and recognition accuracy, and are able to accurately extract the key information of the seal faster and more accurately. It solves the problem of inefficiency in the traditional administrative office, saves material and financial resources, and is of great help to the government, enterprises and other administrative departments in the intelligent office. Although the recognition method in this paper achieved 91.29% accuracy in the test set, the Chinese seal text recognition has achieved good results. However, the method of CTC loss function decoding has a certain degree of randomness, which will affect the alignment effect of features and labels. Therefore, in the future, we consider adding GCN graph convolutional neural network to the CTC branch of the recognition model to establish the connection between the label and local features, and improve the model expression ability, in order to improve the accuracy of the Chinese seal text recognition model.

Supporting information

S1 Appendix. Dataset sources.

(DOCX)

pone.0301862.s001.docx^{(18.3KB, docx)}

Data Availability

All relevant data are within the paper and its Supporting Information files.

Funding Statement

This work was financially supported by the National Natural Science Foundation of China (Grant no. 61962005), but the funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Mandavia, K. Badelia, P. Ghosh, S. Chaudhuri, A. Optical Character Recognition Systems for Different Languages with Soft Computing; Springer: Cham, Switzerland, 2017; Volume 352, pp.9–41.
2.Nagy G. Twenty years of document image analysis in PAMI. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22,38–62. [Google Scholar]
3.Shan F, Hong X, Zhen M, et al. Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,.2021:7098–7107. [Google Scholar]
4.Wang W, Xie E, Li X, et al. Shape robust text detection with progressive scale expansion network. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.2019:9336–9345. [Google Scholar]
5.Ma Lixia, Zhu Qiuping. The application of image processing application. Journal of Wuhan University, (2004), 48(08): 691–693. [Google Scholar]
6.YaoMin, Mou Xueer, Chen Peng, et al. Research on detection, positioning and recognition of seals in images. Information Technology and Informatization, 2018, 43(12):148–150. [Google Scholar]
7.Cai Liang, Mei Li. Color stamp detection and alignment method based on wedge ring structure. Journal of Zhejiang University, 2006, 51(10):1696–1700. [Google Scholar]
8.Zhang Xiang, Qin Yi, Dong Zhicheng, et al. Chinese seal recognition method based on flood filling algorithm Electronic technology application, 2022, 48(11):2–6,12. [Google Scholar]
9.Xiao Jinsheng, Zhao Tao, Xiong Wenxin, et al. Stamp Text Detection and Recognition Algorithm Based on Angle Optimization Network. Journal of Electronics and Information Technology, 2021,43(11):3327–3334. [Google Scholar]
10.Dehua Zhang, Xinyuan Hao, Linlin Liang, Wei Liu, Chunbin Qin, A novel deep convolutional neural network algorithm for surface defect detection, Journal of Computational Design and Engineering, Volume 9,Issue 5,October 2022,Pages1616–1632, doi: 10.1093/jcde/qwac071 [DOI] [Google Scholar]
11.Zhang D, Hao X, Wang D, et al. An efficient lightweight convolutional neural network for industrial surface defect detection[J].Artificial Intelligence Review,2023:1–27. doi: 10.1007/s10462-023-10438-y [DOI] [Google Scholar]
12.Shi Xue Z, Xiaobin Z, Jie-Bo H. Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detect-ion Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:9699–9708. [Google Scholar]
13.Zhu Y, Chen J, Liang L, et al. Fourier contour embedding for arbitrary-shaped text detection. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.2021:3123–3131. https://arxiv.org/abs/2104.10442v2. [Google Scholar]
14.Zhu Y, Du J. Text mountain: Accurate scene text detection via instance segmentation. Pattern Recognition, 2021, 110:107336. [Google Scholar]
15.Liao M, Wan Z, Yao C, et al. Real-time scene text detection with differentiable binarization. Proceedings of the AAAI conference on artificial intelligence, 2020,34(07): 11474–11481. [Google Scholar]
16.Yu W., Liu Y., Hua W., et al., Turning a CLIP Model into a Scene Text Detector. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:6978–6988. 10.48550/arXiv.2302.14338. [DOI] [Google Scholar]
17.Shi B, Bai X, Yao C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence, 2016, 39(11):2298–2304. doi: 10.1109/TPAMI.2016.2646371 [DOI] [PubMed] [Google Scholar]
18.Hu W, Cai X, Hou J, et al. Gtc: Guided training of ctc towards efficient and accurate scene text recognition Proceedings of the AAAI conference on artificial intelligence.2020:1005–11012. [Google Scholar]
19.Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale. Proceedings of International Conference on Learning Representations.2020:1–9. [Google Scholar]
20.Sheng F, Chen Z, Xu B. NRTR: A no-recurrence sequence-to-sequence model for scene text recognition. 2019 International conference on document analysis andrecognition (ICDAR) IEEE.2019:781–786. [Google Scholar]
21.Yu D, Li X, Zhang C, et al. Towards accurate scene text recognition with semantic reasoning networks.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:12113–12122. [Google Scholar]
22.Wang P, Da C, Yao C. Multi-granularity Prediction for Scene Text Recognition. Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, 2022: 339–355. [Google Scholar]
23.Yongkun Du, Zhineng Chen, Caiyan Jia, et al. Scene Text Recognition with a Single Visual Mode.International Joint Conference on Artificial Intelligence.2022: 978–990. 10.48550/arXiv.2205.00159. [DOI] [Google Scholar]
24.Ye M, Zhang J, Zhao S, et al. Deepsolo: Let transformer decoder with explicit points solo for text spot-ting. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023:19348–19357. [Google Scholar]
25.Kai He, Zhang X, Ren S, et al. Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016:770–778. [Google Scholar]
26.Qilong Wang, Banggu W, Pengfei Z, et al. Efficient Channel Attention for Deep Convolutional Neural Networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:11534–11542. [Google Scholar]
27.Zhu L, Wang X, Ke Z, et al. BiFormer: Vision Transformer with Bi-Level Routing Attention. Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition.2023:10323–10333. 10.48550/arXiv.2303.08810. [DOI] [Google Scholar]
28.Ashish Vaswani, Noam Shazeer, Niki Parmar, et al. Attention is All you Need. Conference and Workshop on Neural Information Processing Systems.2017:2150–2164. 10.48550/arXiv.1706.03762. [DOI] [Google Scholar]
29.Karatzas D, Shafait F, Uchida S, et al. ICDAR 2013 robust reading competition.2013 12th international conference ondocument analysis and recognition. IEEE, 2013:484–1493. [Google Scholar]
30.Liao M, Shi B, Bai X. Textboxes++: A single-shot oriented scene text detector[J]. IEEE transactions on image processing, 2018, 27(8): 3676–3690. [DOI] [PubMed] [Google Scholar]
31.Tian Z, Huang W, He T, et al. Detecting text in natural image with connectionist text proposal network[C]//Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VIII 14. Springer International Publishing, 2016: 56–72. [Google Scholar]
32.Zhong Y, Cheng X, Chen T, et al. PRPN: Progressive region prediction network for natural scene text detection[J]. Knowledge-Based Systems, 2022, 236: 107767. [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0301862.r001

Decision Letter 0

Nouman Ali

16 Nov 2023

PONE-D-23-32033DB-EAC and LSTR: DBnet based seal text detection and Lightweight Seal Text RecognitionPLOS ONE

Dear Dr. Huang,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Dec 31 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Nouman Ali

Academic Editor

PLOS ONE

Journal requirements:

1. When submitting your revision, we need you to address these additional requirements.

Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf.

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match.

When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section.

4. Thank you for stating the following financial disclosure:

[National Natural Science Foundation of China (Grant no. 61962005)].

Please state what role the funders took in the study. If the funders had no role, please state: ""The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.""

If this statement is not correct you must amend it as needed.

Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.

5. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability.

""Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized.

Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.

We will update your Data Availability statement to reflect the information you provide in your cover letter.

6. We note that Figure 1, 4, 5 and 7 in your submission contain copyrighted images. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:

a. You may seek permission from the original copyright holder of Figure 1, 4, 5 and 7 to publish the content specifically under the CC BY 4.0 license.

We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text:

“I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.”

Please upload the completed Content Permission Form or other proof of granted permissions as an ""Other"" file with your submission.

In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].”

b. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

Reviewer #2: Yes

Reviewer #3: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

Reviewer #3: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This article discusses the importance of artificial intelligence in the field of seals, aiming to solve the problem of seal information localization and recognition. Therefore, an efficient DBNet-based model is proposed in this article. The paper has a certain degree of innovation, but there are also serious problems that need to be solved. Recommendations for improving the manuscript:

1) Figure 2 is incompletely plotted, for example, activation functions, normalization operations, and so on. Please plot Figure 2 carefully and add relevant descriptions.

2) Please pay attention to the latex writing specifications for equation 10, such as "radical sign" and "superscript and subscript". Because the symbol "^" is often represented as an XOR operation in computer languages such as Java, C++, and Python.

3) In Section 3.2, the article mentions biformer, an attention mechanism that very cleverly lightweights mhsa through operations such as sparse matrices, lightweight convolution, and so on, without reducing accuracy. Please add the cleverness of biformer and reduce the description of mhsa, because mhsa has become common knowledge in the field of attention mechanisms.

4) Please note the specification of equation 16, which uses "F1" instead of "F". In addition, there are formatting errors in the two lines of text below equation 16, such as font and spacing. Please correct them carefully.

5) In Section 4.4, "The results for training accuracy, recall, and F-value are shown in Table 1." does not match the performance metrics in Table 1, so if English is not the first language, please check after using the translator. The article is written with a terrible attitude. Moreover, why "training accuracy"? Why not use test set results?

6) Why are "Params" in table 2 and table 4 not using the same unit?

7) There may be problems with the two sets of experiments in Table 3. How are "vit+inference block" and "swin-t+inference block" completed? The "inference block" is a variant of mhsa. vit and swin-t are also variants of mhsa. Therefore, how the experiments are completed? Please give a reasonable explanation.

8) Why is the cpu used in Table 4? what does the gpu do in the experimental environment? Please give a reasonable explanation.

9) Why are the performance metrics described differently in each table? This is very unfriendly to reviewers and readers.

10) The performance metrics of the comparative models in Table 5 are too few, so please add the results.

11) It's a well-known fact that the fastest way to get academic resources is not to work hard, but to just loot them. CV field ≈ ctrl c + ctrl v field.

12) The problem of seal information localization and recognition belongs to the problem of small target recognition, but the research on such aspect in these paper are clearly insufficient, for example, the achievement in these two literature have not even been mentioned:DOI:10.1093/jcde/qwac071，DOI:10.1007/s10462-023-10438-y，and so on.

Therefore, please draw your own diagram of the general architecture of the model, not the DBNet prototype, so that the subsequent image structure is represented in the general architecture diagram.

Reviewer #2: 1.The author proposes his own method for the detection and recognition of Chinese seal text, which has practical significance. The overall article is innovative and has sufficient workload.

2.The word detection section can be compared with other algorithms, including classic text detection algorithms based on regression methods. Conduct ablation experiments on the improved part and add more comparative images of experimental results.

3.It should be explained whether seal images with completely opposite text directions can be detected and recognized.

4.The data volume is not large enough, and more seal images can be collected or more dataset images can be obtained through transfer learning and other methods.

Reviewer #3: General Comments:

The topic of this manuscript is to improve the accuracy of Chinese seal recognition. In my opinion, there are two innovative points in this paper. Firstly, a model DB-ECA was constructed to improve the differentiable binarization detection algorithm (DBnet). Secondly, a model named LSTR (Lightweight Seal Text Recognition) was propose for text recognition. In my opinion, The method used in this paper is innovative, but the application area is not very attractive. Therefore, I suggest that the manuscript can be published after major revision. Before acceptance for publication, the paper needs following improvement:

1. Line 10-12. The specific significance of this study should be supplemented in the beginning of the abstract. Which fields or works will this research be most helpful for? What is the important significance of extracting text information from Chinese seals?

2. Line 18-23. If this journal does not limit the word count of abstract, I suggest supplementing the results and conclusions of this study in the abstract.

3. Line 25-29. I suggest that the author add the content of the significance and importance of this study at the beginning of the introduction

4. Line 73. Can this section “2.Related Work” be merged with the introduction? Most studies would analyze the methods and results of previous research in the introduction. If it can be merged into Introduction, I suggest rewrite the content of this section. The introduction should be reduced to 5-6 paragraphs with a summary at the end of each paragraph.

5. Line 146-158. A title should be supplemented before the text of this paragraph？

6. Line 158-160. Is Figure 1 original by the author? Can the first input small picture be replaced with the pictures related to this study? I suggest that the title of this figure be written in more detail. It is necessary to add descriptions of important parts and some small figures.

7. I strongly suggest that the author can supplement a diagram as Figure 1 to introduce the overall process of this study.

8. I suggest merging Figures 2 and 3 into one figure and displaying them separately with panel (A) and (B). In addition, the styles and colors of the two small figures should be consistent.

9. Are the three text recognition results in Figure 7 are screenshots of a programming tool? I don't like the style of this figure. Readers outside China cannot understand its meaning. I suggest that the Figure 7 should be redraw. The small pictures of the 3 seals and their text recognition results should be divided into 3 different groups and panels. The figure caption should describe the three panels in details.

10. Line 145-301. I suggest reducing at least 30 lines text of “Text detection and recognition model”. I think the description in this section is not concise, and some redundant content should be reduced.

11. The paper is generally understandable. It is suggested the author check carefully the English writing and use standard terminologies in the neural network areas.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 May 16;19(5):e0301862. doi: 10.1371/journal.pone.0301862.r002

Author response to Decision Letter 0

19 Feb 2024

Thanks to the comments and suggestions from all reviewers, we revised the paper and described our efforts in the 'Response to Reviewers'.

Attachment

Submitted filename: Response to Reviewers.docx

pone.0301862.s002.docx^{(769.2KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0301862.r003

Decision Letter 1

Nouman Ali

25 Mar 2024

DB-EAC and LSTR: DBnet based seal text detection and Lightweight Seal Text Recognition

PONE-D-23-32033R1

Dear Dr. Huang,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Nouman Ali

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Reviewer #1: The authors have revisited manuscript carefully. And the theory is reasonable and has application value. Therefore, there are no more comments, it could be accepted at present form.

Reviewer #3: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #3: No

**********

PLoS One. doi: 10.1371/journal.pone.0301862.r004

Acceptance letter

Nouman Ali

2 May 2024

PONE-D-23-32033R1

PLOS ONE

Dear Dr. Huang,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Nouman Ali

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Appendix. Dataset sources.

(DOCX)

pone.0301862.s001.docx^{(18.3KB, docx)}

Attachment

Submitted filename: Response to Reviewers.docx

pone.0301862.s002.docx^{(769.2KB, docx)}

Data Availability Statement

All relevant data are within the paper and its Supporting Information files.

[pone.0301862.ref001] 1. Mandavia, K. Badelia, P. Ghosh, S. Chaudhuri, A. Optical Character Recognition Systems for Different Languages with Soft Computing; Springer: Cham, Switzerland, 2017; Volume 352, pp.9–41.

[pone.0301862.ref002] 2.Nagy G. Twenty years of document image analysis in PAMI. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22,38–62. [Google Scholar]

[pone.0301862.ref003] 3.Shan F, Hong X, Zhen M, et al. Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,.2021:7098–7107. [Google Scholar]

[pone.0301862.ref004] 4.Wang W, Xie E, Li X, et al. Shape robust text detection with progressive scale expansion network. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.2019:9336–9345. [Google Scholar]

[pone.0301862.ref005] 5.Ma Lixia, Zhu Qiuping. The application of image processing application. Journal of Wuhan University, (2004), 48(08): 691–693. [Google Scholar]

[pone.0301862.ref006] 6.YaoMin, Mou Xueer, Chen Peng, et al. Research on detection, positioning and recognition of seals in images. Information Technology and Informatization, 2018, 43(12):148–150. [Google Scholar]

[pone.0301862.ref007] 7.Cai Liang, Mei Li. Color stamp detection and alignment method based on wedge ring structure. Journal of Zhejiang University, 2006, 51(10):1696–1700. [Google Scholar]

[pone.0301862.ref008] 8.Zhang Xiang, Qin Yi, Dong Zhicheng, et al. Chinese seal recognition method based on flood filling algorithm Electronic technology application, 2022, 48(11):2–6,12. [Google Scholar]

[pone.0301862.ref009] 9.Xiao Jinsheng, Zhao Tao, Xiong Wenxin, et al. Stamp Text Detection and Recognition Algorithm Based on Angle Optimization Network. Journal of Electronics and Information Technology, 2021,43(11):3327–3334. [Google Scholar]

[pone.0301862.ref010] 10.Dehua Zhang, Xinyuan Hao, Linlin Liang, Wei Liu, Chunbin Qin, A novel deep convolutional neural network algorithm for surface defect detection, Journal of Computational Design and Engineering, Volume 9,Issue 5,October 2022,Pages1616–1632, doi: 10.1093/jcde/qwac071 [DOI] [Google Scholar]

[pone.0301862.ref011] 11.Zhang D, Hao X, Wang D, et al. An efficient lightweight convolutional neural network for industrial surface defect detection[J].Artificial Intelligence Review,2023:1–27. doi: 10.1007/s10462-023-10438-y [DOI] [Google Scholar]

[pone.0301862.ref012] 12.Shi Xue Z, Xiaobin Z, Jie-Bo H. Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detect-ion Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:9699–9708. [Google Scholar]

[pone.0301862.ref013] 13.Zhu Y, Chen J, Liang L, et al. Fourier contour embedding for arbitrary-shaped text detection. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.2021:3123–3131. https://arxiv.org/abs/2104.10442v2. [Google Scholar]

[pone.0301862.ref014] 14.Zhu Y, Du J. Text mountain: Accurate scene text detection via instance segmentation. Pattern Recognition, 2021, 110:107336. [Google Scholar]

[pone.0301862.ref015] 15.Liao M, Wan Z, Yao C, et al. Real-time scene text detection with differentiable binarization. Proceedings of the AAAI conference on artificial intelligence, 2020,34(07): 11474–11481. [Google Scholar]

[pone.0301862.ref016] 16.Yu W., Liu Y., Hua W., et al., Turning a CLIP Model into a Scene Text Detector. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:6978–6988. 10.48550/arXiv.2302.14338. [DOI] [Google Scholar]

[pone.0301862.ref017] 17.Shi B, Bai X, Yao C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence, 2016, 39(11):2298–2304. doi: 10.1109/TPAMI.2016.2646371 [DOI] [PubMed] [Google Scholar]

[pone.0301862.ref018] 18.Hu W, Cai X, Hou J, et al. Gtc: Guided training of ctc towards efficient and accurate scene text recognition Proceedings of the AAAI conference on artificial intelligence.2020:1005–11012. [Google Scholar]

[pone.0301862.ref019] 19.Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale. Proceedings of International Conference on Learning Representations.2020:1–9. [Google Scholar]

[pone.0301862.ref020] 20.Sheng F, Chen Z, Xu B. NRTR: A no-recurrence sequence-to-sequence model for scene text recognition. 2019 International conference on document analysis andrecognition (ICDAR) IEEE.2019:781–786. [Google Scholar]

[pone.0301862.ref021] 21.Yu D, Li X, Zhang C, et al. Towards accurate scene text recognition with semantic reasoning networks.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:12113–12122. [Google Scholar]

[pone.0301862.ref022] 22.Wang P, Da C, Yao C. Multi-granularity Prediction for Scene Text Recognition. Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, 2022: 339–355. [Google Scholar]

[pone.0301862.ref023] 23.Yongkun Du, Zhineng Chen, Caiyan Jia, et al. Scene Text Recognition with a Single Visual Mode.International Joint Conference on Artificial Intelligence.2022: 978–990. 10.48550/arXiv.2205.00159. [DOI] [Google Scholar]

[pone.0301862.ref024] 24.Ye M, Zhang J, Zhao S, et al. Deepsolo: Let transformer decoder with explicit points solo for text spot-ting. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023:19348–19357. [Google Scholar]

[pone.0301862.ref025] 25.Kai He, Zhang X, Ren S, et al. Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016:770–778. [Google Scholar]

[pone.0301862.ref026] 26.Qilong Wang, Banggu W, Pengfei Z, et al. Efficient Channel Attention for Deep Convolutional Neural Networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:11534–11542. [Google Scholar]

[pone.0301862.ref027] 27.Zhu L, Wang X, Ke Z, et al. BiFormer: Vision Transformer with Bi-Level Routing Attention. Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition.2023:10323–10333. 10.48550/arXiv.2303.08810. [DOI] [Google Scholar]

[pone.0301862.ref028] 28.Ashish Vaswani, Noam Shazeer, Niki Parmar, et al. Attention is All you Need. Conference and Workshop on Neural Information Processing Systems.2017:2150–2164. 10.48550/arXiv.1706.03762. [DOI] [Google Scholar]

[pone.0301862.ref029] 29.Karatzas D, Shafait F, Uchida S, et al. ICDAR 2013 robust reading competition.2013 12th international conference ondocument analysis and recognition. IEEE, 2013:484–1493. [Google Scholar]

[pone.0301862.ref030] 30.Liao M, Shi B, Bai X. Textboxes++: A single-shot oriented scene text detector[J]. IEEE transactions on image processing, 2018, 27(8): 3676–3690. [DOI] [PubMed] [Google Scholar]

[pone.0301862.ref031] 31.Tian Z, Huang W, He T, et al. Detecting text in natural image with connectionist text proposal network[C]//Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VIII 14. Springer International Publishing, 2016: 56–72. [Google Scholar]

[pone.0301862.ref032] 32.Zhong Y, Cheng X, Chen T, et al. PRPN: Progressive region prediction network for natural scene text detection[J]. Knowledge-Based Systems, 2022, 236: 107767. [Google Scholar]

PERMALINK

DB-EAC and LSTR: DBnet based seal text detection and Lightweight Seal Text Recognition

Baohua Huang

Aokun Bai

Yuqiong Wu

Chanjuan Yang

Han Sun

Roles

Abstract

1. Introduction

2. Text detection and recognition model

2.1 Overall structure

Fig 1. General structure of the paper.

2.2 DB-ECA

Fig 2. DB-ECA model diagram.

Fig 3.

Fig 4. LSTR structure diagram.

2.3 Lightweight Seal Text Recognition model

Fig 5. Inference layer model diagram.

3. Experiment

3.1 Dataset

3.2 Evaluation indicators

3.3 Experimental platform

3.4 Ablation experiments

Table 1. Detection model comparison.

Table 2. Comparison results of different backbone networks.

Table 3. Backbone network performance comparison.

3.5 Comparison experiments

Fig 6. Detection results of different models.

Table 4. Comparison of different recognition models.

3.6 Parameter analysis

Table 5. Model performance comparison.

4. Conclusion

Supporting information

Data Availability

Funding Statement

References

Decision Letter 0

Nouman Ali

Roles

Author response to Decision Letter 0

Decision Letter 1

Nouman Ali

Roles

Acceptance letter

Nouman Ali

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases