Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Jul 3;15:23761. doi: 10.1038/s41598-025-07439-y

Printed document layout analysis and optical character recognition system based on deep learning

Dong-Lin Li 1,, Shih-Kai Lee 1, Yin-Ting Liu 1
PMCID: PMC12229442  PMID: 40610547

Abstract

This paper proposes a layout analysis and text recognition system for printed documents based on deep learning. Initially, scanned documents or image files are processed using a layout analysis algorithm based on YOLOv4 and YOLOv8 deep learning to identify the positions of titles, text paragraphs, tables, and images within the document. Each of these categories undergoes specific character segmentation processing. Then, the content is recognized using a text recognition algorithm based on Convolutional Neural Networks (CNN). Finally, the recognized text is integrated and output in editable formats, such as JSON or Microsoft formats. Our proposed method enables convenient, fast, and highly accurate OCR processing on a local computer.

Keywords: OCR, Layout analysis, CNN, YOLO, Deep learning

Subject terms: Computer science, Information technology

Introduction

Optical character recognition (OCR) refers to the process of recognizing and processing the content of paper documents or image files containing text to obtain text content and layout information. This process includes input image preprocessing, recognition processing, and output post-processing, as show in Fig. 1.

Fig. 1.

Fig. 1

OCR processing flow.

OCR is not a new topic that has emerged in recent years. Its concept was first invented in 1929 by German scientist Gustav Tausheck. He designed a computational system using light, documents, and templates based on the principle of a punching machine, enabling machines to initially have image recognition capabilities1,2. In 1951, American scientist David Hammond Shepard invented a reader called ”Gismo”3, which can be considered the earliest and most complete OCR device. Gismo was able to recognize music symbols and text produced by a standard typewriter but could only recognize 23 characters. In Japan, where square characters are used, OCR research focusing on numbers began in the 1960s. By the 1970s, some simple number recognition inventions had been made, and the first relatively complete OCR product was IBM’s IBM 1418, but it could only recognize printed numbers, English letters, and some punctuation marks in designated fonts4.

In the 1980s, the rise of neural network theory enabled computers to learn efficiently through data. In 1990, Adam Krzyzak et al. published ”Unconstrained handwritten classification using modified backpropagation model”5, proposing a recognition system for a large number of handwritten characters. Its structure used a multilayer neural network classifier with a backpropagation adaptive learning method, learning features derived from the Fourier Descriptors and structural features of characters that are invariant to size and displacement. The final accuracy was better than traditional classifiers.

Yann LeCun first used the idea of ”convolution” in his 1989 publication ”Handwritten Digit Recognition with a Back-Propagation Network”6, and in 1993, he demonstrated the use of Convolutional Neural Networks (CNN) to recognize various digit recordings7, becoming a significant starting point for the widespread application of CNN in computer vision.

In 1998, Yann LeCun et al. published ”Gradient-Based Learning Applied to Document Recognition”8. They mainly used the LeNet-5 architecture model and created training samples with digits, English letters, and special symbols included in the ASCII encoding set. The model was trained using a gradient-based backpropagation learning algorithm, making OCR more related to mathematical calculations than purely optical, by combining deep learning approaches that simulate the human visual system to recognize text images. In recent years, deep learning has developed rapidly, and many emerging applications both domestically and abroad, such as license plate recognition, road sign recognition, document identification, and production line recognition, use text recognition technology. However, these usually focus on numbers and English letters, with less research on Chinese characters. This is likely due to the complex structure of Chinese characters, and the vast number of commonly used characters, totaling 4,808 as recorded by the Ministry of Education in Taiwan9, compared to the 52 uppercase and lowercase English letters, making training much more challenging.

The goal of this research is to propose a more comprehensive OCR system for printed Chinese and English text. This system does not just input a single text image into a neural network model to obtain recognition results. Instead, it allows users to directly input a document file or a scanned document image for recognition, encompassing all the processing steps of a complete OCR software. Firstly, the method uses object detection to analyze the positions of all titles and text paragraphs in the document. Then, it segments the detected areas into characters and uses a trained CNN model for text recognition. Finally, it outputs the results in different formats according to their categories.

To ensure the text recognition model has excellent recognition capabilities and can simultaneously process Chinese and English text, the CNN training dataset includes samples of Chinese characters in various fonts, English letters, numbers, and common punctuation marks. It also adds different types of noise interference samples to each category and incorporates data augmentation techniques such as rotation and translation during training, enabling the model to learn more possible scenarios and enhance its recognition capabilities.

Literature review

Document analysis plays a crucial role in enabling efficient and accurate optical character recognition (OCR), particularly for multilingual document images. Script identification has been widely recognized as an essential pre-processing step to select the appropriate OCR model for different scripts. Despite extensive research, script identification still faces challenges such as faded document images, variable illumination conditions, and positional distortions during scanning. Additionally, noise remains a significant obstacle that can only be minimized but not completely eliminated10. Comprehensive reviews have classified and compared various script identification schemes, highlighting their merits, limitations, and potential future research directions10.

Beyond script identification, extracting text information from images is fundamental for efficient indexing and retrieval. Recognizing characters after text extraction simplifies the search process by allowing users to access indexed images without exhaustive manual search. However, text extraction remains challenging, particularly when dealing with skewed and multi-skewed text lines in printed and handwritten documents. Designing a real-time system capable of achieving high recognition rates across varying document types and fonts remains an open challenge11. Comparative analyses of different extraction approaches emphasize the critical need to address these issues to advance document image analysis technologies11.

Machine learning techniques also play a crucial role in enhancing OCR, particularly for character segmentation and recognition in multilingual documents. Recent studies have proposed robust algorithms tailored for Indian documents containing Latin and Devanagari scripts, which often present complex layouts and local skews12,13. These methods involve preprocessing steps such as noise reduction and illumination correction, followed by character segmentation based on structural properties. Graph distance theory is applied to separate overlapping characters, and Support Vector Machine (SVM) classifiers are used to validate segmentation results. For recognition, geometric and shape-based features are computed and classified using k-Nearest Neighbor (k-NN) classifiers, achieving segmentation and recognition accuracies up to 98.86% and 99.84%, respectively.

Segmentation is a fundamental step that directly impacts OCR accuracy. Several approaches have been developed to improve text-line and word segmentation in complex document layouts.14 proposed a script-independent segmentation method using Dijkstra’s algorithm for text-line segmentation and wavelet transforms for word segmentation. Their technique achieved 97.6% text-line segmentation accuracy and 98.1% word segmentation accuracy. Alternatively,15 introduced a method using the fast marching method for text-line segmentation and wavelet transforms with connected components (CCs) labeling for word segmentation, achieving even higher accuracies of 98.9% and 99.1%, respectively. These segmentation techniques are critical for ensuring high performance in multilingual and mixed-format document recognition systems.

Noise removal is another crucial preprocessing step in document image analysis. In mixed-content documents such as bank cheques and admission forms, noise and the coexistence of handwritten and machine-printed texts pose significant challenges.16 proposed a method combining two-dimensional discrete wavelet transforms and semi-decimated discrete wavelet transforms, treating noise explicitly as a separate class during classification. Their method achieved an average identification rate of 98.02%. Similarly,17 employed contourlet transform-based feature extraction combined with SVM classifiers, achieving a maximum identification recall rate of 98.9%. These studies confirm that effective noise separation is essential for robust OCR systems.

To further enhance the feasibility of document analysis across multiple languages, including Indian scripts and other multilingual environments, recent efforts have focused on script identification at the word level.18 proposed a method using scale- and rotation-robust log-polar wavelet and semi-decimated wavelet features. Text-blobs were segmented using Gaussian filtering, and texture features were extracted in the log-polar domain, effectively mitigating rotational and scale variations, resulting in a maximum recall rate of 98.96%. Similarly,19 introduced a method based on log-polar curvelet features, demonstrating strong directional and anisotropic properties for feature extraction. Experiments across multiple datasets, including both printed and handwritten documents, achieved a maximum recall rate of 98.76%.

These advancements demonstrate the effectiveness and feasibility of robust, script-independent document processing systems capable of handling multilingual and complex document scenarios.

Research methodology

This study proposes a document processing methodology based on deep neural networks (DNN) to enhance document recognition and character analysis accuracy, as show in Fig. 2 The overall process consists of conversion, DNN training document type classification, binarization processing, segment cell, character segmentation, and DNN training document character recognition.

Fig. 2.

Fig. 2

OCR processing flowchart.

First, the system performs conversion to adapt the input document format and utilizes DNN training document type classification to identify the document structure, ensuring appropriate processing strategies. Next, Binarization is a critical preprocessing step in table recognition, enhancing the contrast between text and background. After this, the table structure is segmented into individual cells, facilitating precise character recognition in later stages.

To improve text readability, the system applies skew correction to align the text properly to reinforce character visibility. Subsequently, character segmentation extracts individual characters, which are then processed through DNN training document character recognition.

Finally, the system generates the output, which can be utilized for further text analysis or data storage. By integrating deep learning with image processing techniques, this methodology significantly improves document recognition accuracy and can be applied in optical character recognition (OCR) and other text processing applications.

Document analysis

DNN training document type classification

The input files for the YOLOv4 must be in an image format, and the training parameter settings are shown in Table 1. Therefore, PDF files are first converted to high-quality PNG images to preserve clarity. YOLO is utilized for document analysis, implemented in Python 3.7 with PyTorch 1.2 and CUDA 10.0 on a Windows 10 system. The model is based on the CSPDarkNet53 architecture, with specific training settings designed to improve efficiency and accuracy.

Table 1.

YOLOv4 training parameter settings.

Epoch Batch size Learning rate optimizer
Freeze train 30 8 0.001 Adam
Unfreeze train 370 3 0.0001 Adam

Training data consists of 1720 pages of various documents, each containing at least two different label categories, totaling 8225 labels. The labels are divided into five categories, as shown in Table 2, the dataset is partitioned into training, testing, and validation subsets in an 8:1:1 ratio to ensure robust model evaluation.

Table 2.

Label name and classification criteria.

Label name Classification criteria
Title Title, numbered titles, key prompts, single-line annotations
Text Long texts, general text paragraphs
Table Tabular data
image Photos, pictures, trademarks, handwritten signatures
chart Various statistical charts: bar charts, line charts, pie charts, etc.

After conversion, document type classification is performed using a DNN-based approach to accurately determine structural attributes and select optimal processing strategies. This step ensures that different document types receive appropriate preprocessing tailored to their specific characteristics.

Conversion

Before proceeding to the next step of processing the sections classified as tables or text, several preprocessing steps need to be performed on the image. Manual scanning of documents can sometimes result in slight skewing, which may seem insignificant but can affect character segmentation and the recognition results of the CNN model. Therefore, a preprocessing step of skew correction is necessary during the text recognition phase to adjust the text paragraphs to the correct horizontal angle. After document analysis, the extracted text paragraphs are first subjected to inverse binarization to obtain the point set of the text area. Then, the minimum enclosing rectangle of these point sets and the angle Inline graphic between the rectangle and the horizontal axis are obtained. Finally, angle correction is performed using the rotate function, results show in Fig. 3.

Fig. 3.

Fig. 3

OCR processing flow.

To avoid the impact of noise during character segmentation and to improve the accuracy of text recognition, this study incorporates a method similar to median filtering for noise removal. When using a standard median filter to process the endpoints or edges of text strokes, the number of white (255) pixels within the filter’s 3 × 3 region may outnumber the black (0) pixels. In such cases, the endpoints of strokes could be erased, affecting the text structure. To ensure that strokes are fuller and more complete, this study modifies the value selection method by taking one position forward from the original median value. This reduces the likelihood of stroke endpoints being erased, resulting in clearer processed text strokes, as show in Fig. 4.

Fig. 4.

Fig. 4

OCR processing flow.

Table recognition processing

Binarization processing

Binarization is a crucial preprocessing step in table recognition,original table show in Fig. 5. To enhance the contrast between the text and the background, the system first applies inverse Otsu binarization to the table image (as shown in Fig. 6). Following this, Connected Component Labeling (CCL) is applied to the white areas, isolating the table borders by removing all components except those labeled as 1, which typically represent the outer edges. This process ensures that only the structural elements of the table remain for further processing (as shown in Fig. 7).

Fig. 5.

Fig. 5

The tabular data to be processed.

Fig. 6.

Fig. 6

Invert binarization processing of the table.

Fig. 7.

Fig. 7

Only the table frame remains after labeling the connected components.

Segment cell

After binarization, the table structure is further processed to segment individual cells. The result from the previous step is inverted, and Connected Component Labeling is applied again to the white areas (as shown in Fig. 8). The background is then ignored, and region segmentation is performed on the remaining targets to delineate each cell precisely. Once the cells are segmented, character segmentation within each cell is carried out, allowing the extracted characters to be input into a CNN model for recognition (as shown in Fig. 9).

Fig. 8.

Fig. 8

After the inversion, perform connected component labeling to obtain the white target region.

Fig. 9.

Fig. 9

Cell splitting result.

Text recognition

Character segmentation

Before performing text recognition, a character segmentation step is necessary. The flowchart for this process is shown in Fig. 10.

Fig. 10.

Fig. 10

Character segmentation flowchart.

For the text regions detected through document analysis, the goal is to accurately segment all the contained text characters. This helps in obtaining precise results when inputting them into the CNN model later. This step involves the use of image projection techniques. Below is an explanation of how to segment each character from a pure text block in Fig. 11:

  1. Horizontal projection: Perform horizontal projection first. The starting and ending coordinates of each black region on the y-axis represent the positions of each line of text in the image (y-axis coordinates), as shown in Fig. 12a.

  2. Vertical projection: In Fig. 12b, perform vertical projection on each segmented line. The starting and ending coordinates of each black region on the x-axis represent the positions of each character in that line (x-axis coordinates).

  3. Character segmentation: Through the above two steps, the x-axis and y-axis coordinate ranges of all characters in the region can be obtained, and they can be sequentially segmented, as shown in Fig. 12c.

Fig. 11.

Fig. 11

Character segmentation flowchart.

Fig. 12.

Fig. 12

Text segmentation process (a) horizontal projection: identifies text lines by detecting black region start and end points along the y-axis. (b) Vertical projection: extracts individual characters by detecting black region boundaries along the x-axis within each line. (c) Character segmentation: combines x-axis and y-axis coordinates to accurately segment all characters.

However, many Chinese characters have left-right structures. For example, the characters ”比”, ”加”, and ”油” are composed of left and right parts. When performing vertical projection, it can be observed that there are blank areas in the middle of characters like ”比” and ”加” after zooming in. Therefore, if direct segmentation is performed, this type of character will be split into left and right halves as shown in Fig. 13.

Fig. 13.

Fig. 13

Vertical projection and segmentation results.

Therefore, before performing segmentation, it is necessary to find and merge parts that originally belong to the same character but have been split. These split parts are usually ”narrow and long,” meaning that their width is smaller than their height. We can determine this by comparing the blank space between characters and the ratio of width to height. If the width of both parts is less than 80% of their height, and the length of the blank space between them is less than 30% of the combined length of the two parts, then they are merged into one character for segmentation. The process then continues with further checks and comparisons, as shown in Fig. 14.

Fig. 14.

Fig. 14

Segmentation process for enlarged first character.

For cases where the first character of a paragraph is enlarged, this study uses the projection method to determine the height of the first line (H1) after the first round of segmentation. If it is greater than or equal to twice the height of all other lines (H2, H3, etc.), a second round of segmentation is performed. The enlarged first character is segmented independently, and the remaining text lines are segmented and rearranged in the correct order to obtain accurate recognition results, as shown in Fig. 15.

Fig. 15.

Fig. 15

The processing flow for handling left-right structured text.

This study uses CNN for text recognition, with Python 3.7 as the development language and PyTorch 1.2 + CUDA 10.0 as the deep learning framework, running on the Windows 10 operating system. After multiple designs and training with different structures, we decided to use a CNN model with 4 convolutional layers, 2 pooling layers, and 2 fully connected layers for training, and Leaky ReLU as the activation function. The rest of the training parameters are shown in Fig. 16 and the Table 3. Before inputting the images into the model for training, data augmentation techniques such as color adjustment and affine transformations (rotation and translation) are applied. The brightness, contrast, and saturation of the input images are randomly adjusted by ±30% to simulate various lighting conditions and ink intensities. Additionally, the images undergo rotation within ±3 degrees and horizontal and vertical translation within ±3 pixels to enrich the training samples, as shown in Fig. 17.

Fig. 16.

Fig. 16

CNN model architecture.

Table 3.

CNN training parameter settings.

Epoch Batch size Learning rare optimizer Dropout
100 64 0.0001 Adam 0.2
Fig. 17.

Fig. 17

Rotation and translation of training samples.

A rich training dataset can enhance the model’s recognition capabilities. Therefore, this study augmented each training category with not only clean and neat text samples but also samples of various sizes, fonts, and different noise interferences, as show in Table 4, 5, 6 and 7.

Table 4.

Various sizes of training samples.

graphic file with name 41598_2025_7439_Tab4_HTML.jpg

Table 5.

Training samples of various fonts.

graphic file with name 41598_2025_7439_Tab5_HTML.jpg

Table 6.

Training samples with various proportions of salt-and-pepper noise.

graphic file with name 41598_2025_7439_Tab6_HTML.jpg

Table 7.

Comparison between the original image and the training sample with Gaussian noise, Gaussian blur.

graphic file with name 41598_2025_7439_Tab7_HTML.jpg

Through the aforementioned methods of sample augmentation, the training dataset, initially consisting of 5,842 categories, was expanded to approximately 660,000 training samples.

Experimental results and comparison

Document analysis training results

The dataset used for document analysis includes a total of 1720 pages from publicly available sources such as academic papers, press releases, story articles, promotional advertisements, e-books, scanned magazines, and picture books. The dataset contains a total of 8,225 labels. We used 80% of the dataset as the training set, 10% as the validation set, and the remaining 10% as the test set to calculate various evaluation metrics, as shown in Figs. 18 and 19.

Fig. 18.

Fig. 18

mAP of document analysis test results.

Fig. 19.

Fig. 19

PR curve of document analysis test results.

After 400 epochs of training, the loss value stabilized around 1.3. We tested the model with the validation set at this point and calculated various evaluation metrics, achieving a mean Average Precision (mAP) of 92.9

Text recognition training results

The dataset used for text recognition includes 5,842 categories, with an average of about 110 image samples per category, resulting in a total of approximately 660,000 image samples. This study used 80% of the dataset as the training set, 10% as the validation set to observe convergence, and the final 10% as the test set, base on YOLOv4 and YOLOv8. Various document test samples were used to calculate the Character Error Rate (CER).

CER is a commonly used metric for evaluating the accuracy of speech recognition and text recognition systems. It calculates the proportion of incorrectly recognized characters, accounting for all substitutions, deletions, and insertions in the recognition results, divided by the total number of characters. The formula is as follows:

graphic file with name 41598_2025_7439_Article_Equ1.gif

S, D, I, and N are the number of substitutions, the number of deletions, the number of insertions, and the number of characters in reference text (aka ground truth) respectively. A lower CER indicates higher recognition accuracy. This study used several types of document samples with a scanning resolution of 100 Dots Per Inch (DPI) to test the currently available OCR software and the proposed model to calculate their respective CERs for comparison. Commercial software is used as the main comparison object because it is more robust in practical OCR applications. The first and second lowest CER results are highlighted in italic and bold, respectively, in Tables 8, 9, 10, 11, 12, 13, and 14. From the data in Table 14, it is obvious that most of the methods we proposed are the best or second best except for English recognition performance.

Table 8.

The CERs of various methods on traditional Chinese character recognition in multiple fonts (1000 each in PMingLiU, SimSun, and Microsoft JhengHei fonts).

Tesseract Google Google Vision API Adobe Acrobat Microsoft Office Lens ours
15.58 6.19 5.72 3.32 0.97 0.48

Table 9.

The CERs of various methods on English character recognition in multiple fonts (1000 each in Calibri and Times New Roman fonts).

Tesseract Google Google Vision API Adobe Acrobat Microsoft Office Lens ours
11.955 0.165 3.555 2.095 0.11 0.525

Table 10.

The CERs of various methods on number character recognition in multiple fonts (500 each in PMingLiU, SimSun, Microsoft JhengHei, and Calibri fonts).

Tesseract Google Google Vision API Adobe Acrobat Microsoft Office Lens ours
0.05 0.05 0 0 0.15 0

Table 11.

The CERs of various methods on 4447 mixed samples of traditional Chinese, English, numbers, and punctuation. (Traditional chinese is DFKai-SB font and the others are Times New Roman font).

Tesseract Google Google Vision API Adobe Acrobat Microsoft Office Lens ours
5.79 0.82 0.84 1.32 1.85 0.65

Table 12.

The CERs of various methods on 4447 mixed samples with added blur effects of traditional Chinese, English, numbers, and punctuation. The ratio of Chinese characters to other characters is about 1:1.

Tesseract Google Google Vision API Adobe Acrobat Microsoft Office Lens ours
3.36 0.92 0.86 5.06 Fail 0.92

Table 13.

The CERs of various methods on 4447 mixed samples with added pepper salt noise of traditional Chinese, English, numbers, and punctuation.

Tesseract Google Google Vision API Adobe Acrobat Microsoft Office Lens ours
3.36 2.29 2.46 6.59 5.12 1.86

Table 14.

Overview CER comparison of Tables 8, 9, 10, 11, 12 and 13.

Sample Type Tesseract Google Google Vision API Adobe Acrobat Microsoft Office Lens ours
Chinese 15.58 6.19 5.72 3.32 0.97 0.48
English 11.955 0.165 3.555 2.095 0.11 0.525
Number 0.05 0.05 0 0 0.15 0
Mix 5.79 0.82 0.84 1.32 1.85 0.65
Mix(blur) 3.36 0.92 0.86 5.06 Fail 0.92
Mix(pepper) 6.93 2.29 2.46 6.59 5.12 1.86

Ablation study

Our method demonstrates scalability across different scanning resolutions (100, 200, and 300 DPI), as shown in Table 15. We can find that the CER of the same content but different scanning resolutions does not change much. Table 16 shows the impact of the analysis speed and CER on different scan resolutions and different contents when the original Document Analysis model is replaced from YOLOv4 to the newer YOLOv8x. The settings of training parameter for YOLOv8x are shown in Table 17. The dataset is JVZY dataset20, which contains 20,000 paired images and 1.92 million words, annotated with ZhuYin phonetic symbols. This dataset presents challenges such as noise, uneven lighting, and handwriting variants common in real-world Traditional Chinese documents. We can see that using the newer model has little impact except for reducing the document analysis time and the CER for low resolution.

Table 15.

The CERs of various methods on 4447 mixed samples with different scanning resolution.

Tesseract Google Google Vision API Adobe Acrobat Microsoft Office Lens Ours
100 DPI 5.79 0.82 0.84 1.32 1.85 0.65
150 DPI 1.18 0.78 0.99 2.56 Fail 0.76
300 DPI 1.36 1.47 1.05 2.62 2.12 0.65

Table 16.

The execution time and CER of our method at JVZY Dataset with different DPI levels.

Scanning resolution (DPI) 100 100 200 200 300 300
Document Analysis model YOLOv4 YOLOv8x YOLOv4 YOLOv8x YOLOv4 YOLOv8x
Document Analysis Time (sec.) 1.74 1.2 2.22 1.48 2.77 1.81
Table Recognition Time (sec.) 2.1 2.2 4.98 4.75 6.23 6.00
Text Recognition Time (sec./character) 0.01 0.01 0.01 0.01 0.01 0.01
CER 0.23 0.17 0.07 0.07 0.03 0.03

Table 17.

YOLOv8x training parameter settings.

Epoch Batch size Learning rate Optimizer Momentum
100 16 0.01 SGD 0.937

To evaluate our method’s effectiveness on multilingual datasets, we used the Devanagari Character Set21, consisting of 46 character types with 2,000 images per type, totaling 92,000 samples. The dataset was split into training, validation, and test sets in an 8:1:1 ratio. Our model achieved 99.69% accuracy on training, 98.88% on testing, and 98.86% on validation, demonstrating strong generalization across character types.

To further demonstrate cross-lingual performance, we compared our method with PP-OCRv322 using three languages: English (same as TABLE 9), Traditional Chinese20, and Devanagari21. As shown in Table 18, our method with YOLOv8 consistently outperformed PP-OCRv3, especially in Traditional Chinese and Devanagari, which are more complex and underrepresented in typical OCR systems.

Table 18.

Multi-language average CER comparison table.

Language PP-OCRv3 Ours with YOLOv8
English 0.655 0.525
Traditional Chinese 0.43 0.17
Devanagari 0.63 0.01

Conclusion

This research proposed a deep learning-based text recognition method that integrates object detection with document structure analysis. By leveraging detection-driven document analysis capabilities, the method can accurately segment and individually process various components within scanned documents—including text, tables, and images—improving the efficiency of content extraction and downstream recognition.

The core recognition module is a CNN-based character recognizer trained on mixed-language data, including Chinese, English, punctuation, and numerals. With carefully designed preprocessing techniques and both offline and online data augmentation strategies, the model achieves low character error rates (CER) across multiple font types and degradation conditions.

Performance evaluations demonstrate the robustness and scalability of our approach. In the same content but different scanning resolution experiments, our model achieved CERs of 0.65 at 100 DPI, 0.76 at 150 DPI, and 0.65 at 300 DPI respectively, confirming that the difference in resolution does not affect the stability of our method. Additionally, Better file analysis models can reduce CER at lower resolutions.

The model’s generalizability across languages was further validated using two benchmark datasets. On the Devanagari Character Dataset, consisting of 92,000 images across 46 character types, our CNN classifier achieved 99.69% accuracy on the training set, 98.88% on the test set, and 98.86% on the validation set.

A comparative evaluation with PP-OCRv3 revealed significant performance gaps. For Traditional Chinese, our model achieved a CER of 0.17 compared to PaddleOCR’s 0.43; for Devanagari, the gap widened to 0.01 vs. 0.63. Even in English recognition, our model (CER: 0.525) slightly outperformed PP-OCRv3 (CER: 0.655), underscoring the effectiveness of our tailored training approach.

The proposed table recognition pipeline—based on connected component labeling and multi-projection analysis—effectively addresses common OCR challenges such as enlarged initial letters and mixed content segmentation. This approach reduced segmentation errors and improved final recognition accuracy.

Overall, our method proves both versatile and practical for multilingual document recognition under real-world noise and degradation conditions. Unlike many commercial OCR systems, our solution performs robustly across languages, formats, and resolutions without requiring manual tuning. Future directions include expanding training data diversity, incorporating transformer-based architectures, and extending support for handwriting and low-resource scripts.

Author contributions

Li is the supervisor of this paper and provided all resources for this paper. Lee and Liu wrote the original manuscript. Li reviewed the manuscript.

Data availability

The datasets used and analysed in this paper are available from the corresponding author on reasonable request.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Mori, S., Suen, C. Y. & Yamamoto, K. Historical review of OCR research and development. Proc. IEEE80(7), 1029–1058 (1992). [Google Scholar]
  • 2.Islam, N., Islam, Z. & Noor, N. A survey on optical character recognition system. arXiv preprint arXiv:1710.05703 (2017)
  • 3.Guzdial, M. & Boulay, B. The history of computing. In The Cambridge Handbook of Computing Education Research. Vol. 11 (2019)
  • 4.Munson, J.H. Computer recognition of hand-printed text. Vis. Lang.3(1) (1969)
  • 5.Krzyzak, A. Unconstrained handwritten character classification using modified backpropagation model. In Proceedings of 1st International Workshop on Frontiers in Handwriting Recognition. 155–166 (1990)
  • 6.LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W. & Jackel, L. Handwritten digit recognition with a back-propagation network. Adv. Neural Inf. Process. Syst.2 (1989)
  • 7.LeCun, Y. Convolutional network demo from 1993. YouTube, at www.youtube. com/watch (2014).
  • 8.LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE86(11), 2278–2324 (1998). [Google Scholar]
  • 9.China (Taiwan), M. Table of Standard Typefaces for Frequently-Used Chinese Characters. (Ministry of Education Republic of China (Taiwan), 1987).
  • 10.Sahare, P. & Dhok, S. B. Script identification algorithms: A survey. Int. J. Multimed. Inf. Retrieval6, 211–232 (2017). [Google Scholar]
  • 11.Sahare, P. & Dhok, S. B. Review of text extraction algorithms for scene-text and document images. IETE Tech. Rev.34(2), 144–164 (2017). [Google Scholar]
  • 12.Sahare, P. & Dhok, S. B. Multilingual character segmentation and recognition schemes for Indian document images. IEEE Access6, 10603–10617 (2018). [Google Scholar]
  • 13.Sahare, P. & Dhok, S. B. Robust character segmentation and recognition schemes for multilingual Indian document images. IETE Tech. Rev.36(2), 209–222 (2019). [Google Scholar]
  • 14.Sahare, P. et al. Script independent text segmentation of document images using graph network based shortest path scheme. Int. J. Inf. Technol.15(4), 2247–2261 (2023). [Google Scholar]
  • 15.Sahare, P. et al. Script-independent text segmentation from document images. Int. J. Ambient Comput. Intell. (IJACI)13(1), 1–21 (2022). [Google Scholar]
  • 16.Sahare, P. & Dhok, S.B. Separation of machine-printed and handwritten texts in noisy documents using wavelet transform. IETE Tech. Rev.36(4), 341–361 (2019). 10.1080/02564602.2016.1160805.
  • 17.Sahare, P. & Dhok, S. B. Separation of handwritten and machine-printed texts from noisy documents using contourlet transform. Arab. J. Sci. Eng.43(12), 8159–8177 (2018). [Google Scholar]
  • 18.Sahare, P. & Dhok, S. B. Script pattern identification of word images using multi-directional and multi-scalable textures. J. Ambient Intell. Hum. Comput.12(10), 9739–9755 (2021). [Google Scholar]
  • 19.Sahare, P., Chaudhari, R. E. & Dhok, S. B. Word level multi-script identification using curvelet transform in log-polar domain. IETE J. Res.65(3), 410–432 (2019). [Google Scholar]
  • 20.Lo, S.-W., Chou, H.-M. & Wu, J.-H. Joint variation and Zhuyin dataset for traditional Chinese document enhancement. Sci. Data11(1), 1295 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Rishianand: Kaggle Devanagari Character Set. (2021). https://www.kaggle.com/datasets/rishianand/devanagari-character-set?resource=download. Accessed 28 Apr 2025.
  • 22.Li, C. et al. Pp-ocrv3: More attempts for the improvement of ultra lightweight OCR system. arXiv preprint arXiv:2206.03001 (2022)

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets used and analysed in this paper are available from the corresponding author on reasonable request.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES