Printed document layout analysis and optical character recognition system based on deep learning

Dong-Lin Li; Shih-Kai Lee; Yin-Ting Liu

doi:10.1038/s41598-025-07439-y

. 2025 Jul 3;15:23761. doi: 10.1038/s41598-025-07439-y

Printed document layout analysis and optical character recognition system based on deep learning

Dong-Lin Li ^1,^✉, Shih-Kai Lee ¹, Yin-Ting Liu ¹

PMCID: PMC12229442 PMID: 40610547

Abstract

This paper proposes a layout analysis and text recognition system for printed documents based on deep learning. Initially, scanned documents or image files are processed using a layout analysis algorithm based on YOLOv4 and YOLOv8 deep learning to identify the positions of titles, text paragraphs, tables, and images within the document. Each of these categories undergoes specific character segmentation processing. Then, the content is recognized using a text recognition algorithm based on Convolutional Neural Networks (CNN). Finally, the recognized text is integrated and output in editable formats, such as JSON or Microsoft formats. Our proposed method enables convenient, fast, and highly accurate OCR processing on a local computer.

Keywords: OCR, Layout analysis, CNN, YOLO, Deep learning

Subject terms: Computer science, Information technology

Introduction

Optical character recognition (OCR) refers to the process of recognizing and processing the content of paper documents or image files containing text to obtain text content and layout information. This process includes input image preprocessing, recognition processing, and output post-processing, as show in Fig. 1.

OCR is not a new topic that has emerged in recent years. Its concept was first invented in 1929 by German scientist Gustav Tausheck. He designed a computational system using light, documents, and templates based on the principle of a punching machine, enabling machines to initially have image recognition capabilities^1,2. In 1951, American scientist David Hammond Shepard invented a reader called ”Gismo”³, which can be considered the earliest and most complete OCR device. Gismo was able to recognize music symbols and text produced by a standard typewriter but could only recognize 23 characters. In Japan, where square characters are used, OCR research focusing on numbers began in the 1960s. By the 1970s, some simple number recognition inventions had been made, and the first relatively complete OCR product was IBM’s IBM 1418, but it could only recognize printed numbers, English letters, and some punctuation marks in designated fonts⁴.

In the 1980s, the rise of neural network theory enabled computers to learn efficiently through data. In 1990, Adam Krzyzak et al. published ”Unconstrained handwritten classification using modified backpropagation model”⁵, proposing a recognition system for a large number of handwritten characters. Its structure used a multilayer neural network classifier with a backpropagation adaptive learning method, learning features derived from the Fourier Descriptors and structural features of characters that are invariant to size and displacement. The final accuracy was better than traditional classifiers.

Yann LeCun first used the idea of ”convolution” in his 1989 publication ”Handwritten Digit Recognition with a Back-Propagation Network”⁶, and in 1993, he demonstrated the use of Convolutional Neural Networks (CNN) to recognize various digit recordings⁷, becoming a significant starting point for the widespread application of CNN in computer vision.

In 1998, Yann LeCun et al. published ”Gradient-Based Learning Applied to Document Recognition”⁸. They mainly used the LeNet-5 architecture model and created training samples with digits, English letters, and special symbols included in the ASCII encoding set. The model was trained using a gradient-based backpropagation learning algorithm, making OCR more related to mathematical calculations than purely optical, by combining deep learning approaches that simulate the human visual system to recognize text images. In recent years, deep learning has developed rapidly, and many emerging applications both domestically and abroad, such as license plate recognition, road sign recognition, document identification, and production line recognition, use text recognition technology. However, these usually focus on numbers and English letters, with less research on Chinese characters. This is likely due to the complex structure of Chinese characters, and the vast number of commonly used characters, totaling 4,808 as recorded by the Ministry of Education in Taiwan⁹, compared to the 52 uppercase and lowercase English letters, making training much more challenging.

The goal of this research is to propose a more comprehensive OCR system for printed Chinese and English text. This system does not just input a single text image into a neural network model to obtain recognition results. Instead, it allows users to directly input a document file or a scanned document image for recognition, encompassing all the processing steps of a complete OCR software. Firstly, the method uses object detection to analyze the positions of all titles and text paragraphs in the document. Then, it segments the detected areas into characters and uses a trained CNN model for text recognition. Finally, it outputs the results in different formats according to their categories.

To ensure the text recognition model has excellent recognition capabilities and can simultaneously process Chinese and English text, the CNN training dataset includes samples of Chinese characters in various fonts, English letters, numbers, and common punctuation marks. It also adds different types of noise interference samples to each category and incorporates data augmentation techniques such as rotation and translation during training, enabling the model to learn more possible scenarios and enhance its recognition capabilities.

Literature review

Document analysis plays a crucial role in enabling efficient and accurate optical character recognition (OCR), particularly for multilingual document images. Script identification has been widely recognized as an essential pre-processing step to select the appropriate OCR model for different scripts. Despite extensive research, script identification still faces challenges such as faded document images, variable illumination conditions, and positional distortions during scanning. Additionally, noise remains a significant obstacle that can only be minimized but not completely eliminated¹⁰. Comprehensive reviews have classified and compared various script identification schemes, highlighting their merits, limitations, and potential future research directions¹⁰.

Beyond script identification, extracting text information from images is fundamental for efficient indexing and retrieval. Recognizing characters after text extraction simplifies the search process by allowing users to access indexed images without exhaustive manual search. However, text extraction remains challenging, particularly when dealing with skewed and multi-skewed text lines in printed and handwritten documents. Designing a real-time system capable of achieving high recognition rates across varying document types and fonts remains an open challenge¹¹. Comparative analyses of different extraction approaches emphasize the critical need to address these issues to advance document image analysis technologies¹¹.

Machine learning techniques also play a crucial role in enhancing OCR, particularly for character segmentation and recognition in multilingual documents. Recent studies have proposed robust algorithms tailored for Indian documents containing Latin and Devanagari scripts, which often present complex layouts and local skews^12,13. These methods involve preprocessing steps such as noise reduction and illumination correction, followed by character segmentation based on structural properties. Graph distance theory is applied to separate overlapping characters, and Support Vector Machine (SVM) classifiers are used to validate segmentation results. For recognition, geometric and shape-based features are computed and classified using k-Nearest Neighbor (k-NN) classifiers, achieving segmentation and recognition accuracies up to 98.86% and 99.84%, respectively.

Segmentation is a fundamental step that directly impacts OCR accuracy. Several approaches have been developed to improve text-line and word segmentation in complex document layouts.¹⁴ proposed a script-independent segmentation method using Dijkstra’s algorithm for text-line segmentation and wavelet transforms for word segmentation. Their technique achieved 97.6% text-line segmentation accuracy and 98.1% word segmentation accuracy. Alternatively,¹⁵ introduced a method using the fast marching method for text-line segmentation and wavelet transforms with connected components (CCs) labeling for word segmentation, achieving even higher accuracies of 98.9% and 99.1%, respectively. These segmentation techniques are critical for ensuring high performance in multilingual and mixed-format document recognition systems.

Noise removal is another crucial preprocessing step in document image analysis. In mixed-content documents such as bank cheques and admission forms, noise and the coexistence of handwritten and machine-printed texts pose significant challenges.¹⁶ proposed a method combining two-dimensional discrete wavelet transforms and semi-decimated discrete wavelet transforms, treating noise explicitly as a separate class during classification. Their method achieved an average identification rate of 98.02%. Similarly,¹⁷ employed contourlet transform-based feature extraction combined with SVM classifiers, achieving a maximum identification recall rate of 98.9%. These studies confirm that effective noise separation is essential for robust OCR systems.

To further enhance the feasibility of document analysis across multiple languages, including Indian scripts and other multilingual environments, recent efforts have focused on script identification at the word level.¹⁸ proposed a method using scale- and rotation-robust log-polar wavelet and semi-decimated wavelet features. Text-blobs were segmented using Gaussian filtering, and texture features were extracted in the log-polar domain, effectively mitigating rotational and scale variations, resulting in a maximum recall rate of 98.96%. Similarly,¹⁹ introduced a method based on log-polar curvelet features, demonstrating strong directional and anisotropic properties for feature extraction. Experiments across multiple datasets, including both printed and handwritten documents, achieved a maximum recall rate of 98.76%.

These advancements demonstrate the effectiveness and feasibility of robust, script-independent document processing systems capable of handling multilingual and complex document scenarios.

Research methodology

This study proposes a document processing methodology based on deep neural networks (DNN) to enhance document recognition and character analysis accuracy, as show in Fig. 2 The overall process consists of conversion, DNN training document type classification, binarization processing, segment cell, character segmentation, and DNN training document character recognition.

First, the system performs conversion to adapt the input document format and utilizes DNN training document type classification to identify the document structure, ensuring appropriate processing strategies. Next, Binarization is a critical preprocessing step in table recognition, enhancing the contrast between text and background. After this, the table structure is segmented into individual cells, facilitating precise character recognition in later stages.

To improve text readability, the system applies skew correction to align the text properly to reinforce character visibility. Subsequently, character segmentation extracts individual characters, which are then processed through DNN training document character recognition.

Finally, the system generates the output, which can be utilized for further text analysis or data storage. By integrating deep learning with image processing techniques, this methodology significantly improves document recognition accuracy and can be applied in optical character recognition (OCR) and other text processing applications.

Document analysis

DNN training document type classification

The input files for the YOLOv4 must be in an image format, and the training parameter settings are shown in Table 1. Therefore, PDF files are first converted to high-quality PNG images to preserve clarity. YOLO is utilized for document analysis, implemented in Python 3.7 with PyTorch 1.2 and CUDA 10.0 on a Windows 10 system. The model is based on the CSPDarkNet53 architecture, with specific training settings designed to improve efficiency and accuracy.

Table 1.

YOLOv4 training parameter settings.

	Epoch	Batch size	Learning rate	optimizer
Freeze train	30	8	0.001	Adam
Unfreeze train	370	3	0.0001	Adam

Label name	Classification criteria
Title	Title, numbered titles, key prompts, single-line annotations
Text	Long texts, general text paragraphs
Table	Tabular data
image	Photos, pictures, trademarks, handwritten signatures
chart	Various statistical charts: bar charts, line charts, pie charts, etc.

Sample Type	Tesseract	Google	Google Vision API	Adobe Acrobat	Microsoft Office Lens	ours
Chinese	15.58	6.19	5.72	3.32	0.97	0.48
English	11.955	0.165	3.555	2.095	0.11	0.525
Number	0.05	0.05	0	0	0.15	0
Mix	5.79	0.82	0.84	1.32	1.85	0.65
Mix(blur)	3.36	0.92	0.86	5.06	Fail	0.92
Mix(pepper)	6.93	2.29	2.46	6.59	5.12	1.86

	Tesseract	Google	Google Vision API	Adobe Acrobat	Microsoft Office Lens	Ours
100 DPI	5.79	0.82	0.84	1.32	1.85	0.65
150 DPI	1.18	0.78	0.99	2.56	Fail	0.76
300 DPI	1.36	1.47	1.05	2.62	2.12	0.65

Scanning resolution (DPI)	100	100	200	200	300	300
Document Analysis model	YOLOv4	YOLOv8x	YOLOv4	YOLOv8x	YOLOv4	YOLOv8x
Document Analysis Time (sec.)	1.74	1.2	2.22	1.48	2.77	1.81
Table Recognition Time (sec.)	2.1	2.2	4.98	4.75	6.23	6.00
Text Recognition Time (sec./character)	0.01	0.01	0.01	0.01	0.01	0.01
CER	0.23	0.17	0.07	0.07	0.03	0.03

Language	PP-OCRv3	Ours with YOLOv8
English	0.655	0.525
Traditional Chinese	0.43	0.17
Devanagari	0.63	0.01

PERMALINK

Printed document layout analysis and optical character recognition system based on deep learning

Dong-Lin Li

Shih-Kai Lee

Yin-Ting Liu

Abstract

Introduction

Fig. 1.

Literature review

Research methodology

Fig. 2.

Document analysis

DNN training document type classification

Table 1.

Table 2.

Conversion

Fig. 3.

Fig. 4.

Table recognition processing

Binarization processing

Fig. 5.

Fig. 6.

Fig. 7.

Segment cell

Fig. 8.

Fig. 9.

Text recognition

Character segmentation

Fig. 10.

Fig. 11.

Fig. 12.

Fig. 13.

Fig. 14.

Fig. 15.

Fig. 16.

Table 3.

Fig. 17.

Table 4.

Table 5.

Table 6.

Table 7.

Experimental results and comparison

Document analysis training results

Fig. 18.

Fig. 19.

Text recognition training results

Table 8.

Table 9.

Table 10.

Table 11.

Table 12.

Table 13.

Table 14.

Ablation study

Table 15.

Table 16.

Table 17.

Table 18.

Conclusion

Author contributions

Data availability

Declarations

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases