Integrated framework utilizing scene text detection and recognition techniques for enhancing point of interest extraction from name boards in all Indic languages

Abhishek Kumar Kashyap; Mahima Upadhya; Vikas Singh Panwar; Vikrant Chandrakar

doi:10.1038/s41598-026-40742-w

. 2026 Mar 10;16:12907. doi: 10.1038/s41598-026-40742-w

Integrated framework utilizing scene text detection and recognition techniques for enhancing point of interest extraction from name boards in all Indic languages

Abhishek Kumar Kashyap ^1,^✉, Mahima Upadhya ², Vikas Singh Panwar ^2,^✉, Vikrant Chandrakar ³

PMCID: PMC13096107 PMID: 41807484

Abstract

This paper focused on enhancing text recognition, script identification, language classification, and point of interest (POI) extraction from images captured by Mobile Mapping Systems (MMS). The initiative was undertaken to improve the existing Computer vision-based artificial intelligence modules. The advancements made is to be contributed to the system’s implementation, bringing improved functionality and accuracy to the system. The current system, called Text Detection and Recognition (TDR), consists of several neural modules operating in sequential stages. The first module identifies areas of interest within the MMS images, focusing on shop signboards, traffic signs, and directional boards. The second stage involves detecting text words within these areas and cropping the relevant pixels from the image. These cropped images are then processed in the third stage, where the language script is detected, identifying one of ten Indian scripts. In the fourth stage, specific character recognizers corresponding to the identified script are used to recognize the text. The outputs from these stages are aggregated into a correlated JSON output. Additionally, a parallel fifth stage detects various fields within the MMS images, such as name, address, pin, icon, phone and GSTIN number, ultimately extracting a comprehensive human-readable address for any POI from the MMS image. The primary focus includes investigating novel text recognition algorithms to improve accuracy and efficiency, exploring various script identification algorithms to enhance language classification capabilities, implementing a dictionary-based approach for more accurate word detection, and developing methods for correcting the words that the CRNN model predicts to reduce errors. This work is novel because it combines word correction, OCR, detection, classification, and POI field extraction into a single pipeline designed specifically for Indic scripts. By obtaining 96.17% script recognition accuracy, 92.5% word accuracy, and 33% average precision in POI detection, the suggested framework outperforms previous benchmarks like IndicText (93.6%) and transformer-based OCR (88.5%).

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-026-40742-w.

Keywords: Point of interest (POI) extraction, Scene text recognition, Multilingual script identification, Deep learning-based OCR, Object detection

Subject terms: Engineering, Physics

Introduction

Scene text images refer to images that contain text naturally occurring in the environment, such as shop signboards, traffic signs, billboards, and other public displays. POI¹ extraction for name board in many Indic languages is an open and challenging area of computer vision and natural language processing. With urban landscapes becoming increasingly intricate, there is a rising demand for systems that can quickly identify and read text from varied surroundings. This holds in India primarily, where many languages exist that differ in pronunciation and have different scripts, fonts, and styles. This can be a challenge, but the joint use of scene text detection and recognition methods is an exciting direction to reduce errors during name boards extraction from images, which is extremely useful for navigation, tourism, and local business identification². The problem of scene text detection refers to detecting and pinpointing the location of texts in some natural images, where factors such as lighting conditions, orientation of texts, and background images are highly influential³. Due to these variations, traditional methods tend to fail, and consequently, deep learning techniques are being developed as much more advanced algorithms. Over the last few years, state-of-the-art uncontrolled environment text detectors have greatly improved with the introduction of convolutional neural networks (CNNs)⁴ and other machine learning architectures that greatly enhance the accuracy. In addition, the recognition stage, where extracted text is transformed into machine-readable representations, requires strong optical character recognition (OCR) technologies⁵ capable of addressing the complexities of Indic scripts. The distinctive characteristics of Indic languages make it natural to develop a scene text detection and recognition technique in an integrated framework. The integrated system uses state-of-the-art segmentation-based methods and individually trained OCR engines to achieve a seamless user experience when interacting with textual information across multiple languages. The ability to extract Points-of-Interest (POIs) from name boards in Indic languages through an integrated framework using scene text detection and recognition towards POI extraction is an essential step in the progression of urban navigation systems. That drives technological innovation and brings language diversity to the fore in fast-changing urban environments that often risk eradicating it. Guo⁶ has proposed computer vision-based artificial intelligence modules to identify defects. The project tries to enhance various components of that system.

The existing system, typically called TDR (Text Detection and Recognition), consists of various neural volumes. The first module is detecting areas of interest in the scene text. Incoming MMS images are searched for places of interest such as shop signboards, traffic signboards, Green directional boards, etc. In the second stage, text words are detected. The pixels containing the words are cropped out of MMS images. The boxes detected in the first stage are intersected with the second stage, and the corresponding cropped images are passed to the third stage. In the third stage, the language script is detected. 10 Indian scripts are classified at this stage. In the fourth stage, corresponding character recognizers are used according to the language detected. Finally, all the outputs are combined to create a correlated JSON output.

A fifth parallel stage also detects fields in the MMS Image, such as name, address, pin, icon, phone, GIS number, etc. A human consumable address is extracted from the MMS Image for any Point of Interest. Kumar et al.⁷ have proposed a detailed look at deep learning methods for recognizing Indic scripts. To tackle issues in the previous model, they introduced a new CNN-LSTM network. Nguyen et al.⁸ have offered an interesting way to enhance STR by adding a dictionary-guided setup to boost accuracy. STR can be tough due to different fonts, sizes, orientations & backgrounds. Most traditional STR methods rely on Optical Character Recognition (OCR).

Murad and Ali⁹ have presented an End-End System for Bangla Address Information Extract for detection, recognition, correction and parsing. Gunna et al.¹⁰ have presented a recognition of scene texts among indian languages through transferred learning. A wide variety of fonts improves the ability to recognize non-Latin scene text. Xiong et al.¹¹ have proposed a method that boosts text spotting rate, which works better on complex curvy text. Rahul et al.¹² have portrayed a technique that solves complex backgrounds, making text detection challenging. This method can be extended to translation and other languages as well. Sui et al.¹³ have proposed a framework for integrated text detection and recognition. It can be shared parameters that enhance accuracy and save computational cost. Salunkhe et al.¹⁴ have introduced a method for Multilingual text detection in Indian Languages for studying on ICDAR 2015 and user datasets. Dineshkumar et al.¹⁵ have effectively extracted scene text information that combines character descriptors and stroke configuration maps for recognition. Bixler and Miller¹⁶ have outlined the feasibility of extracting text from any scene mentioned in the Presentation on finding text elements and character tracking techniques. By extracting user reviews to determine preferences and emotional reactions, an emotional analysis-based point-of-interest (POI) suggestion framework has been created by Meena et al.¹⁷. The system improves location-based suggestions’ personalization and relevancy by incorporating feelings polarization into the suggestion pipeline. Although techniques like ResNet and YOLOv5 are commonplace, this work is novel in that it adapts, customizes, and integrates them into a modular pipeline for Indic multilingual POI extraction. As far as we are aware, no previous research has methodically integrated these elements with middleware integration and dictionary-based modification for low-resource Indic scripts.

Research gaps, objective and motivation

In multilingual areas like India, precise language recognition is essential for image and video analysis. Character recognition is challenging in Mobile Mapping System (MMS) imagery because texts frequently appear in intricate, disorganized and low-quality situations, particularly when scripts share visual resemblance. When used in real-world Indic situations with overlaid fonts and diverse scripts, conventional Text Detection and Recognition (TDR) structures perform reasonably well (e.g., ~ 87% accuracy). Given the nature of Indic scripts, which remain absent in most popular scene text datasets and pipelines, these issues necessitate more reliable solutions.

By adding specific, adaptable elements to the TDR pipeline, the main goal of this research is to close the performance gap. The study aims to improve text recognition through post-OCR dictionary modification, optimize the pipeline for Points of Interest (POI) processing, and refine script categorization frameworks employing Indic script data. Standard approaches like ResNet and YOLOv5 are used, but the unique contribution is how they are modified, coordinated, and assessed for the Indian setting. A coherent, error-tolerant pipeline for multilingual scene text comprehension is facilitated by a post-processing modification layer that uses Levenshtein and greedy search, language-specific parameter tuning, and a carefully selected dataset of more than ten Indic scripts from MMS imagery.

Questions for research

The following study topics serve as the basis for this work:

How can script recognition models be improved to differentiate visually comparable Indic scripts in poor-quality MMS images?
Is it possible for dictionary-driven post-OCR modification to significantly increase understanding accuracy in multilingual situations?
How much does integrating upgraded components (YOLOv5, FastAI, CRNN, and correction logic) into a modular TDR pipeline improve POI utilization overall?
In terms of precision, adaptability, and language protection, how does the improved TDR system stack up against current methods?
These inquiries inform the system’s design choices and efficiency assessment, guaranteeing quantifiable improvements over traditional single-script, high-resolution pipelines.

Principal contributions

The engineering of a robust modification of current state-of-the-art frameworks with language-specific datasets, dictionary-driven enhancement, and a middleware coordination layer, rather than proposing completely new architectures, allows for the practical end-to-end POI extraction from noisy MMS imagery in multiple Indic scripts.

This study introduces a scientifically based, engineering-focused pipeline that makes significant progress in multilingual scene text recognition and POI extraction, especially for low-resource Indic scripts under challenging circumstances. Despite using well-known modules like YOLOv5 and ResNet, the framework is unique because of its customized adaptation, integration approach, and post-processing logic, none of which have been thoroughly combined and assessed for Indic MMS imagery in the literature.

The following are the primary scientific and technical contributions:

CNN prediction for multilingual Indic languages based on script:

A carefully selected and balanced dataset comprising more than ten Indic scripts is used to train a sophisticated script recognition component. By creating dataset enhancement techniques (such as contrast jittering, shearing, and noise simulation) to mimic natural MMS image distortions, we were able to improve script-level classification accuracy to 96.17%, which is higher than state-of-the-art benchmarks like IndicText (93.6%).

2.
Dictionary-based OCR pipeline integration with a correction engine:

We proposed a post-OCR correction component that corrects CRNN outputs by using TF-IDF embeddings, greedy best-first search, and Levenshtein distance. This language-specific modification layer greatly enhanced word-level comprehension in low-resource, noisy scripts, which showed accuracy gains of up to 17% over raw CRNN projections.

3.
Creation of a reliable middleware coordination layer for flexible coordination of AI:

Text identification, script categorization, OCR, and POI field extraction are examples of loosely coupled elements that rely on a middleware layer to facilitate the exchange of information, confidence-based filtering, and routing. This advances software engineering for practical implementation in modular and error-prone settings.

4.
Real-world, low-resource POI field identification with an updated YOLOv5 model:

Our trained YOLOv5 outperformed previous models such as YOLOv4-BiLSTM (21%), achieving an average precision of 33% despite a challenging dataset (class imbalance, fine-grained objects, script clutter). Its outputs facilitate downstream organized inference for names, icons, phone numbers, and GSTINs.

5.
Complete the POI inference pipeline with an organized output determined by JSON:

The finished pipeline provides a useful deployment enabler by producing machine-consumable, conceptually arranged address data derived from images. Under real-world occlusion, deformation, and script overlap situations not covered in previous work, the technique endorses multilingual POI recognition.

When combined, these efforts create a brand-new integrated system for Indic multilingual POI extraction that outperforms current techniques in quantifiable ways.

Methodology

Architecture of the TDR system

A five-stage TDR (Text Detection and Recognition) process intended for reliable multilingual scene text comprehension and POI obtaining from an image is the foundation of the suggested system. Every step is in charge of a distinct subtask, and the adaptable design permits autonomous optimization and assessment. The improved elements, including YOLOv5 for object detection, FastAI-based ResNet for text categorization, and a dictionary-driven modification component, are mapped to the corresponding pipeline stages in the flowchart of the improved TDR system shown in Fig. 1. In particular, Stage 1 uses YOLOv5 to identify potential POI regions, Stage 2 consequently crops text portions from bounding boxes, and Stage 3 uses refined ResNet variants to classify the language. Stage 4 uses a CRNN to perform OCR and applies post-processing modifications determined by greedy search and Levenshtein distance. Lastly, organized POI fields (such as name, phone number, and GSTIN) are extracted in Stage 5 and output in a JSON format. As explained in the following results section, this flowchart guarantees that the component-to-stage visualization is clear and discusses how enhancements in each module affect the overall effectiveness.

Fig. 1 — Flowchart of the process of the TDR framework.

Language classification

The project employs a combination of machine learning, deep learning, and artificial intelligence techniques. We employed the Fastai CNN-LSTM methodology, CRNN, Attention network, computer vision, and natural language processing techniques. Training datasets comprising diverse textual samples in various languages are utilized to train and fine-tune the algorithms. The methodology involves iterative algorithm development, testing, and refinement to achieve the stated objectives. Fastai architecture is a popular library used in machine learning, particularly in deep learning tasks. It is built with PyTorch, which provides a high-level API that can simplify the process of building, training, and deploying deep learning models.

Classification of dataset based on script

We selected a dataset of designated word images from ten Indic scripts and English to train and assess our script categorization component. To guarantee script balance, these instances were taken from actual Mobile Mapping System (MMS) imagery and enhanced with open-source datasets and artificial augmentation. Table 1 presents the distribution of datasets based on language (Table 2). A total of 45,050 training, 9000 validation, and 9000 test word-level image crops made up the dataset utilized for training the ResNet-based script classification representations. Bounding boxes with text region annotations were used to extract most samples from real-world Mobile Mapping System (MMS) imagery. We added samples from publicly accessible, script-tagged datasets like CVSI (Classification of Video Script Images) and IndicSceneText (limited usage) to this data to guarantee script diversity and balance. In order to simulate natural appearance using script-specific styling, we created extra synthetic data using Google Fonts for scripts with few real-world samples.

Table 1.

Distribution of dataset based on language.

Language	Test samples	Validation samples	Training samples
Bengali	900	900	4500
English	900	900	4800
Kannada	900	900	4300
Odia	900	900	4100
Tamil	900	900	4700
Devnagari	900	900	5300
Gujarati	900	900	4100
Malayalam	900	900	4100
Punjabi	900	900	4300
Telugu	900	900	4850

For Batch_size = 90, epoch = 20, lr = 0.0014454, Resize = 170, GPU = 6403MiB (ResNet50)
Language	Accuracy	Total count	Correct count	Incorrect count	Architecture
Bengali	93.2	2088	1946	142	ResNet50 and fastai
English	94.35	5715	5392	323	ResNet50 and fastai
Kannada	88.95	1131	1006	125	ResNet50 and fastai
Odia	95.71	2306	2207	99	ResNet50 and fastai
Tamil	90.46	1751	1584	167	ResNet50 and fastai
Devnagari	95.34	3931	3748	183	ResNet50 and fastai
Gujarati	94.54	3040	2874	166	ResNet50 and fastai
Malayalam	94.94	4108	3900	208	ResNet50 and fastai
Punjabi	97.32	2609	2539	70	ResNet50 and fastai
Telugu	96.02	4753	4564	189	ResNet50 and fastai
Average	94.083

Language	Accuracy
Language	Beam search	Greedy search	Levenshtein distance
Bengali	7%	65%	65%
Gujarati	4%	89%	89%
Kannada	1%	70%	70%
Odia	7%	57%	57%
Tamil	1%	90%	90%
English	36%	36%	36%
Hindi	8%	84%	84%
Malayali	5%	43%	43%
Punjabi	8%	92%	92%
Telugu	1%	79%	79%

Sl. No	Parameters	Metric	Proposed system	State-of-the-art method	Reference and year
1	Scene text understanding	Word accuracy	92.5%	88.5%	Selvam et al.¹⁹
2	Scene text identification	Word accuracy	92.5%	91.3%	Vijayan et al.²⁰
3	Script detection	Accuracy	96.17%	93.6%	Lunia et al.¹⁸
4	Point of interest field identification	Average precision	0.33	0.21	Khalid et al.²¹
5	Word correction	Accuracy gain	9 to 17%	Not available

PERMALINK

Integrated framework utilizing scene text detection and recognition techniques for enhancing point of interest extraction from name boards in all Indic languages

Abhishek Kumar Kashyap

Mahima Upadhya

Vikas Singh Panwar

Vikrant Chandrakar

Abstract

Supplementary Information

Introduction

Research gaps, objective and motivation

Questions for research

Principal contributions

Methodology

Architecture of the TDR system

Fig. 1.

Language classification

Classification of dataset based on script

Table 1.

Table 2.

Training process

Table 3.

Evaluating a trained image classification model

Evaluating the accuracy of the model on each language dataset

Table 4.

Object detection using YOLOv5

Dataset preparation

Fig. 2.

Training of object detection model

Fig. 3.

About the dataset

Establishing the dictionary from the given .json annotation files

Fig. 4.

Establishing the dictionary criteria

Table 5.

Table 6.

Table 7.

Table 8.

Table 9.

Table 10.

Table 11.

Table 12.

Table 13.

Table 14.

Comparison of the three algorithm namely: Levenshtein distance, Beam search and Greedy best first search

Table 15.

Incorporation of bridge code and pipeline sturdiness

Standardizing data formats and interactions

Fig. 5.

Confidence sorting and setting

Redundancy and error identification

Results and visualization

Language classification/script identification

ResNet-50 for script recognition and language classification

Table 16.

ResNet-101 for script recognition and language classification

ResNet-152 for script recognition and language classification

Table 17.

Table 18.

Object detection (shop sign board detection) using YOLOv5

Fig. 6.

Fig. 7.

Preparation of the dictionary

Comparison with the existing system and limitations

Table 19.

Conclusion

Supplementary Information

Author contributions

Funding

Data availability

Declarations

Competing interests

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES