Skip to main content
Heliyon logoLink to Heliyon
. 2024 Oct 10;10(20):e39089. doi: 10.1016/j.heliyon.2024.e39089

Deep transformer-based architecture for the recognition of mathematical equations from real-world math problems

Tanjim Taharat Aurpa a,, Kazi Noshin Fariha b, Kawser Hossain b, Samiha Maisha Jeba b, Md Shoaib Ahmed c,d, Md Rawnak Saif Adib b, Farhana Islam a, Farzana Akter a
PMCID: PMC11620133  PMID: 39640623

Abstract

Identifying mathematical equations from real-world math problems presents a unique and challenging task within the field of Natural Language Processing (NLP). It has a wide range of applications in various areas, such as academics, digital content design, and the development of automatic or interactive learning systems. However, the accurate understanding of these equations still needs to be enhanced due to the intrinsic complexity of mathematical symbols and various structural formats. Additionally, the unique syntax, diverse symbols, and complex structure of mathematical equations present significant obstacles that traditional NLP methods and Optical Character Recognition (OCR) systems need help to overcome. In this research, we utilize deep transformer architecture to recognize mathematical equations, and we have utilized our novel dataset with 3433 distinct observations. This dataset, which we have collected to include a diverse range of mathematical equations, is used to predict six (z=x+y, z=xy, z=xy, z=x/y, y=x! and y=x) basic mathematical equations. We applied different transformer-based architectures, such as BERT, ELECTRA, XLNet, RoBERTa, and DistilBERT, and BERT performs best with 99.80% accuracy. To the best of our knowledge, this is the first NLP work in any language where we recognize the equation from the mathematical text.

Keywords: Equation recognition, Transformer-based learning, Mathematical task, Bangla language, Bangla text analysis

1. Introduction

Bengali is the official and national language of Bangladesh, spoken by about 98% of the population [27]. It is also recognized as a state language in India, specifically in West Bengal, Tripura, and parts of Assam. With over 230 million speakers, Bengali is the seventh most widely spoken language in the world. Notably, it is the only language that led to the foundation of an independent state. To honor the language martyrs [38] who fought for its recognition, UNESCO declared February 21st as International Mother Language Day [26]. During the COVID-19 pandemic, around 38 million students in Bangladesh who had a medium of instruction (MOI) in Bangla struggled with their education due to the sudden shift from offline to online learning, exposing significant gaps in online education infrastructure [28]. This situation underscores the urgent need for advanced and automated educational systems in the Bengali language. Despite its widespread use, Bengali still needs to be explored in Natural language Processing (NLP) Research, and the scarcity of appropriate datasets makes research in this area challenging. This highlights the importance of developing technologies for the Bengali education system, particularly for tasks like equation recognition.

Equation recognition, a crucial supervised classification task, is the focus of our research. In this task, the model is trained to categorize input observations, which can either be images with handwritten mathematical expressions or text sequences. Our emphasis is on text-based equation recognition, where, for a given input sequence, the trained model attempts to find an equation that can be used to solve the math problem described. This solution can contribute significantly to mathematical question-making and solving for academicians. In the modern technological era, many automated education systems are being developed. Consequently, works like this can be instrumental in implementing automated education systems in the mathematical field, as well as in other areas. Addressing research problems like Bangla Equation Recognition can be beneficial for developing modern education systems that can contribute to the students of Bangladesh and other regions where the MOI is Bengali by utilizing the latest NLP technologies like BERT, specifically mBERT, for the Bangla language.

Since its introduction, BERT has become exceedingly popular in NLP. This pre-trained language model is transformer-based, using a self-attention mechanism to perform classification and prediction tasks. BERT has outperformed nearly all NLP tools and is widely used for text classification, with many customized architectures proposed for better performance. This pre-trained model has delivered high accuracy in various NLP fields, including question answering (e.g., [24]), sentiment analysis (e.g. [21], [40]), extraction of entities as well as recognition (e.g., [7], [3], [32], [2]). With the advent of mBERT, BERT evolved to accommodate languages other than English. mBERT is a variant of BERT that handles multiple languages, including Greek, Danish, Turkish, and more. It has demonstrated excellent performance in multilingual text categorization [5], [20], [17], offensive language detection [13], Estimating the quality of the translation [12], [15], context-based QA [8], [4], etc.

1.1. Research objectives

Observing the motivation and background study of mathematical Equation Recognition(ER) research, we intend to propose a methodology using the latest NLP technologies that will contribute to developing automated systems to recognize Bangla Mathematical equations from real-world math problems. We have selected BERT, the outperformer transformer model, based on our conventional experiments and observations. Accordingly, we justify the performance of our model with popular measures. As far as we know, this work is the first research on text analysis to recognize the Bangla Math Equations. We have conducted our literature review and did not find any work or dataset for recognizing equations from the given text. Therefore, we have created a dataset with real-world math problems and annotated them with their corresponding equation. Moreover, the dataset reflects this noble idea that may benefit NLP researchers in enhancing the modern automated education system. The main objective of this research is summarized below:

  • Proposing a framework for Bangla Mathematical expression recognition from math problems using the latest transformer-based models, such as BERT, ELECTRA, XLNet, RoBERTa, and DistilBERT

  • Bringing forward a noble dataset with Bangla real-world math problems and equations that will contribute to the prediction of Bangla Mathematical Equation.

  • Maneuvering efficacious measures such as Confusion Matrix, Evaluation Metrics (Accuracy, Precision, Sensitivity/Recall, Specificity, Macro F1 Score, Micro F1 Score), and loss to justify the performance of the proposed framework and represent the obtained observation from the experiment on our dataset.

Table 1 contains the full form of different abbreviations mostly used in this paper for a better understanding of this research.

Table 1.

Abbreviation Table.

Abbreviation Full Form
BERT Bidirectional Encoder Representations from Transformers
ELECTRA Efficiently Learning an Encoder that Classifies Token Replacements Accurately
XLNet Transformer-XL
RoBERTa Robustly Optimized BERT Approach
DistilBERT Distilled BERT
LSTM Long Short-Term Memory
CNN Convolutional Neural Network
NLP Natural language processing
mBERT Multilingual BERT
AdamW Adaptive Moment Estimation with Weight Decay
LM Language Model
NSP Next Sentence Prediction
MT Machine Translation
GPU Graphics Processing Unit
TPU Tensor Processing Unit
TP True Positive
TN True Negative
FP False Positive
FN False Negative

2. Related work

Numerous researchers have adopted BERT to address various difficulties due to its excellent performance. Using pre-trained architectures like BERT, the research [21] described the discovery of contextual modeling of embeddings. By simply adding a linear layer for classification, they changed the BERT architecture's layers, which helped to boost performance. In [24], authors use BERT in another context-aware study. With an average improvement of 1.65%, they demonstrated that their approach can outperform current CNN and BiLSTM-based algorithms. The authors of [34] conducted another BERT-based study on a Twitter-based dataset and pre-trained the BERT model to improve the performance of the Latvian SA-based tweet. Another top-of-the-chart use for BERT is the identification of bogus news. The authors of [29] presented a combination technique that uses LSTM and BERT coupled with the output of the BERT layer to an LSTM layer. On the PolitiFact dataset, the design can increase accuracy by 2.50% and 1.10% on the GossipCop dataset.

The use of deep learning in various mathematical problems has produced accurate results. Authors [39] have developed several deep-learning strategies for extracting, analyzing, and constructing mathematical expressions. They created a public dataset with a variety of math-based materials. In this paper [3], authors introduced an ensemble architecture for recognizing mathematical entities from math statements. They achieved the highest 99.76% accuracy after ensembling the model. Suleiman et al. [33] performed a study to ascertain the abstractive summaries from mathematical writings. They used LSTM and attention-based recurrent neural networks to carry out the task. Here, the ROUGE-L score is 39.9, followed by the ROUGE-L score of 43.85 and the ROUGE1 score of 20.34. Authors in [37] used a novel approach called PDF2LaTeX. They developed a CNN-LSTM-based architecture that extracts text from a PDF picture using OCR, turns it into mathematical statements using LATEX, and then takes the LATEX assertions. The authors of [35] suggested a CNN RNN-based technique for extracting the mathematical definitions from the text.

One of the most current problems in mathematics is the recognition of mathematical expressions. The author of [16] created a machine-learning-based handwritten equation recognition. This work's main objective is to enhance a software application that uses CNN and some image processing knowledge to accurately identify handwritten equations, sentences, and paragraphs made of handwritten numbers, characters, words, and mathematical symbols. Many studies have used different technology. The authors of [41] suggested a network based on encoders and decoders to identify handwritten mathematical expressions. With an ExpRate of 81.0, they used the Second Order Attention Network (SAN). Paper in [10] implements another element in this domain as a bi-directional fashioned mutual learning network with aggregated attention. With the CROHME 2014 dataset, they attained a maximum accuracy of 56.85%. Sakshi [18] used SVM and CNN to recognize and classify mathematical expressions. They employed the Hasyv2 dataset, from which they obtained accuracy for the SVM and CNN of 62.3% and 76.21%, respectively. Another architecture built on Convolutional Neural Networks (CNN) was developed [30]. The accuracy in this area was effectively improved using this technique on the same dataset, which was 76.71%. On a separate dataset called MNIST, Shinde et al.'s equation solver with CNN was developed [31]. CNN showed an accuracy of 85% for complicated equations. A combined CNN-SVC-based model provided 89.76% accuracy for operator classification and 91.48% for predicting the numbers [19]. Another study offers a different combination [23] combining supervised and contrastive learning. On the CROHME public dataset, authors can improve it by 3.4%.

Real-life math problems mostly appeared as text to academicians. The literature review stipulates that significant research is needed to analyze images to recognize mathematical equations. However, text analysis for equation prediction has yet to be explored.

3. Proposed methodology

Natural language processing (NLP) has witnessed remarkable advances with the development of transformer models, which are neural networks that rely on self-attention mechanisms to encode and decode natural language. Transformer models can capture long-range dependencies and complex semantic relationships in text and can be pre-trained on large-scale corpora to learn general language representations. One of the most influential transformer models is BERT, an encoder-only architecture that can be fine-tuned for various downstream NLP tasks, such as text classification, named entity recognition, and question answering. BERT uses a masked language modeling objective to learn bidirectional context from both left and right tokens and a next-sentence prediction objective to learn sentence-level coherence. BERT has been extended to multilingual settings, such as mBERT, a single model that can handle 104 languages and be used for cross-lingual transfer learning. In this section, we will discuss the details of data collection, data preprocessing, and hyperparameter tuning. We will also discuss the details of transformer-based learning, BERT, and mBERT and how they can be applied to our research problem.

3.1. Data collection

Equation Recognition Tasks need an appropriate dataset; hence, this research topic needs additional focus. Identifying equations in handwritten images has been the subject of much research. Recognizing from text is, however, given less weight. To conduct worthwhile research, we require a sufficient quantity of data for the method's training and assessment. We have compiled a list of actual Bangla math issues and labeled them with the pertinent equations utilized to resolve them. Table 2 provides some examples from our data collection. In addition to the original data, we mention the sample data's English translation. All of the math problems required hand translation. This translation highlights the discrepancies between the Bangla and the English texts.

Table 2.

Each observation comprises Bangla text, an equation from the dataset sample, and its English translation.

3.1.

In this dataset, we have a total 3433 numbers of observations consisting of the same numbered different mathematical problems and their corresponding equations. We have worked with a total of six different equations and kept a class named ‘others.’ These six equations are z=x+y, z=xy, z=xy, z=x/y, y=x! and y=x. The largest math text of this dataset has 207 characters and 45 words. The mathematical problems in our dataset are not trivial. On average, they consist of 55 characters and 14 words, indicating their complexity and depth. This complexity underscores the need for a robust dataset and the challenges we face in equation recognition tasks.

3.2. Data prepossessing

Raw data often includes unnecessary letters and words, which may make categorization difficult. We perform several preprocessing procedures before feeding the data into the classifier to guarantee accurate categorization [1]. The following actions are necessary for getting the most significant results:

  • In addition to words, the Raw data includes several characters (such as $, %, #, *, and -), which likely contribute to a deteriorating accuracy. We, therefore, eliminate these characters from our corpus.

  • Additionally, there are a lot of Bangla stop words1 in the data that don't help with prediction tasks. Furthermore, these terms could act as an obstacle to greater precision. The removal of these stop words has supported our accuracy.

  • Words in Bangla may have many distinct forms. For instance, the word Image 2 can be written as Image 3, etc. Therefore, we parse the corpus using stemming, lemmatization, and the root word.

We've shown the raw data and the preprocessed data below. The preprocessed data outperformed the raw data in terms of performance.

  • Raw Data

    Image 4

  • Preprocessed Data

    Image 5

3.3. Hyperparameter tuning

Finetuning BERT requires choosing appropriate hyperparameters that control the learning process and affect the model's performance. Some of the most essential hyperparameters for BERT are -

  • Learning rate measures how much the model weights are updated in each iteration. A too-high learning rate can cause the model to diverge or forget the pre-trained knowledge, while a too-low learning rate can cause the model to converge too slowly or get stuck in a local minimum. The optimal learning rate depends on the task, the dataset, and the model size.

  • Batch size specifies how many examples are processed in each iteration. A larger batch size can reduce the gradient estimates' variance and improve the training's stability. However, it can also increase the memory consumption and the risk of over-fitting. A smaller batch size can increase the diversity of the training and the generalization ability of the model, but it can also slow down the convergence and require more iterations. The optimal batch size depends on the task, the dataset, and the available resources.

  • Number of epochs determines how many times the model goes through the entire training dataset.

  • max-len considered as the maximum length of the input sequence that BERT can handle. It is also known as max-position-embeddings in the BERT configuration. By default, it is set to 512, which means that BERT can process up to 512 tokens (or subwords) in a single input.

Adjusting these hyperparameters can lead to substantial changes in model performance by influencing its capacity, convergence behavior, and sensitivity to data patterns. Experimentation and careful tuning of these parameters are crucial for optimizing model accuracy on specific tasks. Overall, changing hyperparameters can significantly impact a model's accuracy by altering its capacity, generalization ability, optimization behavior, and sensitivity to data characteristics. Experimentation and finetuning of hyperparameters are essential for optimizing model performance on a specific task. Therefore, we finetuned our model with several combinations of these hyperparameters. Table 3 contains the best values for our classifier. The most crucial hyperparameters for a transformer-based model are learning rate, batch_size, max_seq_length, epoch, and others. For a simple transformer, the variables are learning rate = 2e-4, batch_size = 24, max_seq_length = 50, and epoch = 40. The text sequences used in this study are brief. We've thought about simple problems with math. Therefore, the acceptable range for the max_seq_length is between 50 and 60, with 50 being the best performance. By default, it uses the AdamW [25] optimizer.

Table 3.

This table lists the model parameters and their corresponding values.

Hyperparameters BERT
learning rate (AdamW) 2e-04
max_len 50
Batch_size 24
verbose 1
epoch 40

3.4. Transformer-based learning

Transformer-based learning [36] is a startling advancement in the widely used and practical area of NLP in artificial intelligence. This system uses a potent mix of an attention mechanism, encoder, and decoder to handle sequential inputs efficiently. By using auto-regressive steps, the encoder in the Transformer translates the input sequence (x1,..,xn) to an uninterrupted form z (z1,..,zn), which then leads to the output sequence z (z1,..,zn) after passing through the decoder. The Transformer's architecture is seen in Fig. 1.

Figure 1.

Figure 1

The architecture of the Transformer model (the left and right half of this figure illustrates how the encoder and decoder of the Transformer function, respectively, utilize point-wise ultimately linked layers with layered self-attention.).

Transformers need the following two components:

Encoder and Decoder Stacks: Each stack has 2 sublayers and N = 6 layers. A feed-forward network with all positions connected and a multi-head mechanism for self-attention are realized in two sublayers. The result of sublayers is LayerNorm (x + Sublayer (x)), where dmodel=512 and Sublayer (x) is the sublayer function.

Multi-Head Attention: The transformer's multiple-head self-attention mechanism accomplishes three things. While the decoder transmits requests to the following, the encoder's output generates memory keys and values in the attention layers. Queries, keys, and values are generated by encoder self-attention layers using the output from the layer before. After masking out softmax's input, the decoder keeps the auto-regressive feature in the center of scaled dot-product attention. This calculation calculates attention for keys, queries, and values in the dimensions dk and dv. Equations (1) and (2) are the focus for queries, keys, and values packed in matrices Q, K, and V.

Attention(Q,K,V)=softmax(QKTdK)V (1)
MultiHead(Q,K,V)=Concat(head1,...,headh)WOwhere, headi=Attention(QWiQ,KWiK,VWiV) (2)

WiQRdmodel×dK, WiKRdmodel×dK, WiVRdmodel×dV and WORhdV×dmodel

3.5. Bidirectional encoder representations from transformers (BERT)

A bidirectional transformer encoder (BERT) is a multilayered mechanism [14]. The input for BERT is an unambiguous token sequence, which may be a single sentence or many phrases. The two stages that make up BERT are as follows:

  • Pre-training BERT: BERT has been trained to perform Masked LM (Language Model) and NSP (Next Sentence Prediction) unsupervised tasks. Masked LM includes masking a collection of random tokens and making predictions about them to get the pre-trained bidirectional model. NSP aims to anticipate the subsequent sentence in a sentence pair. It is essential to comprehend the connection between the two input sentences for NSP to be beneficial when two input sentences are present. BERT has received pre-training using the English Wikipedia's text passages (not lists, headings, or Tables) and BooksCorpus (800 million words) [42].

  • Fine-tuning BERT: BERT is recognized for several downstream tasks, including choosing appropriate inputs for single and linked phrases. It starts with previously taught parameters, and subsequent tasks may fine-tune all of them using labeled data.

Fig. 2 depicts the BERT architecture, which sends two input corpus to the classifier.

Figure 2.

Figure 2

BERT architecture (The two processes of the BERT architecture are pretraining and fine-tuning.).

As shown in Fig. 2, the classifier forwards Bangla Text as a single-packed sequence for tasks involving the recognition of Math Equations. Here are the start vector SRH and end vector ERH for fine-tuning BERT. The dot product of the input tokens S and Ti identifies the word i as the beginning of the solution. Before that, it had followed the softmax across the whole paragraph:

Pi=eS.TijeS.Tj (3)

Equation (3) is used for the answer span's end, and S.Ti+E.Tj represents the candidate span's score from position i to position j. The highest score span is discovered while ji demonstrated prediction. The sum of log-likelihoods at the veracious start and finish points shows the training target.

3.6. Multilingual BERT (mBERT)

mBERT [22]is a BERT architecture pre-trained to accommodate 104 languages, including Bangla. To train mBERT, the researchers took 10,000 sentences with at least 20 characters from Wikipedia and 10,000 from each language. They allocated 5,000 phrases for testing and 5,000 for validation. The model demonstrates the ability to distinguish between components that are language-neutral and those that are language-specific. Word Alignment, Language Identification, Language Similarity, Parallel Sentence Retrieval, and Machine Translation are just a few of the probing tasks that researchers investigated with the help of mBERT.

One of the main advantages of BERT over other transformer-based models is its bidirectionality, which means that it can process the input text from both left to right and right to left, capturing the context from both sides of a word. Another advantage of BERT is that it is helpful for the Bangla language because it can capture the rich morphology, syntax, and semantics of Bangla, which are often challenging for other languages. By pretraining BERT on a large and diverse corpus of Bangla text, such as the Bangla2B+ dataset [9].

3.7. Proposed framework

After explaining the preliminary concept in this part, we will discuss the proposed framework. We can divide the methodology into three regions- Data Preprocessing, Model Training, and Model Evaluation. Initially, we extracted math problems and equations from the dataset and preprocessed the data. Next, we sent the preprocessed text to the BERT tokenizer to generate an attention-based input sequence. A BERT tokenizer provides us with an attention mask and input sequence.

The tokenized input sequence and attention mask are then sent to the proposed BERT model. This model has three input layers following a BERT layer. The output of the BERT layer is sent to the dense layer. For the class prediction, we have used the softmax function.

After training the model, we will determine the confusion matrix and Evaluation Metrics to understand the model's performance. We will use Accuracy, Precision, Recall, Sensitivity, F1 Score, Macro F1 Score, and Micro F1 Score. Finally, the model is ready to Recognize the relevant equation for a given unseen mathematical problem. Fig. 3 represents the system architecture of our methodology.

Figure 3.

Figure 3

System Architecture of the proposed framework.

4. Experimental evolution

Deep Learning models need high-performance computer configurations to handle data in parallel effectively. Our study used Google Colab's features, as described in [11] performance. The cloud-based Jupyter Notebook platform Google Colab is intelligently furnished with the necessary GPU and TPU exploitation components. It has a hefty 12 GB of GPU RAM and an NVIDIA Tesla K-80 GPU. It also runs on the Ubuntu operating system. It includes a complete set of pre-configured modules and packages created explicitly for deep learning applications and the Python runtime.

4.1. Proposed model's performance

This part of the paper is all about the performance of the proposed model. Here, we will discuss the different observations of our proposed model and dataset.

For any deep learning model, hyperparameters are crucial as they can impact the model's performance. In our work, we have to be vigilant. To find out the best hyperparameters of our proposed model, we train the model on different hyperparameter values and observe which combination provides the highest performance. Table 4 represents the accuracy of the proposed model for different hyperparameters.

Table 4.

Accuracy of the model over different values of the hyperparameter.

4.1.

Here, we have trained the mode with five different learning rates (1e-5, 2e-5, 3e-5, 4e-5, and 5e-5), three different batch sizes (12, 24, and 32), and two different input sequence lengths (50 and 60). The model showed the best performance where the learning rate is 2e-5, the batch size is 24, and the maximum input sequence length is 50. This combination is colored red in the Table 4.

After fixing the hyperparameters for the proposed model, we train the model over 40 epochs. We trace the accuracy and loss of our model over all the epochs to understand how the model is built. In Fig. 4, we have sketched both the training and testing accuracy over 40 epochs where the maximum training accuracy of the model is 99.96 and the testing accuracy is 99.80. In this figure, the red curve represents training accuracy, and the blue curve represents testing accuracy. Besides accuracy, we considered another measure for the model, which is loss. Fig. 5 represented the training and testing loss over 40 epochs. The minimum training loss in these 40 epochs is 0.0013, and the least testing loss is 0.0463. The red and blue curves represent the accuracy of training and testing, respectively.

Figure 4.

Figure 4

The training and testing accuracy of the proposed model.

Figure 5.

Figure 5

The training and testing loss of the proposed model.

The accuracy and loss values indicate that our proposed model's performance is acceptable. The low value of training and testing losses indicates that the misclassification rate is low for our classifier. Again, the large accuracy values mean our model rightly predicted most of the testing observations.

We will determine more measures for further justification of our model. Therefore, we determine the confusion matrix of this work. The heatmap in Fig. 6 indicates the confusion matrix of our classifier. This heatmap represents our model's right and wrong classification on the test data.

Figure 6.

Figure 6

Heatmap representing confusion matrix.

Using the confusion matrix, we determine the True Positive(TP), True Negative(TN), False Positive(FP), and False Negative(FN). After that, we determine some more Evaluation Metrics using the following equations (4), (5), (6), (7), (8), (9), and (10).

Precision=TPTP+FP×100% (4)
Sensitivity/Recall=TPTP+FN×100% (5)
Specificity=TNTN+FP (6)
F1 Score=2PrecisionRecallPrecision+Recall (7)
Macro FI Score=i=1NoofclassesF1ScoreiNoofclasses (8)
Micro F1 Score=TPTP+12(FN+FP) (9)
Accuracy=TP+TNTP+TN+FP+FN (10)

The values of these Evaluation Metrics are mentioned in Table 5. These evaluation metrics, including accuracy, precision, recall (sensitivity), F1 score, and the confusion matrix, are mentioned collectively to assess a classification model's performance comprehensively. Accuracy gives a general overview of correct predictions. At the same time, precision focuses on the proportion of adequately predicted positive instances among all predicted positives. Recall emphasizes the ratio of correctly predicted positive instances among all actual positives. The F1 score balances precision and recall, offering a single metric for overall performance. Finally, the confusion matrix provides a detailed breakdown of predictions, aiding in error analysis and model improvement. These metrics enable practitioners to make informed decisions about model selection and refinement based on specific task requirements and constraints.

Table 5.

Values of different Evaluation Metrics for the proposed model and dataset.

z=xy z=x/y y=x! z=x+y y=x z=xy Others
True
Positive
104 93 88 107 165 111 13
True
Negative
578 592 598 575 521 574 673
False
Positive
3 0 0 1 0 1 0
False
Negative
1 1 0 3 0 0 0
Precision(%) 97.20 100 100 99.08 100 99.11 100
Sensetivity/
Recall(%)
99.04 98.93 100 97.27 100 100 100
Specificity(%) 99.48 100 100 99.83 100 99.83 100
F1 Score(%) 98.11 99.47 100 100 98.17 100 99.55
Macro Average
F1 Score(%)
99.33
Micro Average
F1 Score (%)
99.80
Accuracy(%) 99.27

4.2. Comparison with other transformer models

After observing the performance of our proposed model, we compare the model with other transformer-based architectures. We have chosen ELECTRA, XLNet, RoBERTa, and DistilBERT. We determine both Accuracy and Macro F1 Score for all of these transformer-based architectures. Fig. 7 represents the accuracy and F1 Score of the models we have mentioned.

Figure 7.

Figure 7

Comparison of different transformer-based models on our dataset.

Among them, our proposed model notably outperformed with an accuracy of 99.80% and an F1 Score of 99.33%. ELECTRA shows the second-best performance in our Mathematical Equation dataset. This transformer-based architecture provided 90.06% accuracy and 89.81%. The accuracy and F1 Score for XLNet is 87.78% and 85.76%. We also train the RoBERTa and DistilBERT. RoBERTa showed 83.05% accuracy and 82.20% F1 Score. DistilBERT has shown 83.76% accuracy and 81.33% F1 Score.

To justify the model's generalizability and increase its usability for the broader audience, we have utilized the English translation of this dataset. Moreover, we compare the results with those of other existing transformers. Fig. 8 shows the comparison of different transformer models. Here, the proposed BERT classifier again outperformed the other transformer models with 99.27% accuracy and 98.87% Macro Average F1 Score.

Figure 8.

Figure 8

Comparison of different transformer-based models on English dataset.

In our trained model, we pass the unseen input text and achieve the correct prediction of the equations. For example, with the unseen text Image 7 (Three people were standing and two more came to the bus stand and now there are five people in total.) model predicted equation z=x+y. This is a correct prediction, and similarly to other mathematical statements, our model predicted the right equation.

5. Discussion

Mathematical equation identification is a fundamental component that has the potential to transform how mathematical content is handled. Our unique approach of using deep learning to create algorithms that can recognize equations independently is a significant advancement, considerably lowering the time and effort necessary for mathematical formula entry. The novelty of our work lies in the creation of a novel dataset for Bangla Equation Recognition, a resource that was previously unavailable. We prepared this dataset by creating real-time math problems and annotating them with corresponding equations.

The lack of a dataset to train and evaluate the model was a significant barrier to this research. The image dataset for equation recognition is available in different languages, especially the handwritten image dataset. But practically, math problems are created in text, and text needs to be analyzed. Initially, we did not find any proper dataset for our research. As a result, we create a noble dataset for recognizing mathematical equations in Bangla. We have gathered actual mathematical texts and have used them to figure out various mathematical equations. We have dealt with seven different types of equations in all, including z=x+y, z=xy, z=x/y, z=xy, y=x!, x=y, and others. We think the equation recognition study will benefit from this dataset. Additionally, it can support Bangla Natural Language Processing (NLP) research. We meticulously evaluated the performance of our suggested model, identifying the training and testing accuracy curves, the training and testing loss curves, and the model's Accuracy, Loss, Macro Average F1 Score, and Micro Average F1 Score. We also presented an examination of the suggested classifier's confusion matrix. This comprehensive evaluation, not only in the Bangla language but also with the translated English dataset, provides a robust understanding of our model's capabilities.

This study uses the most recent deep transformer models, such as ELECTRA, XLNet, RoBERTa, and DistilBERT, to concentrate on that issue. Among them, BERT outperformed for its multilingual acceptance through the variance mBERT. We have chosen certain metrics for our suggested architecture to generate a comparison with BERT. Using this technique, BERT outperformed other transformer models, such as XLNet, Electra, RoBERTa, and DistilBert. The BERT model outperformed with a significant 99.96% training accuracy and 99.80% validation accuracy. The performance of ELECTRA in this study is equally outstanding. Losses for the model during testing and training have been significantly reduced. In Section 4.2, we compare the performance of our model to that of other methods and show that it performs better than others. mBERT has already demonstrated impressive performance in natural language processing for different types of languages. One major reason for mBERT's remarkable performance on our dataset is that this model is pre-trained in the Bangla Language. Our dataset is developed in these same languages. For this reason, we have chosen mBERT as our proposed classifier. The mBERT model is a variation of the BERT model that has two steps, pretraining and finetuning. With the pretraining task, MLM and NSP BERT assess the language so that it can show good performance in finetuning. One challenge of this research is the need for the dataset. Therefore, to justify the generalizability, we train the model using both the Bangla and English (Google Translate) datasets. In both cases, the model provided outstanding performance. To complete the 40 epochs, the model requires 1 hour, 09 minutes, and 23 seconds, with one NVIDIA Tesla K80 GPU. Increasing computation power may reduce the training time and enable the train model to have higher values for batch sizes.

The proposed model's higher accuracy may raise the possibility of model overfitting. To make the experimental observation of our proposed model intricate, we have identified various measures. We can observe that in the confusion matrix, the values of True Positives and True Negatives are higher than False Positives and False Negatives. The heatmap in Fig. 6 shows the confusion matrix of our proposed classifier. This means that the model is successful in correctly recognizing most of the equations. We not only achieved higher accuracy but also satisfactory precision, recall, and F1 scores. In Table 5, different Evaluation matrices are mentioned.

We believe our proposed classifier and dataset can significantly contribute to the advancement of real-world educational systems. Academicians can use this trained model after deployment to create and solve mathematical equations. It can be integrated into educational software to help students learn and solve mathematical problems, and assist teachers in grading the students. The potential impact of our work on the educational landscape is inspiring. Moreover, mathematics has contributed to scientific research. Therefore, it can help resolve math problems faster. Researchers can extract and analyze equations from scientific papers, enabling easier information retrieval and cross-referencing of mathematical models and engineers can extract equations from design documents or technical manuals to use in simulations, calculations, or optimizations. The proper utilization of the dataset can also facilitate conversion to LaTeX, a popular typesetting system for academic and technical documents.

Despite covering some important NLP research problems, this work has some limitations. The language-specific nature of the endeavor is a significant limitation of this work. However, we try to overcome this by using translated data. In our research, we used 3,433 observations of seven different equations. These observations are sufficient to train any transformer model. However, in the future, we want to increase the observation number so that this dataset can be used practically for larger projects. Another major limitation of our work is that we have considered only basic equations; nevertheless, mathematics contains more complex equations, which are equally important to developing realistic systems. In time to come, we intend to work with more math equations.

6. Conclusion and future work

The primary goal of this work is to develop an effective and automated method for recognizing equations from Bangla text, which can play a forward-thinking role in the Bangla educational system. Our team has created a model that can extract equations from Bangla text. The most recent deep learning model, the transformer-based learning BERT, which relies on the self-attention mechanism and the pre-training language model for recognition, was used to create the model. We used a real-world benchmark dataset that was brand-new to the Bangla language to apply the suggested methodology. This dataset may provide a valuable service to Bangla NLP.

This work has some possible uses and consequences, such as -

  • It can be used as an educational tool for students and teachers who want to learn and teach mathematics in Bangla. It can help them check their solutions, understand the steps, and practice different types of equations.

  • It can be used as a research tool for scientists and engineers who work with mathematical models and equations in Bangla. It can help them solve complex problems, verify their results, and explore new possibilities.

  • It can be used as a cultural tool for preserving and promoting the Bangla language and script. It can help showcase the beauty and diversity of Bangla mathematics and encourage more people to use and appreciate it.

  • It can have some positive consequences, such as:
    • Improving the quality and accessibility of mathematics education and research in Bangla. Enhancing the computational and cognitive skills of the users.
    • Increasing the awareness and recognition of Bangla mathematics and culture.
  • It can also have some negative consequences, such as:
    • Reducing human involvement and creativity in solving mathematical problems.
    • Creating dependency and over-reliance on the system.
    • Exposing the system to errors and vulnerabilities.

In order to create a more effective real-world Equation Recognition from Bangla Text, we plan to deploy this study work as an embedded system in the future. This methodology is something we wish to use for other recognition methods. Additionally, we intend to increase the size of our dataset so that we may present Bangla Equation Recognition with an applicable data source.

CRediT authorship contribution statement

Tanjim Taharat Aurpa: Writing – review & editing, Writing – original draft, Visualization, Supervision, Methodology, Formal analysis, Conceptualization. Kazi Noshin Fariha: Writing – original draft, Visualization, Methodology, Data curation. Kawser Hossain: Writing – original draft, Visualization, Methodology, Data curation. Samiha Maisha Jeba: Writing – original draft, Visualization, Validation, Methodology. Md Shoaib Ahmed: Writing – review & editing, Writing – original draft, Validation, Project administration. Md. Rawnak Saif Adib: Writing – review & editing. Farhana Islam: Writing – review & editing. Farzana Akter: Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Footnotes

Contributor Information

Tanjim Taharat Aurpa, Email: aurpa0001@bdu.ac.bd.

Kazi Noshin Fariha, Email: noshinfariha4200@gmail.com.

Kawser Hossain, Email: 18103210kawser@gmail.com.

Samiha Maisha Jeba, Email: jebam615@gmail.com.

Md Shoaib Ahmed, Email: shoaibmehrab011@gmail.com.

Md. Rawnak Saif Adib, Email: saifadib.cse@iubat.edu.

Farhana Islam, Email: farhana0001@bdu.ac.bd.

Farzana Akter, Email: farzana0001@bdu.ac.bd.

Data availability

The details of our data will be found in https://github.com/Taharat22/Shomikoron [6].

References

  • 1.Ahmed M.S., Aurpa T.T., Anwar M.M. 2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) IEEE; 2020. Online topical clusters detection for top-k trending topics in Twitter; pp. 573–577. [Google Scholar]
  • 2.Ashrafi I., Mohammad M., Mauree A.S., Nijhum G.M.A., Karim R., Mohammed N., Momen S. Banner: a cost-sensitive contextualized model for bangla named entity recognition. IEEE Access. 2020;8:58206–58226. [Google Scholar]
  • 3.Aurpa T.T., Ahmed M.S. An ensemble novel architecture for bangla mathematical entity recognition (mer) using transformer based learning. Heliyon. 2024 doi: 10.1016/j.heliyon.2024.e25467. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Aurpa T.T., Ahmed M.S., Rifat R.K., Anwar M.M., Ali A.S. Uddipok: a reading comprehension based question answering dataset in bangla language. Data Brief. 2023;47 doi: 10.1016/j.dib.2023.108933. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Aurpa T.T., Ahmed M.S., Sadik R., Anwar S., Adnan M.A.M., Anwar M.M. International Conference on Hybrid Intelligent Systems. Springer; 2021. Progressive guidance categorization using transformer-based deep neural network architecture; pp. 344–353. [Google Scholar]
  • 6.Aurpa T.T., Fariha K.N., Hossain K. Shomikoron: dataset to discover equations from bangla mathematical text. Data Brief. 2024;55 doi: 10.1016/j.dib.2024.110742. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Aurpa T.T., Jeba S.M., Ahmed M.S., Ullah M.A., Mehzabin M., Anwar M.M. Bangla_mer: a unique dataset for bangla mathematical entity recognition. Data Brief. 2024;54 doi: 10.1016/j.dib.2024.110407. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Aurpa T.T., Rifat R.K., Ahmed M.S., Anwar M.M., Ali A.S. Reading comprehension based question answering system in bangla language with transformer-based learning. Heliyon. 2022;8 doi: 10.1016/j.heliyon.2022.e11052. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Bhattacharjee A., Hasan T., Ahmad W.U., Samin K., Islam M.S., Iqbal A., Rahman M.S., Shahriyar R. Banglabert: language model pretraining and benchmarks for low-resource language understanding evaluation in bangla. 2021. arXiv:2101.00204 arXiv preprint.
  • 10.Bian X., Qin B., Xin X., Li J., Su X., Wang Y. Proceedings of the AAAI Conference on Artificial Intelligence. 2022. Handwritten mathematical expression recognition via attention aggregation based bi-directional mutual learning; pp. 113–121. [Google Scholar]
  • 11.Carneiro T., Da Nóbrega R.V.M., Nepomuceno T., Bian G.B., De Albuquerque V.H.C., Reboucas Filho P.P. Performance analysis of Google colaboratory as a tool for accelerating deep learning applications. IEEE Access. 2018;6:61677–61685. [Google Scholar]
  • 12.Chowdhury S., Baili N., Vannah B. Ensemble fine-tuned mbert for translation quality estimation. 2021. arXiv:2109.03914 arXiv preprint.
  • 13.Colla D., Caselli T., Basile V., Mitrović J., Granitzer M. Proceedings of the Fourteenth Workshop on Semantic Evaluation. 2020. Grupato at semeval-2020 task 12: retraining mbert on social media and fine-tuned offensive language models; pp. 1546–1554. [Google Scholar]
  • 14.Devlin J., Chang M.W., Lee K., Toutanova K. Bert: pre-training of deep bidirectional transformers for language understanding. 2018. arXiv:1810.04805 arXiv preprint.
  • 15.Gonen H., Ravfogel S., Elazar Y., Goldberg Y. It's not Greek to mbert: inducing word-level translations from multilingual bert. 2020. arXiv:2010.08275 arXiv preprint.
  • 16.Kishor K., Tyagi R., Bhati R., Rai B.K. Proceedings of International Conference on Recent Trends in Computing: ICRTC 2022. Springer; 2023. Develop model for recognition of handwritten equation using machine learning; pp. 259–265. [Google Scholar]
  • 17.Krishnan J., Anastasopoulos A., Purohit H., Rangwala H. Cross-lingual text classification of transliterated Hindi and malayalam. 2021. arXiv:2108.13620 arXiv preprint.
  • 18.Kukreja V., Ahuja S., et al. 2021 9th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO) IEEE; 2021. Recognition and classification of mathematical expressions using machine learning and deep learning methods; pp. 1–5. [Google Scholar]
  • 19.Kukreja V., et al. 2022 International Conference on Decision Aid Sciences and Applications (DASA) IEEE; 2022. A hybrid svc-cnn based classification model for handwritten mathematical expressions (numbers and operators) pp. 321–325. [Google Scholar]
  • 20.Kulkarni A., Mandhane M., Likhitkar M., Kshirsagar G., Jagdale J., Joshi R. Experimental evaluation of deep learning models for marathi text classification. 2021. arXiv:2101.04899 arXiv preprint.
  • 21.Li X., Bing L., Zhang W., Lam W. Exploiting bert for end-to-end aspect-based sentiment analysis. 2019. arXiv:1910.00883 arXiv preprint.
  • 22.Libovickỳ J., Rosa R., Fraser A. How language-neutral is multilingual bert? 2019. arXiv:1911.03310 arXiv preprint.
  • 23.Lin Q., Huang X., Bi N., Suen C.Y., Tan J. Proceedings of the Asian Conference on Computer Vision. 2022. Cclsl: combination of contrastive learning and supervised learning for handwritten mathematical expression recognition; pp. 3724–3739. [Google Scholar]
  • 24.Liu A., Huang Z., Lu H., Wang X., Yuan C. China National Conference on Chinese Computational Linguistics. Springer; 2019. Bb-kbqa: Bert-based knowledge base question answering; pp. 81–92. [Google Scholar]
  • 25.Loshchilov I., Hutter F. Decoupled weight decay regularization. 2017. arXiv:1711.05101 arXiv preprint.
  • 26.Rahman M.M. Linguistic diversity and social justice in (bangla) desh: a socio-historical and language ideological perspective. J. Multiling. Multicult. Dev. 2020;41:289–304. [Google Scholar]
  • 27.Rahman T. A multilingual language-in-education policy for indigenous minorities in Bangladesh: challenges and possibilities. Curr. Issues Lang. Plann. 2010;11:341–359. [Google Scholar]
  • 28.Rahman T., Ahmed R. 2021. Combatting the impact of COVID-19 school closures in Bangladesh.https://blogs.worldbank.org/en/endpovertyinsouthasia/combatting-impact-covid-19-school-closures-bangladesh Technical Report. World Bank Group. [Google Scholar]
  • 29.Rai N., Kumar D., Kaushik N., Raj C., Ali A. Fake news classification using transformer based enhanced lstm and bert. Int. J. Cogn. Comput. Eng. 2022;3:98–105. [Google Scholar]
  • 30.Sakshi Sharma C., Kukreja V. Cyber Intelligence and Information Retrieval: Proceedings of CIIR 2021. Springer; 2021. Cnn-based handwritten mathematical symbol recognition model; pp. 407–416. [Google Scholar]
  • 31.Shinde R., Dherange O., Gavhane R., Koul H., Patil N. Handwritten mathematical equation solver. Int. J. Eng. Appl. Sci. Technol. 2022;6:146–149. [Google Scholar]
  • 32.Souza F., Nogueira R., Lotufo R. Portuguese named entity recognition using bert-crf. 2019. arXiv:1909.10649 arXiv preprint.
  • 33.Suleiman D., Awajan A. Deep learning based abstractive text summarization: approaches, datasets, evaluation measures, and challenges. Math. Probl. Eng. 2020;2020:1–29. [Google Scholar]
  • 34.Utka A., et al. Human Language Technologies–the Baltic Perspective: Proceedings of the Ninth International Conference Baltic HLT 2020. IOS Press; 2020. Pretraining and fine-tuning strategies for sentiment analysis of latvian tweets; p. 55. [Google Scholar]
  • 35.Vanetik N., Litvak M., Shevchuk S., Reznik L. Proceedings of the Twelfth Language Resources and Evaluation Conference. 2020. Automated discovery of mathematical definitions in text; pp. 2086–2094. [Google Scholar]
  • 36.Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I. Advances in Neural Information Processing Systems. 2017. Attention is all you need; pp. 5998–6008. [Google Scholar]
  • 37.Wang Z., Liu J.C. Proceedings of the ACM Symposium on Document Engineering 2020. 2020. Pdf2latex: a deep learning system to convert mathematical documents from pdf to latex; pp. 1–10. [Google Scholar]
  • 38.Wilce J.M. The Sociology of Language and Religion: Change, Conflict and Accommodation. Springer; 2010. Society, language, history and religion: a perspective on bangla from linguistic anthropology; pp. 126–155. [Google Scholar]
  • 39.Youssef A., Miller B.R. Deep learning for math knowledge processing. Intelligent Computer Mathematics: 11th International Conference, Proceedings 11; CICM 2018, Hagenberg, Austria, August 13-17, 2018; Springer; 2018. pp. 271–286. [Google Scholar]
  • 40.Yu J., Jiang J. IJCAI. 2019. Adapting bert for target-oriented multimodal sentiment classification. [Google Scholar]
  • 41.Yuan Y., Liu X., Dikubab W., Liu H., Ji Z., Wu Z., Bai X. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. Syntax-aware network for handwritten mathematical expression recognition; pp. 4553–4562. [Google Scholar]
  • 42.Zhu Y., Kiros R., Zemel R., Salakhutdinov R., Urtasun R., Torralba A., Fidler S. Proceedings of the IEEE International Conference on Computer Vision. 2015. Aligning books and movies: towards story-like visual explanations by watching movies and reading books; pp. 19–27. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The details of our data will be found in https://github.com/Taharat22/Shomikoron [6].


Articles from Heliyon are provided here courtesy of Elsevier

RESOURCES