Skip to main content
Springer Nature - PMC COVID-19 Collection logoLink to Springer Nature - PMC COVID-19 Collection
. 2023 Apr 27;15(4):2273–2282. doi: 10.1007/s41870-023-01273-z

A two-staged NLP-based framework for assessing the sentiments on Indian supreme court judgments

Isha Gupta 1,, Indranath Chatterjee 2, Neha Gupta 1
PMCID: PMC10133901  PMID: 37256028

Abstract

Topic modeling is a powerful technique for uncovering hidden patterns in large documents. It can identify themes that are highly connected and lead to a certain region while accounting for temporal and spatial complexity. In addition, sentiment analysis can determine the sentiments of media articles on various issues. This study proposes a two-stage natural language processing-based model that utilizes Latent Dirichlet Allocation to identify critical topics related to each type of legal case or judgment and the Valence Aware Dictionary Sentiment Reasoner algorithm to assess people's sentiments on those topics. By applying these strategies, this research aims to influence public perception of controversial legal issues. This study is the first of its kind to use topic modeling and sentiment analysis on Indian legal documents and paves the way for a better understanding of legal documents.

Keywords: Sentiment analysis, Topic modeling, Legal documents, Latent Dirichlet allocation, Valence aware dictionary sentiment reasoner, Google news feed

Introduction

The Internet is a vast repository of data that spans various fields, including medical, engineering, technology, e-commerce, legal, historical, and geographical [1]. However, the sheer volume of unstructured data available on the web can make it challenging to extract meaningful insights from it [2]. Text mining is a powerful tool for converting unstructured text into a structured format to identify significant patterns and fresh insights [1, 3].

One useful application of text mining is topic modeling, which can help identify the general subjects covered by a large corpus of data [4]. Topic modeling is an unsupervised technique that discovers hidden patterns from a large amount of data [6, 7]. It involves clustering groups of words that semantically represent the same topic, offering the benefit of saving time and effort [5]. While it does not produce a summary of the whole document [7], topic modeling outputs various topics that are dominant in the document. A good topic model produces great results that can be easily inferred by a human. Topic modeling is mainly of three types [8], with Latent Dirichlet Allocation (LDA) being one of the finest strategies [9]. LDA is a Bayesian form of Latent Semantic Analysis (LSA), in which the distribution is sampled over a probability simplex, and the model has a generative procedure [10].

LSA, on the other hand, uses Term Frequency-Inverse Term Frequency (TF-IDF) to evaluate textual documents [11]. Both LSA and LDA are based on the distributional hypothesis, with the primary difference being that LDA supposes the allocation of subjects in a manuscript, and the allocation of words in the different subjects is Dirichlet distribution, while LSA does not undertake any distribution, resulting in more incomprehensible vector submissions of topics and credentials [12].

Probabilistic Latent Semantic Analysis (pLSA) was proposed to resolve the depiction test in LSA by substituting Singular Value Decomposition (SVD) with a probabilistic model [13]. It shows every record in the TF-IDF matrix using probability.

Sentiment analysis (SA) is another important application of text mining that focuses on analyzing people's feelings, sentiments, or attitudes toward something [14, 15]. SA can be applied to topics, events, products, or organizations, and it works at three levels [17]: document-level, sentence-level, and aspect level [18]. Sentiment classification mainly works on three techniques: supervised learning, unsupervised learning, and hybrid technique [19], which is a combination of supervised and unsupervised techniques.

Ensembling SA and topic modeling can help deduce a document's importance, as topic modeling can find abstract topics and related words or patterns from the document, and these features can help find the sentiment more efficiently. In this paper, we propose the amalgamation of these two techniques to evaluate the most effective sentiments related to the topics found in legal judgments. Legal documents are critical in all sectors, educational or work-related, and are prepared by a legal officer or corporate lawyer with the provision of entering the trial court of law. Legal documents are often challenging for non-experts to read due to their use of technical vocabulary, complex syntax and semantics, and the use of unusual meanings, doublets, and triplets. Furthermore, legal documents tend to be lengthy and technical, making it essential to develop an automated tool that can help us understand them better. Topic modeling techniques can play a pivotal role in developing this automated tool by identifying the most relevant topics covered in legal judgments..

Section 1 provides an introduction to the concept of topic modeling, its significance, and its distinctive features in comparison to summarization techniques. Furthermore, this section delves into the relevance of sentiment analysis (SA) within the context of legal documents. Section 2 delves into the prior research conducted on topic modeling and sentiment analysis, highlighting the pertinent literature on the topic. Section 3 outlines the proposed methodology and its two-stage framework, providing a comprehensive overview of the approach taken. Section 4 elaborates on the experimental details of the proposed model, including the methodology employed in the implementation of the framework. Section 5 presents the results of the experiment and provides a detailed analysis of the outcomes. The discussion will address the implications of the findings and outline avenues for future research.

Related work

Researchers have extensively studied sentiment analysis (SA) and topic modeling in various domains, and their importance in different applications has been demonstrated through numerous studies. In recent literature, a bibliometric analysis of over 3,710 publications from 1971 to 2018 in a particular journal was conducted to identify significant topical elements and to examine publication and reference patterns, among other things. Using word cloud analysis and topic modeling, the authors of this study revealed key trends and topics in the data [1617, 20].

Another study investigated the impact of the ongoing pandemic on hydro-meteorological disasters, such as floods and typhoons, in 24 countries [22]. To provide an overview of the concerns in these countries, the researchers employed Latent Dirichlet Allocation (LDA), a computational topic modeling technique, to extract key terms and topics from numerous reports and news. This interdisciplinary study offers insights that can be beneficial for policymakers and researchers to address the challenges of responding to such disasters during a pandemic.

Moreover, a different study introduced various topic modeling approaches that can handle the relationship between topics, changes in topics over time, and the ability to deal with short messages such as those encountered in virtual entertainment or sparse message data. The paper also briefly reviewed the algorithms used to optimize and collect parameters in topic modeling, which are crucial for producing meaningful results, regardless of the approach [23].

In this study, the authors [24] framed the possibilities of underlying topic modeling for hierarchical exploration and give a bit-by-bit instructional exercise on the most proficient method to apply it. The application model expanded on 428,492 surveys of Fortune 500 organizations from the internet-based stage Glassdoor, on which representatives can assess associations. The research exhibited how underlying topic models permit inductively recognizing themes that make a difference to representatives and measure their relationship with workers' impression of hierarchical culture. The paper examined the benefits and restrictions of topic modeling as an exploration strategy and layout how future examinations can apply the method to concentrate on hierarchical peculiarities.

The study [25] fostered the embedded topic model, a generative model of records that weds customary topic models with word embedding. More explicitly, the models have each word with a downright circulation whose regular boundary is the internal item between the word's implanting and an inserting of its doled-out point. To fit the model, the authors created a proficient amortized variation deduction calculation. The model found interpretable topics even with huge vocabularies that incorporate interesting words and stop words.

The authors of the study [26] surveyed the examination writing by managing proper pre-handling of the text assortment; satisfactory determination of model boundaries, including the number of topics to be produced, assessment of the model's unwavering quality; and the course of truly deciphering the subsequent topics. They proposed a system that moves toward these difficulties. The objective was to make LDA topic modeling more available to correspondence specialists and to guarantee consistency with disciplinary norms. Thus, the research fostered a short involved client guide for applying LDA topic modeling. The study showed the worth of the methodology with exact information from a continuous exploration project.

Access work has been done in the past on SA. There are many concerns and issues still focused on by many researchers. We have tried to review the latest work on SA. One of the main concerns with SA is the polarization of Twitter sentiment. The suggested method [27] entails categorizing the attitudes using the eight fundamental emotions provided by Plutchik's wheel of emotion, which makes the chores more manageable. Other elements have been applied following the Rule Based Emotion Classification (RBEM) algorithm to determine the polarity of messages. The proposed algorithm has demonstrated reasonable accuracy.

The authors [28] assessed the effectiveness of the various episodes of "Mann Ki Baat," a program that the Indian Prime Minister launched in 2014. Two steps were taken to complete this. First, the execution of SA on this radio show's written episodes. Second, Twitter posts were used that the general public made about the subjects covered in the various episodes of this show. The outcomes reveal that this show has benefited Indian citizens in a variety of ways. Additionally, our method validated this outcome with a respectable accuracy of 85.4%.

This study [29] presents a hybrid SA strategy in which lexicon-based approaches are employed in conjunction with deep learning models to increase sentiment accuracy. Studies entail examining the influence of TextBlob on model classification accuracy in comparison to the original annotations while keeping in mind the possibility of fraudulent annotations.

The goal of the current study [30] was to evaluate Persian tweets to assess Iranians' attitudes toward the COVID-19 vaccination and Iranian attitudes about domestic and foreign COVID-19 vaccines. They recognized sentiments of retrieved tweets using a deep learning SA model based on CNN-LSTM architecture.

A lot of people also used a combination of topic modeling and SA for research purposes. In this study [31], topic modeling and classification approaches are used to create a hybrid model for extracting customer opinions from tweets of Abuja Electricity Distribution Company (AEDC). The electrical business can use SA to enhance the quality of its services. Tweets were utilized to generate dominating topics using the LDA topic modeling technique. A prediction accuracy of 94.8% was achieved by the proposed model.

In a review study [32], a huge dataset of geo-tagged tweets containing specific catchphrases connecting with environmental change is dissected utilizing volume examination. To generate the various themes of discussion, LDA was used for topic modeling, and Valence Aware Dictionary Sentiment Reasoner (VADER) was used for SA to determine the broad attitudes and viewpoints contained in the dataset. These strategies are utilized to investigate the idea of environmental change conversation between various nations over the long run. SA showed that the general conversation is pessimistic, particularly when clients are responding to political or outrageous climate occasions. Topic modeling showed that the various topics of conversation on environmental change are assorted, yet a few topics are more common than others.

The authors [33] proposed a cross-planning table methodology because of the area's prevalence, evaluations, latent topics, and sentiment. The outcomes show that the consolidated elements of LDA, SVM, evaluations, and cross-mappings are helpful for improved execution.

The motivation behind the study [34] was utilizing the genuine encounters of different clients who have encountered aircraft. The information gathered was online audits from 27 carriers, with more than 14,000 surveys. The objective is what sorts of significant words are in the web-based audits.

The review study [35] proposed an ontology and LDA (OLDA) based topic modeling and word implanting approach for sentiment characterization. The proposed framework recovers transportation content from interpersonal organizations, eliminates unessential substances to separate significant data, and creates topics and elements from extricated information utilizing OLDA. AI classifiers are utilized to assess the proposed word implanting framework. The strategy accomplished an accuracy of 93%, which showed that the proposed approach is successful for sentiment classification.

The study [36] introduced a bibliometric survey of SA with the premise of an underlying topic modeling strategy to get a broad outline of the exploration field. The authors additionally used techniques like relapse investigation, geographic perception, informal organization examination, and the Mann-Kendal pattern test. The discoveries gave an exhaustive comprehension of the patterns and topics regarding SA, which could help in effectively observing future exploration works and undertakings. This review proposed a structure for directing a complete bibliometric investigation.

To date, little or no work has been done on legal judgments employing topic modeling. A few related works to this field are explained. The authors [21] investigated the use of 56 distinct strategies for analyzing text-based similarity across legal dispute explanations to a dataset of Indian Supreme Court Cases. Thirty of the 56 diverse tactics are modifications of current procedures, while the remaining 26 are our suggested ideas. Models such as BERT and Law2Vec are included in the techniques under consideration. It was discovered that more conventional approaches (such as the TF-IDF and LDA) that rely on a set of terms depiction perform better than more advanced setting mindful tactics (such as BERT and Law2Vec) for determining report level comparability. Finally, they picked five of our best-performing strategies for evaluating resemblance across case reports based on experimental approval.

The paper [37] utilized LDA topic modeling on a dataset of 3931 diary articles, and investigated three inquiries: Which topics inside legitimate examination on AI can be recognized? When were these topics tended to? Can comparable papers be recognized? The topic modeling brings about a sum of 32 significant subjects. Also, it was found that legitimate examination of AI expanded as of 2016, with topics turning out to be more granular and different over the long run. At last, a correlation of the likeness evaluations created by the calculation and a human master recommends that the evaluations frequently match.

The study [38] proposed the Supreme Court classifier, a framework that applies solutions to the issue of lawful court attitudes report order. The research compares methodologies that use traditional AI and NN-based approaches. The authors also provided a CNN used with pre-prepared word vectors that outperform the best in class when applied to our dataset. The Washington University School of Law Supreme Court Database was used by the reviewers to train and analyze the framework (SCDB). The greatest framework (word2vec + CNN) accomplishes 72.4% accuracy when arranging the court choices into 15 expansive SCDB classifications and 31.9% accuracy while grouping amid 279 better-grained SCDB classifications.

The work [39] portrayed and assessed the utilization of BERT for topic modeling in authoritative archives. The creators have zeroed in on a subset of milestone cases from the US Case law dataset to assess the effect of topic modeling, through area explicit embedding pre-prepared from LEGAL-BERT. The study researched various varieties of producing sentence embedding from the cases.

Table 1 summarized the different work done in the area of topic modeling by different authors. To the best of our knowledge, no one has applied topic modeling and SA in the field of legal documents, especially for Indian judgments. There are studies or articles available in other domains that make use of SA and topic modeling as discussed above. Most of them have exercised Twitter for the application of topic modeling and SA. This becomes the motivation for us to work in this field. The pivotal objective of the paper is to employ topic modeling and SA as a coupled model for analyzing the legal documents from Indian court judgments.

Table 1.

Applications of Topic Modeling

References The technique of topic modeling Dataset Sample size Objective
[32] LDA Tweets 3,90,016 Inference different topics of discussion of global climate change
[40] LDA Twitter NA Identification of noteworthy topics of Twitter messages
[41] LDA Online review data 23,614 Social media mining for product planning
[42] Nonnegative matrix factorization (NMF) Cases reported by Los Angeles police department 1,027,168 Classification of crime into discrete categories
[43] Top2Vec News headlines 100,000 Investigation of COVID-19 News
[44] LDA News 8000 Inferencing connections to the sociological view of culture
[45] LDA Blogs 1,300,000 Discussion on change in climate
[46] LDA Tweets 1,09,076 Analyze scholar's Twitter usage in CS conferences

Proposed methodology

In this study, we propose a two-staged NLP framework that leverages LAD and VADER algorithms to identify topics and sentiments expressed by individuals towards those topics. Our methodology consists of two stages: Stage I employs topic modeling to identify topics from large documents, while Stage II utilizes VADER to extract sentiments related to a specific topic. The flowchart of our proposed technique is depicted in Fig. 1.

Fig. 1.

Fig. 1

Flowchart/Framework of the algorithm

Stage 1: In Stage 1, we begin by gathering large documents from diverse sources. These sources can be any written or printed materials that contain a substantial amount of data in a single document. To process the data, we first convert it into a text file, which undergoes several pre-processing procedures such as stopword elimination, tokenization (the process of breaking text into tokens), stemming (the process of returning a word to its root level), and lemmatization (grouping the inflected forms of a word). Once pre-processing is complete, the documents are ready for topic modeling. We utilize the JAVA-based Mallet tool, LDA, to perform topic modeling. LDA generates a topic-per-document model and a words-per-topic model by leveraging Dirichlet distributions as the modeling framework. After applying LDA, we generate clusters of features, where each cluster represents a group and is named after an expert in the field.

Stage 2: In Stage 2, we focus on SA, which involves studying the sentiments of individuals towards a specific topic. Here, we use the topics generated from Stage 1 as input. We extract Google News related to each topic using the Google News API, which undergoes pre-processing techniques such as data mining. We then apply VADER to obtain sentiment scores of news related to each topic. Finally, we analyze the data using visualization techniques.

Experimental setup

The experimental setup for the proposed methodology is described as follows. The experiments were conducted on a High-Performance Computing facility that was equipped with an AMD Ryzen Threadripper PRO 3945 WX processor with 12 cores and 64 GB DDR4 Quad channel RAM, along with an NVIDIA RTX A5000 graphics card that had 24 GB DDR6 memory. The experiments were implemented using the Python programming language.

For Stage 1 (Topic Modeling), the dataset comprised 700 legal judgments, downloaded in Portable Document Format (PDF) files from the Supreme Court of India website, as per the study conducted [47]. For Stage 2 (News from Google feed), the Google News API was employed to extract the news feed, utilizing the topics generated from clusters of features obtained in Stage 1 as the search key.

The experiment was conducted in two stages. Stage I aimed to identify the critical topics of legal documents using topic modeling techniques, while Stage II aimed to evaluate the sentiments of people based on news articles published via Google News on the topics identified in Stage I.

Stage I

  • Data Collection: The judgments are downloaded from the Supreme Court of India website.

  • Data Pre-processing: The documents were converted to text files. These judgments contain common keywords. Data preprocessing was performed on these text files to transform the manuscript to lowercase as well as eliminate the stop words, punctuation, and numbers. After that, text files were tokenized and stemmed.

  • Topic Modeling: The model will be trained for topic modeling for a different number of topics. There is a pre-requisite while running the LDA that the number of topics should be known in advance. We will train the model for a different number of topics. We have empirically run the dataset on a different number of topics from the [10, 25] with an interval gap of 5. We found that n = 15 performs well on topic modeling. Therefore we have taken 15 topics in our experiment. Fifteen separate topic clusters were generated, with the top correlated words in each cluster.

Stage II:

  • Web Scrapping: Four topics were randomly selected from the resultant topics from Stage I. The topics were named for better clarity of the topics. Using the topics convention, web scrapping of news related to those topics was performed using Google News API.

  • Data Preprocessing: The preprocessing of the news articles was essential for the conversion of the sentences into lowercase, tokenization, stemming, and removal of stop words.

  • Sentiment Polarity: The sentiment polarity of each news article was computed using the VADER algorithm. The Vader score falls between -1 (strongly negative) to + 1 (strongly positive). All the articles were divided into positive(> 0), negative(< 0), and neutral (= 0) categories.

Results & discussion

After downloading 700 legal documents from the Supreme Court of India website in Portable Document Format (PDF) files, they were converted to text files and pre-processed using stemming and lemmatization techniques. The resulting vocabulary size was 39,828, and the average number of words per document remained around 2000.

To identify critical topics of legal documents, we applied topic modeling to these files, selecting 15 topics empirically. Each topic provided a set of features or words, with Table 2 indicating these various topics and their associated features. In general terms, we assigned a name to each topic, with some being unnamed. Topic 0 contained generalized features, while the remaining topics were named according to their focus. These included writ cases, constitutional matters, land/property disputes, criminal matters, disaster management, Indian Penal code matters, Vehicle Act cases, property-related issues, Insolvency act matters, arbitration trials, service agreements, criminal cases, sales tax cases, and capital punishment in India.

Table 2.

Labels and top 20 words for 15 topics from the Proposed LDA topic model

Topic No Top Words Label of Topic
0 [‘provisions’, ‘right’, ‘SCC’, ‘public’, ‘authority’, ‘decision’, ‘person’, ‘power’, ‘time’, ‘state’, ‘sub’, ‘provision’, ‘government’, ‘judicial’, ‘provided’, ‘manner’, ‘matter’, ‘necessary’, ‘effect’, ‘article’] General
1 [‘high’, ‘learned’, ‘judgment’, ‘said’, ‘filed’, ‘passed’, ‘submitted’, ‘counsel’, ‘civil’, ‘writ’, ‘view’, ‘appellants’, ‘present’, ‘proceedings’, ‘matter’, ‘date’, ‘period’, ‘impugned’, ‘orders’, ‘notice’] Writ
2 [‘state’, ‘article’, ‘constitution’, ‘scheduled’, ‘backward’, ‘commission’, ‘list’, ‘reservation’, ‘election’, ‘committee’, ‘constitutional’, ‘amendment’, ‘classes’, ‘government’, ‘union’, ‘parliament’, ‘members’, ‘power’, ‘judgment’, ‘castes’] Constitutional Matters
3 [‘land’, ‘building’, ‘government’, ‘project’, ‘development’, ‘plan’, ‘area’, ‘public’, ‘state’, ‘construction’, ‘authority’, ‘central’, ‘notification’, ‘buildings’, ‘plot’, ‘forest’, ‘use’, ‘proposed’, ‘heritage’, ‘compensation’] Land/property dispute
4 [‘accused’, ‘police’, ‘evidence’, ‘prosecution’, ‘witness’, ‘mohmed’, ‘confession’, ‘stated’, ‘body’, ‘examination’, ‘deposed’, ‘time’, ‘persons’, ‘criminal’, ‘said’, ‘ali’, ‘victim’, ‘recovered’, ‘taken’, ‘kumar’] Criminal
5 [‘government’, ‘state’, ‘submitted’, ‘covid’, ‘persons’, ‘disability’, ‘union’, ‘disabilities’, ‘workers’, ‘national’, ‘policy’, ‘disaster’, ‘pandemic’, ‘central’, ‘writ’, ‘special’, ‘children’, ‘scheme’, ‘states’, ‘relief’] Disaster management
6 [‘accused’, ‘offence’, ‘criminal’, ‘bail’, ‘high’, ‘fir’, ‘police’, ‘offences’, ‘complaint’, ‘investigation’, ‘magistrate’, ‘trial’, ‘person’, ‘singh’, ‘code’, ‘crpc’, ‘charge’, ‘sections’, ‘ipc’, ‘proceedings’] IPC
7 [‘bank’, ‘company’, ‘compensation’, ‘alcohol’, ‘commission’, ‘vehicle’, ‘complaint’, ‘insurance’, ‘loss’, ‘accident’, ‘consumer’, ‘person’, ‘cheque’, ‘locker’, ‘said’, ‘national’, ‘agreement’, ‘forum’, ‘claim’, ‘liability’] Vechile-Related
8 [‘suit’, ‘property’, ‘company’, ‘plaintiff’, ‘decree’, ‘defendant’, ‘possession’, ‘sale’, ‘deed’, ‘filed’, ‘family’, ‘rule’, ‘land’, ‘parties’, ‘trial’, ‘companies’, ‘tata’, ‘judgment’, ‘civil’, ‘held’] Property-related
9 [‘resolution’, ‘corporate’, ‘plan’, ‘debtor’, ‘financial’, ‘creditors’, ‘code’, ‘authority’, ‘adjudicating’, ‘debt’, ‘nclt’, ‘insolvency’, ‘coc’, ‘creditor’, ‘ibc’, ‘process’, ‘approval’, ‘cirp’, ‘committee’, ‘company’] Insolvency Act
10 [‘arbitration’, ‘award’, ‘agreement’, ‘parties’, ‘contract’, ‘tribunal’, ‘arbitral’, ‘arbitrator’, ‘scc’, ‘party’, ‘commercial’, ‘dispute’, ‘limitation’, ‘proceedings’, ‘period’, ‘held’, ‘disputes’, ‘judgment’, ‘civil’, ‘foreign’] Arbritation
11 [‘service’, ‘candidates’, ‘appointment’, ‘rules’, ‘post’, ‘selection’, ‘state’, ‘years’, ‘age’, ‘year’, ‘medical’, ‘vacancies’, ‘government’, ‘examination’, ‘rule’, ‘officers’, ‘committee’, ‘college’, ‘high’, ‘list’] Service agreement
12 [‘accused’, ‘evidence’, ‘deceased’, ‘prosecution’, ‘trial’, ‘singh’, ‘high’, ‘ipc’, ‘witnesses’, ‘state’, ‘appellants’, ‘injuries’, ‘stated’, ‘persons’, ‘judgment’, ‘police’, ‘scc’, ‘incident’, ‘criminal’, ‘learned’] Criminal
13 [‘goods’, ‘tax’, ‘power’, ‘refund’, ‘payment’, ‘customs’, ‘rate’, ‘services’, ‘state’, ‘sale’, ‘input’, ‘import’, ‘itc’, ‘high’, ‘income’, ‘credit’, ‘purchase’, ‘paid’, ‘assesse’, ‘supply’] Sales Tax
14 [‘death’, ‘sentence’, ‘state’, ‘imprisonment’, ‘years’, ‘victim’, ‘offence’, ‘life’, ‘criminal’, ‘accused’, ‘ipc’, ‘circumstances’, ‘conviction’, ‘scc’, ‘committed’, ‘crime’, ‘evidence’, ‘child’, ‘sexual’, ‘punishment’] Capital Punishment

After performing topic modeling on 700 pre-processed and lemmatized legal documents, correlation tests were conducted to obtain correlation values of each document with 15 topics. The boxplot visualization of the various topics with the documents is shown in Fig. 2, which indicates that features are more dispersed in Topic 1 and Topic 12. Topic 0 is right skewed, Topic 1 is almost showing normal distribution, and Topic 12 is skewed to the right with many outliers outside the whiskers of each topic. Table 3 displays the maximum and minimum probability values of each topic with the document. The maximum probability value of 0.965 with document 590 was obtained for topic number 1. Figure 3 represents the highest association of each topic with the document number. Table 4 shows the same results as Table 3.

Fig. 2.

Fig. 2

Boxplot visualization of various topics with the documents

Table 3.

Documents with the highest and lowest probability per topic

Topic no Max probability value Max probability document No Min probability value Min probability of document No
0 0.788751323 441 0.00012936 425
1 0.965711919 590 8.21518E-05 586
2 0.740050002 144 1.03875E-06 213
3 0.696587298 80 2.50045E-06 321
4 0.815674705 653 8.81921E-07 65
5 0.826009937 304 1.80191E-06 580
6 0.757856215 223 2.21043E-06 403
7 0.692951989 399 1.38036E-06 403
8 0.770479731 148 1.14694E-06 65
9 0.794709373 117 5.02766E-07 65
10 0.738390115 224 1.96209E-06 580
11 0.761465274 206 1.62472E-06 213
12 0.95425968 481 7.8978E-06 321
13 0.711524975 216 7.24678E-07 65

Fig. 3.

Fig. 3

Displaying Top Title Topic-wise

Table 4.

Comparison table for the Proposed Approach

Paper Techniques Objective
[48] LDA, SA Stock Market Prediction
[32] LDA, SA Global climate change
[49] Topic Modeling, SA Product opportunities
[34] Topic modeling, SA Airline Reviews
[33] LDA, SA Tourist Spots
[50] Topic Modeling, SA Airport service experience
[51] Topic Modeling, SA Online Education in the COVID-19 Era
[52] Topic Modeling, SA Bangladesh Airlines

To extract Google news feed related to four randomly chosen topics resulting from topic modeling, namely land or property dispute cases, capital punishment in India, insolvency and bankruptcy cases, and service-related cases or matters, web-scrapping was conducted using Google news API. After extracting the news, VADER was applied to obtain sentiment values in terms of polarity value, where '1' shows positive sentiments, '0' indicates neutral behavior, and '-1' indicates negative sentiments. Figure 4a indicates that people or media have almost equal positive and negative sentiments. Figure 4b suggests that news related to capital punishment in India is more negatively perceived, while Fig. 4c suggests that news related to insolvency and bankruptcy cases/matters is more positively perceived. Figure 4d shows that news related to service-related laws is perceived more positively.

Fig. 4.

Fig. 4

Graphs representing SA on different topics. a SA of Google news on Land/Property Dispute Cases. b SA of Google news on Capital Punishment Cases in India. c SA of Google news on Insolvency & Bankruptcy Cases. d SA of Google news on service-related laws in India

The analysis of the legal judgments using topic modeling reveals that there are numerous themes of debate, with some being more prevalent than others. To ascertain the effectiveness of our proposed methodology, it is worth noting that no prior studies have attempted to use topic modeling and sentiment analysis in the context of legal documents or judgments. As such, a comparative analysis with previous works is not feasible. Nonetheless, Table 4 presents past research that has utilized topic modeling and sentiment analysis in other domains, demonstrating the novelty of our approach in the legal domain.

So, the proposed approach is not a quantitative model, but a qualitative model. So this model is unique in itself as it is a completely using novel dataset. The model saves a lot of time as it reduces the need to read large jargon documents which might be difficult to understand.

Conclusion

This paper presents a pioneering study that investigates the application of topic modeling and sentiment analysis in Indian legal documents. Our proposed methodology effectively extracts topics and identifies related sentiments in lengthy legal documents, with promising results that have the potential to enhance users' ability to comprehend legal judgments and identify relevant sentiments in a shorter amount of time. While this area has been largely unexplored by previous studies, our approach provides a valuable contribution to the field. However, there are still significant challenges that need to be addressed, such as the lack of optimized topic models for legal data. Overall, this study represents a significant milestone in the exploration of topic modeling and sentiment analysis in Indian legal documents, providing valuable insights to legal professionals and researchers.

Authors contributions

All the authors of this manuscript contributed to the conceptual framework and design of the study. Material preparation, data collection, and analysis were performed by IG and IC. IG wrote the first draft of the manuscript. IC and NG have edited and revised the manuscript. All authors read and approved the final manuscript.

Funding

The authors declare that no funding has been received to perform this study.

Data availability

The authors declare that this study is fully reproducible, and the data used in this research work may be shared with readers at their request.

Declarations

Conflict of interest

The authors declare no conflict of interest.

Consent For publication

The authors give their full consent for the publication of identifiable details, which can include a photograph(s) and/or details within the text (“Material”) to be published in this esteemed journal in the form of an article.

References

  • 1.Hearst M. What is text mining. SIM UC Berkeley. 2003;5:2234. [Google Scholar]
  • 2.Kumar A, Dabas V, Hooda P. Text classification algorithms for mining unstructured data: a SWOT analysis. Int J Inf Technol. 2020;12(4):1159–1169. doi: 10.1007/s41870-017-0072-1. [DOI] [Google Scholar]
  • 3.Ding K, Choo WC, Ng KY, Ng SI. Employing structural topic modelling to explore perceived service quality attributes in Airbnb accommodation. Int J Hosp Manag. 2020;91:102676. doi: 10.1016/J.IJHM.2020.102676. [DOI] [Google Scholar]
  • 4.Khurana D, Koli A, Khatter K, Singh S. Natural language processing: state of the art, current trends and challenges. Multimed Tools Appl. 2022 doi: 10.1007/s11042-022-13428-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Vayansky I, Kumar SAP. A review of topic modeling methods. Inf Syst. 2020;94:101582. doi: 10.1016/j.is.2020.101582. [DOI] [Google Scholar]
  • 6.Koltcov NSISOK. Topic modelling for qualitative studies. J Inf Sci. 2015;26(5):599–613. doi: 10.1177/0165551515617393. [DOI] [Google Scholar]
  • 7.Asmussen CB, Møller C. Smart literature review: a practical topic modelling approach to exploratory literature review. J. Big Data. 2019 doi: 10.1186/s40537-019-0255-7. [DOI] [Google Scholar]
  • 8.Negara ES, Triadi D, Andryani R. Topic Modelling Twitter Data with Latent Dirichlet Allocation Method. ICECOS Int Conf Electr Eng Comput Sci. 2019 doi: 10.1109/ICECOS47637.2019.8984523. [DOI] [Google Scholar]
  • 9.Reisenbichler M, Reutterer T. Topic modeling in marketing: recent advances and research opportunities. J Bus Econ. 2019;89(3):327–356. doi: 10.1007/s11573-018-0915-7. [DOI] [Google Scholar]
  • 10.Yu H, Yang J. A direct LDA algorithm for high-dimensional data—with application to face recognition. Pattern Recognit. 2001;34(10):2067–2070. doi: 10.1016/S0031-3203(00)00162-X. [DOI] [Google Scholar]
  • 11.Iqbal F, et al. A Hybrid Framework for Sentiment Analysis Using Genetic Algorithm Based Feature Reduction. IEEE Access. 2019;7:14637–14652. doi: 10.1109/ACCESS.2019.2892852. [DOI] [Google Scholar]
  • 12.Landauer TK. LSA as a theory of meaning. In Handbook of latent semantic analysis. 2007 doi: 10.4324/9780203936399. [DOI] [Google Scholar]
  • 13.Lu Y, Mei Q, Zhai C. Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Inf Retr Boston. 2011;14:178–203. doi: 10.1007/s10791-010-9141-9. [DOI] [Google Scholar]
  • 14.Liu B. Sentiment analysis: Mining opinions, sentiments, and emotions. USA: Cambridge University Press; 2015. [Google Scholar]
  • 15.Liu B. Sentiment analysis and opinion mining. Synth Lect Hum Lang Technol. 2012;5(1):1–184. doi: 10.2200/S00416ED1V01Y201204HLT016. [DOI] [Google Scholar]
  • 16.Farhadloo M, Rolland E. Fundamentals of sentiment analysis and its applications. In Studies in Computat Intell. 2016;639:1–24. [Google Scholar]
  • 17.Chen X, Zou D, Xie H. Fifty years of British Journal of Educational Technology: A topic modeling based bibliometric perspective. Br J Educ Technol. 2020;51(3):692–708. doi: 10.1111/bjet.12907. [DOI] [Google Scholar]
  • 18.Zhang L, Liu B. Aspect and Entity Extraction for Opinion Mining”. In: Chu WW, editor. Data Mining and Knowledge Discovery for Big Data: Methodologies, Challenge and Opportunities. Berlin, Heidelberg: Springer, Berlin. Heidelberg; 2014. [Google Scholar]
  • 19.Ghosh S, Hazra A, Raj A. A comparative study of different classification techniques for sentiment analysis. Int J Synt Emot. 2020;11(49–57):2020. doi: 10.4018/IJSE.20200101.oa. [DOI] [Google Scholar]
  • 20.Wawre SV, Deshmukh SN. Sentiment Classification using Machine Learning. Techniques. 2016;5:2015–2017. [Google Scholar]
  • 21.Mandal A, Ghosh K, Ghosh S, Mandal S. Unsupervised approaches for measuring textual similarity between legal court case reports. Artif Intell Law. 2021;29(3):417–451. doi: 10.1007/s10506-020-09280-2. [DOI] [Google Scholar]
  • 22.Malakar K, Lu C. Hydrometeorological disasters during COVID-19: Insights from topic modeling of global aid reports. Sci Total Environ. 2022;838:155977. doi: 10.1016/j.scitotenv.2022.155977. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Vayansky I, Kumar SAP. A review of topic modeling methods”. Inf Syst. 2020;94:101582. doi: 10.1016/j.is.2020.101582. [DOI] [Google Scholar]
  • 24.Schmiedel T, Müller O, vom Brocke J. Topic modeling as a strategy of inquiry in organizational research: a tutorial with an application example on organizational culture. Organ Res Methods. 2019;22(4):941–968. doi: 10.1177/1094428118773858. [DOI] [Google Scholar]
  • 25.Dieng AB, Ruiz FJR, Blei DM. Topic modeling in embedding spaces. Trans Assoc Comput Linguist. 2020;8:439–453. doi: 10.1162/tacl_a_00325. [DOI] [Google Scholar]
  • 26.Maier D, et al. Applying LDA topic modeling in communication research: toward a valid and reliable methodology. Commun Methods Meas. 2018;12(2–3):93–118. doi: 10.1080/19312458.2018.1430754. [DOI] [Google Scholar]
  • 27.Kumar P, Vardhan M. PWEBSA: twitter sentiment analysis by combining plutchik wheel of emotion and word embedding. Int J Inf Technol. 2022;14(1):69–77. doi: 10.1007/s41870-021-00767-y. [DOI] [Google Scholar]
  • 28.Garg K. Sentiment analysis of Indian PM’s ‘Mann Ki Baat’. Int J Inf Technol. 2020;12(1):37–48. doi: 10.1007/s41870-019-00324-8. [DOI] [Google Scholar]
  • 29.Aljedaani W, et al. Sentiment analysis on Twitter data integrating TextBlob and deep learning models: The case of US airline industry. Knowled-Based Syst. 2022;255:109780. doi: 10.1016/j.knosys.2022.109780. [DOI] [Google Scholar]
  • 30.Bokaee Nezhad Z, Deihimi MA. Twitter sentiment analysis from Iran about COVID 19 vaccine. Diabetes Metab Syndr Clin Res Rev. 2022;16:102367. doi: 10.1016/j.dsx.2021.102367. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Ugochi O, Prasad R, Odu N, Ogidiaka E, Ibrahim BH. Customer opinion mining in electricity distribution company using twitter topic modeling and logistic regression. Int J Inf Technol. 2022;14(4):2005–2012. doi: 10.1007/s41870-022-00890-4. [DOI] [Google Scholar]
  • 32.Dahal B, Kumar SAP, Li Z. Topic modeling and sentiment analysis of global climate change tweets. Soc Netw Anal Min. 2019;9(1):1–20. doi: 10.1007/s13278-019-0568-8. [DOI] [Google Scholar]
  • 33.Shafqat W, Byun YC. A recommendation mechanism for under-emphasized tourist spots using topic modeling and sentiment analysis. Sustain. 2020 doi: 10.3390/SU12010320. [DOI] [Google Scholar]
  • 34.Kwon HJ, Ban HJ, Jun JK, Kim HS. Topic modeling and sentiment analysis of online review for airlines. Inf. 2021;12(2):1–14. doi: 10.3390/info12020078. [DOI] [Google Scholar]
  • 35.Ali F, et al. Transportation sentiment analysis using word embedding and ontology-based topic modeling. Knowled-Based Syst. 2019;174:27–42. doi: 10.1016/j.knosys.2019.02.033. [DOI] [Google Scholar]
  • 36.Chen X, Xie H. A structural topic modeling-based bibliometric study of sentiment analysis literature. Cognit Comput. 2020;12(6):1097–1129. doi: 10.1007/s12559-020-09745-1. [DOI] [Google Scholar]
  • 37.C. Rosca, B. Covrig, C. Goanta, G. van Dijck, and G. Spanakis, 2020 Return of the AI: An Analysis of Legal Research on Artificial Intelligence Using Topic Modeling. In NLLP@ KDD. 3–10.
  • 38.Undavia S, Meyers A, Ortega JE. “A Comparative Study of Classifying Legal Documents with Neural Networks”, in. Fed Confer Comp Sci Inform Syst (FedCSIS) 2018;2018:515–522. [Google Scholar]
  • 39.Silveira R, Fernandes CG, Neto JAM, Furtado V, Pimentel Filho JE. Topic Modelling of Legal Documents via LEGAL-BERT. Proc. 2021;1613:73. [Google Scholar]
  • 40.D. A. Ostrowski, “Using latent dirichlet allocation for topic modelling in twitter,” Proc. 2015 IEEE 9th Int. Conf. Semant. Comput. IEEE ICSC 2015, pp. 493–497, 2015, doi: 10.1109/ICOSC.2015.7050858.
  • 41.Jeong B, Yoon J, Lee J-M. Social media mining for product planning: A product opportunity mining approach based on topic modeling and sentiment analysis. Int J Inf Manage. 2019;48:280–290. doi: 10.1016/j.ijinfomgt.2017.09.009. [DOI] [Google Scholar]
  • 42.Kuang D, Brantingham PJ, Bertozzi AL. Crime topic modeling. Crime Sci. 2016 doi: 10.1186/s40163-017-0074-0. [DOI] [Google Scholar]
  • 43.Ghasiya P, Okamura K. Investigating COVID-19 News across Four Nations: A Topic Modeling and Sentiment Analysis Approach. IEEE Access. 2021;9:36645–36656. doi: 10.1109/ACCESS.2021.3062875. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.DiMaggio P, Nag M, Blei D. Exploiting affinities between topic modeling and the sociological perspective on culture: Application to newspaper coverage of U.S. government arts funding. Poetics. 2013;41(6):570–606. doi: 10.1016/J.POETIC.2013.08.004. [DOI] [Google Scholar]
  • 45.Elgesem D, Steskal L, Diakopoulos N. Structure and Content of the Discourse on Climate Change in the Blogosphere: The Big Picture. Environ Commun. 2015;9(2):169–188. doi: 10.1080/17524032.2014.983536. [DOI] [Google Scholar]
  • 46.Parra D, Trattner C, Gómez D, Hurtado M, Wen X, Lin YR. Twitter in academic events: A study of temporal usage, communication, sentimental and topical patterns in 16 Computer Science conferences. Comput Commun. 2016;73:301–314. doi: 10.1016/J.COMCOM.2015.07.001. [DOI] [Google Scholar]
  • 47.“No Title.” https://main.sci.gov.in/judgments Accessed 04 Oct 2022.
  • 48.Nguyen TH, Shirai K. Topic modeling based sentiment analysis on social media for stock market prediction in proceedings of the 53rd annual meeting of the association for computational linguistics. Int Joint Conf Natural Lang Process. 2015;1:354–1364. [Google Scholar]
  • 49.Jeong B, Yoon J, Lee JM. Social media mining for product planning: A product opportunity mining approach based on topic modeling and sentiment analysis. Int J Inf Manage. 2019;48(April):280–290. doi: 10.1016/j.ijinfomgt.2017.09.009. [DOI] [Google Scholar]
  • 50.Kiliç S, Çadirci TO. An evaluation of airport service experience: An identification of service improvement opportunities based on topic modeling and sentiment analysis Res. Transp Bus Manag. 2022;43:100744. doi: 10.1016/j.rtbm.2021.100744. [DOI] [Google Scholar]
  • 51.Waheeb SA, Khan NA, Shang X. Topic modeling and sentiment analysis of online education in the covid-19 era using social networks based datasets. Electronics. 2022;11:5. doi: 10.3390/electronics11050715. [DOI] [Google Scholar]
  • 52.Hasib KM, Towhid NA, Alam MGR. Topic modeling and sentiment analysis using online reviews for bangladesh airlines ieee 12th annual information technology. Elect Mob Commun Conf (IEMCON) 2021;2021:428–434. doi: 10.1109/IEMCON53756.2021.9623155. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The authors declare that this study is fully reproducible, and the data used in this research work may be shared with readers at their request.


Articles from International Journal of Information Technology are provided here courtesy of Nature Publishing Group

RESOURCES