Skip to main content
Data in Brief logoLink to Data in Brief
. 2024 Feb 22;53:110232. doi: 10.1016/j.dib.2024.110232

A structured sentiment analysis dataset based on public comments from various domains

Zhongliang Wei 1,, Shunxiang Zhang 1
PMCID: PMC10910210  PMID: 38439992

Abstract

A structured sentiment analysis dataset, derived from social media comments, is introduced in this paper. The dataset spans 22 diverse domains and comprises over 200,000 reviews, providing a rich resource for sentiment analysis tasks in the Chinese language context. Each comment within the dataset has been manually annotated with a sentiment label, either positive, negative, or neutral, and grouped by topic. This meticulous annotation process ensures the dataset's reliability for training, validating, and testing sentiment analysis models. The construction of the dataset involved a three-step process. Initially, data was collected from the topics that garnered high attention and discussion rates, thereby reflecting the authentic opinions of users. Following data collection, preprocessing was undertaken to remove extraneous elements, while preserving emoticons that are crucial for sentiment analysis. The final step involved manual annotation by researchers, who assigned sentiment labels to each comment based on various factors. The dataset stands as a valuable contribution to the field of natural language processing, particularly for sentiment analysis tasks in the Chinese language context.

Keywords: Sentiment classification, Text mining, Triple classification, Natural language processing


Specifications Table

Subject Computer Science, Natural Language Processing.
Specific subject area Chinese language, Structured dataset, Sentiment analysis, machine learning, deep learning
Data format Analyzed
Type of data Text, Table
Data collection The data were acquired by manually scraping relevant data from Chinese social media platform: Weibo. These data were categorized into 22 domains based on the topics and comments. The data collection was conducted through web browsers and the data were stored in Microsoft EXCEL files.
Data source location The structured sentiment analysis dataset was constructed at Anhui University of Science and Technology, Huainan, China.
Data accessibility Repository name: Zenodo DOI:10.5281/zenodo.10488076
Direct URL to Data: https://zenodo.org/records/10488077

1. Value of the Data

  • This dataset is a Chinese text sentiment analysis dataset, covering 22 domains and nearly 200,000 pieces of data information, and it can be used to analyze the sentiment of Chinese text. It is suitable for researchers who are interested in exploring the sentiment dynamics and variations across different topics and domains in the Chinese social media context.

  • To the best of our knowledge, this dataset is a rare Chinese text dataset with structured collection for topics and reviews. It can facilitate the development and evaluation of sentiment analysis models that can leverage the structured information.

  • Researchers can use this dataset for sentiment analysis by dictionary-based methods, machine learning-based methods or deep learning-based methods. It can also serve as a benchmark dataset for comparing the performance of different methods and models on Chinese sentiment analysis tasks.

  • This dataset provides a different perspective on Chinese media information, and researchers can incorporate this structured information into their proposed sentiment computing models. For example, researchers can use this dataset to investigate how the sentiment of a topic is influenced by the sentiment of its related reviews.

2. Background

Sentiment analysis, alternatively recognized as opinion mining or emotion AI, is a cognitive process that involves the systematic examination, processing, summarization, and inference of subjective text imbued with emotional connotations.

Integral to the effectiveness of sentiment analysis is the requirement for a substantial volume of text data to train models. Weibo is the largest Chinese text data generation platform compared to other Chinese social media. According to Weibo Reports Third Quarter 2023 Unaudited Financial Results [1], in September 2023, Weibo had 605 million monthly active users and 260 million daily active users, so it is necessary to construct a dataset on Weibo for Chinese sentiment analysis.

In the course of our investigation into sentiment analysis [2], [3], [4], we have deliberately directed our attention towards the pivotal role that datasets play in shaping the outcomes of sentiment analysis models. As our research unfolded, we observed a discernible pattern in the structure of topics and reviews prevalent in social media contexts [5]. This observation prompted the conceptualization and implementation of a dataset aligning with similar structures in 2022. Significantly, recent developments indicate a noteworthy interest from other researchers who are actively pursuing avenues stemming from this line of research [6].

3. Data Description

The dataset denoted as ``ch_22d_org'' is characterized by its composition of content spanning across 22 distinct domains. Each domain is systematically assigned a numerical order for reference, delineated as follows: Emotion, Celebrities, Finance, Law, Sports, Cinema, Shows, Campus, Tourism, TV Dramas, Technology, Health, Games, Military, Digital, Constellations, Fitness, Comedy, Animation, International, Covid-19, and Government. Notably, the dataset's organizational structure entails the segregation of content from each domain into individual Excel files, facilitating a clear demarcation of thematic categories. The comprehensive arrangement of domains and their respective content is visually represented in Fig. 1, elucidating the overarching framework of the dataset.

Fig. 1.

Fig. 1

The overall structure of the dataset.

The uniform structure of each Excel file within the ``ch_22d_org'' dataset is characterized by a consistent arrangement of fields. These fields, encapsulated within the title field, include numerical identifiers, category classification, topical information, reviews, and corresponding labels. The correlation between the ``topic'' and ``review'' fields within each data table adheres to a 1-to-10 relationship. This structural characteristic, as expounded in Section VALUE OF THE DATA, underscores the organized structure of the dataset. This structure is a distinctive feature that sets it apart from other Chinese datasets. This unique 1-to-10 relationship is illustratively exemplified in Table 1, providing a tangible representation of the structured framework inherent in the dataset.

Table 1.

The organizational structure of data.

No. Domain Topic Review Label
17-000011-0 Fitness Trying to lose fat and build muscle this year I love it. It's my dream body. Positive
17-000011-1 Fitness Trying to lose fat and build muscle this year I wouldn't believe you if you said you didn't work out at all. Negative
17-000011-2 Fitness Trying to lose fat and build muscle this year That's a great body. Positive
17-000011-3 Fitness Trying to lose fat and build muscle this year Do you have a link to that top? Neutral
17-000011-4 Fitness Trying to lose fat and build muscle this year It's a great body. It grows where it's supposed to. Positive
17-000011-5 Fitness Trying to lose fat and build muscle this year I want to build muscle. Neutral
17-000011-6 Fitness Trying to lose fat and build muscle this year I'm envious. Positive
17-000011-7 Fitness Trying to lose fat and build muscle this year What a perfect body! Positive
17-000011-8 Fitness Trying to lose fat and build muscle this year It's a shame to lose weight. Negative
17-000011-9 Fitness Trying to lose fat and build muscle this year Don't be like this. Negative

As shown in Table 1, a noteworthy attribute of the dataset lies in the consistent alignment of topics, each corresponding to 10 distinct reviews. The label field within the table is intentionally annotated manually, featuring classifications into positive, negative, and neutral categories. This meticulous labeling process is integral to the structured nature of the dataset, a quality that permeates consistently across its entirety. This structural coherence is evident not only in the thematic content but also in ancillary fields, such as the serial number field.

Examining the serial number field as an illustrative example, the structured comment groups are organized systematically. The serial numbers, ranging from 17-000011-0 to 17-000011-9, delineate a cohesive unit of comments. The leftmost numerical identifier ‘17’ signifies the 17th domain, corroborated by the category label ‘fitness.’ The intermediary numerical sequence ‘000011’ signifies the initial group of structured comment information, a numerical range that extends, in some instances, up to 001200 within specific Excel files. Finally, the numeric range ‘0 to 9’ corresponds sequentially to different comments under the same topic within this structured group.

To further elucidate the dataset's compositional characteristics, Table 2 presents a comprehensive overview of comment counts and label statistics for each domain. This tabulation serves to provide quantitative insights into the distribution of comments and associated sentiment labels, contributing to a more nuanced understanding of the dataset's content and structure. emotion, celebrities, finance, law, sports, cinema, variety shows, campus, tourism, TV dramas, technology, health, games, military, digital, constellations, fitness, comedy, animation, international, epidemic, and government.

Table 2.

The number of comments and label statistics for each domain.

Domain File Name Number of Topics Number of Reviews positive negative neutral
Emotion 01-情感.xlsx 1100 11,000 1566 715 8719
Celebrities 02-明星.xlsx 1100 11,000 7135 1684 2181
Finance 03-财经.xlsx 1100 11,000 2935 3287 4778
Law 04-法律.xlsx 1100 11,000 5744 2465 2791
Sports 05-体育.xlsx 688 6880 3239 1655 1986
Cinema 06-放映厅.xlsx 1100 11,000 5454 3855 1691
Shows 07-综艺.xlsx 1100 11,000 5424 1621 3955
Campus 08-校园.xlsx 1100 11,000 2206 5897 2897
Tourism 09-旅游.xlsx 1200 12,000 6734 309 4957
TV Dramas 10-电视剧.xlsx 1082 10,820 6451 2735 1634
Technology 11-科技.xlsx 403 4030 1575 942 1513
Health 12-养生.xlsx 650 6500 4773 1587 140
Games 13-游戏.xlsx 1100 11,000 4812 2655 3533
Military 14-军事.xlsx 1100 11,000 4013 3828 3159
Digital 15-数码.xlsx 1100 11,000 5886 3306 1808
Constellations 16-星座.xlsx 1110 11,100 5716 3356 2028
Fitness 17-健生.xlsx 860 8600 4663 2032 1905
Comedy 18-搞笑.xlsx 1100 11,000 4401 4266 2333
Animation 19-动漫.xlsx 1100 11,000 4238 2060 4702
International 20-国际.xlsx 800 8000 5077 1820 1103
Covid-19 21-疫情.xlsx 70 700 302 208 190
Government 22-政务.xlsx 571 5710 2606 924 2180
Total 20,624 206,240 94,950 51,207 60,183

Illustrated in Fig. 2 is a graphical representation delineating the proportions of positive, negative, and neutral labels within the dataset. This visualization serves to intuitively convey the distribution of label types both collectively across all data and individually for each of the 22 domains. The figure offers a comprehensive overview, enabling a nuanced understanding of the sentiment label distribution within the dataset.

Fig. 2.

Fig. 2

The percentage of three labeled information in pie charts.

By presenting the proportional breakdown of sentiment labels, Fig. 2 facilitates a visual exploration of how positive, negative, and neutral sentiments are distributed across the entire dataset and within specific domains. This graphical representation contributes to the interpretability of the dataset's sentiment dynamics, allowing researchers to discern patterns and variations in sentiment expressions across diverse domains.

4. Experimental Design, Materials and Methods

The construction of this dataset was built in the following three steps, as shown in Fig. 3.

  • 1.

    Data collection for our study involved a substantial workload, particularly across 22 distinct domains. To streamline the process, we allocated the task among 11 students, each responsible for gathering data from two specific domains. The data was acquired by accessing designated Weibo Topics through a web browser. Starting from February 8, 2022, and ending on May 31, 2022, a period of nearly 100 days, the data were obtained by accessing specified Weibo topics through the web browser. These topics were selected based on their daily popularity, which is characterized by a high level of attention and discussion, as well as the diversity and authenticity of Weibo users' opinions. For each topic, we selectively collected 10 reviews based on the popularity of the reviews. We consider that the more popular reviews are more compatible with the current topic. The resulting data was meticulously organized, with different types of information classified and stored in separate Excel files. Each Excel file corresponds to a specific domain and is named using the ``No.-domain.xlsx'' convention.

  • 2.

    Following data collection, we undertook a data preprocessing step to enhance the quality and uniformity of the collected information. The process involved simple cleaning procedures, such as the removal of website links starting with ``http/https'' and sensitive user information initiated with ``@''. Using the Python re library, we implemented regular expressions to identify and replace these unwanted elements. Emoticons, such as ``[开心] '' and ``[悲伤] '', were retained during this stage, as they contribute to expressing emotions and enriching the text's expressiveness. The preprocessed data was then overwritten onto the original Excel files, preserving the initial format and structure.

  • 3.

    Upon completing the preprocessing phase, the dataset underwent manual annotation, wherein each Weibo comment was systematically assigned a sentiment label to signify its emotional orientation. The sentiment labels employed comprised three categories: positive, negative, and neutral. The assessment of the emotional stance of Weibo comments was based on various factors such as tone, vocabulary, and the presence of emoticons. Specifically, comments expressing positive emotions like agreement, support, or satisfaction were designated as positive, while those conveying negative sentiments such as opposition, criticism, or dissatisfaction were labeled as negative. Instances where comments did not overtly express emotions or exhibited unclear emotional ambiguity were categorized as neutral. The labeling procedure was initially completed by the researcher responsible for gathering information about the current domain, followed by meticulous cross-checking and validation by another researcher to ensure accuracy and consistency of the sentiment labels. These researchers were all second-year or third-year students pursuing their master's degrees, and all of their research areas were also in natural language processing, that ensured the professionalism and accuracy of the data collection and data labeling. Because of the complexity of the Chinese language, about 10% of the data was ambiguous during the labeling process, which required active communication and collaboration between the two researchers to resolve.

Fig. 3.

Fig. 3

Dataset construction process.

Concluding these sequential steps, the finalized dataset emerged, poised for application in sentiment analysis tasks within the domain of natural language processing. This curated dataset serves as a systematically organized and structured emotional data resource, encapsulating Weibo users' opinions and sentiments. The dataset, now publicly accessible on an open platform, facilitates researchers in selectively utilizing the entirety or specific subsets of the data tailored to their research content for sentiment analysis endeavors.

Limitations

The data collection methodology employed in constructing this dataset was limited to categorizing sentiments into three broad classes: positive, negative, and neutral. Notably, it did not extend to encompassing multi-class classification, which would involve discerning specific emotions such as joy, anger, grief, and happiness within the comments. This apparent limitation suggests an avenue for future research to delve deeper into a more nuanced sentiment analysis framework.

Furthermore, the dataset lacks explicit sentiment orientation labels for each individual Topic. Researchers engaging in the complex task of sentiment analysis with this dataset are required to independently assess and determine the sentiment orientation of each Topic. This omission poses an additional layer of complexity, as the absence of predefined sentiment orientation adds an element of subjectivity to the interpretation of results. Future efforts may consider incorporating this aspect into dataset augmentation, providing a more comprehensive resource for sentiment analysis tasks.

Ethics Statement

The collected data has been fully anonymous, and the data redistribution policies of social media platforms have been complied with [7].

CRediT authorship contribution statement

Zhongliang Wei: Conceptualization, Methodology, Data curation, Visualization, Writing – original draft, Writing – review & editing. Shunxiang Zhang: Validation, Supervision, Writing – review & editing.

Acknowledgements

Funding: This work was supported by the Natural Science Research Project of Anhui Educational Committee (grant number: KJ2021A0449), and The University Synergy Innovation Program of Anhui Province (grant number: GXXT-2021-008). In addition, the authors would like to express their gratitude to the graduate students who participated in the collection, preprocessing and labelling of the data.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability

References

  • 1.Sina Weibo, Weibo Reports Third Quarter 2023 Unaudited Financial Results. https://www.prnewswire.com/news-releases/weibo-reports-third-quarter-2023-unaudited-financial-results-301982934.html. (Accessed 12 February 2024).
  • 2.Zhang S.X., Wei Z.L., Wang Y., Liao T. Sentiment analysis of Chinese micro-blog text based on extended sentiment dictionary. Fut. Gener. Comput. Syst. 2018;81:395–403. doi: 10.1016/j.future.2017.09.048. [DOI] [Google Scholar]
  • 3.Zhang S.X., Hu Z.Y., Zhu G.L., et al. Sentiment classification model for Chinese micro-blog comments based on key sentences extraction. Soft Comput. 2021;25:463–476. doi: 10.1007/s00500-020-05160-8. [DOI] [Google Scholar]
  • 4.Zhang S.X., Yu H.B., Zhu G.L. An emotional classification method of Chinese short comment text based on ELECTRA. Conn. Sci. 2022;34(1):254–273. doi: 10.1080/09540091.2021.1985968. [DOI] [Google Scholar]
  • 5.Wei Z.L., Liu W.J., Zhu G.L., Zhang S.X., Hsieh M.-Y. Sentiment classification of Chinese Weibo based on extended sentiment dictionary and organisational structure of comments. Conn. Sci. 2022;34(1):409–428. doi: 10.1080/09540091.2021.2006146. [DOI] [Google Scholar]
  • 6.Liu K., Hai M. Rumor detection of Covid-19 related microblogs on Sina Weibo. Procedia Comput. Sci. 2023;221:386–393. doi: 10.1016/j.procs.2023.07.052. [DOI] [Google Scholar]
  • 7.Sina Weibo, Weibo Online Service Agreement. https://open.weibo.com/wiki/. (Accessed 10 December 2023).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement


Articles from Data in Brief are provided here courtesy of Elsevier

RESOURCES