Role of twitter user profile features in retweet prediction for big data streams

Saurabh Sharma; Vishal Gupta

doi:10.1007/s11042-022-12815-1

. 2022 Mar 26;81(19):27309–27338. doi: 10.1007/s11042-022-12815-1

Role of twitter user profile features in retweet prediction for big data streams

Saurabh Sharma ¹, Vishal Gupta ^1,^✉

PMCID: PMC8960086 PMID: 35368857

Abstract

To study the various factors influencing the process of information sharing on Twitter is a very active research area. This paper aims to explore the impact of numerical features extracted from user profiles in retweet prediction from the real-time raw feed of tweets. The originality of this work comes from the fact that the proposed model is based on simple numerical features with the least computational complexity, which is a scalable solution for big data analysis. This research work proposes three new features from the tweet author profile to capture the unique behavioral pattern of the user, namely “Author total activity”, “Author total activity per year”, and “Author tweets per year”. The features set is tested on a dataset of 100 million random tweets collected through Twitter API. The binary labels regression gave an accuracy of 0.98 for user-profile features and gave an accuracy of 0.99 when combined with tweet content features. The regression analysis to predict the retweet count gave an R-squared value of 0.98 with combined features. The multi-label classification gave an accuracy of 0.9 for combined features and 0.89 for user-profile features. The user profile features performed better than tweet content features and performed even better when combined. This model is suitable for near real-time analysis of live streaming data coming through Twitter API and provides a baseline pattern of user behavior based on numerical features available from user profiles only.

Keywords: Twitter, Social media analysis, Retweet prediction, User behavior, User profiling, Big data analysis

Introduction

Today we are living in a world, where people have an active participation in online platforms of social interaction. Some kind or other, online social networks are part of our daily lives. The various types of social media platforms provide different types of services ranging from sharing personal views, collaborating with others, spreading the information of interest, exploring new ideas, discussing real-life events, and participating in evolving communities. Every social media network has a unique purpose, for example, Facebook is primarily used to connect with family and friends, Linkedin is used to connect with people from the professional circle, Instagram is used to share multimedia content, Pinterest is used to explore interesting pins of others and Tumblr is used to find and follow blogs from various categories [4, 49].

In the last 10 years, social media analysis has shown a growth in research studies ranging from ROI for organizations, prediction of real-life changes influenced by social media, descriptive analysis of real-life events as discussed on online platforms [10], viral marketing, social issues, health issues, natural disasters, emergencies, online surveys, countering fake information, detecting cyber bullying and use of abusive language, e-learning, online monitoring, etc.

In the research area of social media analysis, Twitter is a very popular choice of researchers because of its simple method of accessing data using an API interface. The raw feed from Twitter API is very rich in information, in terms of tweet content features and user profile features. The real potential of getting data from API is that it can be used for real-time data analysis and also for batch processing of a huge amount of data [4, 37].

Motivation

To study the activities of online users and to understand the behavioral pattern of users in various research domains noteworthy efforts are being made in the past few years [29, 35, 36, 38, 44]. The online activities make every user unique from other users which will become visible as strong patterns in time. The behavior patterns or signature style of a user is very useful in authentication, identification, and access control applications [19].

User centric approach

The proposed work is an attempt to predict retweets from the point of view of a single user. A Twitter user is the only identity who will take action using free will. As shown in the Fig. 1, a Twitter user receives a huge amount of information from various sources. This information overload has a very deep impact on user actions. A user can not consume the sheer amount of information at the same pace as it arrives. This will leads to a situation where a user may take an active action or a passive action on the current piece of information. All the actions where a user generates some new content fall under the active actions and all the actions where a user does not generate new content come under passive actions. The action of retweeting comes under the category of passive action because without adding any new information, a user let the existing information flow towards its followers in the network.

Fig. 1 — Twitter user as information processing node

These actions form the basis of user behavior and all these actions get recorded in the user profile. For example, a user profile contains information about how many tweets have been posted by a user (active action), and how many tweets are marked as favorites (passive action). The number of tweets retweeted by a user is not recorded in the profile and hence, it is a research problem of retweet prediction by analyzing the all other actions performed by the user from the time a user account has been created.

User data and twitter dataset

The problem of reproducing a Twitter dataset is a major issue for user behavior analysis. The public datasets, release only tweet content features or sometimes just TweetIDs. The challenge of hydrating the dataset from TweetIDs after 4 years results in a loss of 30% dataset [48]. The terms and conditions of Twitter API do not allow fetching user profiles from TweetIDs. The proposed work is an attempt to provide an alternative way to handle this problem by using public Twitter archives [24, 40].

The Fig. 2 has shown three layers of features which can be used for the retweet prediction. The first layer consists of user features which are available with every tweet collected using API. The numerical features can be used as it is and some features can be computed with basic mathematical operations. The second layer i.e. tweets’ content features are partially available in API and more features can be created using complex algorithms such as NLP features. The third layer of features is not directly available in the random feed of tweets. These features must be generated using various methods of data collection, complex algorithms and different assumptions about the structure of network. The recent studies have used various combinations of features from all three layers. However, those methods are not reproducible because user information cannot be shared publically.

User profiles are the most significant part of the user behavior analysis, and easily available with every tweet coming from random feed. Zubiaga et al. [48] found that the most common method of data collection from Twitter is using Twitter streaming API. The use of limited features available in Twitter API can be one of the solutions to generate domain independent, language independent and general purpose analysis on very large datasets. Recent studies [24, 48] have found that due to concerns of user privacy and restrictions imposed by social media companies on the distribution and sharing of dataset makes it very difficult to reproduce the same dataset for social media analysis [48].

A study [40] on the comparison of Twitter datasets and Twitter archives suggested that freely available archives should be used as an alternative way to reproduce and distribute datasets. The available archives are collections of the live feed of random tweets captured using Twitter API. Each tweet contains all data fields available in API as a JSON document. The significance of using archives is that it contains the full user profile along with tweet content features.

Significance of proposed work

Objectives:

To provide a baseline pattern of retweet prediction (using 100 million random tweets) for domain-independent data feed with a minimum feature set and low computation requirement.
To propose a method for user behavior research that is reproducible, scalable, and using a public dataset without violating the terms and conditions of Twitter API.
To reduce the complexity of social media analysis for big data streams using basic numerical features.
To predict the retweets for every random user irrespective of the fact if a user is a normal user or a influencer/celebrity user.

Conditions:

The dataset contains a random feed without any specific domain, topic, or other conditions.
The proposed feature set is created from features available in Twitter streaming API only.
The dataset, containing full user profiles, is freely available for research.
The feature set includes only numerical values for fast processing and to reduce the computational complexity of text features.

Outcomes:

The user profile features performed better than tweet content features for retweet prediction.
The basic numerical features are very useful for real time user behavior analysis.
No preprocessing requirement for proposed features set makes it fast and scalable for processing of big data streams.
The proposed features set have shown promising results for regression and classification algorithms.
The proposed work is able to predict for every user profile, influential or normal user.

In the following sections, the article is divided as follows. The related work on retweet prediction is given in section 2. In section 3 authors described the methodology of the study. The evaluation of the proposed work using Machine Learning Algorithms is presented in section 4. Section 5 comprises of Conclusions and the future scope of this study.

Related works

To understand the user behavior, one interesting research question is, why a user shares few tweets within network and not all of them. The probable reason can be due to information overload, it is practically not possible for a user to keep sharing every incoming tweet. Hemsley [25] found that approximately 47% tweets did not get retweets [14]. It presents an opportunity to study and analysis various factors of user actions to predict information sharing behavior.

Recent studies on information sharing proposed various methods to answer these questions. The studies focused on the content of tweets used sentiment analysis, location-based features, NLP techniques, use of hashtags (#), cashtags ($), URLs, and various text-based statistical features [10, 22, 26, 45]. The text-based approaches demand heavy computational resources and also in some cases all past tweets of the user [10, 23, 27, 43, 47]. The tradeoff between accuracy and computational resources is the bottleneck to scale up for big data analysis and real-time analysis of live data streams.

The graph-based approaches are commonly limited to well-defined network boundaries and some static assumptions about the growth of the network [8]. In reality, to replicate these studies is a very big computation challenge and also very difficult to produce the same accuracy every time due to evolving network structure.

The retweet cascade techniques need data for first k retweets or the first 5–10 min window of temporal features for retweet prediction. The problem with this method is that the time stamp and user profile of each retweeter is needed to create a retweet cascade for every single tweet. These approaches are not useful for live feed data, because it is not possible to monitor every single tweet for its upcoming retweets before starting predicting [14, 18, 28, 31, 46, 47].

Retweet prediction is a very popular way of understanding the dynamics of information sharing on Twitter. In recent years, various combinations of features have been proposed for more accurate retweet prediction. The features range from simple statistical features to more complex features including language-specific NLP features, network structure and centrality-based features, temporal features consisting of first n retweets, etc. There are three main questions to understand information sharing on Twitter. The first question is which tweet will get retweets and why? The second question is, what is the significance of network structure and position of a user in the network for successful information diffusion? The third question is which user will retweet a tweet and why? To answer these questions, information required includes information about tweet content, network structure and user profiles of the author of the source tweet, and user profiles of users who will retweet it further.

Hemsley [25] used network structure features to predict the extent of information sharing for political messages and found that users with medium size network are more successful in spreading political information as compared to influential users with large network size. Dinh & Parulian [15] used cascade model for retweet, quote and reply tweets for COVID related tweets. They found that average cascade length for retweets is 4 h, for quote tweets is 3 days and for reply tweets is 2 days. This pattern indicates that active actions of users in form of quote and reply have more impact than passive action of retweet. Chen e.t. [10] studied the information sharing in the domain of disaster related tweets using NLP and network features and found that neutral and positive sentiment tweets had larger reach as compared to negative information. This finding is just opposite for political messages. Interestingly, they also found that if any negative information gets few retweets then it gets more responses than positive posts. The panic situation and worries about the disaster impact user behavior to share negative information more rapidly.

For handling big data streams, recent studies have proposed some very promising solutions. Murshed et al. [34] have proposed a model to calculate the overall accuracy of Twitter dataset using three different methods. Atish’s measures outperformed other methods. They found that due to several language issues related to spelling, grammar and unstructured style of writing makes it very challenging to achieve higher level of accuracy. Singh e.t. [42] have proposed a framework for processing of big data using machine learning approach. The proposed framework showcased fast processing using distributed computing and ability to scale performance of machine learning algorithm. The clustering of incoming data stream is very difficult for standard machine learning algorithms. Arpaci et al. [5] have proposed evolutionary clustering for Twitter streams on COVID related tweets. They used 43 M+ tweets as a dataset. Duan et al. [16] proposed an algorithm SELM (Spark Extreme Learning Machine) for multi-classification of big data using Apache Spark cluster. The proposed algorithm performed better and achieved highest speedup than traditional ELM (Extreme Learning Machine) algorithms.

The information sharing can be analyzed from three different points of view. The first view [10, 14, 20, 26] is to predict if a tweet will get a retweet or not? The second view [35, 36, 38, 44] is why tweets of some users get more retweets than other user’s tweets? The third view [18, 31, 46, 47] is to predict which user will retweet a post and why? To answer these questions, many recent studies have proposed a large number of new features and claimed better results. However, every study is unique in terms of a dataset, domain, set of assumptions, manually coded features, and nature of findings. The replication of these studies is not suitable for domain-independent, standard features set, and real-time analysis.

A brief summary of related work categorized by feature set used is given in Table 1.

Table 1.

Brief summary of related work

Research Work	Year	Dataset	Features Used	Topic
BPF A Unified Factorization model for predicting retweet behaviors [47]	2020	Sina weibo Dataset 1,60,02,390 microblogs 7,982,752 users	Network features, NLP features, Tweet cascades	Random
Composing tweets to increase retweets [26]	2019	Twitter Dataset a subset of a large corpus of about 1.77 million topic-author controlled tweets	NLP features, User profile, Tweet cascades	Random
COVID-19 pandemic and information diffusion analysis on Twitter [15]	2020	Twitter Dataset, 675,228 tweets	Network features	COVID-19
Crowd or Hubs information diffusion patterns in online social networks in disasters [18]	2020	Twitter Dataset 14 million tweets	NLP features, Tweet cascades	Hurricane Harvey
Followers Retweet The Influence of Middle-Level Gatekeepers on the Spread of Political Information on Twitter [25]	2019	Twitter Datasets 20,580 tweets, 755,957 tweets	Tweet cascades	Random
HawkesEye Detecting Fake Retweeters Using Hawkes Process and Topic Modeling [17]	2020	Twitter Dataset 30,000 tweet objects, 2, 508 retweeters	NLP Features, Manually coded User profile features	Random
Popularity Prediction for Single Tweet based on Heterogeneous Bass Model [22]	2020	Twitter Dataset 2,516,440 tweets 2,122,135 users.	NLP features User Profile features,	Random
Predicting Rumor Retweeting Behavior of Social Media Users in Public Emergencies [46]	2020	Sina weibo Datasets historical tweets 1: 284238 historical tweets 2: 203523	NLP features, Tweet cascades	Public emergencies
Predicting User Retweeting Behavior in Social Networks With a Novel Ensemble Learning Approach [7]	2020	Sina weibo Dataset 762,936 microblogs published by 68,817 users	NLP features Network features, User Profile features	COVID-19
Prediction of Likes and Retweets Using Text Information Retrieval [13]	2020	Twitter Dataset 2 million Tweets,	NLP features	Data science
R-Map A Map Metaphor for Visualizing Information Reposting Process in Social Media [9]	2019	Sina weibo Dataset	Network features, NLP features, Tweet cascades	Random
Temporal Sequence of Retweets Help to Detect Influential Nodes in Social Networks [6]	2019	Twitter Datasets 12,44,645 Tweets, 7,63,109 Tweets	Network features, Tweet cascades	Random
Uncovering sentiment and retweet patterns of disaster-related tweets from a spatiotemporal perspective – A case study of Hurricane Harvey [10]	2020	Twitter Dataset, 7,041,866 tweets	NLP features	Hurricane Harvey

Sr. No.	Feature	Description	Whether the Feature can be computed from Twitter API	Recent Studies
1.	Total hashtag	Count of hashtags in a tweet	Yes	[29, 35, 36, 41, 43, 46]
2.	Total link	Count of link in a tweet	Yes	[29, 33, 35, 43, 44]
3.	Total mention	Count of users mentions in a tweet	Yes	[35, 36, 41]
4.	Total retweet	Count of retweets received by a tweet	Yes	[29, 33, 36, 43]
5.	Is marked Favourite	Check if the favorite count is zero or not	Yes	[10, 36]
6.	Publication time of Tweet	Timestamp in 24 h format as per local time zone of a user account	Yes	[35]
7.	Original tweet or retweet	Check if a tweet is an original post or a retweet	Yes	[36]
8.	Tweet text	The textual content of a tweet post in UTF-8 format	Yes	[36, 43]
9.	URLs	Hypertext of URL posted in a tweet	Yes	[36]
10.	Tweet ID	Unique ID of a tweet	Yes	[36]
11.	Word Count	Total number of words in the text of a tweet	Yes	[43, 44]
12.	Character Count	Total number of characters in the text of a tweet	Yes	[26]
13.	Symbols and acronyms Count	Total number of symbols and acronyms in the text of a tweet	Yes	[26]
14.	Punctuations	Total number of punctuations in a tweet	Yes	[26]
15.	Creation Time of Tweet	The timestamp of posting a tweet	Yes	[10, 26, 29, 43]
16.	Hashtag ratio	The ratio of all hashtags to all tweets	No	[48]
17.	Link ratio	The ratio of all links to all tweets.	No	[48]
18.	Mention ratio	The ratio of all mention to all tweets.	No	[48]
19.	Retweet ratio	The ratio of all retweets to all tweets.	No	[48]
20.	Total likes count	Count of all tweets liked	No	[1, 35]
21.	Tweet similarity	The similarity of tweet text using cosine similarity.	No	[48]
22.	Unique URL ratio	The ratio of unique URLs posted to total tweets.	No	[48]
23.	Duplicate tweet count	Count of tweets posted as duplicate	No	[48]
24.	Unique hashtag	Count of unique hashtags used in all tweets.	No	[41]
25.	Unique mention	Count of unique mentions in all tweets.	No	[41]
26.	Maximum frequency of hashtag	Hashtag with maximum frequency in all tweets.	No	[41]
27.	Average frequency of hashtag	Mean value of hashtags used in all tweets.	No	[41]
28.	Average frequency of mention	Mean value of mentions used in all tweets.	No	[1]
29.	Average frequency of URLs	Mean value of URLs posted in all tweets.	No	[41]
30.	Deviation of hashtag	Hashtags population deviation in all tweets.	No	[1]
31.	Deviation of link	Links population deviation in all tweets.	No	[1]
32.	Deviation of mention	Mentions population deviation in all tweets.	No	[1]
33.	Deviation of re-tweet	Retweets population deviation in all tweets.	No	[1]
34.	Deviation of tweet length	Tweet length population deviation in all tweets.	No	[1]
35.	Deviation of hashtag position aggregate	Population deviation of hashtag position aggregate.	No	[1]
36.	Deviation of link position aggregate	Link position population deviation aggregate.	No	[1]
37.	Deviation of mention position aggregate	Mention position population deviation aggregate.	No	[1]
38.	Average daily tweet	The ratio of all tweets to count of days between first and last tweet.	No	[1]
39.	Average tweet length	Mean value of the lengths of all tweets.	No	[1]
40.	Average sentiment polarity	Mean value of the polarity of sentiment for every posted tweet.	No	[1, 12]
41.	Average sentiment subjectivity	Mean of sentiment subjectivity for every posted tweet.	No	[1, 2]
42.	Average TF-IDF score	Mean value of TF-IDF weight of the tweets.	No	[1]
43.	Popularity ratio	The ratio of the favourites count plus re-tweet count to the number of all tweet count.	No	[1]

Sr. No.	Feature	Description	Whether the Feature can be computed from Twitter API	Recent Studies
1.	Author total Activity	Sum of all the tweet posted by a user and all the tweets liked by a user	Yes	Proposed
2.	Author total Activity per year	Sum of all the tweet posted by a user and all the tweets liked by a user divided by user account age in years	Yes	Proposed
3.	Author tweets per year	Sum of all the tweet posted by a user divided by user account age in years	Yes	Proposed
4.	Screen name length	Count of characters in the screen name of a user.	Yes	[30]
5.	User location	If user location is mentioned or not.	Yes	[1, 10, 36]
6.	Age in days (Creation date of User Account)	The number of days since User Account created.	Yes	[35, 36, 44, 49]
7.	Followers count	Followers count of the user.	Yes	[26, 35, 36, 43, 44, 48]
8.	Friends count	Friends count of the user.	Yes	[26, 33, 35, 36, 43, 44]
9.	Statuses count	Number of statuses posted by a user	Yes	[1, 26, 35, 43, 44, 48]
10.	Favorites count	Count of tweets a user has marked as favorite.	Yes	[33, 48]
11.	User description	Check If the user description is provided or left blank.	Yes	[2]
12.	Account verified	Check if the user account is marked as verified or not.	Yes	[11]
13.	Default profile image	Check if the profile image is default or changed by the user.	Yes	[3]
14.	Listed count	Count of lists where the user account is listed.	Yes	[33, 35]
15.	Account reputation	Normalized ratio of user followers to user friends.	Yes	[41]
16.	Follower following ratio	The ratio of the count of user followers to user friends.	Yes	[48]
17.	Following follower ratio	The ratio of the count of user friends to user followers.	Yes	[49]
18.	User ID	Unique ID of the User	Yes	[36]
19.	User Name	Display name of the user account	Yes	[36]
20.	Profile URL	Check if profile URL is provided or not	Yes	[1]
21.	Default profile	Check if the profile theme is default or changed by the user.	Yes	[1]
22.	User Time zone	Check if user time is present or not.	Yes	[1]
23.	Geo-enabled	Check if geotagging is enabled or not by the user.	Yes	[1]
24.	Tweet text of all past Tweets	Collection of Text of all posted tweets	No	[22]
25.	Sentiment Score	Sentiment score based on tweet text	No	[10, 22]

Features	Count	Mean	std	Min	Max	Skewness	Kurtosis
Tweet char count	101,681,675	85.24	44.51	1	494	−0.08	−1.40
Tweet emojis count	101,681,675	0.46	1.95	0	140	20.97	833.89
Tweet word count	101,681,675	11.32	7.97	1	70	0.54	−0.78
Tweet emojis to char ratio	101,681,675	0.01	0.05	0	1	17.84	362.05
Tweet word to char ratio	101,681,675	0.08	0.08	0	1	1.68	13.16
Hashtags count	101,681,675	0.33	1.03	0	46	5.01	35.18
Urls count	101,681,675	0.18	0.41	0	5	2.05	4.01
User mentions count	101,681,675	0.91	0.95	0	28	4.08	30.33
Is quoted	101,681,675	−0.85	0.52	0	1	3.24	8.52
Is reply	101,681,675	−0.63	0.77	0	1	1.63	0.67
Author favorites count	101,681,675	8170.77	28,736.30	0	2,792,266	6.23	57.57
Author followers count	101,681,675	322,862.39	2,803,970.00	0	106,873,281	16.92	347.31
Author friends count	101,681,675	3178.34	30,755.90	0	4,710,009	19.22	621.58
Author Tweets count	101,681,675	28,310.74	424,569.00	0	27,837,830	11.24	259.67
Author total activity	101,681,675	36,481.99	426,640.00	0	27,838,020	5.85	63.66
Author total Activity per year	101,681,675	6636.31	50,133.70	0	5,508,960	5.85	63.66
Author tweets per year	101,681,675	4705.38	49,218.70	0	5,508,960	11.24	259.67
Retweet Count (as Label)	101,681,675	2341.21	18,934.10	0	3,614,140	23.49	1473.16

RQ 1: Binary Prediction	RQ 2: Regression Analysis	RQ 3: Classification
Precision	R-squared	Precision
Recall	Mean Square Error	Recall
F1-Measure	Root mean Square Error	F1-Measure
Log loss	Mean Absolute Error
AUC	Median Absolute Error
Accuracy

Logistic Regression
	Tweet Feature	Author Feature	Proposed Combined Features
Log loss	0.13111	0.06666	0.03719
AUC	0.98531	1	0.99937
Accuracy	0.97	0.98	0.99
Logistic Model Tree
	Tweet Feature	Author Feature	Proposed Combined Features
Log loss	0.12247	0.00045	0.00126
AUC	0.98579	1	1
Accuracy	0.97	1	1

	Random Forest Regression
	Tweet Content	Author Features	Proposed Combined Features
R-squared	0.6701	0.8162	0.9824
Mean Square Error	447,262,685.9598	240,524,382.1370	22,716,410.1733
Root mean Square Error	21,148.5859	15,508.8485	4766.1735
Mean Absolute Error	4540.3229	2121.2072	465.9543
Median Absolute Error	46.4760	8.2595	6.5000
	Decision Tree Regression
	Tweet Content	Author Features	Proposed Combined Features
R-squared	0.6776	0.8005	0.9756
Mean Square Error	417,917,933.0894	256,786,293.8943	31,564,528.4161
Root mean Square Error	20,443.0412	16,024.5528	5618.2318
Mean Absolute Error	4468.0276	2139.7003	435.7641
Median Absolute Error	31.7551	1.0000	1.0000
	Gradient Boosted Regression
	Tweet Content	Author Features	Proposed Combined Features
R-squared	0.239	0.699	0.745
Mean Square Error	1,000,375,856.76	405,327,941.74	326,527,452.02
Root mean Square Error	31,628.718	20,132.75	18,070.07
Mean Absolute Error	8533.43	4743.67	4582.83
Median Absolute Error	3261.74	834.79	859.04
	Support Vector Regression
	Tweet Content	Author Features	Proposed Combined Features
R-squared	0.029	0.021	0.0257
Mean Square Error	1,271,439,879.27	1,386,167,002.3	1,417,650,832.75
Root mean Square Error	35,657.25	37,231.26	37,651.7
Mean Absolute Error	6137.07	6535.5	6584.76
Median Absolute Error	29.13	5.08	25.61
	Bayesian Ridge Regression
	Tweet Content	Author Features	Proposed Combined Features
R-squared	0.024	0.346	0.36
Mean Square Error	1,290,683,865.92	860,205,619.56	850,860,029.43
Root mean Square Error	35,926.08	29,329.26	29,169.5
Mean Absolute Error	11,324.58	7611.37	7961.72
Median Absolute Error	6295.9	2529.78	3041.12
	Stochastic Gradient Descent Regression
	Tweet Content	Author Features	Proposed Combined Features
R-squared	0.025	0.34	0.35
Mean Square Error	1,272,648,394.91	851,245,028.42	838,362,217.15
Root mean Square Error	35,674.19	29,176.1	28,954.48
Mean Absolute Error	11,309.67	7873.71	8166.94
Median Absolute Error	6297.14	3041.28	3142.62

Number of Bins = 2
	Tweet Feature			Author Feature			Proposed Combined Features
	Precision (P)	Recall (R)	F1-score	Precision (P)	Recall (R)	F1-score	Precision (P)	Recall (R)	F1-score
Accuracy			0.98			1			1
macro average	0.98	0.98	0.98	1	1	1	1	1	1
weighted average	0.98	0.98	0.98	1	1	1	1	1	1
Number of Bins = 3
	Tweet Feature			Author Feature			Proposed Combined Features
	P	R	F1	P	R	F1	P	R	F1
Accuracy			0.87			0.89			0.9
macro average	0.77	0.67	0.66	0.6	0.67	0.63	0.84	0.71	0.71
weighted average	0.85	0.87	0.83	0.8	0.89	0.84	0.89	0.9	0.87
Number of Bins = 4
	Tweet Feature			Author Feature			Proposed Combined Features
	P	R	F1	P	R	F1	P	R	F1
Accuracy			0.77			0.79			0.8
macro average	0.53	0.51	0.46	0.52	0.56	0.53	0.57	0.54	0.51
weighted average	0.71	0.77	0.7	0.72	0.79	0.75	0.74	0.8	0.74
Number of Bins = 5
	Tweet Feature			Author Feature			Proposed Combined Features
	P	R	F1	P	R	F1	P	R	F1
Accuracy			0.68			0.71			0.72
macro average	0.37	0.43	0.37	0.38	0.48	0.42	0.39	0.45	0.4
weighted average	0.61	0.68	0.62	0.62	0.71	0.66	0.63	0.72	0.65
Number of Bins = 6
	Tweet Feature			Author Feature			Proposed Combined Features
	P	R	F1	P	R	F1	P	R	F1
Accuracy			0.65			0.67			0.68
macro average	0.29	0.36	0.3	0.35	0.39	0.34	0.31	0.38	0.32
weighted average	0.58	0.65	0.59	0.62	0.67	0.62	0.59	0.68	0.61
Number of Bins = 7
	Tweet Feature			Author Feature			Proposed Combined Features
	P	R	F1	P	R	F1	P	R	F1
Accuracy			0.57			0.64			0.63
macro average	0.22	0.3	0.23	0.38	0.38	0.34	0.37	0.37	0.33
weighted average	0.53	0.57	0.53	0.63	0.64	0.61	0.63	0.63	0.6

Research Work	Year	Domain Independent	Independent of Network features	Independent of NLP features	Handling Big Data Streams	User Profile Features included	Independent of Historical Tweets or Cascade Retweets	Complexity of features used
Ensemble learning approach [7]	2020	Yes	No	No	No	Yes	Yes	High
A media synchronicity theory for effective communication during disasters [42]	2020	Yes	Yes	No	No	No	No	High
Covid 19 information diffusion model [15]	2020	No	No	Yes	No	No	No	High
Popularity prediction using heterogeneous Bass model [21]	2020	Yes	Yes	No	No	Yes	Yes	High
Hawkes process and topic modeling [17]	2020	No	Yes	No	No	No	No	High
Network analysis for predicting influence of user nodes [22]	2019	No	Yes	Yes	No	No	No	Low
Temporal features for influential node detection [6]	2019	Yes	No	Yes	No	No	No	Medium
Visualizing reposting process [9]	2019	Yes	No	No	No	No	No	High
Spatiotemporal features for sentiment and retweet analysis [10]	2020	No	Yes	No	No	No	Yes	High
Text information retrieval for retweet and like prediction [13]	2020	No	Yes	No	No	No	Yes	High
Analysis of user nodes for information diffusion patterns [18]	2020	No	Yes	No	No	No	No	High
unified factorization model for retweet prediction [43]	2020	Yes	No	No	Yes	Yes	No	High
Tweet content analysis for increased retweets [24]	2019	No	Yes	No	No	Yes	No	High
A user retweet prediction method for hot topics [14]	2021	No	No	No	No	No	No	High
COVID-19 pandemic machine learning to measure twitter users’ perceptions [23]	2021	No	Yes	No	No	No	No	High
Retweet Prediction based on Topic, Emotion and Personality [21]	2021	Yes	No	No	No	Yes	No	Medium
Value-Based Retweet Prediction on Twitter [29]	2021	Yes	Yes	No	No	Yes	No	Medium
Proposed Work	2022	Yes	Yes	Yes	Yes	Yes	Yes	Low

PERMALINK

Role of twitter user profile features in retweet prediction for big data streams

Saurabh Sharma

Vishal Gupta

Abstract

Introduction

Motivation

User centric approach

Fig. 1.

User data and twitter dataset

Fig. 2.

Significance of proposed work

Related works

Table 1.

Challenges for retweet prediction in real time big data analysis

Table 2.

Table 3.

Methodology

Fig. 3.

Fig. 4.

Experimental evaluation and results

Dataset: The dataset, of 100 million random tweets, is created from the online twitter archive of august 2018 [39, 40]

Table 4.

Experimental setup (Fig. 5)

Fig. 5.

Evaluation metrics

Table 5.

Performance evaluation

Fig. 6.

Fig. 7.

Table 6.

Table 7.

Fig 8.

Table 8.

Table 9.

Table 10.

Table 11.

Table 12.

Table 13.

Table 14.

Fig. 9.

Comparison with other works

Table 15.

Conclusions and future work

Acknowledgments

Declarations

Conflict of interest

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases