Abstract
Social media has gained huge importance in our lives wherein there is an enormous demand of getting high social popularity. With the emergence of many social media platforms and an overload of information, attaining high popularity requires efficient usage of hashtags, which can increase the reachability of a post. However, with little awareness about using appropriate hashtags, it becomes the need of the hour to build an efficient system to recommend relevant hashtags which in turn can enhance the social popularity of a post. In this paper, we thus propose a novel method hashTag RecommendAtion for eNhancing Social popularITy to recommend context-relevant hashtags that enhance popularity. Our proposed method utilizes the trending nature of hashtags by using post keywords along with the popularity of users and posts. With the prevalent evaluation techniques of this field being quite unreliable and non-uniform, we have devised a novel evaluation algorithm that is more robust and reliable. The experimental results show that our proposed method significantly outperforms the current state-of-the-art methods.
Keywords: Hashtag recommendation, Popularity prediction, Social media analysis, Information spread, Data mining
Introduction
Social media has become an integral part of our day-to-day lives. It is the new normal wherein one’s online presence and popularity in social media are construed as a major contributing factor to one’s social position. Especially after the COVID-19 pandemic, the number of users on social media had increased enormously due to onset of increased digital interaction. For instance, the number of users of Instagram has crossed the mark of one billion as of 2020 (Andujar 2020). Therefore, there is a growing demand of gaining popularity, be it in case of an individual user or giant companies that use social media for advertisements, over various social media platforms such as Twitter, Instagram, Facebook and LinkedIn. However, with 300 million posts (Baltaci and Ersoz 2022) being uploaded everyday on these platforms, there is an immense information overload which makes it quite difficult for a single post to attain high popularity solely based on the post content. Therefore, these platforms constitute of a functionality called hashtags that can help posts gain higher social popularity.
Hashtags are alphanumeric strings preceded with a hash (#) symbol and are annotations that represent the context of posts (Caleffi 2015). Hashtags direct social media algorithms to show a post in the feeds of interested audience. Moreover, hashtags are used by users to retrieve all posts related to a specific topic. Therefore, it has been claimed that hashtags are an important driving factor in attaining popularity of a post (Yamasaki et al. 2017; Wang et al. 2019). The social popularity of a post is computed in terms of its various engagement metrics such as likes, comments and shares which are attained when users engage with that post.
Therefore, if a post on a particular topic does not use the related hashtags, then it might lose out on a majority of the audience interested in that topic. Users search for posts of their interest using hashtags and as a result, posts annotated with those hashtags are rendered by the social media algorithms. Therefore, if a post does not contain the relevant hashtags, then it may not appear as a search result leading to a loss of potential audience and lesser social popularity. On the other hand, using appropriate hashtags can enhance the reachability of a post by making it visible to a wider audience. Moreover, due to the nature of social media algorithms, users interested in topics similar to that of the post can potentially engage with that post. As a result, using appropriate hashtags leads to higher user engagement thereby boosting the social popularity of posts. However, a lot of users might be unaware about the importance of hashtags and even if they are aware, they might be uncertain about suitable hashtags to be assigned.
In this work, we aim to recommend hashtags which are relevant to the post and can also enhance its social popularity. In this lesser explored field, there have been research works which aim to boost social popularity by using different social media entities. Yamasaki et al. (2017) used popular posts as a basis to select the candidate hashtags which can deliver higher social popularity and are simultaneously relevant to the post content. Wang et al. (2019) recognized the effect of recommending hashtags used by popular users and combined this factor with that of popular posts. This also stands as the current state-of-the-art method which recommends the highest popularity delivering hashtags. In these previous works (Yamasaki et al. 2017; Wang et al. 2019), the authors targeted trending hashtags using different social media entities. For example, if a celebrity (a popular user) uses a hashtag , it is likely that a lot of users take interest in posts annotated with the hashtag and might also use it. Instinctively, recommending to users who are posting on the same topic would be a smart move as more users will be engaging with the post leading to higher social popularity of that post and the corresponding user. A similar analogy can be drawn for popular posts as well. A post that gains high popularity creates a larger audience for each of the hashtags used by that post. Therefore, other posts using the same hashtags will garner a wider audience leading to higher social popularity. In both cases, it can be noted that hashtags with a larger audience are given an upper hand at the time of recommendation. In other words, using trending hashtags can lead to higher social popularity.
Unlike existing works, we introduce a new method, which incorporates keywords used in posts, besides users and posts, to select the trending hashtags for achieving better performance as compared to existing methods.
Furthermore, the predominant method of evaluation in this field is to directly upload the post with recommended hashtags on social media platform and monitor the social popularity of the post attained after regular time intervals (Wang et al. 2020). However, social media is a dynamic ecosystem which is difficult to predict and hence unreliable. This experimental setup might lead to different results for different settings like time, location, etc., and hence may not be regarded as a uniform method of evaluation. For instance, a poorly recommended hashtag might accidentally lead to high popularity due to incidence of a related event after the post is uploaded. In this case, the credit of higher popularity would be wrongly assigned to the recommendation method and this would lead to an incorrect analysis. Similarly, there can be other variables like time, location, current affairs and many more which are beyond the scope of recommendation methods but can affect this mode of evaluation. Hence, we have devised a novel evaluation algorithm which is uniform and reliable unlike the current methods. We have obtained an improvement in performance using our proposed method over the prevalent baseline methods. Contributions of our work are enumerated as follows:
We proposed a novel method, which considers keywords of posts as a cornerstone besides users and posts to recommend context relevant hashtags that can enhance the social popularity of a post.
We designed a novel evaluation algorithm which is uniform and reliable unlike the current simulation-based methods.
We created a new social media dataset that contains recent posts fetched from Twitter on a wide range of topics. This dataset can be used for carrying out research in text-based hashtag recommendation and popularity prediction.
We obtained better performance using our proposed framework over the state-of-the-art methods.
The rest of the research paper is organized as follows. In Sect. 2, we present our formal problem definition. In Sect. 3, we discuss the related works of this field. Further, we describe the fundamental concepts in Sect. 4. Section 5 explains our methodology and Sect. 6 presents our experimental evaluations. Section 7 presents our conclusions.
Problem statement
In this section, we define our problem statement formally. Our problem statement can be divided into two subproblems which are stated below.
Problem 1: Hashtag recommendation
Given a dataset D comprising of posts denoted by where , with each post
being posted by a user having followers, where , the set of all users in the dataset D, and ,
annotated with a set of hashtags such that , the set of all hashtags in the dataset D, and ,
- having Social Post Popularity , where
with , , denoting the number of likes, comments and shares attained by the post , respectively,
we recommend a set of hashtags for each post , where and , such that if the content of post is annotated with the recommended set of hashtags and is uploaded on the concerned social media platform, then it should achieve social post popularity where
Problem 2: Evaluation of recommended hashtags
Suppose we have a dataset comprising of posts along with their actual social post popularity and denoted by , where . Now, given the content of a post annotated with a set of recommended hashtags , the task to
predict the social post popularity of this new post
compare with the actual social post popularity of and compute success rate of the recommendation method
The success rate can be calculated as percentage of posts in , where .
Related work
There have been various studies that emphasize the importance of hashtags in social media (Caleffi 2015) and the impact of hashtags on the internet as studied by Baltaci and Ersoz (2022). Hashtags play a vital role to associate the content with a context leading to multiple theories and methods to analyze it (Chang 2010; Ferragina et al. 2015). The previous works related to our research come under the fields of hashtag recommendation, popularity prediction and hashtag recommendation for boosting popularity which are stated in the following subsections.
Hashtag recommendation
This field aims at recommending hashtags which are semantically relevant to the context of a post. Various techniques and models have been introduced for the same. Lops et al. (2013) and Zhang et al. (2009) used Collaborative Filtering and Hu et al. (2017) used Tagcoor to incorporate the concept of hashtag co-occurrence in recommending hashtags that are semantically relevant. Sigurbjörnsson and Van Zwol (2008) and Si et al. (2009) also included the correlation between hashtags, textual content and the concerned user.
Kumar et al. (2021) used semantic features (based on word-embeddings) and user influence features (based on users’ influential position) to recommend hashtags. Liang et al. (2010) deduced analogies between users and hashtags and recommended the latter accordingly. Folksonomy (Ibba et al. 2015) has also been used for semantic hashtag recommendation (Jäschke et al. 2007; Hong et al. 2018) wherein content is categorized into a class and the associated hashtags are recommended. Ben-Lhachemi (2018) proposed an unsupervised approach that relies on the density-based clustering of the embedded representation of social media posts for selection of hashtags used by similar posts for recommendation. Cantini et al. (2021) used a BERT-based hashtag recommendation methodology that maps semantic features of social media posts in the latent representation of their hashtags. Li et al. (2019) proposed a topical co-attention LSTM network to recommend hashtags by jointly considering semantic and topical information. Furthermore, Guan et al. (2009) used a graph-based approach for hashtag recommendation by taking inspiration from the Page Rank algorithm (Page et al. 1999) and Text Rank algorithm (Mihalcea and Tarau 2004). Apart from text, there have been some works which adapt a multimodal approach wherein textual as well as visual features are taken into consideration while recommending hashtags. For instance, Bansal et al. (2022) used a multimodal deep learning framework for recommending personalized hashtags. Deep learning was also used by Nguyen et al. (2017) for personalized hashtag recommendation.
Popularity prediction
This field primarily includes prediction of social popularity that a post will achieve when uploaded on a social media platform. The social popularity of a post can be formulated from various engagement metrics such as likes, comments and shares (Yamasaki et al. 2014). For prediction of post popularity, He et al. (2014) used user comments along with textual features. To extract features from textual data of a post, various methods were introduced by Manning et al. (2014). Later on, different Natural Language Processing models were introduced for textual feature extraction including the renowned BERT model (Devlin et al. 2018) which stands as one of the most widely used model for the same. Apart from textual features, some researchers adapted a multimodal approach by using the context features of images and videos in a post in order to predict the popularity (Zohourian et al. 2018). Meghawat et al. (2018) used social features of the user along with the visual and textual features of a post. Karthikeyan et al. (2019) also introduced multilingual capability in popularity prediction. Huang et al. (2018) used a machine learning-based model Random Forest for prediction of post popularity. Since hashtags are an important driving factor in achieving popularity for a post, Ma et al. (2012) discussed about the popularity of hashtags itself.
Hashtag recommendation for boosting popularity
This field is different from the one presented in Sect. 3.1 because the motive here is to recommend hashtags which should be simultaneously relevant to the context of the concerned post and capable of getting high popularity for that post. This field involves generating scores for all the hashtags and selecting the highest ranked hashtags for recommendation.
For ranking hashtags, Folkrank (Hotho et al. 2006) had been a prevalent technique which was further improved by Gemmell et al. (2009) by using Collaborative Filtering. Landia et al. (2012) introduced the usage of Folkrank with textual data where words had been used in a graph-based structure and hashtag ranks were calculated. This field also involves research in the construction of various transition matrices which are used to rank the hashtags. Yamasaki et al. (2017) devised a Folk Popularity matrix which was based on post popularity. Wang et al. (2019) introduced a new matrix called AUFP that integrated a matrix based on user popularity with the Folk Popularity matrix. AUFP also stands as the current state-of-the-art method in this field. Our work focuses on introducing a new transition matrix which outperforms AUFP and will be presented in Sect. 5.
Evaluation techniques
Many researchers have conducted simulation-based tests for evaluating the effectiveness of the hashtags, recommended using different techniques, in boosting the popularity of a post. The recommended hashtags were attached with the content of a post and were uploaded on the respective social media platform. This approach was adapted for each set of hashtags recommended by different methods and their respective post popularity figures were recorded after fixed intervals of time (Sigurbjörnsson and Van Zwol 2008; Yamasaki et al. 2017; Wang et al. 2019). However, this mode of evaluation is not reliable due to the dynamic nature of social media. We have explained the limitations of this evaluation technique and overcame those by our novel evaluation algorithm based on popularity prediction, as proposed in Sect. 5.
The fundamental concepts
Our proposed framework is based on seven fundamental concepts which are stated as follows:
Hashtags associated with popular users are important.
The more hashtags associated with a user, the lesser is the contribution of a single hashtag toward the popularity of that user. However, hashtags with relatively higher association with the user has higher contribution in comparison to that of others.
Hashtags associated (co-occurred) with important hashtags are also important.
Hashtags associated with popular posts are important.
The more hashtags associated with a post, the lesser is the contribution of a single hashtag toward the popularity of that post.
Hashtags associated with popular words are important.
The more hashtags associated with a word, the lesser is the contribution of a single hashtag toward the popularity of that word. However, hashtags with relatively higher association with the word has higher contribution in comparison to that of others.
The first five concepts were introduced in previously by Wang et al. (2020). In this research work, we introduce two new concepts 6 and 7. The justifications behind each of the concepts are stated below.
Concept 1, 4 Hashtags associated with popular users/posts are important. Both concepts express the relation between popular users/posts and importance of hashtags used by them. For instance, if a popular user uses a hashtag , it is likely that a lot of users would take interest in posts with the hashtag and might also use it. Naturally, recommending to the users who are posting on the same topic would be beneficial as more users will be engaging with that post leading to higher social popularity of that post and the corresponding user. Further, a post which gains high popularity leads to a larger audience for each of the hashtags used by that post. Therefore, other posts using similar hashtags will experience a wider audience leading to higher social popularity.
Concept 2, 5 The more hashtags associated with a user/post, the lesser is the contribution of a single hashtag toward the popularity of the user/post. However, hashtags with relatively higher association with the user has higher contribution in comparison to that of others. These two concepts highlight the link between the number of hashtags associated with user/post and the contribution of those hashtags toward the popularity of that user/post. If a popular user uses a lot of hashtags then the followers of that user would be divided in using and following those hashtags. The reason is that the followers of a user are interested in the content generated by that user. However, that user might generate content on different fronts and each front might have a different audience. Therefore, if there are a lot of such fronts resulting in a lot of hashtags being used by that user, then the followers of that user would be divided in consuming the content from different fronts generated by that user. As a result, the credit of popularity of the user would be distributed among all hashtags used by that user. However, if one of the fronts of the user has a larger audience, meaning the corresponding hashtag of the front is more trending than those of other fronts, then that hashtag would be comparatively more important than other hashtags used by that user. Similarly, if a post is associated with multiple hashtags, then the credit of popularity of that post would be distributed among those hashtags. If there are a lot of hashtags in a post, then each hashtag would receive a small share of the credit of popularity of that post. On the other side, with a fewer hashtags, the contribution of each of those hashtags will be higher toward the popularity of the corresponding post.
Concept 3 Hashtags associated (co-occurred) with important hashtags are also important. The reason is that hashtags co-occurring with trending hashtags will also be consumed by the large audience of the trending hashtags and therefore, they will be trending too. However, they might not be as trending as the original trending hashtag they co-occur with depending on their degree of co-occurrence with the trending hashtag. Here, we define degree of co-occurrence of hashtag with hashtag as the ratio of number of posts where and appear together upon the number of posts using .
Concept 6 Hashtags associated with popular words are important. Analogous to concept 1 and 4, this concept connects the popularity of word with the importance of hashtags associated with the word. Here, we define a word as a string of alphanumeric characters. Posts generally comprises of multiple sentences and each sentence comprises of multiple words. Further, hashtags of a post are a part of its content and hence can be regarded as words of a post preceded by a hash symbol.
Now, if a hashtag is trending then it appears in a large number of posts. Since hashtags can be considered as words of a post, we can term the trending hashtags as popular words. Further, from concept-3 which says hashtags associated (co-occurred) with important hashtags are also important, we can re-frame this concept and state that hashtags associated with popular words are important.
Concept 7 The more hashtags associated with a word, the lesser is the contribution of a single hashtag toward the popularity of the word. However, hashtags with relatively higher association with the word have higher contribution. A hashtag gets trending because of the audience of corresponding topic and also due to the social traffic of other related topics which is depicted by different hashtags. The reason is that, in general, a post contains multiple hashtags and a particular hashtag co-occurs with other hashtags as well. Therefore, if hashtag co-occurs with hashtag , has an audience which also comprises of audience of . In other words, contributes to the trending nature of . This leads to the scope of comparing the contributions of each hashtag that co-occurs with hashtag , toward the trending nature of hashtag itself. The answer depends on the degree of co-occurrence of the hashtag with . If the audience of a hashtag repeatedly sees the hashtag due to high degree of co-occurrence of with , it is highly likely that they will start showing interest in hashtag and engage with posts containing the hashtag leading to an increase in the trending nature of . On the other hand, if the degree of co-occurrence of and is lesser, then the audience of will see the hashtag lesser. Consequently, it is less likely that they will start showing their interest in hashtag and engage with posts containing the hashtag . Therefore, we can say that the contribution of toward the trending nature of is higher than that of .
Consequently, if hashtag co-occurs with very few hashtags then the degree of co-occurrence of with each of these few hashtags will be higher. Since there are generally multiple hashtags in a post, if hashtag is in a post and it is co-occurring with another hashtag, then we have fewer options to select from for a co-occurring hashtag. As a result each one of those co-occurring hashtags have a high probability co-occurring with leading to its high degree of co-occurrence with . Therefore, each of those few hashtags would have a significant contribution toward the trending nature of . On the other hand, if co-occurs with a lot of other hashtags, then we would have a lot of options to select from for the hashtag that would co-occur with in a post. This would lead to lesser probability of each of those hashtags getting selected as a co-occurring hashtag resulting in its low degree of co-occurrence with . As a result, each of those hashtags would have a lesser significant contribution toward the trending nature of in comparison to those of the earlier case. Hence, we can conclude that more the number of hashtags co-occurring with a hashtag , the lesser is the contribution of each hashtag toward the trending nature of . Since we can represent the hashtag as a word (from Concept 6), we can conclude that more the number of hashtags associated with a word, the lesser is the contribution of each hashtag toward the popularity of that word. However, out of the set of hashtags co-occurring with , if one of them has relatively higher degree of co-occurrence with , then it would have higher contribution toward the trending nature of in comparison to others. In other words, hashtags with relatively higher association with the word have higher contribution in comparison to that of others.
Methodology
System architecture
The overall architecture of our proposed framework TRANSIT (hashTag RecommendAtion for eNhancing Social popularITy) is illustrated in Fig. 1. TRANSIT comprises four components namely data preparation, hashtag recommendation, popularity prediction and success rate computation. The data preparation component consists of two modules namely data cleaning and processing and data splitting. The hashtag recommendation component is made up of three modules: transition matrix construction, relevance computation and Markov chain technique. The third component is popularity prediction that consists of training of popularity prediction models as first module and popularity score generation as second module. The last component is success rate computation which yields the final results of our framework. Each component is explained along with its modules in detail in the subsequent sections.
Fig. 1.
System architecture
Data preparation
Data cleaning and processing
This module consists of two steps namely data cleaning and data processing. The collected dataset can have data outside the scope of our work like non-English posts, multimodal posts and many more. These posts should be removed first and comes under the scope of data cleaning.
The second step involves computation of ground truth popularities based on various engagement metrics, extraction of hashtags and a few more. This is done in order to convert the dataset into the required input format of the next step, i.e., data splitting. These data cleaning and processing steps are covered in detail in experimental evaluation (Sect. 6).
Data splitting
Once the dataset is cleaned and processed, we need to split it with ratio of (r) into two parts namely hashtag recommendation dataset, which is used by hashtag recommendation component, and popularity prediction dataset, which is used by popularity prediction component. The reason is that the Machine Learning-based popularity prediction models should be trained and tested on separate posts. Moreover, the splitting should be in such a manner that both the datasets should comprise of the same set of hashtags so that there is a consistency between the feature vectors of both the datasets. Consequently, posts with just a single hashtag must be present in both datasets. This results in duplicates in both datasets and can be minimized by trying out different values of ratio (r). The complete algorithm to split the dataset is explained in Algorithm 1.
In this algorithm, for each hashtag, we divide the posts using that hashtag into two datasets HRD and PPD. However, there might be some cases where a post can be present in both datasets to preserve the consistency of hashtag to index mapping in both datasets. This algorithm requires the construction of a data structure that stores the mapping between a hashtag and the posts using it and is depicted as D in Line 1. As depicted in Lines 4–6, if there is only one post using a hashtag , then this post needs to be duplicated so that the hashtag is present in both datasets. Further, in cases when there are multiple posts using the hashtag , we store the count of posts which are already assigned to HRD and PPD as and , respectively, in Lines 7–8. A post might have multiple hashtags and while processing a hashtag , a post, which also uses hashtag , is assigned to one of the datasets. Therefore, while processing hashtag , it is required to check posts that were already assigned to a dataset while processing a previous hashtag. Further, if all posts of a hashtag are assigned to one of the dataset, then some of the posts need to be duplicated in the other dataset as depicted in Lines 10–15.
Furthermore, if the split ratio is (r), it means that all the posts using a hashtag should be split according to the ratio (r). Therefore, in Lines 16–17, the desired number of posts in each dataset are computed. Further, there are four cases with the first case being the one where both and are lesser than the corresponding desired number of posts in both datasets. In this case, we resolve the deficit of one of the datasets and assign rest other unassigned posts to the other dataset as shown in Lines 19–22. The second case is where there is a deficit for HRD and surplus for PPD. In this case we use all unassigned posts in HRD as illustrated in Lines 23–24. The third case is reverse of second case and here we use all unassigned posts in PPD as shown in Lines 25–27. In the fourth case where both PPD and HRD are in a surplus, we use all the unassigned posts in that dataset which has lesser surplus than the other as shown in Lines 28–32. Lastly, in Line 33, we return our final datasets HRD and PPD.
Hashtag recommendation
This component recommends hashtags which are relevant to context of the post and also capable of enhancing its social popularity. This is established using four modules namely transition matrix computation, relevance computation, Markov chain technique and post feature extraction as explained below.
Transition matrix construction
This module has been the main area of focus for all previous research works in this field (Yamasaki et al. 2017; Wang et al. 2020). In this module, a transition matrix is constructed on the basis of concepts introduced in Sect. 4 and is used in the next module, i.e., Markov chain technique. This transition matrix is the main deciding factor for generating scores of all hashtags after which the highest ranked hashtags recommended.
Folk Popularity Matrix (AFP): The Folk Popularity transition matrix (Yamasaki et al. 2017) abbreviated as AFP is based on post popularity and is constructed by the matrix multiplication of two matrices namely , which captures Concept 4, and transpose of , which captures Concept 5. The ith row and jth column of the matrix constitutes the social post popularity of the jth post if it is associated with the ith hashtag. Here, association implies that the ith hashtag is used by the jth post. Further, the matrix is normalized row-wise. The ith row and jth column of the matrix constitutes the numerical association of the jth post and the ith hashtag. Here, numerical association refers to a discrete function yielding 1 if the ith hashtag is used by the jth post and 0 otherwise. Further, the matrix is normalized column-wise.
User Popularity Matrix (AUP): The User Popularity transition matrix (Wang et al. 2019) abbreviated as AUP is based on user popularity and it is constructed by the matrix multiplication of two matrices namely , which captures Concept 1, and transpose of , which captures Concept 2. The ith row and jth column of the matrix constitutes the user popularity of the jth user if the user is associated with the ith hashtag. Here, association implies that the ith hashtag is used by the jth user. Further, the matrix is normalized row-wise. The ith row and jth column of the matrix constitutes the numerical association of the jth user and the ith hashtag. Here, numerical association refers to a discrete function yielding the frequency of the usage of the ith hashtag by the jth user and 0 in case of no usage. Further, the matrix is normalized column-wise.
User-aware Folk Popularity Matrix (AUFP): The User-aware Folk Popularity matrix (Wang et al. 2020) abbreviated as AUFP is constructed by the element wise multiplication of matrices AUP and AFP which are stated above. Similarly, Wang et al. (2019) used a transition matrix which is addition of the matrices AUP and AFP. The transition matrix AUFP is the current state-of-the-art transition matrix. In all the matrices stated here, Concept 3 is taken into account when we multiply the first matrix ( and ) with the transpose of the second matrix ( and ). Therefore, all assumptions are taken into account while constructing these transition matrices. In the algorithm to construct the basic matrices AUP and AFP, we need two metrics. Both matrices are fabricated by the multiplication of two matrices. The first matrix requires the metric of popularity of user/post and the second matrix requires the metric of association of user/post with hashtags. Popularity of user/post can be derived from various engagement metrics such as likes, comments, shares and followers (Yamasaki et al. 2014). Association of user/post with hashtags can be calculated from the usage of those hashtags by that user/post.
In order to present our word-based transition matrix with a similar analogy, its required to numerically define the popularity of words and the association of words with hashtags. To calculate the popularity score of each word, we have used the Text Rank (Mihalcea and Tarau 2004) algorithm. Moreover, for calculating the association of a hashtag with a word, we used Association Rule Mining (Agrawal et al. 1993). Association Rule Mining is a statistical procedure used to find correlations in a dataset and hence we adopt it to find the association of hashtags and words. It comprises concepts like support, confidence and others. We have assumed that if a word w and a hashtag appear together in a post then they are associated with each other.
Therefore, if we define S as support of and w appearing together in a post, then we can write S as:
| 1 |
Similarly, if we define as support of the word w, we can write as:
| 2 |
Now, if we define C as confidence of and w appearing together in a post, then we can write C as
| 3 |
Finally, we have devised a formula to denote the association between the word w and hashtag as
| 4 |
The complete algorithm to calculate association between hashtags and words is described in Algorithm-2. In this algorithm, for each post, we first extract all unique words and hashtags in W & H, respectively, in Lines 4–6. Then, in Lines 7–10, we form all combinations of the words of W and hashtags of H and increment the corresponding counts in the and . Now, all these counters stored in and need to be divided by N to compute the supports as shown in Lines 12 and 15. Further, to compute confidence of (word, hashtag), we divide support of (word, hashtag) by support of word as shown in Line 18. Finally, we compute the association of word and hashtag using Eq. (4) in Line 19.
Clearly, is not a discrete function unlike that of posts and users because the association of a word with a hashtag requires a continuous distribution. For instance, there are two hashtags , and a word w and we need to find and using a discrete binary distribution. Without loss of generality, if we take to be semantically more relevant to w than , we would like to be greater than . This means that we have to assign and . However, considering one of the associations as 0 does not capture the reality because both the hashtags and are associated with the word w. So, we can not handle this case with just two states of a discrete binary distribution and we would require at least three states. Similarly, if we increase the number of words and number of hashtags, we would require more and more states. Therefore, to prevent creating a lot of discrete states, we have taken a continuous distribution to denote the association of hashtag and word. We have found one of the avenues to denote the association using the above formula and there might be different other ways of doing so which might come under the scope of a separate research work.
Word and User aware Folk Popularity Matrix (AUFWP): We propose a new transition matrix, AUFWP which is based on user, post and word popularity. This matrix is obtained by the element wise multiplication of AUFP (mentioned above) and AWP.
| 5 |
AWP is based on word popularity and is created by matrix multiplication of two matrices namely , which captures Concept 6, and transpose of , which captures Concept 7. The ith row and jth column of the matrix constitutes the word popularity of the jth word if it is associated with the ith hashtag. Further, the matrix is normalized row-wise. Here, word popularity refers to the Text Rank score of the word. The ith row and jth column of the matrix constitutes the numerical association of the jth word and the ith hashtag. Further, the matrix is normalized column-wise. Here, association of jth word () and ith hashtag () refers to , as discussed above.
All transition matrices are constructed such that they yield the most trending hashtags when used in the Markov chain technique. However, they are incapable of recognizing the relevance of the hashtags to the concerned post. Therefore, this needs to be handled separately in the relevance computation module.
Relevance computation
This module ensures relevance of the recommended hashtags. The output of this module is a column vector where T represents the total number of unique hashtags in given dataset. This column vector is termed as the preference vector wherein we assign a weight to each of the hashtags and write it in the corresponding index of that hashtag. In the previous research works (Yamasaki et al. 2017; Wang et al. 2020), different methods have been undertaken to figure out the most relevant hashtags from already existing hashtags for concerned posts. After figuring out these hashtags, the corresponding indices of these chosen hashtags were set as 1 and others were set as 0 in the preference vector. However, this method has a limitation that it does not consider different degrees of relevance of chosen hashtags to the concerned post as there are only two states (0 and 1) to express the preference weights.
We have come up with a novel weighted preference vector where we adopted a continuous distribution instead of a discrete distribution to accommodate higher number of degrees of relevance. We extract keywords of a post using the KeyBERT model. Keywords are the words which are very significant and close to the context of the post. Keywords can be worthy candidates for relevant hashtags as they represent the context of the post, which is precisely the function of a hashtag. KeyBERT generates a relevance score for each of the extracted keywords. Instead of using 0 and 1, we directly used relevance scores to set values of the preference vector. Since we are using different weights for each of the preferred hashtags, we name this vector as weighted preference vector. This gives us more freedom to assign different degrees of relevance to keywords of the concerned post. This weighted preference vector is then used in the Markov chain technique module, which is explained below.
Markov chain technique
Markov Chain is the central module of the hashtag recommendation component. There are three steps namely generating scores for all the hashtags, selecting the highest ranked hashtags for recommendation and generating a hashtag encoding from the recommended hashtags.
Previous research (Yamasaki et al. 2017; Wang et al. 2019) in this field has used the Markov chain technique to generate scores of hashtags and rank them by taking inspiration from the Page Rank algorithm (Page et al. 1999). In this technique, a transition matrix is constructed and used in the iterative algorithm. If we denote r(M, t) as a vector of scores of each hashtag after t iterations (N is the number of hashtags) when we use the transition matrix M in the Markov chain technique, then we can write the iterative formula as
| 6 |
Here, denotes the damping factor which is introduced to handle the spider trap problem and dead end problem (Bidoni et al. 2014) and we took as 0.85, as taken by Wang et al. (2020). Here, p refers to the weighted preference vector defined above. To maintain the semantics of the Markov chain technique, the transition matrix needs to be column stochastic (sum of values in each column should equal 1) and sum of values of the preference vector p should also equal 1.
Since r(M, t) denotes the probability distribution of a hashtag getting selected after t iterations, the sum of values of r(M, t) will also be equal to 1 (equal to the sum of values of the preference vector p). For the initial value of the vector of scores of all hashtags, i.e., r(M, 0), we can assign equal probability to each hashtag such that the sum of these probabilities is 1. Theoretically, the vector of scores of hashtags would converge completely after infinite iterations but for practical purposes, we run this iterative algorithm for a finite number of steps to approximately converge the scores. This iterative algorithm is used to recommend the highest ranked hashtags. In a similar fashion as Wang et al. (2020), we can recommend hashtags based on the equation
| 7 |
Here, is the number of iterations after which the vector of scores of hashtags approximately converges and M is the transition matrix used in the iterative algorithm. Here, the difference between and is the setting of the weighted preference vector. In the case of , we use a weighted preference vector in which we set higher weights for hashtags of our choice (keywords of the post). On the other hand, for , we use a weighted preference vector which gives equal weights to all hashtags such that the sum of all the weights is the same as that of . So, the final vector of scores might have higher weights for the hashtags which are trending in nature as well as co-occurring with the hashtags of our choice. Hence, the recommended hashtags will be relevant to the context.
The trending nature of recommended hashtags is ensured by the iterative algorithm stated above. However, it does not guarantee absolute relevance of recommended hashtags to the context of the concerned post. The weighted preference vector is used to give higher weights to relevant hashtags and produced good results when tested manually. One crucial point is that the final choice of the hashtags among our recommendations will be of the user. This has twofold advantages. The first advantage is that every user is likely to use different hashtags as they will choose their set of hashtags from our recommendations and the intent of this choice will vary from person to person leading to different choices. Consequently, all posts of a topic do not use the same hashtags and the diversity is maintained. The second advantage is that although preference vector handles the factor of relevance quite well, if somehow it fails to do so in a few cases, then users can apply their own intelligence to choose the most relevant hashtags from our recommendations. Although this algorithm may not guarantee relevance every time but produces positive results in general.
After obtaining the column vector , we can use this to select the highest ranked hashtags for recommendation. The selection criteria can vary depending on the algorithm designer’s discretion. After getting the recommended hashtags, we have to construct a vector called hashtag encoding where T is the total number of unique hashtags of the dataset. Here, we have to set the corresponding indices of the selected hashtags as 1 and remaining indices as 0 in the hashtag encoding. The last step in the hashtag recommendation component is to generate a feature vector which is based on hashtag encoding and post features encoding, generated by the post features extraction module explained below.
Post feature extraction
The role of this module is to extract features from a post and use them to generate an encoding in the form of a column vector. In our case, post features encoding is a 768 dimensional vector generated by passing the text of the post to BERT model (Devlin et al. 2018). The model can be chosen as per the concerned social media platform. For example, we used the TweetBERT (Qudar and Mago 2020) model for the Twitter dataset. This post features encoding is then concatenated with the hashtag encoding to form a single column vector also known as feature vector.
Popularity prediction
This component is the central module of our evaluation procedure. The predominant method of evaluation in previous research works (Yamasaki et al. 2017; Wang et al. 2020) of this field involves uploading a post with the recommended hashtags and monitoring the post popularity gained after regular time intervals. This technique was deployed for recommendations of various methods and then the post popularities were compared to analyze the performance of recommendation methods. However, this technique might not be reliable. Social media is a dynamic ecosystem which is quite unpredictable and hence this technique will yield non-uniform results for different settings like time, location, campaigns and many more. There are a lot of parameters which are not in our control and are outside the scope of recommendation methods but can seriously affect the results of this experimental setup. Hence, there is a need to standardize this procedure and come up with an algorithm which is uniform and reliable.
Therefore, we introduced TRANSIT which incorporates a separate field of research on social media, i.e., popularity prediction of posts and to the best of our knowledge, the venture to combine these two fields (hashtag recommendation for boosting popularity and popularity prediction) into a single work is a first of its kind. The field of popularity prediction generally involves a Machine Learning model (He et al. 2014; Yamasaki et al. 2014; Meghawat et al. 2018; Zohourian et al. 2018) which takes as input the features of the post and its metadata, including hashtags, and predicts the post popularity that would be achieved in future on the concerned social media platform. This component of popularity prediction involves two module namely training popularity prediction models and popularity score generation which are explained below.
Training popularity prediction models
This module takes the popularity prediction dataset as input and uses it to train various Machine Learning-based models of popularity prediction. The feature vectors of posts for training these models is same in structure as the feature vectors generated by the hashtag recommendation component to maintain consistency. The feature vector of posts of popularity prediction dataset are constructed by concatenating the extracted post features and the hashtag encoding of the corresponding ground truth hashtags. The models which we have used in our work are described in experimental evaluation (Sect. 6). We adopt the best performing model to be used in the next module for predicting the popularities of the posts with the recommended hashtags.
Popularity score generation
After getting the best performing trained popularity prediction model, this module uses it to predict the popularity scores for feature vectors produced by hashtag recommendation component. Each transition matrix yields a separate set of recommended hashtags for a single post and hence, for each transition matrix and each post, we have a corresponding feature vector. In other words, for a post , we recommend hashtags using a transition matrix and obtain a predicted popularity using the trained popularity prediction model. These scores are computed for each post and each transition matrix and are compiled together to pass to the final component, i.e., success rate computation.
Success rate computation
The main goal of our work is to recommend hashtags that should be relevant to the concerned post and also boost its popularity. Consequently, we have targeted posts which are not popular because there is no need to enhance the popularity of the posts which are already highly popular. Therefore, this component takes as input the hashtag recommendation dataset along with the predicted popularities for each post of this dataset and each transition matrix. Further, it splits the posts into three categories namely low, average and high popularity posts based on the ground truth post popularities. These ground truth post popularities can be formulated from various engagement metrics namely likes, comments and shares and are described in experimental evaluation (Sect. 6). We discard the high popularity category as the posts of that group are already very popular and do not need enhancement. Therefore, we report our results in low and average popularity category. In a single category, for each transition matrix M, we check the percentage of posts of that category where the predicted popularity (corresponding to hashtags recommended using M) is higher than ground truth popularity of the corresponding post. We term this percentage as success rate of M in the corresponding popularity category. Our method achieved higher success rate than the current state-of-the-art transition matrix in the low and average popularity posts.
Recollecting everything, we perform the data preparation and split into two datasets: hashtag recommendation dataset and popularity prediction dataset. The hashtag recommendation dataset is used to construct the transition matrices and set the weighted preference vectors. The transition matrices and weighted preference vectors are passed to Markov chain technique component wherein the hashtags are assigned scores, recommended and stored in a hashtag encoding. Meanwhile, the post features encodings are constructed for the posts of hashtag recommendation dataset and are concatenated with the corresponding hashtag encodings to form the feature vectors. Further, the popularity prediction dataset is used to train the Machine Learning-based popularity prediction models and the best performing model is selected. Then, the feature vectors are passed to this model to predict the popularities. Finally, the hashtag recommendation dataset is divided into low, average and high popularity categories based on ground truth popularities and we evaluate the success rate of each of the transition matrices in low and average popularity category.
Adaptation to dynamic nature of social media
Machine Learning models suffer from performance degradation when there is a shift in distribution of data on which the models are trained. Therefore, like any other Machine Learning model, the popularity prediction models would need to be retrained and matrices used for hashtag recommendation would need to be reconstructed after some intervals of time to keep up with the newer data.
However, the rate of performance degradation of both popularity prediction and hashtag recommendation models would not be high. In general, gaining popularity takes time for users, posts as well as hashtags. Similarly, their popularity stays for even longer time. In other words, the overall distribution of popularity of various users, posts, hashtags and words does not change drastically in a short period of time. Our main motive is to rank the hashtags and select the highest ranked ones for recommendation. The scores of the hashtags act as a guide to do this task. Therefore, slight changes in the actual scores of hashtags do not change the order or rank of hashtags much. As a result, even if our models do not capture these slight changes in actual scores, we will still be recommending highest ranked hashtags. However, there might be a problem if there is a large change in the actual scores. Nevertheless, large changes in scores will practically take considerable amount of time as stated before. Therefore, if the interval of time, after which matrices are reconstructed and models are retrained, is adjusted while keeping the average time taken for gaining popularity in mind, then our proposed framework will continue to work smoothly among the dynamism of social media. This interval of time may vary for different social media platforms depending on their general usage pattern.
At the time when a new hashtag, user, post or word is just introduced, it is not much useful to include them in our models. Our main goal is to recommend hashtags which are already popular and while doing so, we are making use of users, posts and words that are already popular. Instinctively, a newly introduced hashtag, word, user or post has practically zero popularity. Therefore, these newly introduced entities will not affect the selection of highest ranked hashtags. As a result, it might not be required to include newly introduced entities into our system at the time of their introduction or creation. As stated before, these nascent entities will take time to gain considerable popularity. Therefore, inclusion of these entities might be delayed safely. Hence, whenever the models are retrained and matrices are reconstructed, these new entities can be incorporated into our framework. In total, to adjust to dynamic nature of social media, the hashtag recommendation matrices need to be reconstructed and popularity prediction models need to retrained after specific intervals of time and this interval should be selected as per usage pattern of the concerned social media platform. Having described the complete algorithm of TRANSIT, we state the details of the experimental setup in the next section.
Experimental evaluation
This section comprises the details of the experiments we conducted. We describe the details of data collection, cleaning and processing, computation of ground truth popularities, data splitting, performance of popularity prediction models and our final results in the following subsections.
Data collection, cleaning and processing
Data collection
In order to perform the experiments, we collected data from Twitter using the tweepy API. In order to increase the diversity, we collected nearly an equal number of posts for each keyword. Moreover, we accumulated nearly ten keywords for every topic and repeated this for various topics such as politics, wildlife, food, travel, sports, technology, entertainment and many more. For instance, under the topic “Entertainment”, we listed various keywords like #movies, #music, #comedy, #celebrities among many others. Similarly, some of the keywords under the topic “Career” included #students, #engineers, #hiring and many more. In a similar fashion, we listed all the topics and their keywords manually.
Data cleaning and processing
Due to the rapid and volatile nature of Twitter, the collected data were noisy. Data were cleaned by removing posts which were duplicate, having null values, were romanized and were not in English. Multilingual posts can also be taken into consideration for future work. Further, only those posts were taken which comprised of only text. Multimodal content can be treated as part of future research work.
The cleaned data were pre-processed by removing links, converting text to lowercase, removing mentions, non-alphanumeric characters except full stop and space. Further, hashtags were extracted from these pre-processed posts. Different attributes of posts were extracted such as user_id, original post, hashtags used in the post, post id, number of likes, number of comments, number of shares and number of followers of the user who created that post. The final dataset comprises of 10,298 users having a total of 15,000 posts which contains 21,536 hashtags.
Ground truth popularity computation
Ground Truth Popularity is calculated using various engagement metrics such as followers, shares, likes and comments and this comes under the scope of data processing. If a post gets higher number of engagement metrics, it concludes that a lot of users have consumed and reacted to that post. This increases the radiality of the post getting it suggested in feeds of other users as well leading to higher popularity. Hence, the popularity of a post is directly proportional to its likes, shares and comments. Through these observations we can derive a formula for calculating post popularity as
| 8 |
Here, , , are proportionality weights, L refers to total likes of the post, C refers to total comments of the post, S refers to total shares of the post. As per real-life scenario, if a user gets interested in a post, then that user hits the like button. However, if the user gets more interest in that post, then that user also posts a comment on the post. Moreover, commenting on a post also makes that post available in the feeds of friends of that user. If the user is really influenced from the post, then that user explicitly shares that post with their network.
This would lead to more users engaging with that post leading to even higher popularity. Therefore, a share has the highest contribution to the popularity of the post with comment being the second highest and likes being the third highest. We can accordingly state that . A user is considered as popular if the number of followers of that user are high. Further, if a post created by that user gains a lot of traction, which is measured in terms of engagement metrics of that post, the consumers of content of that post also tend to be more interested about that user and might eventually follow that user. Hence, social popularity of a user is directly proportional to number of followers of that user as well as the sum of social popularity of the posts created by that user. Therefore, user popularity can be calculated as
| 9 |
Here, , , , are proportionality weights, L refers to total likes of all posts, C refers to total comments of all posts, S refers to total shares of all posts created by that user and F denotes the number of followers of that user. eps is a small value added to both the factors so that if one of the factor is zero, then it does not zero out the other factor. Ground truth popularities are calculated in order to feed into the transition matrices which are used to recommend hashtags that will increase the popularity. These formulas are robust and reliable as they envelope all the angles and characteristics of an entity to be termed as popular and be more visible on social media. The distribution of the post popularity is varied from 0.0 (minimum value) to 89,540.0 (maximum value). Mean was recorded to be 105.2 and standard deviation was recorded as 1408.4. The distribution of the user popularity is varied from 1.0 (minimum value) to 7.95 (maximum value). Mean was recorded to be 1.00 and standard deviation was recorded as 7.86 .
Data splitting
This work comprises two subparts namely hashtag recommendation for popularity boosting and popularity prediction. Therefore, the complete dataset is split with ratio of into two datasets namely popularity prediction dataset and hashtag recommendation dataset. Splitting must satiate the condition of comprising same hashtags in both the subparts while minimizing the number of duplicate posts in them. The motive behind this condition is to maintain consistency in the hashtag encoding features both in popularity prediction dataset and hashtag recommendation dataset.
Our algorithm, as stated in Algorithm-1, divides the dataset into two halves such that both these datasets have the same set of hashtags. However, this is one of the multiple ways which we adopted for data splitting. Hashtag Recommendation dataset consists of 7851 users having 10,130 posts. It contains 21,536 hashtags and 34,951 unique words (nonstop-words). Popularity prediction dataset consists of 8499 users having 11,467 posts. It contains 21,536 hashtags.
Popularity prediction models
We conducted experiments on various Machine Learning models for the task of Popularity Prediction namely—Support Vector Regression (SVR) (Drucker et al. 1996; Awad and Khanna 2015), Extreme Gradient Boosting (Carmona et al. 2019), Random Forest (Biau and Scornet 2016; Huang et al. 2018) and Deep Learning (Nguyen et al. 2017). All the mentioned models were fit for our use case. We used grid search technique for attaining the best hyperparameter values for SVR, XGBoost and Random Forest. We constructed a custom deep learning framework with six layers containing 1024, 256, 64, 16, 4 and 1 neurons, respectively, trained using Adam optimizer in 100 epochs with a learning rate of 0.05. We also experimented using an ensemble model which averages the predicted popularities of above four models. We computed the mean average errors for the four models. Training error is obtained on the popularity prediction dataset and testing error is evaluated on hashtag recommendation dataset. SVR got the best results among all. The training mean average error for SVR was recorded to be 8.79 and testing error was 9.81 by setting epsilon (eps) as 1 and regularization parameter (C) to be 7 up to maximum iterations of 12,500 to get converged. Here, it can be seen that the values are justified if we consider the distribution of ground truth post popularities of our dataset (refer Sect. 6.2). These figures are reported for posts that come under either low popularity zone or average popularity zone.
Experimental results
We introduced a novel method for evaluating the popularity of posts annotated with recommended hashtags in contrast to the current simulation-based tests. Simulation-based tests consist of using recommended hashtags in a post on any social media such as Instagram, Twitter, Facebook and recording the engagement metrics such as likes, shares and comments after certain time intervals. The results of various recommendation methods were computed and compared as the post got exposed to the real-world traffic (Yamasaki et al. 2017; Wang et al. 2019). The real-world simulation includes significant irregularities leading to unpredictable results owing to the dynamic nature of social media as explained in Sect. 5. Our evaluation procedure consists of comparing predicted popularity of posts annotated with hashtags that were recommended using the transition matrix, to the ground truth popularity of post. A feature vector is constructed by using the post features encoding and the hashtag encoding of the hashtags recommended using the transition matrix. This vector is fed into the trained model for predicting the post popularity. The predicted popularities are compared with corresponding ground truth popularities of posts. The percentage of posts are calculated where the numerical value of predicted popularity is more than ground truth popularity. This percentage is called the success rate of the transition matrix which was used to recommend the hashtags. This procedure is conducted in three subdivisions on the basis of ground truth popularity numerical values, i.e., low, average and high. However, high popularity is dropped as there is no need of recommending hashtags for already highly popular posts.
Having stated the evaluation algorithm, we present our experimental results in three subsections. We first compare the effectiveness of our method with that of the existing methods. Further, we present a qualitative analysis of the hashtags recommended by different methods for some example posts taken from our collected dataset. Lastly, we illustrate our analysis of the popularity zones for which the success rates are computed.
Effectiveness comparison
In this section, we present success rates of each of the existing methods. Our proposed method for both low and average popularity zone. As shown in Figs. 2 and 3, it can be seen that our evaluation algorithm is consistent with the order of performance of the existing methods. AUFP is the current state-of-the-art method and has a success rate of 79.33% and 16.98% in low and average popularity zones, respectively. This is higher than other existing methods like AUP (60.51% in low and 11.63% in average), AFP (72.52% in low and 14.44% in average) and (62.41% in low and 12.35% in average). We constructed the matrix AWP in Sect. 5 while stating our proposed method AUFWP. We also used AWP as a transition matrix and obtained 72.11% success rate in low popularity posts and 16.58% success rate in average popularity posts.
Fig. 2.

Success rates of different transition matrices in low popularity zone
Fig. 3.

Success rates of different transition matrices in avg. popularity zone
Our proposed method AUFWP surpasses all methods with a success rate of 82.81% in low popularity posts and 18.81% in average popularity posts. This is a significant improvement in success rate by 3.5% in low popularity zone and 1.8% in average popularity zone from the current state-of-the-art method AUFP. Further, AUFWP has an improvement of 22.3% in low and 7.17% in average popularity zone from AUP. Moreover, our method surpasses AFP by 10.29% in low and 4.36% in average popularity zone. Our proposed method incorporates the factor of keywords besides that of users and posts. Therefore, our proposed method produces better results than the existing methods and recommends relevant hashtags which can also boost the popularity while being relevant to the context of a post.
Qualitative analysis
In this section, we present a qualitative analysis of hashtags recommended by existing methods and our proposed method on two posts taken from hashtag recommendation dataset. As illustrated in Fig. 4a, there is a tweet with 21 ground truth hashtags having a ground truth popularity of 12.00. The hashtags which are relevant to the post are marked as green where as irrelevant hashtags are colored red. AUFWP delivers the highest popularity among all the methods with a predicted popularity of 36.96. AUFWP recommends 19 relevant hashtags which reflects that our proposed method performs at par with the state-of-the-art method in terms of relevance.
Fig. 4.
Example posts depicting hashtags recommended by different transition matrices along with predicted popularities
Other methods like , AWP and AFP perform very well in terms of relevance with all 21 hashtags being relevant to the post. However, they fail in delivering high popularity. On the other hand, AUP delivers relatively higher popularity but it fails in recommending relevant hashtags. Similarly, in Fig. 4b, there is a tweet with 11 ground truth hashtags with ground truth popularity of 1.00. AUFWP outperforms all other methods with a projected popularity of 18.44.
Further, as it recommends ten relevant hashtags, AUFWP is at par with AUFP in terms of relevance. AFP and AWP perform quite well on relevance factor with all 11 hashtags being marked green but fail in getting high popularity. On the other hand, and AUP perform quite well in boosting popularity but fail in relevance. We also conducted the experiment of computing Relevance Score of hashtags recommended using our proposed method with the concerned posts using TweetBERT. For each post, Relevance Score is computed by computing average of cosine similarities between encodings of recommended hashtags and that of the concerned post. After computing average Relevance Score for all posts of Hashtag Recommendation Dataset, we obtained a score of 0.927. Therefore, our proposed framework recommends hashtags which enhance popularity while also being semantically relevant to the concerned post.
Distribution of post popularity
In this section, we describe the low, average and high popularity zones. The numerical value of popularity ranges from 0.0 to 89,540.0 where upper bound for low popularity is 12 and for average popularity is 100. These upper bounds are decided upon the number of posts which fall in each category. 11,467 posts of popularity prediction dataset are divided into low, average and high popularity zones based on the ground truth popularities and finally we obtained 7575 posts in low popularity zone, 2782 posts in average popularity zone and 1110 posts in high popularity zone. Similarly, the 10,130 posts of hashtag recommendation dataset are subdivided into 6627 posts for low popularity zone, 2527 posts for average popularity zone and 976 posts for high popularity zone.
In real life, a majority of posts never become popular on social media. However, there are some posts which gain somewhat higher popularity. There are very few posts which gain very high popularity. It can be seen from Fig. 5 that the distribution of posts among the popularity zones are similar for both datasets and it is identical to the real-life scenario as well.
Fig. 5.

Distribution of posts in popularity zones
Conclusion
In this paper, we introduced a novel method TRANSIT to recommend hashtags that enhance popularity of a social media post. Our proposed method utilizes word popularity in addition to user and post popularity. We also proposed a novel evaluation algorithm which is much more reliable and uniform than the current simulation-based techniques. The experimental results showed that the proposed method achieves a significant 3.5% improvement than the current state-of-the-art method for low popular posts and 1.8% improvement for average popular posts. Thus, our proposed transition matrix-based hashtag recommendation method enhances the social popularity of posts besides recommending relevant hashtags.
Author contribution
Purnadip Chakrabarti and Eish Malvi handled the methodology, material preparation, data gathering, and preparing the original draft and analysis. The design, conception of the study, methodology, and data collection along with editing the original draft were contributions by Shubhi Bansal. Dr. Nagendra Kumar provided the conceptualization, methodology, investigation, supervision, writing- review, and editing.
Declatations
Conflict of interest
The authors declare no conflict of interest
Contributor Information
Purnadip Chakrabarti, Email: ee190002048@iiti.ac.in.
Eish Malvi, Email: cse190001015@iiti.ac.in.
Shubhi Bansal, Email: phd2001201007@iiti.ac.in.
Nagendra Kumar, Email: nagendra@iiti.ac.in.
References
- Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on Management of data, pp 207–216
- Andujar A (2020) Analysing WhatsApp and Instagram as blended learning tools. In: Recent tools for computer-and mobile-assisted foreign language learning. IGI Global, pp 307–321
- Awad M, Khanna R (2015) Support vector regression. In: Efficient learning machines. Springer, pp 67–80
- Baltaci S, Ersoz AR (2022) Social media engagement, fear of missing out and problematic internet use in secondary school children. Int Online J Educ Sci 14(1)
- Bansal S, Gowda K, Kumar N (2022) A hybrid deep neural network for multimodal personalized hashtag recommendation. In: IEEE transactions on computational social systems
- Ben-Lhachemi N, et al. Using tweets embeddings for hashtag recommendation in Twitter. Procedia Comput Sci. 2018;127:7–15. doi: 10.1016/j.procs.2018.01.092. [DOI] [Google Scholar]
- Biau G, Scornet E. A random forest guided tour. Test. 2016;25(2):197–227. doi: 10.1007/s11749-016-0481-7. [DOI] [Google Scholar]
- Bidoni ZB, George R, Shujaee K (2014) A generalization of the pagerank algorithm. In: ICDS 2014, the eighth international conference on digital society, pp 108–113
- Caleffi P-M. The ‘hashtag’: a new word or a new rule? SKASE J Theor Linguist. 2015;12(2):46–70. [Google Scholar]
- Cantini R, Marozzo F, Bruno G, Trunfio P. Learning sentence-to-hashtags semantic mapping for hashtag recommendation on microblogs. ACM Trans Knowl Discov Data (TKDD) 2021;16(2):1–26. [Google Scholar]
- Carmona P, Climent F, Momparler A. Predicting failure in the US banking sector: an extreme gradient boosting approach. Int Rev Econ Finance. 2019;61:304–323. doi: 10.1016/j.iref.2018.03.008. [DOI] [Google Scholar]
- Chang H-C. A new perspective on Twitter hashtag use: diffusion of innovation theory. Proc Am Soc Inf Sci Technol. 2010;47(1):1–4. [Google Scholar]
- Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. In: arXiv preprint. arXiv:1810.04805
- Drucker H, Burges CJ, Kaufman L, Smola A, Vapnik V (1996) Support vector regression machines. In: Advances in neural information processing systems, p 9
- Ferragina P, Piccinno F, Santoro R. On analyzing hashtags in twitter. Proc Int AAAI Conf Web Soc Media. 2015;9(1):110–119. doi: 10.1609/icwsm.v9i1.14584. [DOI] [Google Scholar]
- Gemmell J, Schimoler T, Ramezani M, Christiansen L, Mobasher B (2009) Improving folkrank with item-based collaborative filtering. In: Recommender systems and the social web
- Guan Z, Bu J, Mei Q, Chen C, Wang C (2009) Personalized tag recommendation using graph-based ranking on multi-type interrelated objects. In: Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval, pp 540–547
- He X, Gao M, Kan M-Y, Liu Y, Sugiyama K (2014) Predicting the popularity of web 2.0 items based on user comments. In: Proceedings of the 37th international ACM SIGIR conference on research and development in information retrieval, pp 233–242
- Hong Yu, Zhou B, Deng M, Feng H. Tag recommendation method in folksonomy based on user tagging status. J Intell Inf Syst. 2018;50(3):479–500. doi: 10.1007/s10844-017-0468-1. [DOI] [Google Scholar]
- Hotho A, Jäschke R, Schmitz C, Stumme G (2006) Folkrank: a ranking algorithm for folksonomies
- Hu J, Yamasaki T, Aizawa K (2017) Tag recommendations in social media for popularity boosting. In: ITE technical report 41.05 multimedia storage (MMS)/consumer electronics (CE)/human information (HI)/media engineering (ME)/artistic image technology (AIT). The Institute of Image Information and Television Engineers, pp 209–214
- Huang F, Chen J, Lin Z, Kang P, Yang Z (2018) Random forest exploiting post-related and user-related features for social media popularity prediction. In: Proceedings of the 26th ACM international conference on Multimedia, pp 2013–2017
- Ibba S, Orrù M, Pani FE, Porru S (2015) Hashtag of instagram: from folksonomy to complex network. In: KEOD, pp 279–284
- Jäschke R, Marinho L, Hotho A, Schmidt-Thieme L, Stumme G (2007) Tag recommendations in folksonomies. In: European conference on principles of data mining and knowledge discovery. Springer, pp 506–514
- Karthikeyan K, Wang Z, Mayhew S, Roth D (2019) Cross-lingual ability of multilingual BERT: an empirical study. In: International conference on learning representations
- Kumar N, Baskaran E, Konjengbam A, Singh M. Hashtag recommendation for short social media texts using word-embeddings and external knowledge. Knowl Inf Syst. 2021;63(1):175–198. doi: 10.1007/s10115-020-01515-7. [DOI] [Google Scholar]
- Landia N, Anand SS, Hotho A, Jäschke R, Doerfel S, Mitzlaff F (2012) Extending FolkRank with content data. In: Proceedings of the 4th ACM RecSys workshop on recommender systems and the social web, pp 1–8
- Li Y, Liu T, Jingwen H, Jiang J. Topical co-attention networks for hashtag recommendation on microblogs. Neurocomputing. 2019;331:356–365. doi: 10.1016/j.neucom.2018.11.057. [DOI] [Google Scholar]
- Liang H, Xu Y, Li Y, Nayak R, Tao X (2010) Connecting users and items with weighted tags for personalized item recommendations. In: Proceedings of the 21st ACM conference on Hypertext and hypermedia, pp 51–60
- Lops P, De Gemmis M, Semeraro G, Musto C, Narducci F. Content-based and collaborative techniques for tag recommendation: an empirical evaluation. J Intell Inf Syst. 2013;40(1):41–61. doi: 10.1007/s10844-012-0215-6. [DOI] [Google Scholar]
- Ma Z, Sun A, Cong G (2012) Will this hashtag be popular tomorrow? In: Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, pp 1173–1174
- Manning CD, Surdeanu M, Bauer J, Finkel JR, Bethard S, McClosky D (2014) The Stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pp 55–60
- Meghawat M, Yadav S, Mahata D, Yin Y, Shah RR, Zimmermann R (2018) A multimodal approach to predict social media popularity. In: 2018 IEEE conference on multimedia information processing and retrieval (MIPR). IEEE, pp. 190–195
- Mihalcea R, Tarau P (2004) Textrank: bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing, pp 404–411
- Nguyen HTH, Wistuba M, Grabocka J, Drumond LR, Schmidt-Thieme L (2017) Personalized deep learning for tag recommendation. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp. 186–197
- Page L, Brin S, Motwani R, Winograd T (1999) The PageRank citation ranking: bringing order to the web. Tech. rep. Stanford InfoLab
- Qudar MMA, Mago V (2020) Tweetbert: a pretrained language representation model for twitter text analysis. In: arXiv preprint. arXiv:2010.11091
- Si X, Liu Z, Li P, Jiang Q, Sun M (2009) Content-based and graph-based tag suggestion. In: DC@ PKDD/ECML
- Sigurbjörnsson B, Van Zwol R (2008) Flickr tag recommendation based on collective knowledge. In: Proceedings of the 17th international conference on World Wide Web, pp 327–336
- Wang X, Zhang Y, Yamasaki T (2019) User-aware folk popularity rank: user-popularity-based tag recommendation that can enhance social popularity. In: Proceedings of the 27th ACM international conference on multimedia, pp 1970–1978
- Wang X, Zhang Y, Yamasaki T (2020) Earn more social attention: user popularity based tag recommendation system. In: Companion proceedings of the web conference 2020, pp 212–216
- Yamasaki T, Sano S, Aizawa K (2014) Social popularity score: predicting numbers of views, comments, and favorites of social photos using only annotations. In: Proceedings of the first international workshop on internet-scale multimedia management, pp 3–8
- Yamasaki T, Hu J, Sano S, Aizawa K (2017) FolkPopularityRank: tag recommendation for enhancing social popularity using text tags in content sharing services. In: IJCAI, pp. 3231–3237
- Zhang Y, Zhang N, Tang J (2009) A collaborative filtering tag recommendation system based on graph. In: ECML PKDD discovery challenge, pp 297–306
- Zohourian A, Sajedi H, Yavary A (2018) Popularity prediction of images and videos on Instagram. In: 2018 4th international conference on web research (ICWR). IEEE, pp 111–117


