Skip to main content
PLOS One logoLink to PLOS One
. 2025 Jan 30;20(1):e0309688. doi: 10.1371/journal.pone.0309688

Signals of propaganda—Detecting and estimating political influences in information spread in social networks

Alon Sela 1,2,*, Omer Neter 3,4, Václav Lohr 5, Petr Cihelka 5, Fan Wang 3, Moti Zwilling 6, John Phillip Sabou 5, Miloš Ulman 5
Editor: Gilad Ravid7
PMCID: PMC11781619  PMID: 39883667

Abstract

Social networks are a battlefield for political propaganda. Protected by the anonymity of the internet, political actors use computational propaganda to influence the masses. Their methods include the use of synchronized or individual bots, multiple accounts operated by one social media management tool, or different manipulations of search engines and social network algorithms, all aiming to promote their ideology. While computational propaganda influences modern society, it is hard to measure or detect it. Furthermore, with the recent exponential growth in large language models (L.L.M), and the growing concerns about information overload, which makes the alternative truth spheres more noisy than ever before, the complexity and magnitude of computational propaganda is also expected to increase, making their detection even harder. Propaganda in social networks is disguised as legitimate news sent from authentic users. It smartly blended real users with fake accounts. We seek here to detect efforts to manipulate the spread of information in social networks, by one of the fundamental macro-scale properties of rhetoric—repetitiveness. We use 16 data sets of a total size of 13 GB, 10 related to political topics and 6 related to non-political ones (large-scale disasters), each ranging from tens of thousands to a few million of tweets. We compare them and identify statistical and network properties that distinguish between these two types of information cascades. These features are based on both the repetition distribution of hashtags and the mentions of users, as well as the network structure. Together, they enable us to distinguish (pvalue = 0.0001) between the two different classes of information cascades. In addition to constructing a bipartite graph connecting words and tweets to each cascade, we develop a quantitative measure and show how it can be used to distinguish between political and non-political discussions. Our method is indifferent to the cascade’s country of origin, language, or cultural background since it is only based on the statistical properties of repetitiveness and the word appearance in tweets bipartite network structures.

Introduction

In most democratic systems, politicians need to convince the masses to choose them over other candidates. One important technique that is commonly used in political struggles includes repetitive broadcasting [1, 2] of a simple and clear message. These focused and short messages, also called slogans [3], tend to be catchy, simple, and clear. Examples of such political slogans are “Make America Great Again” used by Donald Trump in the US Presidential Election of 2016, or “Yes We Can” used by the Obama 2008 campaign. Such slogans in political campaigns have a greater degree of repetitiveness compared to “normal” non-political discussions.

Pre-election political messages tend to be more aggressive [4], populistic [5] and in general, use similar techniques as the ones used in commercial communication [6, 7]. These commercial communication techniques repetitively broadcast a few well-defined messages, to penetrate the minds of the exposed audience and affect the memory recall for a product [8] or similarly, a political player.

Political campaigns use massive broadcasts to spread their messages and capture the voter’s attention. Such mass broadcasts are generally performed in many mediums in parallel and in high volumes. They have become an essential technique in modern Russian propaganda. Such propaganda methods use “rapid, continuous, and repetitive” broadcast, in a “high numbers of channels”, and have “a shameless willingness to disseminate partial truths or outright fictions” [9]. While such propaganda techniques, which are sometimes referred to as “Brainwashing” are associated with Communist or Fascist regimes, it is naive to assume they are not applied in modern Western democracies.

For example, massive use of bots to increase the repetitiveness of a message has been detected in Canada [10], India [11], the UK [12], the US [13] elections and are probably operating in any country where elections occur. Bots are complex automated accounts that operate in a social network and help spread their master’s messages. Bots include a wide range of NLP techniques, e.g., they use fake accounts, cyborgs (human-machine cooperation), and search engine manipulations [10, 1216]. Furthermore, with the latest developments in large language model techniques, and the effort to shrink these models’ computational resources such that they can be used by everyone, bots are expected to become even more sophisticated and harder to detect.

In addition to bots (of which some are illegal since they pretend to be real humans), information spreaders also use legal methods to increase the spread of a political agenda and increase one’s influence including the use of opinion leaders, or social media management platforms [14, 17]. One should note however that we use in this study the term “propaganda” in the context of an effort to change the public’s opinion. While the initial use of this term has been mainly associated with totalitarian regimes, it has also been used for many years, in softer regions of marketing [18]. We use this term in its wider and contemporary context [19], which defines propaganda as tools that include “filtered digital content, targeted advertising, and differential product pricing to online users”. We follow this line and consider propaganda as any intended effort to push a message or an agenda, to as large as possible audiences, as opposed to simply publishing the message on the web and having others comment and discuss the content of the message through a natural discussion. This wide definition enables us to move from the negative association of “bad Russia vs. good USA” to a more quantitative and neutral definition which includes any intervention in the natural process of information spread. In this aspect, any centralized (or decentralized) effort to change the opinion of the masses is to some degree, seen as propaganda, even when the causes are good and true.

As such, the use of bots, either by a classical Russian information bureau or by a commercial communication firm is considered here as some type of propaganda. For a broader review of the topic of bots, cyborgs, and other modern information manipulations, their detection methods, their working mechanism, and their goals, we recommend the broad taxonomy of [15].

Social network platforms such as Twitter (now named “X”), YouTube, and Facebook, constantly struggle against powerful and sophisticated entities that try to manipulate their platforms. These manipulation either by bots or by other algorithmic methods, operate on behalf of a country, state-sponsored agencies, individual politicians or stock manipulators, or simply people aiming to spread conspiracy theories. All these parties are constantly trying to change and influence the opinions in their society, by mass broadcasting includes an effort to change and influence the opinions in society by massive broadcasting of specific agendas on social platforms.

In the current study, we compare 22 data sets of large information cascades. These include 10 cascades related to political topics and 12 cascades related to non-political ones. We use as a comparison group for the political cascades, i.e., the information cascades that were related to the political topics, several non-political data sets related to disasters. We believe that disaster-related information cascades include lower direct gains from the manipulation of information and are therefore probably a good candidate for a comparison group to the political cascades. This belief is based on the following arguments:

  1. Natural disaster-related information cascades might be less biased because a disaster or national emergency is experienced by larger portions of the society (less likely to be extremist or marginal parts of society). This applies even if the disaster is regional, e.g., the 9/11 terror attacks, the 2011 Fukushima nuclear disaster, the 2013 Vanuatu tsunami, the 2015 Nepal earthquake, the 2017 Gulf Coast hurricanes, etc. In these cases, many of the users expressing their opinions are “regular users” with no aim nor experience in methods to spread their message. They communicate on social networks simply to share, get informed, and help others.

  2. People’s tendency to share their experiences in hard times is natural. It is not for the sake of financial or power gains, but rather as a natural humanistic act higher [2022].

  3. Disasters cascades are less likely to be manipulated because there is less to gain from such events. Also, in the early stages of a disaster, its final consequences are still unclear. While politicians might relate themselves to good management of a catastrophic event, they will probably be more careful not to be perceived as opportunistically promoting themselves on the backs of their suffering people.

  4. Overall, events of disaster seem to contain a genuine human need of “regular people” to share and seek information. Political discussions on the other side are aimed at convincing others of the rightness of the ideologies of one party and the falseness of the ideology of the opposing party.

We, therefore, hypothesize that political cascades, as opposed to disaster cascades, will include higher levels of repetitiveness due to their rhetoric nature. By comparing the statistical properties of the politically-related cascades to those related to disasters, we hope to differentiate between these two types of information cascade, thus possibly revealing large-scale signals of external interventions/information manipulation, which we name: “Signals of Propaganda”.

The rest of the article is ordered as follows. In the Background section, we introduce some of the most relevant literature related to “classical” and computational propaganda. We also explain some essential theories related to power law distribution, which is the base core for our statistical analysis. The Material and Methods section then describes the data sets, the data collection process, and the methods used to study the power law distribution [23, 24]. We also show additional separating properties of the two types of cascades. The code used to conduct the main analysis of this part can be found in the repository (DOI 10.5281/zendo.10805274) of the work. We also show additional separating properties of the two types of cascades based on their transformation to a bipartite graph connecting words and tweets and develop a quantization measure to detect bias. The Results section presents a comparative visualization of the two types of cascades based on these separative features. It also presents the separation by the mathematical quantity that we develop to distinguish between political and nonpolitical cascades. The Discussion and Conclusion sections finalize the study, address some known limitations, and propose possible future research directions.

Background

Historical overview of propaganda

Propaganda is an inseparable part of the history of wars. It was used in ancient times as a psychological method to weaken one’s enemies, by spreading blends of scary stories with mystic, unrealistic, or grandiose events, stories that strengthen one king’s image while weakening the perceived power of the opponent. For example, Roman troops were known for their brutal ways of planting fear in their opponents’ hearts. These methods helped them defeat rebellious populations as horror stories of their cruel war acts moved faster than the troops themselves [25]. In the early 20th century, propaganda was redefined again [26]. Both the Nazi fascist propaganda machine and the Communist Cold War propaganda methods used mass media to mold collective beliefs and control their citizens. The most important difference between the 20th and the 21st century propaganda is the medium in use. The Cold War propaganda mainly used mass media (radio, newspaper, or TV) as its medium for spreading its ideology. Unlike mass media that can be controlled by governments, social networks, especially when consumed via smartphone applications, are based on a many-to-many communication channel. This forms a different spreading dynamic that requires the multiplication of social network accounts, i.e., the use of fake personas, to spread messages by governmental agencies and/or private parties alike, and also, to trick and manipulate the recommendation agents (search engines) [27] that connect news to users consuming this news.

Propaganda and the repetition of messages

The relations between the degree of repetition of a message and its effect on the person receiving the message are complex. The experimental work of [8] has found that repetition first increases the rates of recall from one’s memory, but then, with more than 4 or 5 exposures, further repetitions decrease the recall. Other studies found similar results but also claimed that the persuasiveness levels of the message mediate this effect [28]. In general, politicians tend to use higher than average levels of repetition in their rhetoric speeches [29]. The Nazi Propaganda Minister Josef Goebbels, one of the most notorious propaganda masters, summarized his principles of propaganda in his diary [30] where he claimed that propaganda “must be utilized again and again”, and also that “A propaganda theme must be repeated, but not beyond some point of diminishing effectiveness”. Also, that “propaganda must label events and people with distinctive phrases or slogans”.

Computational propaganda

Modern propaganda is sometimes known as computational propaganda [12, 13, 31]. This new type of propaganda abuses the same old sociological and psychological patterns as the 20th century propaganda while using new technologies and tools. It is recognized by repetitiveness, a blend of true and false messages, and the use of fear and anger to help spread the messages. Computational propaganda methods include, for example, organized groups that disseminate similar messages into online social media platforms [32] and the synchronization of these groups to increase their web presence. Also, techniques such as an early “ping-pong” like the exchange of messages within Spreading Groups [14] enhance the future spread of a message and propagate it to larger audiences.

The operation of groups of bots blended with real users, (also named “cyborgs”), is another method to trick the social network algorithm [33]. Twitter and Facebook publish officially their efforts to remove such groups of bots and fake accounts from their platforms, but also, in parallel, permit the creation of automated programmable APIs [34] that enable the construction of such tools. Regardless of the debate if the media giants can or cannot stop the use of bots, it is clear that the number of bots continuously grows [35]. Bots were found to influence public opinions in Chinese political cascades on Twitter and Weibo. Interestingly, not only anti-governmental messages use bots, but also pro-government actors [36]. Bots were found to operate in the UK pre-Brexit debate [12] and are used by countries as well as by private entities that operate groups of bots, cyborgs, and trolls [31] to spread their messages.

While most researchers agree that bots operate in most social network platforms, their identification seems to become harder. The recent advancements in large language models [37, 38] such as GPT-4–5 with the ability of AI to smartly imitate human patterns [39], increase the sophistication of bots operations in social platforms. The sophistication of bots is growing and is expected to further grow. Also, in the period of Information Overload [40] we are likely to see an increasing noise-to-signal ratio between alternative and real truth, making the detection of the former harder. While bots and similar professional information spreaders detection is likely to become harder than ever before, the goal of bots is kept unchanged. Their goal is to increase the size of their audiences as they spread info-bites to us—their human customers. Thus, we need a method to capture their influence through the statistical properties of manipulation of information spread. These directions will be explained in the next section.

Long tailed distribution in information cascades

The normal distribution is the most fundamental in statistics and is applied in almost every scientific discipline. Its importance is derived from the Central Limit Theorem and the fact that the sum (or average) of a large enough number of reoccurring experiments, e.g., independent coin flips with a probability p of success is distributed according to the normal distribution. This robust statistical tool has an important underlying assumption that needs to be fulfilled—that the results of each experiment are not dependent on previous results. When a researcher inspects the effect of a given drug on any defined disease, the researcher first needs to verify that there is no interaction between the different subjects of the experiment where the success rates of one trial will influence the outcome of the other.

In some cases, however, this assumption does not hold. If the researcher studies the sizes of cities, the growth of one city might not be independent of its current size. A large city attracts more people than a small town, thus city growth is not independent, but rather dependent on its size. They will thus be distributed according to a power law distribution [41] and not a normal distribution as would be expected in the case of independent trials.

This idea is critical for the analysis of information cascades. Power law distributions are found in a different class of phenomena where positive feedback exists. Not only city populations and sizes of cities, the intensity of wars, the sizes of electrical blackouts, the relations between palm heights and diameter [42] or the number of citations of academic articles, all have positive feedback in their growth dynamics and as can be easily seen, are all distributed according to power law (or long-tailed) distributions [41]. While there are claims that fitting data to Power Law by graphical methods based on linear fit on the log-log scale is biased and inaccurate [43], other researchers claim that binning is a good method, and that the critics on the LME methods have no real base [44]. We thus applied an exponential binning on the data, which helps removing the possible bias due to the long tail of the distribution. In regards to information cascades, such positive feedback exists in modern information cascades, since the “hotter” a topic is, the more people discuss it. The more it is discussed, the more it captures people‘s attention, which positively influences the cascades’ dimensions. We thus expect information cascades to follow a power law distribution.

The relationship between Power Law and propaganda passes through the positive feedback mechanism that in one of the processes that create power law. Propaganda is not effective if it does not include the echoing of the message through the authentic audience. We showed in a previous work on this topic [14], that an effective way to spread a message is by echoing it first in a recruited group, which we name “spreading group”, and then from this echoing, the message spreads outside the group to the authentic users. We also name this group of initial spreaders also “message detonator” since they act to activate the spread. This process, where authentic users are more likely to believe to a message when they are exposed to it from several directions thus tend to spread it themselves, is exactly the positive feedback mechanism that we believe creates the power law distribution. Furthermore, the data simply shows that the hashtags and users are distributed according to a long tail distribution. This usually implies some dependence or positive feedback in the process, compared to a sum of independent Bernoulli processes which result in a Normal distribution.

Materials and methods

The data

We used three different data repositories and over 13 GB of data to analyze the difference between the political and disaster cascades. First, we downloaded 6 data sets related to political information cascades from open source repositories such as Kaggle [45], and 6 other datasets related to disasters from the Digital Library Repository [46]. The sizes of these data sets are shown in S1 and S2 Tables in the SI section. To ensure the collection and analysis method complied with the terms and conditions for the source of the data. and also due to the sizes of the datasets, we published in the repository of this article the links to all datasets, and only included a small data set to demonstrate the code itself. All code, links to the open-source datasets and the tables used in this article are found at the repository (DOI 10.5281/zendo.10805274) of the work.

These data sets are used to compute the slopes of the distributions, reflect the speed of decay in each field, and compare these macro-scale properties of political and disaster events. The distribution of users/hashtags on a log-log scale and the computed exponent slopes of these datasets distributions are presented in Figs 2 and 3 respectively. The data was binned on a logarithmic scale before the slope estimation to eliminate the long-tail bias—a common practice in this field. we further Elucidate in the discussion section for the reasons of using this method. The slopes in power law distributions represent the exponent of the distribution Ycxα, where α is the exponent determining its speed of decay.

For constructing the bipartite graphs, we used the data sets from Kaggle for the political cascades, and for the disasters, we used a third data repository—the Figshare [47]. Here, since we could not get initially the entire tweet message, which was required to create the word-to-tweet graph, but only the tweet ID. We thus used a Twitter scraping tool named Hydrator [48, 49], which scrapes Twitter and collects tweet messages according to tweet ID. to use this tool, one first needs to subscribe to the Twitter (now X) developer platform, and receive appropriate keys and tokens from the Twitter platform API. We use this tool after verifying that its use fully complies with the terms and conditions of Twitter, as clearly defined in their developer platform [50]. The full list of datasets used to compare the word occurrence on the bipartite graphs is presented in S3 Table in the SI section.

Quantifying the bias in information cascade

Based on the findings described above, we continued and developed a quantitative measure to differentiate between political and non-political cascades. Note that the measure is language-independent since it is based on the internal statistical property of propaganda to include higher repetition rates.

To explain the quantitative measure, one needs to note that the main difference between the distributions of hashtags and the curve-fit line, is the deviation of some points from the line toward the upper left side. This deviation can be observed in the political cascade (left column) of Fig 2. The deviation reflects several hashtags in the political discussion which repeat more than would be expected.

Based on this pattern, we develop a new measure that captures the political bias, named Power law MSE (PLMSE). The name was given since its development is based on the concept of mean square error (MSE) with an adaptation to fit power law distributions, i.e. PL MSE.

The measure is presented in Eq 1 below, where n is the number of points in the histogram of the cascade, i defines the index of each point when the data is sorted from most frequent to least frequent, yi is the frequencies of a hashtag in index i when sorted from the most common to the least common, for any observed value at x = i, α represents the slope of the least square optimal curve-fit line and c represents the intercept of this linear line on a log-log scale. Note that the first term of the equation 1i·(i+1) is the weight of each point in the final PLMSE score. The first points, i.e., i = 1 receive the highest weight. We further discuss this weighting method in the discussion section. The second term in the summation; log2(yic), is the square error of each point, i.e., the distance of each point from the theoretical power law distribution curve on a log scale, where i is the point on the x-axis, α is the power law exponent (which on a log-log scale appears as a simple slope), and c is the intercept. Last, the term n+1n ensures that the value of all the weights in the PLMSE score will always be equal (regardless of the number of points n) to 1.

PLMSE=n+1n·i=1n1i·(i+1)·log2(yic·iα) (1)

Explaining the PLMSE formula

The following section explains the PLMSE formula and each of its components. Overall, Eq 1 has 3 main components: (1) n+1n, (2) 1i·(i+1) and (3) log2(yic·iα). Let us start from part (3). We denote i as the index of the bin (x-axis) and similarly yi is the frequency of the distribution at point i, in our case the y-axis. To compute a simple MSE, on a linear line, we need to compute the expression 1n·x=1n(yi-(a·i+b))2 where a and b are the linear line coefficients. In our case, where our data is not linear, assuming it is a power law distribution, then it should be linear on a log-log scale.

We thus transform:

  • i^:xlog(i)

  • yi^:yilog(yi)

  • line:yi^=-α·i^+log(c)

We now define the squared error in terms of the Log-Log line:

  • i=1nwi·(y^i^-yi)2i=1nwi·(log(yi)-(-α·log(i)+log(c)))2

  • i=1nwi·(log(yi)-(-α·log(i)+log(c)))2

  • =i=1nwi·(log(yi)-log(c)+(log(iα))2

  • =i=1nwi·(log(yic)+(log(iα))2=i=1nwi·log2(yic·iα)

As for part (2) of the equation, we constructed modified weights that give a higher focus (weight) on the upper left data points in the straight line. This uneven weighting system is needed since in the power law distribution, these points represent many of the observations, while the lower right parts of the distribution (the long tail) represent only a few points. Thus, to give a fair representation as required to compute the slope properly, and to construct proper weights, first notice that the following equation holds:

i=1n1i·(i+1)=i=1n(1i-1i+1)=(11-12)+(12-13)++(1n-1n+1)=1-1n+1=nn+1.

Then, as for part (1) of the equation, we need this part such that the normalized weights in part (1) and their multiplication by part (2) of the PLMSE equation, are always equal to 1.

i=1nwi=n+1n·i=1n1i·(i+1)=1

Bipartite graphs transformations

The comparison between the political and non-political cascades based on hashtag distribution alone can suffer from an internal bias. The reason for this is a possibility that political users might tend to use hashtags more often than regular users. To correct this issue, we want to look at the words used in the tweet regardless if they are hashtags or not. Thus, we transformed the data and constructed a bipartite graph (see illustration in Fig 1). In this graph, we can compare the word recurrence in the tweets themselves and bypass the possible problem of uneven use of hashtags in political and non-political actors.

Fig 1. Illustration of word-to-tweet bipartite graph.

Fig 1

Started by a conventional stemming and lemmatization process that removed common words such as “and / “or” / “if”, etc. from the tweet text. These steps were performed by the NLTK python package. Then, we constructed a link between tweets and words appearing in these tweets connecting words appearing in each tweet to create a weighted link between words and their tweets for each tweet, where the weight is according to the number of times similar words appear together. Fig 1) illustrates this process. These links form a graph for each cascade, connecting words to tweets. These bipartite graphs were constructed both for the political and for the non-political topics. We computed and compared several graph properties on these bipartite graphs, e.g., degree distribution, number of communities, and average degree. We also looked at network and node properties such as the betweenness centrality, the variance of the node’s degree, the average largest component size (GC), and the distribution of component sizes. Last, we used the properties that were found most significant in the bipartite graphs as a separative construct.

Results

We show first the slopes of each of the 12 distributions—for the 6 political cascades and the 6 disasters cascades. We use the Least Square Method after binning the data on a logarithmic scale [44] to find the slopes on a log-log scale, and comparing the exponent parameter α of the power law distribution y(x) = cαx between the two groups.

We see a clear difference between both the distribution of hashtag repetitions and the distribution of user repetition between the political and the disaster cascades.

The smaller mean slope in the hashtags’ appearance for the political cascades reflects a slower decay of hashtag frequency. This is due to some hashtags that repeatedly appear in the political messages (repeating hashtags in political slogans), while for the disaster messages, there is only one such hashtag, i.e., the name of the disaster itself. After inspecting the hashtags more deeply, we find that in the disasters, the name of the disaster event itself, e.g. #florence, #dorian, #harvey, etc. appears in almost all the tweets. In contrast, in the political cascades, we observe (left column of Fig 2) several points above the line reflecting hashtags that appear repeatedly more frequently. Examples of political terms that were commonly repeated in the political cascades are “#MakeAmericaGreatAgain”, “#ImWithHer”, “#HillaryClinton”, “#NeverHillary”, “#realdonaldtrump”,“#NeverTrump” or “#fakenews” in the 2016 USA elections. These demonstrate the repetitiveness nature of political slogans.

Fig 2. Distribution of hashtags on a log-log scale for 6 political (left column) vs. 6 disasters (right column) cascades.

Fig 2

We only show here 6 disaster and political cascade distributions, while the scores and PLMSE scores for all the data are shown in Fig 4(C) and 4(D). Mean slopes are -0.938 / -1.214 for the political/disasters cascades, a significant difference (t-test, p-value = 0.0041). For the political cascades, most slopes values α < 1, while for the disasters cascades, all α > 1. When using all 10 political cascades (and not only 6 cascades) the significance even grows (slopes = -0.973 / -1.208, t-test, p-value = 0.00187).

Slopes (exponents) of hashtags distribution

One important and meaningful property in power law distributions is their slopes. Note that a power law graph represents a general function of a type y = cαx, where c is a constant and α is the exponent (the slope when on a log-log plot). We can observe in Fig 2 that the slopes are generally smaller in the political cascades (left column) compared to the non-political ones (right column). The slopes of the Hashtags distributions in the political cascades (left column of Fig 2) are in the range α = 0.763 for the Catalonia politics cascade and α = 0.976 for the Bangladesh politics cascade. When compared to the disaster cascades, we can see that most cascades have steeper slopes (i.e. a larger absolute value). The average slopes for the political cascades are α¯=0.97 while for the disasters cascades, they are α¯=1.21. These values differ significantly (t-test, p-value = 0.0018). The sharper slopes in the power law distribution of the disaster cascades (i.e., larger absolute values of the negative slope), represent a sharper decline in the likelihood of a hashtag to repeat. This can result from many messages and hashtags that only appear once or twice, and very few hashtags that appear more times. In comparison, in the political cascades, a larger number of hashtags appear repeatedly in many of the political campaigns, as the political players use rhetorical repetition in their messages to implant the messages in their audience’s minds. Note that in the power law distribution equation, the probability p of an event occurring x times; p(X = x) = cxα where α is the slope, and c is the y-intercept. Thus a larger (absolute value) of α implies a faster decay of the power law slope and a sharper difference between hashtags appearing many times and those appearing only a few times. We can see that the deviation from the slope in the upper left side of the curve is a result of these several hashtags that appear in the political cascades more than they should appear naturally, suggesting higher repetition levels than we would expect. More importantly, the difference between the slopes of the two types of cascades is significant (t-test, p-value = 0.002) when accounting for all 10 political cascades compared to the 6 disaster cascades respectively as seen in Fig 4, and also it is significant when only using the 6 political cascades as in Fig 2 (t-test, p-value = 0.004). These results suggest a clear difference between the hashtags distributions of the political and disaster cascades.

Slope comparison for user’s distributions

A comparison of the user‘s distribution slopes and scores for the disasters and the political cascade groups is presented in Fig 3. The left column presents the political cascades while the right column presents the disasters cascade. The user distribution differs substantially between the two topics. In disasters, the points fit rather well the theoretical power law curve. This, however, is not the case in the political cascades group (left column), where we observe a sharp decrease in the lower right part of the distribution. We are not certain of the reasons for this sharp fall.

Fig 3. Distribution of users on a log-log scale for political (left column) vs. disaster (right column) topics.

Fig 3

As in Fig 2, we show here 6 disasters compared to 6 political cascades. The mean slopes of political topics are -2.21 compared to -0.589 for the disaster cascades, a clearly significant difference (t-test, p-value = 1.4E-5). Full slopes and scores comparisons when accounting for the additional 4 political cascades are present in Fig 4(A) and 4(B).

In the disaster cascades, the slopes are substantially lower. For the users distributions of the political topics, the exponents slopes are in the range between −1.32 and −3.09 compared to the disasters discussion where the exponents slopes were ranging between −0.545 and −0.664. These differences are clearly significance (t-test, p-value = 1.4E-05). The sharper decay in the political users can be understood as the strong emotional aspect of political discussions, where—one can either be talking about politics, one or does barely talk about politics at all [51]. In the disaster cascades, the distribution of user engagement is more evenly distributed and better fits the “natural” power law distribution curve which is expected.

Separation through word-to-tweet bipartite networks

In addition to the method described above of separation between political and non-political cascades based on the exponent (slope), we also transformed the data into a bipartite network. We show that this transformation improved the separation. Furthermore, we use our newly developed quantitative measure—the PLMSE in Eq 1 and show how it can be used to better separate between political and non-political cascades.

This separation is observed in Fig 4 (without the network transformation) and with the network transformation in Fig 5 below. First, in Fig 4, we show the simpler separation by the user’s slopes (A) and the user’s scores (B) where (A) separates the cascades by the slope while (B) by the PLMSE scores. In both sub-images, the orange points are the political cascades while the blue points are the disaster cascades. In (C) and (D), we show a similar separation, but now by the slopes of the distributions of hashtags (C) along their PLMSE scored (D).

Fig 4. Separation between political (orange) and disasters (blue) cascades.

Fig 4

(A) slope of user distribution, (B) PLMSE scores of user distribution, (C) slope of hashtags distribution, (D) PLMSE scores of hashtags distribution. Note the slope difference for the users and the hashtags and the differences between the political and disaster cascades. Means of political and non-political cascades are found significantly different. For the users distribution, the mean slopes are -2.05 vs. -0.59, (t-test p-value = 1.4E-05). Similarly, the user’s PLMSE difference is 8.14 vs. 0.53 (t-test; p-value = 0.00045). For the hashtags, the mean slopes are -0.97 vs. -1.21 for the politics/disasters (t-test; p-value = 0.0018), while the PLMSE scores differences are 1.39 vs. 2.1 (p-value 0.0026).

Fig 5. Separation of political and natural events by word-to-tweet bipartite graphs.

Fig 5

(A)Slope of the degree distributions of the bipartite graphs. (B) The average degree of the bipartite graphs. (C) Two-dimension separation. A number of clusters in the graph—x-axis, degree distribution—y-axis. (D) Two-dimension separation. A number of clusters in the graph—x-axis, average degree—y-axis.

We can observe a good separation between the political and the disaster cascades in (A) and (B), along a different mean but without a single line separation in the Hashtags slopes (C) and their scores (D) where the means differ but not all points.

In the distribution of hashtags (C), the slopes in the political topics (orange) (blue) are generally smaller compared to the disasters. One should note that a small slope indicates a slower decay, and thus more words and hashtags that appear many times compared to the disasters, where hashtags appearing many times are less frequent. This is possibly due to the more repetitive use of words (and hashtags) in the political cascades, resulting from the rhetoric propaganda and a style that uses repetitions, i.e., brainwashing of slogans. Also, since the users’ slopes alone do not separate between the two types of cascades well enough, we add to the separation the bipartite graphs transformation.

In Fig 5, we show the political / disaster separation on a two-dimensional space, (A) and (B) show only the s of degree distribution slope (A) and the average degree (B) in the bipartite network. Images (C) and (D) add these dimensions to the number of clusters in the bipartite network.

To construct a bipartite graph from each cascade, we used the entire tweet message, which was not available in the initial datasets. We, therefore, use another set to collect 6 disaster-related datasets (JSON format) from [52] where the complete tweet record of the disaster cascades was found. The political tweets were collected as before from Kaggle [53]. As before, the political topics appear in orange and the disasters appear in blue. In Fig 5(A) we show the slopes of the degree distribution for both types of cascades, and in Fig 5(B) we show the network‘s average degrees. Again, an interpretation of these results might suffer from our own confirmation bias and the data transformation makes the interpretation of these results sometimes less clear. Nevertheless, the separation is rather clear and can be observed in Fig 5(A)–5(D). Also, we see that adding the number of clusters to the separation adds some information, but not such that deeply changes the separation by a single dimension.

Discussion

We collected from Twitter information cascades of two different types—political topics and disasters. The disasters included topics such as earthquakes, hurricanes, mass shootings, and large fires. The comparison between these two types of cascades is based on our previous preliminary results [54], suggesting an intrinsic difference between the political and the non-political cascades. This difference is mainly due to an effort to influence the spread of political discussions and also due to the nature of political discussions to include a greater degree of repetitiveness.

In contrast, large-scale disasters generate more genuine social (media) discussions, which are based on an authentic human need to discuss, worry, and mostly share information in times of stress, or seek helpful information in times of disaster. Political campaigns, on the other hand, try to use social networks for their benefit, as a tool to draw attention, influence potential voters, or debate with political opponents. The goals and internal motivations of users who tweet about politics and those who tweet about large-scale disasters might also be different to some degree, and in this work, we try to capture these differences on a statistical macro scale level.

We find that the distribution of users in the political cascades strongly deviates from the power law distribution. We cannot truly claim that in the political cascades, the distribution of users follows a power law at all. The sharp drop in the lower right side of all the political cascades distribution (as observed in Fig 3), indicates a different user’s distribution in politics compared to the natural disasters.

As for the hashtag distributions, we see in the political topics a slower decay (slope) indicating more hashtags that are used repetitively.

We develop also here a quantification method that detects the deviation from the natural power-law line. We name this measure the PLMSE in Eq 1, then demonstrate its efficiency. One should note that future directions can improve the PLMSE equations where one such possible candidate is PLMSE=1-ψψ·i=1nψi·log2(yiC·iα), where the tuning parameter 0 ≤ ψ ⩽ 1, should be searched to find its optimal value. While we do not claim Eq (1) is optimal, by demonstrating its efficiency, we open the path to further research in the direction of measuring the fit to a power law by a mean square errors (PLMSE) formula, between the theoretical line and the data.

Another issue and possible limitation that needs to be considered is the fit of the distribution to the power law distribution. Some studies claim that the distribution of hashtags is indeed a power law [55], while others claim it is some generalized Zip‘s law [56]. While a true fit of many distributions to the power law is more seldom than expected [57], such an exact fit is not critical in the practical aspect of our work. This is because the conceptual novelty of our work is the ability to differentiate between political and non-political cascades based on statistical patterns of repetitions or words and users. These patterns of repetitions are even correct if the distribution does not fit an exact power law distribution but simply to any fat-tailed distribution.

We used logarithmic binning and least square methods after binning on a logarithmic scale. While some researchers claim that Maximum Likelihood (ML) method, that is strongly advocated by Clauset and others [57, 58], is the correct method to estimate the Power Law slope, other researchers, after generating power law distributions with known parameters and seeking their (known) slopes, claim otherwise [44]. This experimental work claims that “…the criticism about the inaccuracy of LSE in fitting power-law distributions is complete nonsense”, … and also that “. Our experiments uncover a fundamental flaw in the widely known CSN2009 method proposed by Clauset et al, it tends to discard the majority of power-law data and fit the long-tailed noises.

Furthermore, it has also been shown by [59], that in cases when the data is not a clean Power Law, but rather an approximation of the power law (as in our case), then a logarithmic binning will result in an unbiased slope estimate using LSE methods, and will perform better compared to the ML method. Along this line, we also tried using the method of Maximum Likelihood [57, 58], but it simply did not result in slopes that fitted our data visually, and since we could observe the slopes differences visually, we preferred to stick to slope measuring method by logarithmic binning and LSE that better describes the visual slope and it was observed in the date, and as it captures correctly the viewed differences between the political and non-political discussions.

A limitation that needs to be also discussed is that we used both the tweets and the retweets, which is not directly information cascade. Also, since in twitter, if user B retweets user A, and user C retweets user B, the data will appear as if user B and C retweeted user A, and the real path will be kept hidden. In this sense, the cascades in our data set do not truly represent direct cascades, but rather discussions on topics. For example, if a hashtag #Harvey appears after the hurricane Harvey, then we collect the discussion related to the hurricane, which was not discussed before the hurricane occurred. While the tweets were collected in the time where the hashtag was relevant, we have no true knowledge of the exact cascade structure in terms of who spread who. We thus in fact measure the discussion about a topic, and not necessarily the direct cascade spread from one person to another.

Another issue that needs to be considered is related to the users’ distributions. While the distribution of users is highly different between the political and the nonpolitical cascades, as seen clearly in Fig 3 we are also more careful about counting on this difference. One should note that Twitter allows regular (noncommercial) users a limit of 5000 accounts, but paid accounts might not have this limitation. We are not sure what is the attitude of X toward this issue. Thus, the extreme difference between the user‘s and the disaster‘s cascades may be also due to political cascades including more paid accounts.

Last, interesting future direction can inspect additional topics, such as sports, fashion, and music hits, to see if these behave like the political or the disaster cascades. An example of such a topic is the Covid-19 pandemic, which created a global movement of vaccine supporters and deniers. A question arises if COVID-19 vaccine-related discussions were more similar to political or disaster cascades. Based on a single cascade data set from [60], we can see (S2 Fig) that the word distributions of this COVID-19 cascade have a slope of (slope = −0.648, R2 = 0.96), which better resembles a political than a disaster cascade. This only draws a possible further study to determine if COVID-19-related information was politicized.

Also, Indeed, the line separating between propaganda, information bias and simple repetitions in twitter is a fine line. First, we will discuss the relations between information bias and propaganda, then discuss the role of authentic users in the spread of propaganda. We name in this work an information bias by any external force, aiming to bias the natural spread of ideas, as propaganda. This is a very broad definition, since it also includes commercial communication efforts to spread a name of a brand, as well as the messages from a charismatic opinion leader. This definition also includes the “darker” aspects of this phenomena such as the use of centrally or decentralized organized bots and fake accounts to spread misinformation. What we see in our work is that although the existence of charismatic opinion leaders that repeat consistently some specific messages is for sure a natural phenomenon, (and cannot be considered as propaganda by itself), the distribution of such charismatic users and topics is more common in the political topics compared to the natural disaster’s topics. This can be clearly observed in Figs 2 and 3, where one can observe that the distribution slopes for the hashtags used in the political topics are less steep, thus, the decay in hashtags usage is slower in the political topics, representing higher repetitiveness in the language of the political topics. In contrast, for the users, the slope is steeper for politics, thus some users are very active while the majority are barely active at all. These active users in the politics, that repeatedly spread the same hashtags again and again, and are thus the statistical footprint of propaganda. Furthermore, as long as we keep the neutral definition of propaganda, as a repetitive broadcast of a message, regardless if this is an acceptable or unacceptable message, we avoid the need for any evaluation of the good or bad moral aspect of the topic. In this perspective, “Just do it” is a propaganda similarly to “From the river to the sea, Palestine will be free”, since we ignore our personal view of the content, and just inspect if the distribution of these topics is biased or unbiased. Indeed, in general the term “propaganda” is used for a “bad” spread of ideas that includes a massive brainwash. We claim that any information bias is considered as propaganda, either if it’s for good or for bad reasons. Furthermore, while the reviewer‘s observation that “yes we can” can be a legitimate message that has been spread by an authentic user, our “statistical” definitions of propaganda does not limit the term propaganda only to the direct effect between the government and the people. This is because such a narrow definition, will not define the burning of old scripts (either by German Nazy or by Young Mao enthusiasts), when no government gave a direct order to burn such books, as a natural phenomenon, but clearly as an act resulting of propaganda, even if such acts were not directly including the government, but only resulted in a crowd-to-crowd direct messages spread. To conclude: in this work we define propaganda as any effort to bias the spread of information. This effort is generally hidden. Bots, fake accounts, smart slogans, are all a part or this effort, in which although most effects cannot be observed individually, our claim here is that they can be observed by their statistical properties.

Last, with the fast emergence of large language models, such as GPT-4 or Google Gemini, we expect a greater sophistication in machine-related content generation and smart bots. Our method can be more valuable than before considering this coming future.

Conclusion

We inspect several information cascades, of which one group of cascades is related to political topics while the other is to disasters, comprising in total of over 9 million records and 9 GB of data. We found that the initial distributions of hashtag repetitions and user repetitions differ (pvalue = 1.15E−5) for users and (pvalue = 0.004) for hashtags. These results suggest higher repetitiveness of hashtags in the political cascades which is manifested by a smaller exponent of the hashtag distribution compared to the disasters (non-political) cascades. As for the distribution of users, the political cascades in contrast have larger slopes, indicating a faster decay in user repetition, resulting from a tendency of one to either talk politics or barely discuss it at all. We additionally propose a quantification measure named PLMSE, which is based on the known MSE measure with adaptation to power law curves, and show how its use helps to separate the two types of cascades.

Overall, the distribution of users in the political discussions tends to generate larger slopes of α > 1.1, while disasters tend to result in milder slopes of α < 0.67 (Fig 4A). In the hashtag distribution, the political topic tends to generate smaller slopes with mean(α)=0.97, while the disasters hashtag distribution tends to result in a faster decay, and slopes with an exponent mean(α)=1.21 (see Fig 4C). As for the PLMSE scores, for the hashtags, their means values is 1.4 compared to the disasters where the mean score is 2.1. The user’s mean score is 8.97 compared to the disaster’s mean score which is 0.522. These differences are observed both visually as well as numerically in Fig 4B. When transforming the data into a word-to-tweet bipartite graph results in an additional separation space, and thus an even better separation between these two types of cascades.

We proposed here several attributes to estimate and separate between political and non-political cascades. It is commonly believed that the level of bias in the political cascades is higher, although we know that disaster cascades also include some degree of manipulation. Authentic cascades and discussions are assumed to be these discussions where people truly share and seek information, and are less likely to try to manipulate the flow of information. Political cascades, on the other hand, have a more defined goal—to convince others. We believe that our proposed method can be useful as a first high-level assessment tool to detect the presence of propaganda or bias in information cascades and discussions in social media. With the coming age of computational propaganda based on large language models tool, the ability to detect intentional bias, and an external effort to spread ideas, might more relevant than ever before.

Supporting information

S1 Table. Data sets used for hashtag analysis.

(XLSX)

pone.0309688.s001.xlsx (9.9KB, xlsx)
S2 Table. Data sets used for user analysis.

(XLSX)

pone.0309688.s002.xlsx (9.9KB, xlsx)
S3 Table. Data sets used for bipartite graphs network analysis.

(XLSX)

pone.0309688.s003.xlsx (10KB, xlsx)
S4 Table. Hashtags data points.

(XLSX)

pone.0309688.s004.xlsx (9.7KB, xlsx)
S5 Table. Users data points.

(XLSX)

pone.0309688.s005.xlsx (9.8KB, xlsx)
S6 Table. Bipartite network data points.

(XLSX)

pone.0309688.s006.xlsx (10.2KB, xlsx)
S1 Fig. Power law slope of COVID-19 bigrams (Banda et al, 2020).

The slope is more similar to the political cascades than it is to the disaster cascades.

(TIFF)

pone.0309688.s007.tiff (1.8MB, tiff)
S2 Fig. Additional slopes images (complementary to Figs 2 and 3 for hashtags of Egypt and Iran politics.

(TIFF)

pone.0309688.s008.tiff (1.2MB, tiff)

Data Availability

All the code is available in the Github repository of the study: https://zenodo.org/records/10805274 or https://github.com/OmerNeter/tweets_politics_project Data cannot be fully shared publicly because of its size and because we are not its owners. Due to the sizes of the datasets, which are over 200 MB (even when zipped), and also since we are not the owner of the data but rather downloaded it from open repositories, we added links to the data sets locations in the project`s Git repositories. We also added several datasets to enable a smooth run of the code. The additional datasets are found in links in the repository. These are links to (1) Kaggle, (2) Fishgar and (3) the Digital Library data repository. In the root of the Git repository, we added a table named "DataSources.xlsx" where we specify for each data source the link for downloading it. Additionally, statistical properties for each of the network data sources (#clusters, #nodes, #edges, Slope) is found in the "Network_dataset.csv" file in the root of the repository. The minimal data set underlying the results described in your manuscript can be found in the GitHub repository, https://github.com/OmerNeter/tweets_politics_project in files ‘Fig 4 A-D.xlsx’ and ‘Fig 5.xlsx’.

Funding Statement

The fund support was financial. The Ariel Cyber Innovation Center in conjunction with the Israel National Cyber Directorate in the Prime Minister’s Office; Award Number: None | Recipient: Alon Sela, Ph.D. This grant was used to pay for the programmer of this project, Mr. Omer Neter which is also one of the authors of this paper, and we declare that the funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript or any other scientific or non scientific matter other than providing the initial funds.

References

  • 1. Vickers B. Repetition and emphasis in rhetoric: Theory and practice. Swiss Papers in English Language and Literature. 1994;7:85–114. [Google Scholar]
  • 2. McQuarrie EF, Mick DG. Figures of rhetoric in advertising language. Journal of consumer research. 1996;22(4):424–438. doi: 10.1086/209459 [DOI] [Google Scholar]
  • 3. Hodges A. Yes, we can: The social life of a political slogan. Contemporary critical discourse studies. 2014;349:366. [Google Scholar]
  • 4. Bykov IA, Balakhonskaya LV, Gladchenko IA, Balakhonsky VV. Verbal aggression as a communication strategy in digital society. In: 2018 IEEE Communication Strategies in Digital Society Workshop (ComSDS). IEEE; 2018. p. 12–14. [Google Scholar]
  • 5. De Vreese CH, Esser F, Aalberg T, Reinemann C, Stanyer J. Populism as an expression of political communication content and style: A new perspective. The international journal of press/politics. 2018;23(4):423–438. doi: 10.1177/1940161218790035 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Aalberg T, de Vreese CH. Comprehending populist political communication. Populist Political Communication in Europe, New York: Routledge. 2017; p. 1–13. [Google Scholar]
  • 7. Serazio M. Branding politics: Emotion, authenticity, and the marketing culture of American political communication. Journal of Consumer Culture. 2017;17(2):225–241. doi: 10.1177/1469540515586868 [DOI] [Google Scholar]
  • 8. Cacioppo JT, Petty RE. Effects of message repetition and position on cognitive response, recall, and persuasion. Journal of personality and Social Psychology. 1979;37(1):97. doi: 10.1037/0022-3514.37.1.97 [DOI] [Google Scholar]
  • 9. Paul C, Matthews M. The Russian “firehose of falsehood” propaganda model. Rand Corporation. 2016;2(7):1–10. [Google Scholar]
  • 10. McKelvey F, Dubois E. Computational propaganda in Canada: The use of political bots; 2017. [Google Scholar]
  • 11. Neyazi TA. Digital propaganda, political bots and polarized politics in India. Asian Journal of Communication. 2020;30(1):39–57. doi: 10.1080/01292986.2019.1699938 [DOI] [Google Scholar]
  • 12.Howard PN, Kollanyi B. Bots, strongerin, and brexit: Computational propaganda during the uk-eu referendum. Available at SSRN 2798311. 2016;.
  • 13. Lightfoot S, Jacobs S. Political propaganda spread through social bots. Media, Culture, & Global Politics. 2017; p. 1–22. [Google Scholar]
  • 14. Sela A, Milo O, Kagan E, Ben-Gal I. Improving information spread by spreading groups. Online Information Review. 2019;. doi: 10.1108/OIR-08-2018-0245 [DOI] [Google Scholar]
  • 15. Latah M. Detection of malicious social bots: A survey and a refined taxonomy. Expert Systems with Applications. 2020;151:113383. doi: 10.1016/j.eswa.2020.113383 [DOI] [Google Scholar]
  • 16. Vasilkova V, Legostaeva N. Social bots in political communication. RUDN Journal of Sociology. 2019;19(1):121–133. doi: 10.22363/2313-2272-2019-19-1-121-133 [DOI] [Google Scholar]
  • 17. Sela A, Cohen-Milo O, Kagan E, Zwilling M, Ben-Gal I. Using Connected Accounts to Enhance Information Spread in Social Networks. In: Cherifi H, Gaito S, Mendes JF, Moro E, Rocha LM, editors. Complex Networks and Their Applications VIII. Cham: Springer International Publishing; 2020. p. 459–468. [Google Scholar]
  • 18. McGarry ED. The Propaganda Function in Marketing. Journal of Marketing (pre-1986). 1958;23:131. doi: 10.1177/002224295802300202 [DOI] [Google Scholar]
  • 19. Hobbs R. Propaganda in an age of algorithmic personalization: Expanding literacy research and practice. Reading Research Quarterly. 2020;55(3):521–533. doi: 10.1002/rrq.301 [DOI] [Google Scholar]
  • 20. Kim J, Hastak M. Social network analysis: Characteristics of online social networks after a disaster. International Journal of Information Management. 2018;38(1):86–96. doi: 10.1016/j.ijinfomgt.2017.08.003 [DOI] [Google Scholar]
  • 21. Keim ME, Noji E. Emergent use of social media: a new age of opportunity for disaster resilience. American journal of disaster medicine. 2011;6(1):47–54. doi: 10.5055/ajdm.2011.0044 [DOI] [PubMed] [Google Scholar]
  • 22. Alexander DE. Social media in disaster risk reduction and crisis management. Science and engineering ethics. 2014;20(3):717–733. doi: 10.1007/s11948-013-9502-z [DOI] [PubMed] [Google Scholar]
  • 23. Muchnik L, Pei S, Parra LC, Reis SD, Andrade JS Jr, Havlin S, et al. Origins of power-law degree distribution in the heterogeneity of human activity in social networks. Scientific reports. 2013;3(1):1–8. doi: 10.1038/srep01783 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Muchnik L, Pei S, Parra L, Reis S, Andrade J Jr, Havlin S, et al. Heterogeneity of Human Activity Levels Gives Rise to Power-Law Distribution in Online Social Networks. In: APS March Meeting Abstracts. vol. 2013; 2013. p. F28–004. [Google Scholar]
  • 25. Taylor PM. Munitions of the mind: A history of propaganda from the ancient world to the present era. Manchester University Press; 2013. [Google Scholar]
  • 26. Lasswell HD. The theory of political propaganda. American Political Science Review. 1927;21(3):627–631. doi: 10.2307/1945515 [DOI] [Google Scholar]
  • 27. Sela A., Shekhtman L., Havlin S., Ben-Gal I. Comparing the diversity of information by word-of-mouth vs. web spread. Europhysics Letters, 2016; 114(5), 58003. doi: 10.1209/0295-5075/114/58003 [DOI] [Google Scholar]
  • 28. Cacioppo JT, Petty RE. Effects of message repetition on argument processing, recall, and persuasion. Basic and applied social psychology. 1989;10(1):3–12. doi: 10.1207/s15324834basp1001_2 [DOI] [Google Scholar]
  • 29. Omozuwa VE, Ezejideaku E. A stylistic analysis of the language of political campaigns in Nigeria: Evidence from the 2007 general elections. OGIRISI: a New Journal of African Studies. 2008;5:40–54. [Google Scholar]
  • 30. Doob LW. Goebbels’ principles of propaganda. Public Opinion Quarterly. 1950;14(3):419–442. doi: 10.1086/266211 [DOI] [Google Scholar]
  • 31.Martino GDS, Cresci S, Barrón-Cedeno A, Yu S, Di Pietro R, Nakov P. A survey on computational propaganda detection. arXiv preprint arXiv:200708024. 2020;.
  • 32.Orlov M, Litvak M. Using behavior and text analysis to detect propagandists and misinformers on Twitter. In: Annual International Symposium on Information Management and Big Data. Springer; 2018. p. 67–74.
  • 33. Zhang J, Zhang R, Zhang Y, Yan G. The rise of social botnets: Attacks and countermeasures. IEEE Transactions on Dependable and Secure Computing. 2016;15(6):1068–1082. doi: 10.1109/TDSC.2016.2641441 [DOI] [Google Scholar]
  • 34.Meta. Facebook for developers Using Personas; 2022. Available from: https://developers.facebook.com/docs/messenger-platform/send-messages/personas/.
  • 35. Hagen L, Neely S, Keller TE, Scharf R, Vasquez FE. Rise of the machines? Examining the influence of social bots on a political discussion network. Social Science Computer Review. 2020; p. 0894439320908190. [Google Scholar]
  • 36. Bolsover G, Howard P. Chinese computational propaganda: automation, algorithms and the manipulation of information about Chinese politics on Twitter and Weibo. Information, communication & society. 2019;22(14):2063–2080. doi: 10.1080/1369118X.2018.1476576 [DOI] [Google Scholar]
  • 37.Beltagy I, Lo K, Cohan A. SciBERT: A pretrained language model for scientific text. arXiv preprint arXiv:190310676. 2019;.
  • 38. Elkins K, Chun J. Can GPT-3 pass a writer’s Turing Test? Journal of Cultural Analytics. 2020;5(2):17212. doi: 10.22148/001c.17212 [DOI] [Google Scholar]
  • 39.Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al. Language models are few-shot learners. arXiv preprint arXiv:200514165. 2020;.
  • 40. Hołyst Janusz A and Mayr Philipp and Thelwall Michael and Frommholz Ingo and Havlin Shlomo and Sela Alon, et al. and others, Protect our environment from information overload. Nature Human Behaviour, 8, 3, 402–403, 2024. doi: 10.1038/s41562-024-01833-8 [DOI] [PubMed] [Google Scholar]
  • 41. Barabási AL, Albert R. Emergence of scaling in random networks. science. 1999;286(5439):509–512. doi: 10.1126/science.286.5439.509 [DOI] [PubMed] [Google Scholar]
  • 42. Rich PM. Mechanical architecture of arborescent rain forest palms. Principes. 1986;30(3):117–131. [Google Scholar]
  • 43. Goldstein Michel L and Morris Steven A and Yen Gary G Problems with fitting to the power-law distribution The European Physical Journal B-Condensed Matter and Complex Systems, 41, pp 255–258, 2004; doi: 10.1140/epjb/e2004-00316-5 [DOI] [Google Scholar]
  • 44.Zhong, Xiaoshi and Wang, Muyin and Zhang, Hongkun. Is least-squares inaccurate in fitting power-law distributions? The criticism is complete nonsense. Proceedings of the ACM Web Conference 2022, pp 2748–2758.
  • 45.Mooney P. Twitter Election Data Archives—Kaggle; 2019. Available from: https://www.kaggle.com/datasets/paultimothymooney/twitter-election-data-archives.
  • 46.UNT L. Digital Library Data Repository; 2021. https://digital.library.unt.edu/.
  • 47.FigShare. FigShare Repository data; 2022. https://figshare.com/search?q=hurricane&itemTypes=3.
  • 48.Hydrator. GitHub—DocNow/hydrator: Turn Tweet IDs into Twitter JSON & CSV from your desktop!; 2020. Available from: https://github.com/DocNow/hydrator.
  • 49.Learn how to easily hydrate tweets Using the Hydrator app and twarc tool by DocNow Medium by Aruna Pisharody Available from: https://towardsdatascience.com/learn-how-to-easily-hydrate-tweets-a0f393ed340e.
  • 50.X developer platform—Developer Terms Redistribution of X content Available from: https://developer.twitter.com/en/developer-terms/more-on-restricted-use-cases.
  • 51. Mascheroni G, Murru MF. “I can share politics but I don’t discuss it”: everyday practices of political talk on Facebook. Social Media+ Society. 2017;3(4):2056305117747849. doi: 10.1177/2056305117747849 [DOI] [Google Scholar]
  • 52. Zubiaga A. A longitudinal assessment of the persistence of Twitter datasets. Journal of the Association for Information Science and Technology. 2018;69:974–984. doi: 10.1002/asi.24026 [DOI] [Google Scholar]
  • 53. Goldstein Michel L and Morris Steven A and Yen Gary G Problems with fitting to the power-law distribution Twitter Election Data Archives; [Google Scholar]
  • 54. Sela A, Sabou JP, Cihelka P, Ganiyev M, Ulman M. Everything But Social: A Study of Twitter Accounts and Social Media Management Tool. In: Agrarian Perspective XXIX. Trends and Challenges of Agrarian Sector; 2020. p. 326–336. [Google Scholar]
  • 55. Zhou B, Pei S, Muchnik L, Meng X, Xu X, Sela A, et al. Realistic modelling of information spread using peer-to-peer diffusion patterns. Nature Human Behaviour. 2020;4(11):1198–1207. doi: 10.1038/s41562-020-00945-1 [DOI] [PubMed] [Google Scholar]
  • 56. Chen HH, Alexander TJ, Oliveira DF, Altmann EG. Scaling laws and dynamics of hashtags on Twitter. Chaos: An Interdisciplinary Journal of Nonlinear Science. 2020;30(6). doi: 10.1063/5.0004983 [DOI] [PubMed] [Google Scholar]
  • 57. Clauset A, Shalizi CR, Newman ME. Power-law distributions in empirical data. SIAM review. 2009;51(4):661–703. doi: 10.1137/070710111 [DOI] [Google Scholar]
  • 58. Bauke H. Parameter estimation for power-law distributions by maximum likelihood methods. The European Physical Journal B. 2007;58:167–173. doi: 10.1140/epjb/e2007-00219-y [DOI] [Google Scholar]
  • 59. Milojević Staša, Power law distributions in information science: Making the case for logarithmic binning. Journal of the American Society for Information Science and Technology, 61, 12, 2417–2425, 2010, Wiley. doi: 10.1002/asi.21426 [DOI] [Google Scholar]
  • 60.Banda JM, Tekumalla R, Wang G, Yu J, Liu T, Ding Y, et al. A large-scale COVID-19 Twitter chatter dataset for open scientific research–an international collaboration. arXiv preprint arXiv:200403688. 2020;. [DOI] [PMC free article] [PubMed]

Decision Letter 0

Gilad Ravid

12 Sep 2023

PONE-D-23-25017“Signals of Propaganda” - Detecting and Estimating Political Influences in Information Cascades in Social NetworksPLOS ONE

Dear Dr. Sela,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Oct 27 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Gilad Ravid, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. In your Methods section, please include additional information about your dataset and ensure that you have included a statement specifying whether the collection and analysis method complied with the terms and conditions for the source of the data.

3. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

4. Please note that funding information should not appear in any section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. Please remove any funding-related text from the manuscript.

5. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match. 

When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section.

6. Thank you for stating the following financial disclosure: 

   "no"

Please state what role the funders took in the study.  If the funders had no role, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript." 

If this statement is not correct you must amend it as needed. 

Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.

7. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability.

"Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized.

Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.

We will update your Data Availability statement to reflect the information you provide in your cover letter.

8. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide.

9. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information

Additional Editor Comments:

Please look at the detailed review by the reviewers. In addition to their helpful response, I would like to add the following review:

a. Propaganda is based on "partial truth or outright fiction" (page 2). Since you did not analyze the tweets' content, you cannot classify the tweets as true or not; hence, you actually measure and analyze the cascading of information. I think you should consider re-framing the paper as cascading rather than Propaganda

b. The example of tree height as power-law distribution should be mentioned in the citations you cite. The citations research the connection between tree diameter and its height. Actually, Barabasi and Albert discussed that height cannot be power-law distributed, as with such a case, we anticipate finding trees with 1 km of height.

c. Methodology – Please describe the criterion for collecting the data for each dataset. If data is collected by their hashtag, that might explain the PL distribution.

d. How is the data binned into bins?

e. PLMSE – mean squared error cost function can be implemented to the power law without modification (the mean of the square of the difference between the real value and the predicted one). What are the needs for the new equation? How is it related to the original concept?

f. On page 7, the authors describe the network analysis they conducted. The results of this analysis should have been reported—for example, betweenness centrality, GC, and more.

g. You mention the outlined frequency of the name of the disease hashtag (page 8). Since regression is sensitive to outliners, how will the results be changed if you omit this point?

h. Did you check the t-test assumptions? What flavor of t-test has been used? If you use two groups unequal variance test, the p-value significance is marginally

i. The data series for hashtags and users and the network data series differ. It seems like a cherry-picking series. It is better to combine all the datasets into one set so that you can perform all your research on them.

j. Have you considered that the politics users distribution is not a power law?

k. Fig 4 – What is the x-axis in graphs A and B? In all four graphs, there are more than six red points ( which are described in the SI tables)

l. Fig 5 The analysis on parts A and B does not need network analysis; they are the distribution of hashtags in tweets and tweets to hashtags, and B is the average number of hashtags.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: I Don't Know

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Sela et al. explores cascades of tweets relating to political topics and disasters, finding that for disasters the cascade distribution better follows a power-law whereas for political topics, the distribution falls off more quickly. The authors hypothesize about why this might be, noting various ideas from the literature. The paper is on an interesting topic and suggests an interesting comparison, along with novel findings. At the same time, several points should be clarified before publication.

On page 2 the authors suggest they use the disaster information as a control group for political cascades. I’m a bit concerned about this as each group has unique aspects and there are not other types of information explored beyond disasters. Perhaps suggesting this is just a comparison rather than a control is sufficient, or more carefully qualifying the idea of this as a control group. Likewise, the comment on page 3 that disaster cascades are not likely to be manipulated is not supported and I am not sure that this is certainly true.

The PLMSE measure is not fully clear to me. Conceptually I believe the authors are fitting a power-law to the data and then asking how far from the fit the data appears. However, Eq. 1 does not really appear similar to a MSE of a fit. Perhaps they could show the derivation more carefully (in SI) to explain better? Likewise, the authors describe ’n’ as being the number of points, yet it seems to me that this might be the number of bins in the power-law distribution? Otherwise, how do they have an expectation for what each point should be (since they’re drawn from a distribution)?

In Figure 3, the authors don’t show the US elections it seems. Is there a reason for this? Are the results similar for the US as the others? Likewise the ‘user_index’ label is unclear to me. Is this just the number of users that e.g., have 1 tweet, 10 tweets, etc. on the topic? The caption for this and Fig. 2 should be more detailed.

The authors claim that the separation in Fig. 4 is clear between disasters and politics seems a bit overstated. Perhaps the authors could provide a metric on how good their method is at making predictions? Could they maybe calculate an AUROC based on the scores or give another standard metric? Eyeballing, the graph for example in 4D many points in politics are comparable to disasters (e.g., USA politics 2011).

The paper could use a thorough proofreading. For example, on page 1 ‘includes a repetitive broadcasting’ should be ‘includes repetitive broadcasting’; on page 2 ‘of a political discussions’ should be ‘of political discussions’; on page 5, ‘would be explain’ should be ‘will be explained’; also on page 5 ‘the researcher study’ should be ‘the researcher studies’; also on page 5 ‘get Power law’ should be ‘get a power law’; on page 6 ‘data sets where used’ should be ‘data sets were used’; also on page 6 ‘scarping’ should be ‘scraping’ and ‘scraps’ should be ‘scrapes’; on page 8 ‘twits’ should be ‘tweets’; on page 14 ‘additianl’ should be ‘additional’;

On page 2 the statement that bots are illegal might be overstated, rather I think it is just a violation of terms of service potentially.

Reviewer #2: The paper proposes the analysis of several datasets of Twitter posts to evaluate the statistical differences between political and non-political discussions.

The proposed evaluation starts from the hypothesis that political discussions (as propaganda) generally leverage the repetition of messages and involve a limited number of users when compared with non-political discussions. The authors are able to distinguish between political and non-political datasets by studying the properties of graphs generated with the different datasets. Moreover, the authors propose the construction of a bipartite graph in which each tweet is connected with the words it contains.

The results emerge by studying how closely the graph properties exhibit theoretical power-law distributions and focusing on the evaluated parameters to derive the slope and an additional score called PLMSE (power-law mean square error).

Strong points:

- an attractive and reasonable approach

- a pleasant writing style

Weak points:

- not convincing experiments

- lacking of crucial details

- limited number of tests

- detection motivation not fully plausible

- typos

The bipartite graph-based approach is attractive and sounds relevant with respect to the traditional NLP-based ones.

However, while the paper's approach is attractive, its overall quality and impact seem limited, and the results appear not entirely sound.

First of all, many important details appear to be overlooked when describing the evaluation methodology. For example, it is unclear how the authors processed the tweets and which words were included in the graphs (what about the stopwords and alike?) or which are the details of the operation "The data was binned prior to the slope estimation". The plots reported in the figures are inconsistent and incomplete: it is unclear why some datasets are used for showing some properties but not for others (for example, USA-2016 and Catalonia-2006 are missing in Figure 3).

Another questionable point is the claim on page 7, where considering "words used in the tweet regardless if they are hashtags or not" would remove the bias that "political users might tend to use hashtags more often than regular users". This way, political discussions will include the words related to the tweet *and* the hashtags downgraded to regular words.

The use of disaster-related datasets is a reasonable sample. However, it is only one of the possible types of "cascades" found on Twitter. How do the results differ when considering discussions about other events, like the Super Bowl, lengthy expected movies, or World Cup exhibitions?

Furthermore, motivating the proposed methodology to detect political discussion sounds relatively trivial, considering that the presence of given hashtags proves that a tweet is political.

Finally, several typos undermine the overall quality of the paper: it seems that the paper did not receive a last text revision. A prominent example is the word "sty;e" on page 11, showing that a spell-check was missing before the submission to a prominent venue like PONE.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2025 Jan 30;20(1):e0309688. doi: 10.1371/journal.pone.0309688.r002

Author response to Decision Letter 0


5 May 2024

Respond to editor comments and to each of the reviewers can be found in the attached "Respond to Reviewers- file.

_________________________________________________________________________________________________

Response to reviewers

Thank you for the time and effort to read and comment our manuscript "Signals of Propaganda: Detecting and Estimating Political Influences in Information Cascades in Social Networks". Thank you also for your beneficial comments, which we feel truly helped us improve our manuscript.

In our study, we evaluate how information cascades that include a non-organic spread, i.e., which is suspected to include some bias resulting from an effort to spread an agenda, and as such named “computational propaganda”, can be detected on Twitter via their macro-scale network properties, without examining the text itself.

Our proposed detection method is based on the property of rhetorical speech (which is a part of propaganda) to include a certain degree of repetition in the messages i.e., “brainwash” or “slogans”, and in the tendency of political discussions to repeat broadcasting similar messages to their target audience.

This creates a distribution where several terms repeat more times than in an "organic" discussion, for example, we use large scale disasters as a comparison group.

We show how such repetitions can be detected through their Power law exponents features, for both users and hashtags distribution. We show also how transforming of the cascades into bipartite graphs, connecting tweets and the words appearing in these tweets, can help find this type of repetitiveness when hashtags are lacking or in cases where hashtags are unevenly distributed between the political cascades and the comparison group.

As a comparison group to the political information cascades, we use information cascades related to large-scale disasters. We believe these cascades reflect a more “organic” cascade type since people in harsh situations share their knowledge and emotions for the sake of pure human cause, and not for the sake of spread or personal gain, and thus disasters cascades are less centralized in their nature.

Also, we developed a distinctive quantitive measure, named the PLMSE measure, based on our distinction above. We find that the political discussions tend to be more centralized, i.e., fewer users repeat similar words more often than it would have occurred in a n “organic” discussion. On the other hand, authentic social network conversations are more heterogeneous in terms of their used vocabulary.

Following the reviewer`s comments, we added an entire section explaining our PLMSE formula. We also added statistical tests to show the separation`s significance as long as many other corrections. We reorganized the GitHub repository entirely to enable a reproduceable work.

Thank you for this opportunity and your consideration of this manuscript as well as for the additional time that we received for its corrections and the deep remarks which improved the readability and clarity of our work.

Responses to the first reviewer are shown directly after this letter, while responds to the second reviewer starting at page 6. Additional comments and corrections following the editor's comments (which in some cases were similar to the reviewer`s comments) are allocated to the letter to the editor document and here starting at page 10. When the editor and the reviewer`s comments were on the same topic, we only answered the editor for to reduce redundancy.

We look forward to hearing back from you soon.

With kind regards,

Alon Sela (on behalf of the authors).

Reviewer 1

Comment: Reviewer #1: Sela et al. explores cascades of tweets relating to political topics and disasters, finding that for disasters the cascade distribution better follows a power-law whereas for political topics, the distribution falls off more quickly. The authors hypothesize about why this might be, noting various ideas from the literature. The paper is on an interesting topic and suggests an interesting comparison, along with novel findings.

At the same time, several points should be clarified before publication.

Answer: Thank you for this positive general evaluation.

Comment: On page 2 the authors suggest they use the disaster information as a control group for political cascades. I’m a bit concerned about this as each group has unique aspects and there are not other types of information explored beyond disasters. Perhaps suggesting this is just a comparison rather than a control is sufficient, or more carefully qualifying the idea of this as a control group. Likewise, the comment on page 3 that disaster cascades are not likely to be manipulated is not supported and I am not sure that this is certainly true.

Answer: Thank you for this comment. Indeed, following the reviewer’s comments, we expressed our belief of a weaker degree of manipulation in disasters compared to politics in a more careful way.

Comment: The PLMSE measure is not fully clear to me. Conceptually I believe the authors are fitting a power-law to the data and then asking how far from the fit the data appears. However, Eq. 1 does not really appear like a MSE of a fit. Perhaps they could show the derivation more carefully (in SI) to explain better? Likewise, the authors describe ’n’ as being the number of points, yet it seems to me that this might be the number of bins in the power-law distribution? Otherwise, how do they have an expectation for what each point should be (since they’re drawn from a distribution)?

Answer: Thank you for this important comment. We added to the article an explanation of the PLMSE formula. Indeed, the PLMSE formula was not clear, and we hope now it is clearer.

Comment: In Figure 3, the authors don’t show the US elections it seems. Is there a reason for this? Are the results similar for the US as the others? Likewise, the ‘user_index’ label is unclear to me. Is this just the number of users that e.g., have 1 tweet, 10 tweets, etc. on the topic?

Answer: Thank you for this comment. In figure 3 we show 6 vs 6 politics and disasters, and in fig 4 we show the entire data. We simply showed part of the data as histograms due to space limits. We added to the git the additional images from which we compute the slopes which are the values that separate between the numbers. As for the user_index, indeed it is the number of user when sorted from the most to the least frequently appearing user in the data.

Comment: The caption for this and Fig. 2 should be more detailed.

Answer: Thank you for this comment. We added to the caption the following explanation.

Comments: The authors claim that the separation in Fig. 4 is clear between disasters and politics seems a bit overstated. Perhaps the authors could provide a metric on how good their method is at making predictions? Could they maybe calculate an AUROC based on the scores or give another standard metric? Eyeballing, the graph for example in 4D many points in politics are comparable to disasters (e.g., USA politics 2011).

Answers: Thank you for this important comment. We added a t-test and show the p-values which were computed and added to the manuscript in the caption of Fig.4.

Comments: repetitive broadcasting’ should be ‘includes repetitive broadcasting’; on page 2 ‘of a political discussions’ should be ‘of political discussions’; on page 5, ‘would be explain’ should be ‘will be explained’; also on page 5 ‘the researcher study’ should be ‘the researcher studies’; also on page 5 ‘get Power law’ should be ‘get a power law’; on page 6 ‘data sets where used’ should be ‘data sets were used’; also on page 6 ‘scarping’ should be ‘scraping’ and ‘scraps’ should be ‘scrapes’; on page 8 ‘twits’ should be ‘tweets’; on page 14 ‘additianl’ should be ‘additional’.

Answer: Thank you and our apologies for this issue. We thoroughly spell check the article again. We believe that now all spelling mistakes have been corrected.

Comments: On page 2 the statement that bots are illegal might be overstated, rather I think it is just a violation of terms of service potentially.

Answer: Thank you for this comment, we rewrote the sentence in a more subtle manner to correctly describe the reality where bots in many cases are legal (see below):

Reviewer 2

Comment: The paper proposes the analysis of several datasets of Twitter posts to evaluate the statistical differences between political and non-political discussions.

The proposed evaluation starts from the hypothesis that political discussions (as propaganda) generally leverage the repetition of messages and involve a limited number of users when compared with non-political discussions. The authors are able to distinguish between political and non-political datasets by studying the properties of graphs generated with the different datasets. Moreover, the authors propose the construction of a bipartite graph in which each tweet is connected with the words it contains.

The results emerge by studying how closely the graph properties exhibit theoretical power-law distributions and focusing on the evaluated parameters to derive the slope and an additional score called PLMSE (power-law mean square error).

Strong points:

- an attractive and reasonable approach

- a pleasant writing style

Answer: Thank you very much for this positive evaluation. We will do our best to address the weak points.

Comment:

Weak points:

- not convincing experiments

- lacking of crucial details

- limited number of tests

- detection motivation not fully plausible

- typos

Answer: Thank you for this evaluation. We added t-tests to statistically support our claim that the political and nonpolitical cascades differ. We did not perform experiments, but rather analyzed large datasets where some include millions of tweets. All together, we analyzed networks with over 19 million edges and 3 million nodes. For the analysis that did not include the full cascade network, we analyze almost 24 million records, of which we concluded these results.

This is not a negligible data set, and each cascade topic can be of many millions' records. The full data sets sizes and records used for the network analysis and non-network analysis are found in the Git of the project in a file named "DataSources.xlsx" and Network_datasets.csv". We corrected the comments and concerns raised by the reviewers.

We added the statistical significance to estimate the difference between the attributes of the political /non-political cascades and added also a mathematical explanation to our PLMSE method of numerical estimation of the cascade`s type.

Typos have been corrected.

Comment: The bipartite graph-based approach is attractive and sounds relevant with respect to the traditional NLP-based ones. However, while the paper's approach is attractive, its overall quality and impact seem limited, and the results appear not entirely sound.

First of all, many important details appear to be overlooked when describing the evaluation methodology. For example, it is unclear how the authors processed the tweets and which words were included in the graphs (what about the stop words and alike?) or which are the details of the operation.

Answer: Thank you for this comment. We added the following paragraph to the article to clearly explain this process.

Comment: "The data was binned prior to the slope estimation". The plots reported in the figures are inconsistent and incomplete: it is unclear why some datasets are used for showing some properties but not for others (for example, USA-2016 and Catalonia-2006 are missing in Figure 3).

Answer: Thank you for this comment. Your observation that “some datasets are used for showing some properties but not for others” in Figure 3 is correct. Figs 2 and 3 show each 6 data sets of political and 6 of nonpolitical cascades. Figs 4 and 5 in comparison, show 6 disaster cascades but 10 points (and not 6) related to political cascades. The reason why we show in Fig 2 and 3 only 6 X 2 images is simply for ease of presentation space and since we only found 6 disaster’s datasets of large enough size.

Furthermore, based on figs. 2,3, one cannot truly conclude that these images are different. It is rather more of a demonstration that illustrates the phenomena and the deviation, mainly on the upper left side in the political cascades. The images show the conceptual idea but does not quantify it. This quantification is shown in fig 4 and 5 in contrast, where each cascade is only marked by a single point (and not as an entire graph). Thus, we can show all data sets that were available to us, which are 10 political cascades and 6 disaster cascades.

Following the reviewer`s comment, we corrected the caption of images 2 and 3 and emphasized that this is only a part of the datasets.

In addition, we add to the SI section a table containing the data points for all 10 political points and 6 disaster points. Following the reviewer's comments we added this paragraph

Comment: Another questionable point is the claim on page 7, where considering "words used in the tweet regardless if they are hashtags or not" would remove the bias that "political users might tend to use hashtags more often than regular users". This way, political discussions will include the words related to the tweet *and* the hashtags downgraded to regular words.

Answer: Thank you for this correct comment. When analyzing the tweets, we removed first stop words. We performed a standard lemmatization which include this work removal. Thus, we do not include in our analysis of bipartite graphs words such as and / or / etc. in the tweet. We used the NLTQ and scikit-learn python packages that do this already.

Comment: The use of disaster-related datasets is a reasonable sample. However, it is only one of the possible types of "cascades" found on Twitter. How do the results differ when considering discussions about other events, like the Super Bowl, lengthy expected movies, or World Cup exhibitions?

Answer: Thank you for this comment. We understand the point of comparing additional cascades to the 16 available cascades. We added the Covid19 frequent bigrams cascades to the SI section, and saw that according to its slope, Covid 19 cascades are more similar to a political issue than it resembles a disaster. This is since its slope is -0.648. We plan to perform such an analysis on the coming USA election, but this analysis is out of the scope of the current work.

Comment: Furthermore, motivating the proposed methodology to detect political discussion sounds relatively trivial, considering that the presence of given hashtags proves that a tweet is political.

Answer: Thank you for this comment. Truly finding political topics is trivial. Our method does not offer only method to find pollical discussions (which clearly does not much effort), but rather finds interventions in a discussion that are not “organic” in the sense that the distributions deviates from the organic discussion. We use politics since we know that politics has a clear goal – to push an agenda. But in other cases, and topics this is not clear. For example, in health-related topics, it is not always clear if it’s an organic discussion or not. Furthermore, if we do not know in advance if a discussion is political or not, for example Global Warming or Immigration, our method permits us to determine for any given topic if its more organic (authentic users discussion) or more politic (specific key holder that aim at spreading their agenda. One should note that we did not inspect the textual meaning or content of the hashtag itself, nor if it was a hashtag or a simple word. We only look at the distributions and their slopes parameters.

The method used does not pretend to distinguish solely political propaganda, but mainly an amplification of some words, which appears in political messages but also might appear in commercial communication.

Indeed, we mention in the introduction on page 2 that “Pre-election political messages tend to be more aggressive, populistic and in general, use similar techniques as the ones used in commercial communication. These commercial communication techniques repetitively broadcast a few well-defi

Attachment

Submitted filename: Response to reviewers.pdf

Decision Letter 1

Gilad Ravid

26 Jul 2024

PONE-D-23-25017R1Signals of Propaganda - Detecting and Estimating Political Influences in Information Cascades in Social NetworksPLOS ONE

Dear Dr. Sela,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Sep 09 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Gilad Ravid, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments:

The authors improved the paper and answered most of the reviewers' concerns. Reviewer 1 is satisfied with the results, but Reviewer 2 did not respond to requests to review the paper again.

Some minor issues still need to be resolved before we publish the paper.

a. the paper positions itself as dealing with propaganda and cascade. As they appear in the paper, the two phrases are not used as their true meaning. As Twitter is a crowded, uncontrolled system, it is not precise to name information distribution as propaganda (neither good nor bad). This is even stronger when the authors examine hashtags, which, as they claim, are keywords that describe the tweet. Hence, if many use the same hashtag, it is not a sign of propaganda. For example, many use the hashtag "#YesWeCan" to indicate their support for Obama's campaign, but it is hard to call it a sign of propaganda. Also, I need help finding a relationship between the users' power law distribution and propaganda. As regards the cascade term, the research does not look at the cascade of information; cascade happens when someone transfers ideas or information he hears from someone else. There is no indication of transferability in the study.

The authors use the term "political cascade" to denote the political networks. Political can cascade information, ideas, and knowledge but can't cascade politics.

Line 242: should be a curve fit line (you miss the "v").

Line 252 vs. line 267: Yi is index or frequency (I think the letter).

Line 273: Please use the standard notation in the equation; the predicted value is Y hat (^), and the upper bar is reserved for the mean.

Line 280: both sides of the equations are exactly the same.

Throughout the article and equations, the authors interchangeably treat alpha as slop and alpha as minus the slop. For example, line 333 states that alpha equals -763 (you miss the decimal point. It should be -0.763), but on line 276 the line equation considers alpha as positive (the minus is part of the equation)

Please add comments and refer to some prior works on the differences and problems of fitting power law directly vs. fitting to the log-log of the values.Goldstein, M. L., Morris, S. A., & Yen, G. G. (2004). Problems with fitting to the power-law distribution. The European Physical Journal B-Condensed Matter and Complex Systems, 41, 255-258. is one candidate reference ( the problem related to the changing distribution of the log(error) you introduce)

line 254 vs. line 312: Did you make least square error fitting or maximum likelihood fitting?

Line 306: In your answers to the reviewers, you stated that you didn't perform betweenness calculations; in the paper, you did.

Line 337 vs. Fig 2, Bangladesh 2019 slope is 0.976

Line 340 vs. caption of Fig 2. What is the mean 0.97 or 0.938

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: (No Response)

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors have addressed my concerns.

In particular they added a section on their method of PLMSE better explaining the model. They also added statistical tests that are more convincing for the segregation between the datasets.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2025 Jan 30;20(1):e0309688. doi: 10.1371/journal.pone.0309688.r004

Author response to Decision Letter 1


14 Aug 2024

A detailed response to reviewers' document has been uploaded to the system.

Attachment

Submitted filename: Response to Reviewers.14.8.2024.docx

pone.0309688.s010.docx (1.3MB, docx)

Decision Letter 2

Gilad Ravid

16 Aug 2024

Signals of Propaganda - Detecting and Estimating Political Influences in Information Cascades in Social Networks

PONE-D-23-25017R2

Dear Dr. Sela,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Gilad Ravid, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

All comments addressed appropriately

Reviewers' comments:

Acceptance letter

Gilad Ravid

7 Nov 2024

PONE-D-23-25017R2

PLOS ONE

Dear Dr. Sela,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Prof. Gilad Ravid

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Data sets used for hashtag analysis.

    (XLSX)

    pone.0309688.s001.xlsx (9.9KB, xlsx)
    S2 Table. Data sets used for user analysis.

    (XLSX)

    pone.0309688.s002.xlsx (9.9KB, xlsx)
    S3 Table. Data sets used for bipartite graphs network analysis.

    (XLSX)

    pone.0309688.s003.xlsx (10KB, xlsx)
    S4 Table. Hashtags data points.

    (XLSX)

    pone.0309688.s004.xlsx (9.7KB, xlsx)
    S5 Table. Users data points.

    (XLSX)

    pone.0309688.s005.xlsx (9.8KB, xlsx)
    S6 Table. Bipartite network data points.

    (XLSX)

    pone.0309688.s006.xlsx (10.2KB, xlsx)
    S1 Fig. Power law slope of COVID-19 bigrams (Banda et al, 2020).

    The slope is more similar to the political cascades than it is to the disaster cascades.

    (TIFF)

    pone.0309688.s007.tiff (1.8MB, tiff)
    S2 Fig. Additional slopes images (complementary to Figs 2 and 3 for hashtags of Egypt and Iran politics.

    (TIFF)

    pone.0309688.s008.tiff (1.2MB, tiff)
    Attachment

    Submitted filename: Response to reviewers.pdf

    Attachment

    Submitted filename: Response to Reviewers.14.8.2024.docx

    pone.0309688.s010.docx (1.3MB, docx)

    Data Availability Statement

    All the code is available in the Github repository of the study: https://zenodo.org/records/10805274 or https://github.com/OmerNeter/tweets_politics_project Data cannot be fully shared publicly because of its size and because we are not its owners. Due to the sizes of the datasets, which are over 200 MB (even when zipped), and also since we are not the owner of the data but rather downloaded it from open repositories, we added links to the data sets locations in the project`s Git repositories. We also added several datasets to enable a smooth run of the code. The additional datasets are found in links in the repository. These are links to (1) Kaggle, (2) Fishgar and (3) the Digital Library data repository. In the root of the Git repository, we added a table named "DataSources.xlsx" where we specify for each data source the link for downloading it. Additionally, statistical properties for each of the network data sources (#clusters, #nodes, #edges, Slope) is found in the "Network_dataset.csv" file in the root of the repository. The minimal data set underlying the results described in your manuscript can be found in the GitHub repository, https://github.com/OmerNeter/tweets_politics_project in files ‘Fig 4 A-D.xlsx’ and ‘Fig 5.xlsx’.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES