A model for the Twitter sentiment curve

Giacomo Aletti; Irene Crimaldi; Fabio Saracco

doi:10.1371/journal.pone.0249634

. 2021 Apr 15;16(4):e0249634. doi: 10.1371/journal.pone.0249634

A model for the Twitter sentiment curve

Giacomo Aletti ¹, Irene Crimaldi ², Fabio Saracco ^2,^*

Editor: Haoran Xie³

PMCID: PMC8049311 PMID: 33857207

Abstract

Twitter is among the most used online platforms for the political communications, due to the concision of its messages (which is particularly suitable for political slogans) and the quick diffusion of messages. Especially when the argument stimulate the emotionality of users, the content on Twitter is shared with extreme speed and thus studying the tweet sentiment if of utmost importance to predict the evolution of the discussions and the register of the relative narratives. In this article, we present a model able to reproduce the dynamics of the sentiments of tweets related to specific topics and periods and to provide a prediction of the sentiment of the future posts based on the observed past. The model is a recent variant of the Pólya urn, introduced and studied in Aletti and Crimaldi (2019, 2020), which is characterized by a “local” reinforcement, i.e. a reinforcement mechanism mainly based on the most recent observations, and by a random persistent fluctuation of the predictive mean. In particular, this latter feature is capable of capturing the trend fluctuations in the sentiment curve. While the proposed model is extremely general and may be also employed in other contexts, it has been tested on several Twitter data sets and demonstrated greater performances compared to the standard Pólya urn model. Moreover, the different performances on different data sets highlight different emotional sensitivities respect to a public event.

1 Introduction

In the last few years, the internet has become the main source for news for citizens both in EU [1] and in USA [2]. Such a rapid change in the media system has created a symmetric change in the way news are delivered: before the diffusion of the web, information was intermediated by journals, newspapers, radio and TV newscast, that represented the authority, being publicly responsible for the diffusion of reliable news. Nowadays, such intermediation is not present anymore: every blog or account on Facebook or Twitter assumes truthfulness just for existing online [3–6]. Due to this abrupt change of paradigm in the fruition of news, we observe a great increase of the diffusion of misinformation [7–9], that appears on the web via the use of automated [10–16] or genuine accounts [4, 16–20]. It has been observed that the diffusion of disinformation or misinformation campaigns leans on the emotionality of users [3, 4, 6, 21].

Twitter is one of the most famous microblogging service, where people freely express their views and feelings in short messages, called tweets [22]. Twitter is reknown to be used especially for the political communications [23], due to the limited amount of characters, perfectly suitable for political slogans, and for the quick sharing of messages. Due to the availability of its data, via the official API, it represents an extremely rich resource of “spontaneous emotional information” [24]. Sentiment analysis, also known as opinion mining, is a collection of techniques in order to automatically detect the positive or negative connotation of texts. An overview of the latest tools, updates and open issues in sentiment analysis can be found in [25–27] (see also the references therein). Some examples of applications, where predictions are formulated based on the sentiment extracted from on-line texts are provided in [28–35]. In [36], sentiment analysis is used to investigate the emotion transmission in e-communities; while in [37], it is employed in order to investigate on the interplay between macroscopic socio-economic, political or cultural events and the public mood trends, showing that these events have a significant and immediate effect on various aspects of public mood. The Ref. [24] provides a matrix-factorization method to predict individuals’ opinions toward specific topics they had not directly given. In [38], the authors consider the sentiment curve of Twitter posts along time in order to infer the causes of sentiment variations, leveraging on the idea that the emerging topics discussed in the variation period could be highly related to the reasons behind the variations. In [39], the authors present the data prediction as a process based on two different levels of granularity: i) a fine-grained analysis to make tweet-level predictions on various aspects, such as sentiment, topics, volume, location, time-frame, and ii) a coarse-grained analysis to predict the outcome of a real-world event, by aggregating and combining the fine-grained predictions. With respect to this classification, the present work can be placed in the stream of literature regarding the fine-grained analysis to model/predict the sentiment of Twitter posts.

While an important body of research target the issue of predicting the information cascades [40–47], to the best of our knowledge, there are not works that provide models for the evolution of Twitter sentiment. We aim at filling in this gap, presenting a model that is able to reproduce the sentiment curve of the tweets related to specific topics and periods and to provide a prediction of the sentiment of the future posts based on the observed past. We achieve this purpose employing a recent variant of the Pólya urn, introduced in [48] and called Rescaled Pólya (RP) urn. In brief, the RP urn model differs from the standard Pólya urn for the presence of a “local” reinforcement, i.e. elements that are recently observed have a greater impact on the near future and may be identified as the “fashion” of the moment. In the online social networks applications, this local reinforcement aims at representing the persistence of an emotional response to a public event, capturing the phenomenon observed in [3]. Moreover, it is able to correctly reproduce the sentiment dynamics of the tweets, outperforming the standard Pólya urn model, as we will see, on several different data sets. Its prediction ability is also quite high. It is important to note that we also include a delay in information: indeed, it is plausible that, when the user decides to write the tweet posted at time-step n + 1, he /she only knows the previous tweets until a certain time-step t(n)<n.

Finally, we underline that the proposed model may be also employed in other contexts.

The sequel of the work is so structured. In Section 2 we will present the model: after introducing the standard Pólya model in Subsection 2.1, in Subsection 2.2 we formally describe the Rescaled Pólya urn model in general and, then, we focus on the case with two colors and, next to the general model (Complete model), we identify two special cases (“Only fashion” model and “No fashion” model). Finally, in Subsection 2.3, we explain how we include a delay in information. In Section 3, we describe the considered datasets and we illustrate the performed analysis and the obtained results. In Section 4 we comment the results and draw our conclusions. The paper is enriched by an appendix regarding the evolution of the estimated model parameters and additional analyses.

2 Model

2.1 Standard Pólya urn

The standard Pólya urn (see [49–51]) is a stochastic model driven by a reinforcement mechanism (also known as “rich get richer” principle): the probability that a given event occurs increases with the number of times the same event occurred in the past. This rule is a key feature governing the dynamics of many biological, economic and social systems (see, e.g. [51]) and it seems plausible that it plays a role also in the sentiment dynamics of the Twitter posts as the emotional state of an individual influences the emotions of others [36, 52]. The Pólya urn model has been widely studied and generalized (some recent variants can be found in [48, 53–65]) and in its simplest form, with c-colors, works as follows. An urn contains N_0i balls of color i, for i = 1, …, c, and, at each discrete time-step, a ball is extracted from the urn and then it is returned inside the urn together with α > 0 additional balls of the same color. Therefore, if we denote by N_ni the number of balls of color i in the urn at time-step n, we have

\begin{matrix} N_{n i} = N_{n - 1 i} + α ξ_{n i} = N_{0 i} + α \sum_{h = 1}^{n} ξ_{h i} for n \geq 1, \end{matrix}

where ξ_ni = 1 if the extracted ball at time-step n is of color i, and ξ_ni = 0 otherwise. The parameter α regulates the reinforcement mechanism: the greater α, the greater the dependence of N_ni on $\sum_{h = 1}^{n} ξ_{h i}$ .

2.2 Rescaled Pólya (RP) urn

The “Rescaled” Pólya (RP) urn model, introduced in [48], is characterized by the introduction of the parameter β, together with the initial parameters (b_0i)_{i = 1, …, c} and (B_0i)_{i = 1, …, c}, next to the parameter α of the original model, so that

\begin{matrix} \begin{matrix} N_{n i} & = b_{0 i} + B_{n i} & with \\ B_{n i} & = β B_{n - 1 i} + α ξ_{n i} & n \geq 1 . \end{matrix} \end{matrix}

Therefore, the urn initially contains b_0i + B_0i > 0 balls of color i and the parameter β ≥ 0, together with α > 0, regulates the reinforcement mechanism. More precisely, the term βB_n−1i links N_ni to the “configuration” at time-step n − 1 through the “scaling” parameter β, and the term αξ_ni links N_ni to the outcome of the extraction at time-step n through the parameter α. Note that the case β = 1 corresponds to the standard Pólya urn with an initial number N_0i = b_0i + B_0i of balls of color i. When β ∈ [0, 1), this variant of the Pólya urn is characterized by a “local” reinforcement, i.e. a reinforcement mechanism mainly based on the most recent observations, and by a random persistent fluctuation of the predictive mean ψ_ni = E[ξ_n+1i = 1 |“past”]. As we will show, this latter feature is capable of capturing the trend fluctuations in the sentiment curve of Twitter posts (see Figs 1–6).

Fig 1 — In each panel, the yellow line is the cubic spline smoothing of the time series of the observed tweets ξ_n+1, together with the default confidence interval (gray), the red line represents the cubic spline smoothing of the time series of the estimated predictive means ${\hat{ψ}}_{n}$ (defined in Subsec. 3.2), obtained with the complete RP model, the black and the blue lines provide similar approximations obtained with the other models: black = Only fashion RP model and blue = Standard Pólya model. In each panel, the smoothing is obtained with a given number of nodes: k = 3 (top left panel), 5 (top middle panel), 10 (top right panel), 20 (bottom left panel), 30 (bottom middle panel), 50 (bottom right panel).

Fig 6 — In each panel, the yellow line is the cubic spline smoothing of the time series of the observed tweets ξ_n+1, together with the default confidence interval (gray), the red line represents the cubic spline smoothing of the time series of the estimated predictive means ${\hat{ψ}}_{n}$ (defined in Subsec. 3.2), obtained with the complete RP model, the black and the blue lines provide similar approximations obtained with the other models: black = Only fashion RP model and blue = Standard Pólya model. In each panel, the smoothing is obtained with a given number of nodes: k = 3 (top left panel), 5 (top middle panel), 10 (top right panel), 20 (bottom left panel), 30 (bottom middle panel), 50 (bottom right panel).

Fig 2 — In each panel, the yellow line is the cubic spline smoothing of the time series of the observed tweets ξ_n+1, together with the default confidence interval (gray), the red line represents the cubic spline smoothing of the time series of the estimated predictive means ${\hat{ψ}}_{n}$ (defined in Subsec. 3.2), obtained with the complete RP model, the black and the blue lines provide similar approximations obtained with the other models: black = Only fashion RP model and blue = Standard Pólya model. In each panel, the smoothing is obtained with a given number of nodes: k = 3 (top left panel), 5 (top middle panel), 10 (top right panel), 20 (bottom left panel), 30 (bottom middle panel), 50 (bottom right panel).

Fig 3 — In each panel, the yellow line is the cubic spline smoothing of the time series of the observed tweets ξ_n+1, together with the default confidence interval (gray), the red line represents the cubic spline smoothing of the time series of the estimated predictive means ${\hat{ψ}}_{n}$ (defined in Subsec. 3.2), obtained with the complete RP model, the black and the blue lines provide similar approximations obtained with the other models: black = Only fashion RP model and blue = Standard Pólya model. In each panel, the smoothing is obtained with a given number of nodes: k = 3 (top left panel), 5 (top middle panel), 10 (top right panel), 20 (bottom left panel), 30 (bottom middle panel), 50 (bottom right panel).

Fig 4 — In each panel, the yellow line is the cubic spline smoothing of the time series of the observed tweets ξ_n+1, together with the default confidence interval (gray), the red line represents the cubic spline smoothing of the time series of the estimated predictive means ${\hat{ψ}}_{n}$ (defined in Subsec. 3.2), obtained with the complete RP model, the black and the blue lines provide similar approximations obtained with the other models: black = Only fashion RP model and blue = Standard Pólya model. In each panel, the smoothing is obtained with a given number of nodes: k = 3 (top left panel), 5 (top middle panel), 10 (top right panel), 20 (bottom left panel), 30 (bottom middle panel), 50 (bottom right panel).

Fig 5 — In each panel, the yellow line is the cubic spline smoothing of the time series of the observed tweets ξ_n+1, together with the default confidence interval (gray), the red line represents the cubic spline smoothing of the time series of the estimated predictive means ${\hat{ψ}}_{n}$ (defined in Subsec. 3.2), obtained with the complete RP model, the black and the blue lines provide similar approximations obtained with the other models: black = Only fashion RP model and blue = Standard Pólya model. In each panel, the smoothing is obtained with a given number of nodes: k = 3 (top left panel), 5 (top middle panel), 10 (top right panel), 20 (bottom left panel), 30 (bottom middle panel), 50 (bottom right panel).

More formally, given a vector $x = {(x_{1}, \dots, x_{c})}^{⊤} \in R^{c}$ , we define $| x | = \sum_{i = 1}^{c} | x_{i} |$ . Moreover, we set b₀ = (b₀₁, …, b_0c)^⊤ and B₀ = (B₀₁, …, B_0c)^⊤, we assume |b₀|>0 and we define $p_{0} = \frac{b_{0}}{| b_{0} |}$ . At each discrete time-step (n + 1)≥1, a ball is drawn at random from the urn, obtaining the random vector ξ_n+1 = (ξ_n+11, …, ξ_n+1c)^⊤ defined as

\begin{matrix} ξ_{n + 1 i} = {\begin{matrix} 1 & when the extracted ball at time-step n + 1 is of color i \\ 0 & otherwise . \end{matrix} \end{matrix}

The number of balls in the urn is so updated:

\begin{matrix} N_{n + 1} = b_{0} + B_{n + 1} with B_{n + 1} = β B_{n} + α ξ_{n + 1}, \end{matrix}

(1)

which gives (since |ξ_n+1| = 1)

\begin{matrix} | B_{n + 1} | = β | B_{n} | + α . \end{matrix}

(2)

Therefore, setting $r_{n}^{*} = | N_{n} | = | b_{0} | + | B_{n} |$ , we get

\begin{matrix} r_{n + 1}^{*} = r_{n}^{*} + (β - 1) | B_{n} | + α . \end{matrix}

(3)

Moreover, denoting by $F = {(F_{n})}_{n \geq 0}$ the filtration representing the information along time-steps (formally, this means to set $F_{0}$ equal to the trivial σ-field and $F_{n} = σ (ξ_{1}, \dots, ξ_{n})$ for n ≥ 1), the conditional probabilities ψ_n = (ψ_n1, …, ψ_nc)^⊤ of the extraction process, also called predictive means, are

\begin{matrix} ψ_{n i} = E [ξ_{n + 1 i} | F_{n}] = P (ξ_{n + 1 i} = 1 | F_{n}) = \frac{N_{n i}}{| N_{n} |} = \frac{b_{0 i} + B_{n i}}{r_{n}^{*}}, i = 1, \dots c, n \geq 0 . \end{matrix}

(4)

This urn model has been studied in [48, 53]. All the mathematical proofs and details can be found in these papers.

2.2.1 Two colors (c = 2)

With two colors, the quantity of interest are only ξ_n = ξ_n1 = 1 − ξ_n2 and ψ_n = ψ_n1 = 1 − ψ_n2. In the sequel, we consider the RP urn model with β = 1 (i.e. the standard Pólya urn model) and with β < 1. In the first case, we have

\begin{matrix} ψ_{n} = \frac{N_{0 1} + α \sum_{h = 1}^{n} ξ_{h}}{| N_{0} | + α n} . \end{matrix}

In the second case, by (1), (2), (3) and (4), using $\sum_{m = 0}^{n - 1} x^{m} = (1 - x^{n}) / (1 - x)$ , we obtain

\begin{matrix} r_{n}^{*} = | b_{0} | + \frac{α}{1 - β} + β^{n} (| B_{0} | - \frac{α}{1 - β}) ⟶ r^{*} = | b_{0} | + \frac{α}{1 - β} \end{matrix}

and

\begin{matrix} ψ_{n} = \frac{b_{0 1} + β^{n} B_{0 1} + α \sum_{h = 1}^{n} β^{n - h} ξ_{h}}{| b_{0} | + \frac{α}{1 - β} + β^{n} (| B_{0} | - \frac{α}{1 - β})} . \end{matrix}

Since β < 1, the dependence of ψ_n on ξ_h exponentially increases with h, because of the factor β^n−h, and so the main contribution is given by the most recent extractions. We refer to this phenomenon as “local” reinforcement. The case β = 0 is an extreme case, for which ψ_n depends only on the last extraction ξ_n. Note that, when β = 1, i.e. the case of the standard Pólya urn, all the past observations ξ_h equally contribute to ψ_n, with a weight equal to α. This different dependence on the past leads to a different behaviour of ψ_n along time-steps (see [48]): in the standard Pólya urn, the process (ψ_n) asymptotically stabilizes, converging almost surely toward a random variable, while in the RP urn, the process (ψ_n) persistently fluctuates (see Figs 1–6).

If we set

\begin{matrix} p_{0} = p_{0 1} = \frac{b_{0 1}}{| b_{0} |}, (1 - γ^{*}) = \frac{| b_{0} |}{r^{*}}, {\tilde{B}}_{n} = \frac{B_{n 1}}{| B_{n} |}, \end{matrix}

we get for a large n

\begin{matrix} \begin{matrix} ψ_{n + 1} & = \frac{b_{0 1}}{r_{n + 1}^{*}} + \frac{B_{n + 1 1}}{r_{n + 1}^{*}} = \frac{| b_{0} |}{r_{n + 1}^{*}} p_{0} + \frac{| B_{n + 1} |}{r_{n + 1}^{*}} {\tilde{B}}_{n + 1} \\ = \frac{| b_{0} |}{r_{n + 1}^{*}} p_{0} + \frac{r_{n + 1}^{*} - | b_{0} |}{r_{n + 1}^{*}} {\tilde{B}}_{n + 1} \\ \sim \frac{| b_{0} |}{r^{*}} p_{0} + \frac{r^{*} - | b_{0} |}{r^{*}} {\tilde{B}}_{n + 1} = (1 - γ^{*}) p_{0} + γ^{*} {\tilde{B}}_{n + 1} \end{matrix} \end{matrix}

and

\begin{matrix} \begin{matrix} {\tilde{B}}_{n + 1} & = \frac{B_{n + 1}}{| B_{n + 1} |} = \frac{β}{| B_{n + 1} |} B_{n} + \frac{α}{| B_{n + 1} |} ξ_{n + 1} \\ = β \frac{r_{n}^{*} - | b_{0} |}{r_{n + 1}^{*} - | b_{0} |} {\tilde{B}}_{n} + \frac{α}{r_{n + 1}^{*} - | b_{0} |} ξ_{n + 1} \\ \sim β {\tilde{B}}_{n} + \frac{α}{r^{*} - | b_{0} |} ξ_{n + 1} = β {\tilde{B}}_{n} + (1 - β) ξ_{n + 1} . \end{matrix} \end{matrix}

Summing up, the model dynamics can be approximated for n large by

\begin{matrix} ψ_{n + 1} = (1 - γ^{*}) p_{0} + γ^{*} {\tilde{B}}_{n + 1}, {\tilde{B}}_{n + 1} = β {\tilde{B}}_{n} + (1 - β) ξ_{n + 1}, \end{matrix}

where $p_{0}, γ^{*}, β, {\tilde{B}}_{0}$ are the parameters. Note that α does not appear among the parameters for the above approximated dynamics, but it is included in the new parameter γ*. Moreover, the quantity ${\tilde{B}}_{0}$ is exponentially fast negligible, because we have ${\tilde{B}}_{n} = β^{n} {\tilde{B}}_{0} + (1 - β) \sum_{h = 1}^{n} β^{n - h} ξ_{h}$ , with β < 1. Therefore, the fundamental parameters are p₀, γ* and β: p₀ is a deterministic component, γ* tunes the weight in the predictive mean ψ_n+1 of the random “fluctuation” component ${\tilde{B}}_{n + 1}$ with respect to the deterministic one, and β regulates the dependence of the present state ${\tilde{B}}_{n + 1}$ on the previous state ${\tilde{B}}_{n}$ and on the present observation ξ_n+1. We refer to ${({\tilde{B}}_{n})}_{n}$ as the “fashion” process, since it reproduces the trend variations of the considered phenomenon (in our case, the sentiment of Twitter posts). In the following applications, we consider the following cases:

Complete RP model: The three parameters θ = (p₀, γ*, β) are free to vary.
“Only Fashion” RP model: γ* = 1 (and p₀ = 0 irrelevant). This means that the predictive mean is not driven by any deterministic component, but it coincides with the fashion process. The free parameter is given by θ = β.
“No Fashion” RP model: γ* = 0 (and β = 0 irrelevant). In this case ψ_n is equal to the constant p₀ and, consequently, the free parameter is given by θ = p₀.

2.3 Model with delay

In applications, the extractions from the urn typically correspond to actions performed by agents. Therefore, it is plausible that there is a delay in information, in the sense that, when the agent decides to make the action that will appear at time-step n + 1, he/she only knows what happened until a certain time-step t(n)<n, i.e. the actions at time-steps 1, …, t(n). For instance, in our framework, the actions are the tweets and so it is plausible that, when the author of the tweet posted at time-step n + 1 is writing, he /she only knows the previous tweets until a certain time-step t(n)<n. In other words, we can image that an agent, after reading the tweets posted until time-step t(n), starts to write his/her tweet and posts it at time-step n + 1. Therefore, tweet n + 1 is not affected by tweets posted at time-steps t(n) + 1, …, n. When this is the case, the predictive means for action n + 1 are given by the composition of the urn until time-step t(n). In particular, if the number of colors is c = 2 and we denote by $I_{n}$ the information the agent has when performing action n + 1 (formally, $I_{0}$ equal to the trivial σ-field and $I_{n} = σ (ξ_{1}, \dots, ξ_{t (n)})$ ), we have

\begin{matrix} {\hat{ψ}}_{n} = E [ξ_{n + 1} | I_{n}] = P (ξ_{n + 1} = 1 | I_{n}) = \frac{N_{t (n) 1}}{N_{t (n) 1} + N_{t (n) 2}} = ψ_{t (n)} . \end{matrix}

(5)

Assuming to know the real time at which actions appeared (i.e., in our framework, the real time at which the posts are posted), a possible way to define t(n) is the following. Fix a value D > 0, divide (real) time in blocks of length D (choose D so that the blocks contain at least one action), define j(n + 1) the index of the time block containing the action n + 1 and set

\begin{matrix} t (n) = max {t \in N : j (t) \leq {(j (n + 1) - 2)}_{+}} . \end{matrix}

It follows that, for all actions appeared in a certain time block j, the missing information are the actions appeared in the immediately previous time block (i.e. block j − 1) plus the preceding actions of the same block. As a consequence, the quantity D is a lower bound for the delay and 2D an upper bound: agents looses at least D units of time and not more than 2D units of time.

In the following, we refer to this variant of the RP urn as “RP urn model with delay”.

3 Results

3.1 Data

Data have been collected from the Twitter platform, using the official API to Stream the exchange of messages on several topics. In the following, the various datasets are described in more details:

Italy, Migration debate

Data were collected through the Filter API since 23rd of January to 22nd of February 2019 and targeted the Italian debate on migration. Data were previously analysed in [16]. In the dataset, the information about the nature, automated or not (BOT or not), of the users is present. The BOT detection algorithm embedded is a lightweight version of the classifier proposed in [10]; more details on the dataset can be found in [16].
Italy, 10 days of traffic

The dataset collects the entire traffic, compatibly with the Filter API sampling, of messages in Italian in the days from the first to the 10th of September 2019: the keyword used for the query were the Italian vowels, in order to collect all messages that may contain some word. In the dataset, the information about the nature, automated or not (BOT or not), of the users is present. The BOT detection algorithm used was developed in [14].
Italy, COVID-19 epidemic

The dataset covers the period from February 21st to April to 20th 2020, including tweets in Italian language, and was previously analysed in [20]. The keywords used for the query are relative to the COVID-19 epidemic; more details can be found in the original reference. The dataset includes information on the automated or not (BOT or not) nature of the accounts, detected using the algorithm developed in [14].

For every message, the relative sentiment was calculated using the polyglot python module developed in [66], that provides a numerical value v ∈ [−1, 1] for the sentiment. We fix a threshold T = 0.35 so that we classify as a tweet with positive sentiment those with v > T and as a tweet with negative sentiment those with v < −T. We discard tweets with a value v ∈ [−T, T]. (There is not a particular reason for our choice of the value of T: indeed, we take the value 0.35 only because the interval [−1, 1] results divided into three parts of almost the same length. In Appendix, Sec. B, we show the results for other values of the threshold T).

Tables 1–3 show some descriptives of the samples obtained with T = 0.35:

Table 1. “Migration” sample: Descriptives of the sample obtained with T = 0.35.

Migration	Entire	Only BOTs’ posts
Posts	367367	4124
Percentage of positive posts	49.60%	47.97%

Migration	Standard Pólya	Complete RP	Only Fashion RP	No Fashion RP	Theoretical value
Entire	200.91%	206.28%	206.22%	200.78%	200.93%
OnlyBOT	194.52%	199.63%	199.75%	194.24%	194.74%

Covid	Standard Pólya	Complete RP	Only Fashion RP	No Fashion RP	Theoretical value
Entire	199.97%	203.15%	203.15%	199.96%	200.02%
OnlyBOT	187.60%	190.43%	190.47%	187.58%	187.85%

10 days of traffic	Standard Pólya	Complete RP	Only Fashion RP	No Fashion RP	Theoretical value
Entire	160.75%	160.88%	160.88%	160.74%	160.75%
OnlyBOT	159.57%	159.70%	159.62%	159.57%	159.58%

smoothing	Only Fashion RP	Complete RP	Standard Pólya
no smooth	2.44 × 10⁻¹	2.43 × 10⁻¹	2.50 × 10⁻¹
k = 3	3.44 × 10⁻⁹	1.41 × 10⁻⁶	3.03 × 10⁻⁴
k = 5	1.19 × 10⁻⁸	3.23 × 10⁻⁶	3.43 × 10⁻⁴
k = 10	2.64 × 10⁻⁷	1.74 × 10⁻⁵	1.64 × 10⁻³
k = 20	1.04 × 10⁻⁶	2.98 × 10⁻⁵	2.73 × 10⁻³
k = 30	2.79 × 10⁻⁶	4.03 × 10⁻⁵	3.83 × 10⁻³
k = 50	7.18 × 10⁻⁶	5.41 × 10⁻⁵	4.85 × 10⁻³

smoothing	Only Fashion RP	Complete RP	Standard Pólya
no smooth	2.45 × 10⁻¹	2.45 × 10⁻¹	2.49 × 10⁻¹
k = 3	3.23 × 10⁻⁶	5.33 × 10⁻⁵	3.38 × 10⁻³
k = 5	1.16 × 10⁻⁵	5.16 × 10⁻⁵	3.38 × 10⁻³
k = 10	2.84 × 10⁻⁵	6.88 × 10⁻⁵	3.53 × 10⁻³
k = 20	5.70 × 10⁻⁵	9.78 × 10⁻⁵	3.80 × 10⁻³
k = 30	1.67 × 10⁻⁴	1.81 × 10⁻⁴	4.01 × 10⁻³
k = 50	3.05 × 10⁻⁴	2.94 × 10⁻⁴	4.38 × 10⁻³

smoothing	Only Fashion RP	Complete RP	Standard Pólya
no smooth	2.43 × 10⁻¹	2.41 × 10⁻¹	2.50 × 10⁻¹
k = 3	2.62 × 10⁻⁶	3.56 × 10⁻⁵	7.66 × 10⁻⁴
k = 5	4.19 × 10⁻⁴	1.90 × 10⁻⁴	1.10 × 10⁻³
k = 10	1.03 × 10⁻⁴	3.50 × 10⁻⁴	3.36 × 10⁻³
k = 20	5.36 × 10⁻⁴	8.19 × 10⁻⁴	6.58 × 10⁻³
k = 30	9.09 × 10⁻⁴	1.22 × 10⁻³	9.13 × 10⁻³
k = 50	2.53 × 10⁻³	2.36 × 10⁻³	1.34 × 10⁻²

smoothing	Only Fashion RP	Complete RP	Standard Pólya
no smooth	2.32 × 10⁻¹	2.32 × 10⁻¹	2.33 × 10⁻¹
k = 3	3.15 × 10⁻⁹	2.61 × 10⁻⁷	1.22 × 10⁻⁵
k = 5	3.86 × 10⁻⁹	8.09 × 10⁻⁷	3.34 × 10⁻⁵
k = 10	1.94 × 10⁻⁸	2.02 × 10⁻⁶	6.88 × 10⁻⁵
k = 20	7.81 × 10⁻⁸	2.65 × 10⁻⁶	8.80 × 10⁻⁵
k = 30	1.74 × 10⁻⁷	2.86 × 10⁻⁶	9.65 × 10⁻⁵
k = 50	1.08 × 10⁻⁶	5.15 × 10⁻⁶	1.53 × 10⁻⁴

smoothing	Only Fashion RP	Complete RP	Standard Pólya
no smooth	2.31 × 10⁻¹	2.31 × 10⁻¹	2.31 × 10⁻¹
k = 3	4.10 × 10⁻⁷	6.67 × 10⁻⁷	5.73 × 10⁻⁶
k = 5	7.95 × 10⁻⁷	2.02 × 10⁻⁵	5.97 × 10⁻⁵
k = 10	6.81 × 10⁻⁶	2.35 × 10⁻⁵	7.19 × 10⁻⁵
k = 20	1.59 × 10⁻⁵	5.43 × 10⁻⁵	1.52 × 10⁻⁴
k = 30	2.59 × 10⁻⁵	5.98 × 10⁻⁵	1.75 × 10⁻⁴
k = 50	9.80 × 10⁻⁵	1.23 × 10⁻⁴	3.49 × 10⁻⁴

smoothing	Only Fashion RP	Complete RP	Standard Pólya
no smooth	2.46 × 10⁻¹	2.46 × 10⁻¹	2.50 × 10⁻¹
k = 3	3.98 × 10⁻⁸	7.37 × 10⁻⁶	2.58 × 10⁻³
k = 5	5.51 × 10⁻⁸	7.53 × 10⁻⁶	2.64 × 10⁻³
k = 10	1.54 × 10⁻⁷	8.63 × 10⁻⁶	2.92 × 10⁻³
k = 20	7.93 × 10⁻⁷	9.37 × 10⁻⁶	3.10 × 10⁻³
k = 30	1.06 × 10⁻⁶	9.80 × 10⁻⁶	3.24 × 10⁻³
k = 50	2.06 × 10⁻⁶	1.10 × 10⁻⁵	3.46 × 10⁻³

Sample	Standard Pólya	Complete RP	Only Fashion RP	No Fashion RP	Theoretical value
Migration (T = 0, D = 3’, S = 100)	202.23%	202.25%	202.22%	202.22%	202.23%
Migration (T = 0.5, D = 3’, S = 100)	194.48%	194.51%	194.48%	194.48%	194.48%
Migration (T = 0.35, D = 3’, slots = days)	198.84%	203.11%	203.46%	198.02%	198.86%
Migration (T = 0.5, D = 3’, slots = days)	192.67%	198.40%	198.78%	191.10%	192.70%
10 days traffic (T = 0, D = 30”, S = 100)	165.68%	165.79%	165.79%	165.68%	165.68%
10 days traffic (T = 0.5, D = 30”, S = 100)	160.18%	160.32%	160.32%	160.17%	160.18%
10 days traffic (T = 0.35, D = 30”, slots = days)	158.10%	158.23%	158.23%	158.08%	158.10%
10 days traffic (T = 0.5, D = 30”, slots = days)	157.50%	157.64%	157.64%	157.48%	157.50%
Covid (T = 0, D = 3’, S = 100)	201.60%	204.33%	204.33%	201.51%	201.70%
Covid (T = 0.5, D = 3’, S = 100)	199.04%	202.24%	202.25%	198.91%	199.10%
Covid (T = 0.35, D = 3’, slots = days)	198.01%	201.08%	201.11%	197.82%	198.01%
Covid (T = 0.5, D = 3’, slots = days)	197.10%	200.20%	200.23%	196.85%	197.10%

smoothing	Only Fashion RP	Complete RP	Standard Pólya
no smooth	2.42 × 10⁻¹	2.42 × 10⁻¹	2.50 × 10⁻¹
k = 3	2.66 × 10⁻⁷	2.29 × 10⁻⁶	3.56 × 10⁻⁴
k = 5	6.47 × 10⁻⁷	2.49 × 10⁻⁶	3.75 × 10⁻⁴
k = 10	2.75 × 10⁻⁶	3.75 × 10⁻⁵	2.79 × 10⁻³
k = 20	6.37 × 10⁻⁶	5.80 × 10⁻⁵	3.92 × 10⁻³
k = 30	1.66 × 10⁻⁵	8.32 × 10⁻⁵	5.33 × 10⁻³
k = 50	3.15 × 10⁻⁵	1.09 × 10⁻⁴	6.54 × 10⁻³

smoothing	Only Fashion RP	Complete RP	Standard Pólya
no smooth	2.37 × 10⁻¹	2.37 × 10⁻¹	2.37 × 10⁻¹
k = 3	6.56 × 10⁻⁹	5.77 × 10⁻⁹	3.68 × 10⁻⁵
k = 5	7.12 × 10⁻⁹	5.97 × 10⁻⁹	5.20 × 10⁻⁵
k = 10	1.86 × 10⁻⁸	1.62 × 10⁻⁸	6.71 × 10⁻⁵
k = 20	7.66 × 10⁻⁸	6.46 × 10⁻⁸	7.92 × 10⁻⁵
k = 30	2.69 × 10⁻⁷	2.44 × 10⁻⁷	8.49 × 10⁻⁵
k = 50	1.23 × 10⁻⁶	1.02 × 10⁻⁶	1.25 × 10⁻⁴

smoothing	Only Fashion RP	Complete RP	Standard Pólya
no smooth	2.47 × 10⁻¹	2.47 × 10⁻¹	2.50 × 10⁻¹
k = 3	3.20 × 10⁻⁸	4.23 × 10⁻⁶	2.37 × 10⁻³
k = 5	4.57 × 10⁻⁸	4.32 × 10⁻⁶	2.40 × 10⁻³
k = 10	2.09 × 10⁻⁷	5.23 × 10⁻⁶	2.65 × 10⁻³
k = 20	7.14 × 10⁻⁷	5.56 × 10⁻⁶	2.72 × 10⁻³
k = 30	1.04 × 10⁻⁶	6.02 × 10⁻⁶	2.89 × 10⁻³
k = 50	1.70 × 10⁻⁶	6.91 × 10⁻⁶	3.06 × 10⁻³

PERMALINK

A model for the Twitter sentiment curve

Giacomo Aletti

Irene Crimaldi

Fabio Saracco

Roles

Abstract

1 Introduction

2 Model

2.1 Standard Pólya urn

2.2 Rescaled Pólya (RP) urn

Fig 1. “Migration” (T = 0.35, entire, D = 3′, S = 100 slots of equal size): Sentiment curves.

Fig 6. “Covid” (T = 0.35, only BOTs’ posts, D′ = 3, S = 1000 slots of equal size): Sentiment curves for BOTs’ posts.

Fig 2. “Migration” (T = 0.35, only BOTs’ posts, D = 3′, S = 100 slots of equal size): Sentiment curves for BOTs’ posts.

Fig 3. “10 days of traffic” (T = 0.35, entire, D = 30′′, S = 100 slots of equal size): Sentiment curves.

Fig 4. “10 days of traffic” (T = 0.35, only BOTs’ posts, D = 30′′, S = 100 slots of equal size): Sentiment curves for BOTs’ postsx.

Fig 5. “Covid” (T = 0.35, entire, D = 3′, S = 1000 slots of equal size): Sentiment curves.

2.2.1 Two colors (c = 2)

2.3 Model with delay

3 Results

3.1 Data

Table 1. “Migration” sample: Descriptives of the sample obtained with T = 0.35.

Table 3. “Covid” sample: Descriptives of the sample obtained with T = 0.35.

Table 2. “10 days of traffic” sample: Descriptives of the sample obtained with T = 0.35.

3.2 Analysis of the prediction ability

Table 4. “Migration” sample (T = 0.35, D = 3′, S = 100 slots of equal size): Comparison of the different considered models in terms of (6).

Table 6. “Covid” sample (T = 0.35, D = 3′, S = 1000 slots of equal size): Comparison of the different considered models in terms of (6).

Table 5. “10 days of traffic” sample (T = 0.35, D = 30′′, S = 100 slots of equal size): Comparison of the different considered models in terms of (6).

3.3 Fluctuations of the sentiment curve

Table 7. “Migration” (T = 0.35, entire, D = 3′, S = 100 slots of equal size): MSE for different levels of smoothing.

Table 12. “Covid” (T = 0.35, only BOTs’ posts, D = 3′, S = 1000 slots of equal size): MSE for different levels of smoothing).

Table 8. “Migration” (T = 0.35, only BOTs’ posts, D = 3′, S = 100 slots of equal size): MSE for different levels of smoothing.

Table 9. “10 days of traffic” (T = 0.35, entire, D = 30′′, S = 100 slots of equal size): MSE for different levels of smoothing.

Table 10. “10 days traffic” (T = 0.35, only BOTs’ posts, D = 30′′, S = 100 slots of equal size): MSE for different levels of smoothing.

Table 11. “Covid” (T = 0.35, entire, D = 3′, S = 1000 slots of equal size): MSE for different levels of smoothing.

4 Discussion and conclusions

Appendix

A Parameters evolution

Fig 7. “Migration” (T = 0.35, entire, D = 3′): Model parameters evolution with S = 100 slots of equal size (i.e. 3673 observations).

Fig 8. “Migration” (T = 0.35, only BOTs’ posts, D = 3′): Model parameters evolution with S = 100 slots of equal size (i.e. 41 observations).

Fig 9. “10 days of traffic” (T = 0.35, entire, D = 30′′): Model parameters evolution with S = 100 slots of equal size (i.e. 31646 observations).

Fig 10. “10 days of traffic” (T = 0.35, only BOTs’ posts, D = 30′′): Model parameters evolution with S = 100 slots of equal size (i.e. 1023 observations).

Fig 11. “Covid” (T = 0.35, entire, D = 3′): Model parameters evolution with S = 1000 slots of equal size (i.e. 2037 observations).

Fig 12. “Covid” (T = 0.35, only BOTs’ posts, D = 3′): Model parameters evolution with S = 1000 slots of equal size (i.e. 48 observations).

B Additional analyses

Table 13. Comparison of the different considered models in terms of (6).

Table 14. “Migration” (T = 0.5, entire, D = 3′, S = 100 slots of equal size): MSE for different levels of smoothing.

Table 25. “Covid” (T = 0.5, entire, D = 3′, slots = days): MSE for different levels of smoothing.

Fig 13. “Migration” (T = 0.5, entire, D = 3′, S = 100 slots of equal size): Sentiment curves.

Fig 24. “Covid” (T = 0.5, entire, D = 3′, slots = days): Sentiment curves.

Fig 14. “Migration” (T = 0, entire, D = 3′, S = 100 slots of equal size): Sentiment curves.

Fig 15. “Migration” (T = 0.35, entire, D = 3′, slots = days): Sentiment curves.

Fig 16. “Migration” (T = 0.5, entire, D = 3′, slots = days): Sentiment curves.

Fig 17. “10 days of traffic” (T = 0.5, entire, D = 30′′, S = 100 slots of equal size): Sentiment curves.

Fig 18. “10 days of traffic” (T = 0, entire, D = 30′′, S = 100 slots of equal size): Sentiment curves.

Fig 19. “10 days of traffic” (T = 0.35, entire, D = 30′′, slots = days): Sentiment curves.

Fig 20. “10 days of traffic” (T = 0.5, entire, D = 30′′y, slots = days): Sentiment curves.

Fig 21. “Covid” (T = 0.5, entire, *D = 3′, S = 100 slots of equal size): Sentiment curves.

Fig 22. “Covid” (T = 0, entire, D = 3′, S = 100 slots of equal size): Sentiment curves.

Fig 23. “Covid” (T = 0.35, entire, D = 3′, slots = days): Sentiment curves.

Table 15. “Migration” (T = 0, entire, D = 3′, S = 100 slots of equal size): MSE for different levels of smoothing.

Table 16. “Migration” (T = 0.35, entire, D = 3′, slots = days): MSE for different levels of smoothing.

Table 17. “Migration” (T = 0.5, entire, D = 3′, slots = days): MSE for different levels of smoothing.

Table 18. “10 days of traffic” (T = 0.5, entire, D = 30′′, S = 100 slots of equal size): MSE for different levels of smoothing.

Table 19. “10 days of traffic” (T = 0, entire, D = 30′′, S = 100 slots of equal size): MSE for different levels of smoothing.

Table 20. “10 days of traffic” (T = 0.35, entire, D = 30′′, slots = days): MSE for different levels of smoothing.

Table 21. “10 days of traffic” (T = 0.5, entire, D = 30′′, slots = days): MSE for different levels of smoothing.

Table 22. “Covid” (T = 0.5, entire, D = 3′, S = 100 slots of equal size): MSE for different levels of smoothing.

Table 23. “Covid” (T = 0, entire, D = 3′, S = 100 slots of equal size): MSE for different levels of smoothing.

Table 24. “Covid” (T = 0.35, entire, D = 3′, slots = days): MSE for different levels of smoothing.

Acknowledgments

Data Availability

Funding Statement

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles