Skip to main content
PLOS One logoLink to PLOS One
. 2025 Jun 26;20(6):e0324697. doi: 10.1371/journal.pone.0324697

Social context in political stance detection: Impact and extrapolation

Ramon Villa-Cox 1,2,*, Evan M Williams 2, Kathleen M Carley 2
Editor: Carlos Henrique Gomes Ferreira3
PMCID: PMC12201627  PMID: 40570076

Abstract

Stance detection is an important task with a wide range of high-impact social applications, including opinion polling and detecting propaganda, misinformation, and hate speech. In this work, we explore the performance and extrapolation power of political stance-detection models using an existing large-scale weakly-labeled Twitter dataset collected around the 2019 South American Protests. We construct transformer-based user and tweet encoders to embed users in a low-dimensional space using their posts and interactions. We then train heterogeneous graph attention networks to predict user stances and contrast their ability to extrapolate stance predictions to different country contexts and to future events. We find that leveraging users’ ego-network in political stance detection improves in-country model performance for every country we examine. More notably, we find that leveraging a user’s social context greatly enhances the ability of our stance detection models to extrapolate to new country contexts and future data.

1 Introduction

Public opinion toward governments and their policies has widespread social implications. Public support is often viewed as a kind of currency in political science: more public support can allow governments more freedom in their actions, whereas widespread public disapproval can constrain the set of actions that a government can take. Studies have found public opinion to be a strong predictor of policy [1,2], and information about public opinion can impact the decisions and planning of politicians, businesses, foreign governments, and people. It is therefore unsurprising that global election and opinion polling has grown into a 6.78 billion dollar industry [3]. However, opinion polling often lags behind real events, and in unstable political situations, this can engender widespread uncertainty. Surveys, especially in dynamic or unstable political situations, are constricted by resources, and the diversity of participants is confined by the survey method. To mitigate these constraints, researchers have explored automated opinion mining, a stance detection sub-task, in social media settings.

In some cases, large-scale social media data has served as an alternative way to gauge public opinion [4]. However, it has long been recognized in the stance detection field that considering only an isolated sentence will provide an incomplete assessment of a user’s stance towards a predefined target [5]. Despite this, little research has been done to integrate the different levels of contextual user information available in social media data into stance detection models. More importantly, there is a need to evaluate the impact that this additional context can have on final classification performance [6].

In this work, we explore the in-sample performance and generalizability of country-level models trained at the various levels of contextual information available as users interact through social media. We focus on three context levels that increasingly convey more information about a given user, namely, the document/post level, the user level, and the social network level. For this task, we use a weakly-labeled large-scale Twitter (now X) dataset, that encompasses the widespread 2019 protests that erupted in Ecuador, Chile, Bolivia, and Colombia [7]. The regional nature of the dataset presents the opportunity to test the generalization capabilities of our proposed models in out-of-sample country data at different levels of contextual information. The different cultural and ideological motivations for the protests provide relevant roadblocks to test the robustness of the proposed estimators. However, given that these protests occurred concurrently, the dataset does not facilitate the testing of robustness over time. This is an outstanding problem in social media settings, as relevant textual features can change significantly in a short period. To address this, we constructed an additional weakly-labeled dataset around the 2020 Chilean Referendum. In summation, the contributions of this paper are the following:

  • We propose a compartmentalized stance-detection architecture using transformers and graph neural networks that leverage a user’s ego-network, a user’s tweet timelines, and weakly-labeled protest-related tweets to predict their stance towards the government of each country. This architecture, albeit computationally expensive, allows us to evaluate the effect, both on in-sample performance and generalizability, of making more contextual information available to a classifier.

  • Through ablation studies, we examine the cross-country performance of each stance model in a one-shot prediction setting. We found that increasing the contextual information available to a model (as defined by each compartment of the architecture) not only improved the performance of a classifier within the country it was trained but also made it more robust to out-of-sample predictions.

  • We collect a novel weakly-labeled Chilean referendum dataset to explore the ability of each Chilean model, trained at different context levels, to generalize to future data. As before, we found that increasing context improves the model’s generalization for future posts of users seen during the Chilean protests and for users exclusively seen in the referendum collection. We find that the inclusion of a user’s heterogeneous ego-network information, by differentiating between replies and retweets, yields much larger improvements in related country-context assessments or its robustness over time.

  • We make publicly available a variant of a BERT language model [8], we call twBETO, developed for this work and trained on a substantial corpus of 150 million Spanish tweets. This model is not only trained on more tweets than other Spanish BERT models for Twitter [9] but has better coverage of non-European Spanish dialects.

To the best of our knowledge, the evaluation of the effects of context on the generalization capabilities of these models has not been explored in other work and has been identified as an outstanding issue in the task of stance classification [6]. Moreover, even though the subject of interest of our study is political stances in South American politics during charged political events, the method proposed can nevertheless be used in understanding social change and political movements in different cultures and continents. Finally, while most user-level stance detection applications are generally benign or beneficial, e.g., targeted advertising, they certainly have the potential to be abused by malign governments or corporations in targeting vulnerable or dissident groups. Our findings underscore the importance of anonymity and privacy on social media platforms, and remaining cognizant of one’s digital footprint.

2 Related work

Stance detection is an essential component of many tasks associated with online social network moderation and analysis, including opinion polling and the detection of propaganda, misinformation, and hate speech. In some domains, sentiment analysis may be a reasonable approximation of stance, but it has been shown that, on Twitter, sentiment polarity is a poor proxy, particularly around contentious political issues [10]. Posts with a positive sentiment score can be used to oppose a topic and vice versa. As such, the overall sentiment of the text is not necessarily relevant to the classification [11]. Work in this area has concentrated on exploring stance in conversations (also known as rumor stance classification [12]) and on debates with respect to a predefined topic or target (known as target-stance classification). However, progress in this area has been limited by its reliance on small datasets, primarily created around challenge competitions like SemEval-2016 [10] through crowd-sourced human annotation. The work undertaken in this paper is an instance of target-stance classification. In what remains of this section we provide a brief overview of the main research that has been applied in the relevant areas. For a more detailed description of these areas and the main datasets available, we refer the reader to other comprehensive reviews of the state of the art in the subject [6].

Target Stance Classification

Target Stance Classification focuses on classifying the stance of a user or document for a predefined topic or target. Task 6 of Semeval 2016 is a common benchmark for target stance classification. Authors have achieved SOTA results on this benchmark using architectures ranging from end-to-end neural ensemble models [13] to hand-crafted feature-based classifiers [14]. The latter relies on domain knowledge to determine the most informative features for a given problem. Due to the limited amount of data available, algorithms that rely on hand-crafted features are still prominent and achieve competitive performances with Deep Learning algorithms. Additionally, the vast majority of stance-detection literature focuses on classifying documents. Considering users adds additional complexity to the task as users can post many unrelated documents or express contradictory or nuanced stances towards targets that may not be captured in a single post. Even though there exists work that has explored user-level stance detection on Twitter data [15], the proposed approach requires pairwise similarity calculations for all users of interest, which limits the scalability of the approach to large social media networks like those we use in this work. To make the approach tractable, the authors dropped all but the most active 5k users—the pairwise similarity bottleneck limits the ability of the approach to scale to millions of users.

GNNs for Stance Detection

Graph Neural Networks (GNNs) have become increasingly popular in recent years, but to our knowledge, have only recently been used for stance detection at the user level. Graph Neural Networks have been used in tasks related to stance in conversations at a document level, including several papers that explore GNNs for fake news classification and rumor detection. For example, a gated graph neural network model, PCGNN [16], was proposed for rumor detection on Tweet threads achieving SoTA results on the PHEME dataset [17]. Reference [18] propose a modified label propagation algorithm that they benchmark against graph neural network architectures for user-level stance prediction. Reference [19] proposed an unsupervised BERT training approach that leverages user retweet interactions and fine-tunes this for user political stance detection. Most works focus on text-level classification. While this is effective for fine-grained stance detection, no strong inference can be made about the alignment or intentions of the user who posted the false or misleading content. To combat misinformation at scale in an environment with state-backed propaganda campaigns, troll farms, and bot networks, stance detection at the user level is essential for moderation at scale.

Generalizing Stance classification

Several recent efforts have been made to study the generalization capabilities of stance classifiers to different domains, contexts, or targets. One vein of this research, known as cross-target stance detection, focuses on applying a target stance classifier trained on one domain to a different, but related, domain. Work in this area has explored how to leverage labeled instances from related political topics to increase generalizability across targets [20,21], or how effectively stance predictions transfer across languages [22]. Similarly, other work has explored how synthetic data, generated from related targets, can improve cross-target stance detection on Tweets [23]. However, in contrast to our work, research in this field improves the generalizability of classifiers across different targets by leveraging their semantic relationship at a document level. For this reason, each of these works explores stance at the tweet/ document level, rather than at a user level. More recent work [24], proposes a methodology for multi-modal cross-target stance detection on 3,871 Twitter users, however, their model uses followership and friendship networks for each user, which are often not feasible to extract on large networks given API limitations. Importantly, our work differs from these approaches as, for a fixed target, we explore how varying the amount of contextual information for a given user (an individual post, all a user’s posts, or their local ego-network) improves within-sample and out-of-sample predictions.

A similar vein of research, known as context-sensitive classification, models context in the form of spatial and temporal locality and is crucial to many tasks, including search engines, and context-aware Web applications [25]. This is highly relevant for stance detection in online social media where users engage with one another in a variety of ways, and thus focusing on an isolated sentence will likely result in an incomplete assessment of a user’s stance, as they can employ humor, satire, ask a question, or talk about something else entirely [5]. Although some works have implicitly demonstrated the benefits of this approach by incorporating different levels of context in their architecture [26,27], there is a need to systematically quantify the resulting gains in classifier performance as different levels of contextual information become available to the model. This line of questioning has been identified as an outstanding issue in the task of stance classification [6], and is something this work attempts to explore.

3 Data description

We ground our analysis in the wave of protests that effectively paralyzed the South American region at the end of 2019. The protests in Ecuador, Chile, and Colombia were led by left-wing movements that sought to resist austerity measures, the Bolivian protests were a right-wing response to an alleged electoral fraud undertaken in favor of the president who was seeking reelection. We use a dataset that contains the stances of hundreds of thousands of Twitter users towards their respective governments during the event [7]. The labels were constructed using a weak-labeling methodology that leveraged the users’ endorsement of hand-labeled political figures as well as their usage of stance-tags (hashtag campaigns with well-defined stances towards each government and that occur at the end of a tweet). The methodology relies on the hypothesis that users are more likely to tweet (or retweet) stance-tags or political figures that are aligned with their stances during these events. For this reason, stance labels are assigned to a user if the percentage of tweets with a consistent stance-tag or retweets from political figures with a consistent stance is above a given threshold (the authors use 90% as a threshold based on a labeled validation set). A final stance label is assigned to a user if the stance obtained by both signals (usage of hashtags and retweet of political figures) is consistent. In Table 1 we provide some examples of the most prominent stance-tags observed for each country collection.

Table 1. Example of the most prominent stance-tags used for each country. Pro-government hashtags tend to support the country’s president (Evo, Piñera, Lenin), the police (ESMAD or Carabineros) or attack the opposition leader (Petro, Correa, CONAIE). Against-government hashtags label the country’s president as a dictator, assassin or traitor, or support the protest (“paro”).

Pro Government Against Government
Bolivia #AbajoElGolpe #ElMundoConEvo #AutoGolpeDeEstado #EvoDictador
#EvoNoEstasSolo #GolpeDeEstadoEnBolivia #CarlosMesaPresidente #BoliviaNoHayGolpe
Chile #YoApoyoACarabineros #FuerzaPiñera #CarabinerosAsesinos #FueraPiñera
#FueraComunismoDeChile #DejenTrabajar #AsambleaConstituyenteoNADA #ChileEnDictadura
Colombia #ApoyoAlESMAD #ApoyoAlPresidente #ESMADAsesino #DuqueAsesino
#NoMasPetro #ColombiaNoPara #VivaElParoNacional #ChaoDuque
Ecuador #LeninNoCedas #CONAIEterroristas #DictaduraEnEcuador #LeninChao
#FueraComunistas #NoAlGolpeCorreista #ElParoNoPara #LeninMorenoTraidor

The authors validate the quality of the weakly-annotated labels based on a hand-labeled sample of users and by showing that the constructed labels partition the users in communities that are polarized in their language and news-sharing behavior in a way consistent with the ideological underpinnings of each protest. By leveraging an established machine translation algorithm [28], they show that terms related to left-leaning ideologies in one community tend to be discussed in similar contexts as right-leaning terms (e.g., Socialism mistranslates to Fascism); terms related to law and order in one group are discussed in a similar context as the other discusses oppression, or that opposition leaders are discussed in similar contexts as government representatives. However, they show that, while some ideological consistencies are maintained when comparing protest movements from different countries, differences in dialects start to dominate the polarization measures. For more details on the polarization dimensions identified in the data, refer to [7]. These semantic regularities suggest that text-based features should be informative for political stance classification, but this information should degrade when applied to different country contexts. We explore this formally in this work.

The protest dataset, which was collected between September 25 and December 24 of 2019, contains 550k labeled users split unevenly across the four countries and contains over 36 million labeled tweets. It contains an additional 1.1 million unlabeled neighbors and 40 million unlabeled tweets. Stances are imbalanced in 3 of the 4 countries. In Table 2, we present the distribution of the different users in each country and the number of valid Spanish tweets available. Valid tweets included Tweets in Spanish, as determined by Twitter’s API, with more than 5 tokens after the pre-processing step. This included the removal of all trailing hashtags, as the weak-labels were assigned in part by leveraging the usage of labeled hashtags at the end of a tweet.

Table 2. Distribution of labeled users, their first-order neighbors (based on the response network) and their tweets (including retweets) for each of the countries studied and the 2020 Chilean Constitutional Referendum.

Users Tweets
Against Pro Neighbors Against Pro Neighbors
Bolivia 58,508 54,347 292,684 3,508,300 3,583,943 12,507,155
Chile 220,391 33,331 409,014 14,659,535 2,775,496 14,211,964
Colombia 79,874 28,322 257,912 5,328,651 1,983,161 7,501,917
Ecuador 51,466 25,567 170,780 3,352,028 1,336,546 5,963,914
Chilean Referendum 10,423 11,206 96,202 6,454,647 6,060,679 1,288,351

For training purposes, we constructed training sets for each country by sampling 80% of the users (with their corresponding tweets), using 10% for validation and a 10% held-out test set. Splitting at the user level guarantees that there are no duplicate users between the different splits. Moreover, even though some retweets posted by users might target tweets in different splits, we ensured the corresponding validation and test nodes were masked during training. This is consistent with a transductive learning setting, which is common practice when training GNNs [29].

It is important to note that given the temporal proximity of the protests, and their corresponding collection dates, there is a non-trivial percentage of users that were active during more than one event. In Table 3, we present the percentage of users collected for each country (column) that were also contained in the training set of the other country (row). The largest overlaps occur when comparing the Chilean collection with the other countries, as users involved in this discussion represent more than 20% of the user base of other countries. This is expected considering that it involved at least twice as many users as the other collections. For all other country pairs, the overlap varies between 8 and 17 percent. However, this percentage diminishes considerably when considering tweets that formed part of multiple collections. As we see in Table 3, the overlap with the Chilean collection drops an average of 13 percentage points, while for all other country pairs their similarity is now below 7 percent (Ecuador and Colombia share no tweets which are consistent with their collection dates). To avoid bias in our results, given that one of the objectives of this study is to evaluate the extrapolation capabilities of models trained at different levels of user contextual user information, we mask users seen in multiple collections when performing cross-country experiments.

Table 3. Percentage of users, and their respective tweets, in the protest data of country j (column) that were also present in the training set of a country i (row).

Bolivia Chile Colombia Ecuador Chilean Referendum
Users Tweets Users Tweets Users Tweets Users Tweets Users Tweets
Bolivia 11.45 6.33 13.54 4.47 23.99 5.67
Chile 25.79 18.74 20.2 8.26 30.05 13.36 43.90 0.00
Colombia 13.10 4.58 8.60 2.85 17.00 0.00
Ecuador 16.44 2.92 9.10 2.30 12.08 0.00

Although the regional nature of the dataset allows us to test the robustness of our models at different levels of contextual user information used, it does not allow us to effectively test this robustness over time. To address this issue, we constructed an additional weakly-labeled dataset around the 2020 Chilean Referendum.

3.1 The 2020 Chilean Plebiscite

Throughout the 2019 Chilean protests, different social movements made calls in favor of drafting a new constitution. This social pressure reached a boiling point in November of that same year, which led the Chilean National Congress to agree to hold a National Plebiscite. The Plebiscite was scheduled for the start of 2020 but was delayed because of the Coronavirus pandemic. In October of 2020, it was overwhelmingly approved with 78% of the vote. Using Twitter’s v2 full-archive search endpoint feature which was available on their Academic Research Track, we collected tweets from September 25 to November 10 of 2020 (a month prior and two weeks after the plebiscite took place). This research program allowed full historical access to publicly available tweets matching complex queries. The queries matched 124 hashtags and terms relevant to the event (e.g.: #Plebiscito2020, #PlebiscitoChile, #NuevaConstitucion, etc.). We then labeled these hashtags to identify useful “Stance Tags”.

The labeling procedure was as follows: an expert in the South American region who is fluent in Spanish reviewed a sample of tweets at the end of the collection period. The annotator then established if the tweets were used consistently in favor or against the approval of the new Constitution (or of the referendum process in general), or if this was not possible the hashtag was labeled “Undetermined”. This was done in separate meetings with the authors of this paper, where the reasoning for each label was openly discussed. This process resulted in 27 “Undetermined”, 64 “Against”, and 32 “Approve” stance tags. Instead of using hand-labeled Political Figures for the second stance signal, as was done for the construction of the protest data [7], we opted to identify the hashtags used in the user description that were explicitly rejecting or approving the referendum in first person. This resulted in 16 “Approve” (#Apruebo, #AprueboCC, etc.) and 21 that “Against” (#YoRechazo, #Rechazo, etc.). This set is used to label users based on their user descriptions. We found that the usage of stance tags in the descriptions as a replacement signal, despite requiring less annotation effort than labeling political figures, also resulted in meaningful polarized stance partitions based on the same language metrics used to validate the protest weak-labels.

Following the previously described methodology for the construction of the South American protest data [7], tweets or descriptions were assigned a stance if they used stance tags with consistent stances. We only proceeded with users that had at least 8 tweets with a consistent stance or if at least one description was consistent. A stance was assigned to a user if at least 90% of their tweets had the same stance. For description-based stances, a user was assigned a stance if all different user descriptions observed during the period were assigned the same stance. The final user stance was determined by combining the labels obtained from both sources and validating the stance assigned to a subset of the users. The final number of user counts and their distributions are presented in Table 2. Moreover, as shown in Table 3, 43.89% of users labeled were part of the 2019 Chilean protest training set while, by construction, no tweets were part of the protest data. We also note that only 0.015% of retweets that occurred in the referendum collection reference a tweet that occurred during the protests (this is consistent with the diffusion dynamics of retweets [30]). Moreover, to improve the resolution of the networks available for the labeled users, we collected their timelines during the event. Timelines were not always collected in the original protest data, but we hypothesize that the additional timeline context for each user will improve the quality of user embeddings for the referendum data.

Following X’s January 2023 User Protection Policy update, tweet IDs related to sensitive political events cannot be publicly shared. We respect this policy, and instead share only anonymized network edges, the type of tweet the edge represents (Original, Reply, or Quote), and user weights for replication. These resources can be found at: https://doi.org/10.5281/zenodo.14207926. The code for this paper is publicly released at https://github.com/rvillaco/Protest_Stance_Detection and the weights for the trained models are available at https://doi.org/10.5281/zenodo.15571840.

4 Materials and methods

The focus of this work is to explore the effect of increasing the level of contextual user information available to a classifier in the task of political target-stance classification. In line with this goal, we build a compartmentalized architecture that allows the evaluation of component outputs at various levels of contextual information leveraged.

In Fig 1, we present the different context-based components of the architecture developed. Note that the output of each component not only serves as input for the next, but can be used for stance prediction with the information available at its contextual level. Training and predictions for each component were done independently. In training, we optimize using cross-entropy loss using Adam with weight decay, a linear schedule with warmup and a maximum learning rate of 2e-4 (1e-3 for the network component). We train all models over 10 epochs with early stopping based on validation loss. Due to the unbalanced nature of the dataset in most countries (e.g.: Chile’s label distribution was 87–13%), we opted for a dynamic resampling approach (with replacement) that weighted tweets and users based on the inverse frequency of their corresponding label. This under-samples the majority label and over-samples the minority label, and is performed at the start of each epoch. We found that this strategy improved training stability and the validation performance of our transformer models when compared to other static over and under-sampling strategies or when weighting the loss function by the inverse class frequency. We hypothesize that this is due to the inclusion of the nmax hyperparameter which limits the number of tweets considered for a user in a given batch (see Sect 4.2 for more details). In what follows, we describe the different components of our proposed architecture. As our focus is on exploring the effects of integrating contextual user information and not on training the best possible model for each country, we used only the more balanced Bolivian dataset to tune hyperparameters and select architectures.

Fig 1. Proposed compartmentalized architecture for Context-Sensitive Stance Classification.

Fig 1

The white arrow denotes the starting input for each component. The left figure presents the Tweet Level encoding applied to a batch of nmax tweets from a given user producing a vector of dimensions (nmax,hhid) per user. For the Tweet Embeddings to serve as input for the User Transformer (depicted in the middle panel) they are reshaped and a [CLS] parameter vector is appended at the start of each user’s tweets. In this layer, a batch of bsize users is processed by Lx encoder stacks obtaining the final User Embeddings. On the right, we show the heterogeneous GAT model, wherein a user receives information from its neighborhood 𝒩(i) and their corresponding User Embeddings. The attention mechanism learns different attention weights αi,j for j𝒩(I) which reflect the importance of a neighbor j to the label of the i’th user.

4.1 Tweet encoder

The first component of the model creates embeddings for the different tweets produced by the users. For this purpose we pre-trained a BERT [8] language model, which we call TwBETO v0, following the robust approach introduced in RoBERTa [31]. We opted for the smaller architecture dimensions introduced in DistilBERT [32], namely, 6 hidden layers with 12 attention heads and a hidden dimension of 768. We also reduce the model’s maximum sequence length to 128 tokens, following another BERT instantiation trained on English Twitter data [33]. We utilize the RoBERTa implementation in the Hugging Face library [34] and optimize the model using Adam with weight decay, a linear schedule with warmup and a maximum learning rate of 2e-4. We use a global batch size (via gradient accumulation) of 5k across 4 Titan XP GPUs (12 GB RAM each) and trained the model for 650 hours.

The model was trained with a corpus comprised of 155M Spanish tweets (4.5B words tokens), as determined by Twitter’s API, and includes only original tweets (retweets are filtered out) with more than 6 tokens, while long tweets were truncated to 64-word tokens. The data was compiled from the following sources:

  • 110M Tweets (3B word tokens) from the South American protests collected from September 20 to December 31 of 2019.

  • 25M (0.7B word tokens) Tweets collected around the Coronavirus pandemic from April 01 to December 31 of 2020.

  • 3M (0.3B word tokens) Tweets collected around the Chilean referendum from September 25 to November 10 of 2020.

  • 17M (0.5B word tokens) rehydrated targets across all the collections listed.

Tweets are pretokenized using the “TweetTokenizer” from the NLTK toolkit [35] and use the emoji package to translate emotion icons into word tokens (in Spanish). We also preprocess the Tweets by replacing user mentions with “USER_AT” and, using the tweet JSON, we replace media urls with “HTTPMEDIA” and web urls with “HTTPURL”. We found that this new model produced significantly better quality embeddings than the only other currently available Spanish BERT variant for Twitter [9]. We hypothesize that this is a result of the latter being trained mainly in European Spanish with fewer data and it not applying the RoBERTa pretraining framework. We make the pretrained TwBETO v0 language model available through the Hugging Face model hub (https://huggingface.co/Ramavill/twBETO_v0).

The output of our TwBETO model, after being fed a batch of tweets, is pooled and passed through an activation layer. We use the standard pooling method used for BERT models, namely using the first element of the output of the final layer (corresponding to the “s” token), as input for a fully connected layer and passing through an activation. This output serves as our final Tweet Embeddings of size hhid = 768 per tweet. For the Tweet-level Context prediction, the embeddings are used as input for a classification head comprising a fully-connected layer and a softmax to predict the stance of a batch of tweets. Given that our main focus is to evaluate the effect of increasing the contextual user information available to the model, we do not fine-tune the TwBETO parameters in any training settings.

4.2 User encoder

As shown in the center part of Fig 1, the second component of our architecture is comprised of a stack of Transformer Encoder blocks [36] which operate on the Tweet Embeddings for a given user. As this requires a fixed input size, we introduced a parameter nmax that determines the maximum number of tweets to consider for each user. In this work, we use nmax = 15. When users exceeded this limit, we sampled nmax tweets before assigning them to a batch. This way, we avoided wasting information as different tweets can be included each time a user is sampled. We hypothesize that this, combined with the resampling approach taken to account for class imbalance, improved training stability as repeated users had a high probability of including different tweets (at least 83% of the users in each training set had 15 or more tweets in the dataset). Given the power law distribution of user tweet counts and to reduce sampling times when constructing the batches, we chose only to keep the first 150 tweets available for each user. For the Tweet Embeddings to serve as input for this component we reshape them and append a [CLS] parameter vector at the start of each user’s tweets, obtaining a tensor with dimensions: bsize,1+nmax,hhid. Where bsize is the number of users included in the batch and nmaxN is the aforementioned maximum number of tweets allowed per user in the batch. We used trainable positional embeddings to maintain the temporality of the tweets and introduced a second type of trained embeddings to encode the type of each tweet as Original, Reply or Quote. Twitter users can interact in various ways and we hypothesized that different interaction types would have different impacts on stance. This was confirmed in our ablation study (done for Bolivia), as the inclusion of each of these components improved the validation macro-F1 score at this context level. These 3 embedding tensors are then added and normalized (by layer normalization) and serve as input for our encoder stack. Based on validation results, we use an encoder stack of size Lx = 3, an intermediate embedding size of 2048, 6 attention heads and a GELU activation for the transformer stacks.

The output of the last encoder layer is pooled following the same strategy used for the TwBETO model as this allows the [CLS] parameter to attend to all tweets in the sequence and be optimized for stance classification. The output of this pooling layer serves as our final User Embeddings which maintain a dimension size of hhid = 768. As before we feed the embeddings to a classification head to perform user-level classification, or use this as input for the next component or our architecture.

4.3 Network-based prediction

We take the embeddings produced by the user encoder and predict users’ stances using their interactions in the social network. We explored several different graph neural network architectures, including GraphSage layers [37], Graph Convolutional layers [38], and W-L Graph Convolutional layers [39]. Ultimately, we achieved the highest macro F1 score using Graph Attention Network (GAT) layers [40]. In the social setting imposed by Twitter’s platform, there are also different user information context levels that can be leveraged by our Network-based classifier to improve stance detection. To test this, we trained separate homogeneous models for the network retweets, the network of responses (the union of replies and quotes) and the combined network (obtained by the union of both). Finally, we trained a heterogeneous model where we distinguish response and retweet edges.

For homogeneous graph experiments, after testing different alternatives, we use two GAT Layers with dropout, each with four attention heads, followed by 4 batch-normed linear layers with dropout and relu activations. The corresponding output is fed to a classification head for our final network-based user stance classification. We observed our heterogeneous model quickly overfit the data with our homogeneous architecture, so for the heterogeneous model, we removed 2 linear layers and use only one attention head (for each edge type). Given the size of the underlying graphs, we sample neighborhoods for each node using the neighbor loader proposed in [37].

In our GAT models, for a given user i, the input to the attention layer is the User Embeddings of a random sample of k one-hop neighbors of i denoted as 𝒩k(i). In our homogeneous GAT networks, the attention mechanism calculates the importance of 𝒩k(i) to the label of i. In our heterogeneous GAT networks, the attention mechanism calculates the 𝒩k(i) with relation Φ (retweet or response) to the label of i.

4.4 Baselines

To demonstrate the overall effectiveness of our approach, we sought to re-implement as baselines four recent user-level stance detection papers on our data that both leverage user-interaction networks. However, two of these approaches were not possible to re-implement as we discuss next. The first proposed an unsupervised clustering methodology for user-level stance detection [15], that did not scale to the dataset explored in this work as it requires pairwise similarity calculations for all users of interest. The second approach proposed a multi-modal cross-target stance detection architecture [24] that uses followership and friendship networks for each user which were not possible to recover due to API constraints.

The first approach that we successfully adapted to our dataset was TSPA [18], which uses label propagation first over the retweet network, and then over the reply network. We implement the TSPA version without weighted propagation as it achieves comparable performance to weighted propagation, and it provides us with a baseline that uses only network information. This helps us understand the signals associated with the natural community structure that exists within the network. As the original authors, we were unable to run the algorithm on the Chilean dataset given its size. We followed the same pre-labeling approach used by the authors, but note that not all labeled users could be assigned predictions as some components of the interaction for each country did not receive labels. For fairness of comparison, we evaluate the accuracy and F1 only on users for which the algorithm assigned labels (a total of only 21 users across the Bolivia, Colombia and Ecuador test sets were unlabeled). As TSPA is an extension of the label-propagation algorithm, cross-country baselines are not possible with the algorithm. TSPA uses only network signals, and apart from initial hashtag label seeds, does not incorporate any external user text or features.

The final baseline implemented was Retweet-BERT [19], which performs unsupervised training on Sentence-BERT to contrastively embed user descriptions based on user retweet networks. They then fine-tune this trained model on user political stance task. As not all users in our data have descriptions, and many user descriptions changed, we implement the unsupervised fine-tuning step only on the first description for each labeled user using the retweet network of all labeled users. We then fine-tune this model on training users with descriptions and evaluate the model on validation and test users with descriptions. As users were randomly split into train, validation, and test sets, this approach impacted each split roughly equally. In each country between 70% and 80% of labeled users had descriptions. As data were weakly labeled using hashtags, hashtags could provide too easy a signal. Consequently, all hashtags and urls were removed from tweets and user descriptions baselines and for our proposed architectures.

5 Results & discussion

5.1 Main results

In this section, we present the results of the different components of our architecture to evaluate the effects of increasing the level of context available to a classifier. We use the Tweet-level classification as a baseline for our ablation studies. It is important to note that this task is evaluated at a tweet level (not at a user level) by assigning each its corresponding user stance. We define four different baseline experiments to assess the performance of classifiers trained at this context level:

  • Weak-Labeled Tweets: This model was trained using tweets that contained “Stance Tags” as identified in [7]. All trailing hashtags were removed during pre-processing to avoid overfitting to these tokens.

  • All Original Tweets: This model was trained using all original, replies and quote tweets produced by the different weakly-labeled users.

  • User Average of Original Tweets: This model was trained by averaging the predicted labels assigned to a given user. A user is assigned the majority of their predicted labels. When no majority exists, a stance is assigned randomly.

  • User Average of All Tweets: This User-Level prediction exercise is similar to the one described above, but also includes the retweets of a user. Retweets are assigned the label predicted for their target.

We include the last two user-level predictions to evaluate how effective our proposed User Transformer is at aggregating tweet information. We also include the TSPA model [18] (see Sect 4.4) as a network baseline that uses only a user’s ego-network information. This helps us understand the signals associated with the natural community structure that exists within the network. However, as mentioned before, TSPA could not be run on Chile due to its data size. We evaluate the performance of each model using accuracy, macro F1 (due to the observed asymmetries in the label distribution), and the Area Under the Receiver Operating Characteristic (ROC AUC) on the held-out test set. The latter metric accounts for the possible differences in the optimal decision threshold that might exist across the different datasets. To assess whether the differences seen are statistically different, we recomputed the model performance metrics based on 1000 bootstrapped samples of the corresponding test set. We report the median result of the bootstrapped empirical distribution for each metric and compute its corresponding confidence band to assess whether the differences are statistically significant (these are presented in the Supporting Information).

In Table 4, we present the results of the ablation studies. As shown, there is a clear performance improvement when more contextual information is available to the classifier. As expected, focusing only on the weakly-labeled tweets presents an easier task than focusing on all original tweets, but still falls short of User-Level averages. When retweets are included, the performance improves but is still significantly below the User Transformer in all countries and across all metrics (see Table 1 of S1 for the 95% confidence bands of each metric). Interestingly, we observed that, for Colombia, adding retweets hurts the user-average performance. The User Transformer does not exhibit these disparate behaviors, highlighting its ability to identify tweets that are more relevant to the stance of the user. While the user-level transformer significantly outperformed Retweet-BERT [19] in all settings, it yielded statistically significant lower accuracy and M-F1 scores than TSPA in Bolivia and Ecuador. In the case of Colombia, the differences are not significantly different, which is consistent with what was noted before for the Retweet Average baseline. TSPA’s relatively strong performance demonstrates the strong stance signals present in each network, even without user text features. Nonetheless, the User Transformer significantly outperforms it, across all countries, when considering their ROC AUC score, which shows that no model uniformly outcompetes their counterpart. Importantly, and as hypothesized at the start of this work, leveraging the ego-network of the user, in conjunction with the semantic features aggregated by the User-level Transformer, can consistently improve upon its already high bar. In all four countries, the heterogeneous model, which leverages all network information, yielded macro F1 scores greater than 94.1 and accuracy scores greater than 95.7. The difference in model performance is significantly different in Chile while for the remaining countries, other network models or the User Transformer can match its performance in some metric. However, as we will explore next, the differences in performance are exacerbated when exploring each model’s generalization capabilities.

Table 4. Performance of in-country Stance Classifiers at different context levels.

Model performance metrics correspond to the median of 1000 bootstrapped samples of the test set. For each metric, the classifiers whose performance is not statistically different, at a 95% confidence level, from the best-performing model are highlighted in bold. We also highlight in bold the name of the best-performing model if it obtains the best results (or is statistically tied) in all presented metrics. The corresponding confidence bands, computed from the bootstrapped empirical distribution of each metric, are included in Table 1 of S1.

Chile (%) Bolivia (%) Ecuador (%) Colombia (%)
Acc. M-F1 ROC AUC Acc. M-F1 ROC AUC Acc. M-F1 ROC AUC Acc. M-F1 ROC AUC
Tweet Weak-Labeled Tweets 85.48 83.90 85.60 83.75 73.82 88.38 77.42 76.53 79.70 82.47 82.14 82.97
Level All Original Tweets 75.50 75.49 76.28 72.03 66.83 77.44 69.95 65.55 72.61 73.45 71.12 83.92
User-Level Avg. All Original 86.07 75.99 86.19 79.53 79.18 83.93 81.40 78.45 83.80 80.72 73.66 81.21
Tw.-Level With Retweets 94.79 88.70 95.77 89.03 88.97 95.71 92.30 90.60 95.81 81.41 67.71 93.02
Retweet-BERT 92.84 83.42 92.63 75.77 75.77 84.11 71.74 63.03 73.70 79.25 72.43 80.01
Transformer 95.98 91.53 97.71 94.06 94.05 96.60 95.00 94.35 98.11 95.61 94.24 98.24
Network-Level Homogeneous TSPA 95.02 95.02 95.55 95.87 95.33 96.89 96.34 95.19 94.29
Response 95.61 90.58 97.69 94.01 93.94 96.82 94.20 92.89 97.65 95.11 93.57 98.22
Retweet 96.23 91.54 97.77 94.00 93.99 96.56 95.03 94.39 98.13 95.68 94.39 98.44
Combined Network 95.74 90.00 97.70 94.06 94.06 96.58 94.66 94.02 98.13 95.60 94.23 98.46
Heterogeneous 97.41 94.41 98.62 95.76 95.76 97.01 96.22 95.76 98.64 97.01 96.08 98.91

5.2 Robustness analysis

Cross-country robustness.

Even though the addition of social context can lead to an in-sample increase in classification performance, we also seek to evaluate how context can affect the generalizability of our proposed classifiers. Given that our target countries share a common language, we hypothesized that the models should be able to extrapolate the stance learned from one country to another. Protests in Bolivia had opposing ideological motivations in comparison with the motivations observed in the other countries, which we hypothesized would degrade cross-country performance significantly. The results for the cross-country ablation are shown in Table 5. For these experiments, we excluded from the test country (column) any user that was part of the training set of the other country’s classifier (row). Refer to Table 3 for more details on the overlap between the different collections. As before, we recomputed the model performance metrics based on 1000 bootstrapped samples of the corresponding cross-country test set. For the economy of space, we only include the macro F1 score and the ROC AUC metric for the best-performing classifiers at each context level. As mentioned in Sect 4.4, we do not include the TSPA baseline as it is an extension of a label propagation algorithm that relies on hashtag seeds that are not reliably used in different protest collections.

Table 5. Macro F-1 (%) score and Area Under de ROC curve for out-of-sample cross-country predictions for classifiers at different context levels. Model performance metrics correspond to the median of 1000 bootstrapped samples from column-country. This excluded users seen during training of the country classifier (row). For each metric, the classifiers whose performance is not statistically different, at a 99% confidence level, from the best-performing model are highlighted in bold. We also highlight in bold the name of the best-performing model if it obtains the best results (or is statistically tied) in all presented metrics. For the Bolivian case study, the worst-performing classifier is highlighted. The corresponding confidence bands, computed from the bootstrapped empirical distribution of each metric, are included in Table 2 of S1.
Chile Bolivia Ecuador Colombia
M-F1 ROC AUC M-F1 ROC AUC M-F1 ROC AUC M-F1 ROC AUC
Chile Avg. Tweet-Level (WR) 37.37 19.68 57.67 88.05 89.92 95.98
User-Level Transformer 35.73 27.29 81.24 88.66 84.32 93.31
Hetero. Network-Level 22.81 18.36 88.95 97.23 92.57 98.49
Bolivia Avg. Tweet-Level (WR) 9.86 11.78 26.84 27.69 17.36 14.78
User-Level Transformer 9.94 18.74 25.26 45.41 18.36 24.06
Hetero. Network-Level 3.12 7.53 25.74 24.89 5.41 9.52
Ecuador Avg. Tweet-Level (WR) 58.18 93.63 33.87 13.13 80.01 86.44
User-Level Transformer 80.43 90.87 23.57 15.98 77.90 89.75
Hetero. Network-Level 92.13 97.17 9.80 5.08 95.50 98.33
Colombia Avg. Tweet-Level (WR) 80.86 91.93 39.05 38.79 45.69 80.26
User-Level Transformer 73.48 93.26 38.59 29.92 77.61 86.30
Hetero. Network-Level 85.59 96.51 42.37 27.49 73.20 92.90

As shown, the Bolivian case serves as an adversarial setting for classifiers trained in other countries, which suggests that, when applied to this country, the semantic features leveraged by the classifiers are operating on an ideological dimension. This is consistent with results presented in [7], where the authors found that language polarization remained along ideological lines when comparing protests of any of the three countries with Bolivia. This was not the case when comparing protests of the other three countries. As expected, Chile, Colombia, and Ecuador exhibit strong pairwise performance consistent with their ideological alignment. We can also note that the Tweet-Level performance varies significantly, falling in some cases 30 points below the user Transformer’s macro F1, which is far more stable (73–85% in these countries). However, when contrasting their ROC AUC performance, both models remain competitive, which suggests that there are country-specific thresholds that can significantly improve the average tweet-level classification performance. Nonetheless, given that we seek to evaluate the one-shot performance of the different models (including its optimal decision threshold for the respective country), we will focus mainly on differences in macro F1 performance for the robustness analysis.

We found that on average, excluding Bolivian predictions, the heterogeneous network model increased cross-country prediction macro F1 score by 7.62 points and significantly outperforms all other models in all metrics and all country combinations except for one discussed below. In the best case, the Ecuadorian instance yielded a zero-shot macro F1 of 95.5 on Colombian data, improving on its User Transformer counterpart by 15 points. Interestingly enough, the only case where the heterogeneous model underperforms its User Transformer counterpart (by 4 points) is when applying the Colombian models to Ecuadorian data. It is worth noting that although the heterogeneous model achieves this in countries with similarly motivated protests, it also significantly underperforms other alternatives in the Bolivian case (except for the Colombian classifiers). This is consistent with the observed tendency of deep learning models to be overconfident in their predictions and highlights the importance of domain knowledge when applying these models to different domains. If we leverage the knowledge of the opposed motivations for the Bolivian protests by inverting the heterogeneous model’s predictions, its performance would be in line with what was observed for the other countries. We take the approach of inverting the predictions when we report the results of the 2020 Chilean referendum.

Robustness over time.

To test the effect of context over time, we applied the Chilean protests classifiers to predict the stance of users, in a zero-shot setting, towards the 2020 Plebiscite vote for drafting a new constitution. In contrast to previous experiments, we report the results for all users, regardless of whether or not they were seen when training the Chilean models. However, we segment the results based on this condition, as it allows us to contrast the performance of the different models on new users. We take this approach given that (1) it is often realistic on applied settings and (2) there is no overlap between the tweets seen in both collections (see Table 3). Additionally, only 0.015% of retweets in the referendum collection reference a tweet that occurred during the protests.

We hypothesized that opposition to the government during the protests should signal endorsement for the new constitution (the models should be good inverse classifiers). Table 6 presents the results of this exercise. We also include the predicted Referendum vote obtained when applying each classifier, based on the two-hop neighborhood of labeled users. The benefits of the heterogeneous model are more starkly observed in this setting, as it significantly outperforms all other variants across all metrics. However, surprisingly, the homogeneous network models underperform the User-level Transformer in all settings and these differences are statistically significant at a 99% confidence level. We hypothesize that this might be due in part to differences in network construction: the Referendum data included the timeline of all users collected, whereas this was not done for the country-level protest data. This implies that the networks constructed for the former dataset are denser and include more reply and quote interactions between users. Hence, it more likely that we capture both positive and negative responses from one user to others of the same or different stances. While the User-level transformer can leverage the semantic context of each tweet as it pertains to the stance of the user, the homogenous models are forced to collapse these disparate signals into one interaction link. As such, we see that the Retweet-based homogenous model shows the least degradation (as retweets provide more consistent endorsements), while the combined network model drops significantly below other simple average user-level models. However, when the model is allowed to leverage the heterogeneous signals provided by retweets and other responses, its performance improves dramatically. Moreover, except for the heterogeneous network model, the performance of all classifiers is significantly reduced when dealing with completely new users at a 99% confidence level. This again highlights how the disparate signals present in a user’s ego-network can improve the robustness of the classifier over time. The User-level Transformer shows the second least degradation, while other models show a 10-point drop in performance. With regards to the predicted referendum vote (shown as “Pr-Ref” in Table 6), we note that semantic-based classifiers tend to undershoot the observed Referendum tallies (the final vote was 78% in favor of the new constitution), while the single-network-based classifiers do the opposite. Interestingly, the heterogeneous model predicts the voting ratio almost perfectly which is encouraging. Albeit, we stress that given that there is no evidence that the sample constructed is representative of the Chilean population, the closeness of this prediction is merely suggestive of good performance and more research is necessary to assess if this behavior is truly generalizable. The network models’ strong performance, when allowed to leverage the heterogeneous interactions between users, can be in part explained by the fact that language and conversation topics change faster than social ties, and the fact that social ties themselves can alter language [41]. As a result, features extracted from interaction networks may carry strong signals that are more generalizable to different cultural contexts.

Table 6. Out of sample Predictions for the Chilean Referendum at different context levels. These correspond to the classifier trained on the 2019 Chilean Protest Data, but with inverted labels (“Pro” government are considered “Against” the referendum and vice-versa). Model performance metrics correspond to the median of 1000 bootstrapped samples of the Referendum collection, disaggregated by whether a user was seen during training of the Chilean Protest classifier. For each metric, the classifiers whose performance is not statistically different, at a 99% confidence level, from the best-performing model are highlighted in bold. We also highlight in bold the name of the best-performing model if it obtains the best results (or is statistically tied) in all presented metrics. The corresponding confidence bands, computed from the bootstrapped empirical distribution of each metric, are included in Table 3 of S1.
Accuracy (%) M-F1 (%) ROC AUC Pr-Ref (%)
New Users Protests User New Users Protests User New Users Protests User
Tweet-Level Weak-Labeled Tweets 69.74 70.70 61.70 70.13 78.97 79.09 57.82
All Original Tweets 64.48 70.75 63.68 67.69 72.61 74.61 65.18
User-Level Avg. Tweet-Level All Original 77.50 81.61 77.25 79.74 82.90 84.17 63.81
With Retweets 80.83 88.59 80.83 86.00 97.14 97.90 67.81
Transformer 89.43 91.19 89.18 90.24 95.97 96.56 63.36
Network-Level Homogeneous Response 76.04 84.78 76.03 81.26 93.34 93.65 86.3
Retweet 80.12 87.64 80.12 85.11 95.51 96.58 82.3
Combined Network 71.84 82.19 71.69 77.45 90.98 91.65 87.6
Heterogeneous 98.78 99.43 98.74 99.36 99.84 99.85 78.5

To summarize, in all but one of the robustness experiments conducted, our heterogeneous graph neural network yielded the highest accuracy and F1 scores. The strong performance of the heterogeneous model aligns with intuition, as different edges on Twitter have different social functions. Retweeting with no commentary is more likely an endorsement than an argument, whereas replying could imply a fight or it could imply an agreement. To further test this explanation, for each node i in labeled Bolivian retweet and response networks, we calculated the percent of neighbors of node i that shared the same label as node i and then took the average of those percentages. In the retweet network, the average neighborhood-agreement percentage was 93%. In the Response (reply, quote, mention) networks, neighborhood agreement was only 69%. This suggests that ability to differentiate between relation types should help the model’s performance, which was validated by the results of the referendum experiment. Our results confirm that including interaction types in models can yield modest improvements in predictive power in in-country stance detection tasks. However, we find that the inclusion of a user’s heterogeneous ego-network information yields much larger improvements in related country-context assessments or its robustness over time.

6 Limitations and future work

There are several limitations to our study. First, we use binary stance labels in our models, so users with no or contradictory/nuanced opinions may be incorrectly categorized into pro- or anti-categories. Also, we did not attempt to remove bots or trolls, nor did we survey in-country Twitter user demographics, so these datasets may be noisy proxies for population opinions. However, despite this, we found the stance distribution of users in the Chilean referendum data was nearly identical to the final referendum vote. Nonetheless, given that the construction of the referendum data is in no way representative of the Chilean population at the time, further research needs to be conducted to assess whether it is truly generalizable. A possible avenue to explore would be to reweight the sample with estimated demographic information (see [42] for a promising approach to this problem). Finally, our network model only uses interactions and the text of posts for classification. This excludes many other attributes available on individual posts that might correlate with stance, including URLs, followership, posting patterns, bios, and shared multimedia. In future work, we plan to explore the integration of multimodal data as well as the integration of other attributes that may be useful in detecting stance.

7 Ethical considerations

We comply with Twitter’s fair use policy by presenting our results at an aggregate level. We make no effort to identify individual users, and the primary interest of this work is to assess how classifier performance is affected when we leverage the different levels of contextual information available as users interact through social media. However, in this research, we demonstrate how indirect interactions on these platforms can still provide strong signals for identifying user stances, even in cases where users may not explicitly express their opinions. In other words, advertisers and governments may be able to target users using indirect ties.

In a best-case scenario, this might allow organizations to send users relevant advertisements or to provide beneficial information on government services to relevant groups. However, in worst-case scenarios, systems like this also have the potential to be abused, as corporations or governments could choose to target vulnerable or dissident groups. This further underscores the importance of social network platforms protecting user anonymity and privacy and the importance of being cognizant of traces one leaves online. We hope that our work can help inform future discussions on protecting user privacy.

8 Conclusions

In this work, we explored the value of contextual user information in the task of target-stance classification during the 2019 South American Protests. For this purpose, we constructed a compartmentalized architecture that relied on Transformers for the Tweet and User level contexts, and GNNs to leverage social media relations. We found that increasing context not only improved the performance of a classifier within the country it was trained but also made it more robust to out-of-sample predictions. We found these out-of-sample improvements were substantial both when comparing a classifier’s performance across varying country contexts and over time.

Supporting information

S1. Confidence Bands.

Confidence bands for the Macro F-1 score and Area Under the ROC curve of the within-country and out-of-sample cross-country predictions at different context levels. The confidence bands are computed from the corresponding percentiles of the 1000 bootstrapped samples constructed for each experiment

(PDF)

pone.0324697.s001.pdf (78.4KB, pdf)

Acknowledgments

The collection of the data required for this work was possible due to access to Twitter’s v2 full-archive search endpoint granted to us by their now discontinued Academic Research Program. We would also like to thank the two anonymous journal reviewers for helpful comments on an earlier draft of this work.

Data Availability

In compliance with Twitter’s (now known as X) 2023 Terms of Service, we are not able to share tweet or user ids, nor the corresponding text for each post. However, to ensure reproducibility of the results presented in the paper, we make available all predictions obtained for the different models for anonymized users and them corresponding anonymized tweets. This can be used to recreate the country level results and the different robustness analysis. The code developed is publicly available at: • https://github.com/rvillaco/Protest_Stance_Detectionhttps://doi.org/10.5281/zenodo.15571840 The anonymized data, including the User Embeddings, required to reproduce the results presented in the paper is available at: • https://doi.org/10.5281/zenodo.14207926.

Funding Statement

The research for this paper was supported in part by the ARMY Scalable Technologies for Social Cybersecurity, Office of Naval Research, MURI: Persuasion, Identity, & Morality in Social-Cyber Environments, and Office of Naval Research, MURI: Near Real Time Assessment of Emergent Complex Systems of Confederates under grants W911NF20D0002, N000142112749, and N000141712675. RVC was also supported by the Secretaría de Educación Superior, Ciencia, Tecnología e Innovación (https://siau.senescyt.gob.ec/convocatorias), Ecuador; the Center for Informed Democracy and Social-cybersecurity (https://www.cmu.edu/ideas-social-cybersecurity) and the Center for Computational Analysis of Social and Organizational Systems (CASOS) at Carnegie Mellon University. The views and conclusions are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the ARMY, the ONR, or the US or Ecuadorian Government. None of the sponsors or funders of this work played any role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Burstein P. The impact of public opinion on public policy: A review and an agenda. PRQ. 2003;56(1):29–40. [Google Scholar]
  • 2.Lax JR, Phillips JH. The democratic deficit in the states. Am J Political Sci. 2012;56(1):148–66. [Google Scholar]
  • 3.GlobeNewswire. $6.78 billion public opinion and election polling global market to 2030 - identify growth segments for investment. https://www.globenewswire.com/news-release/2021/07/29/2271041/28124/en/6-78-Billion-Public-Opinion-and-Election-Polling-Global-Market-to-2030-html. 2021.
  • 4.Tsakalidis A, Aletras N, Cristea AI, Liakata M. Nowcasting the stance of social media users in a sudden vote: The case of the Greek referendum. In: Proceedings of the 27th ACM international conference on information and knowledge management, 2018. p. 367–76. [Google Scholar]
  • 5.Du Bois JW. The stance triangle. Stancetaking in discourse: Subjectivity, evaluation, interaction. 2007;164(3):139–82. [Google Scholar]
  • 6.Kü¸cük D, Can F. Stance detection: A survey. ACM Comput Surv. 2020;53(1):1–37. [Google Scholar]
  • 7.Villa-Cox R, Zeng HS, KhudaBukhsh AR, Carley KM. Linguistic and news-sharing polarization during the 2019 South American protests. In: International conference on social informatics, 2022. p. 76–95. [Google Scholar]
  • 8.Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, 2019. p. 4171–86. [Google Scholar]
  • 9.G JA, Hurtado L-FP, Ferran. TWilBert: Pre-trained deep bidirectional transformers for Spanish Twitter. Neurocomputing. 2020;426:58–69. [Google Scholar]
  • 10.Mohammad SM, Sobhani P, Kiritchenko S. Stance and sentiment in tweets. ACM Trans Internet Technol. 2017;17(3):26. [Google Scholar]
  • 11.Du J, Xu R, He Y, Gui L. Stance classification with target-specific neural attention networks. In: International joint conferences on artificial intelligence, 2017. [Google Scholar]
  • 12.Zubiaga A, Liakata M, Procter R, Bontcheva K, Tolmie P. In: Proceedings of the 24th international conference on World Wide Web, 2015. p. 347–53. [Google Scholar]
  • 13.Siddiqua UA, Chy AN, Aono M. Tweet stance detection using an attention based neural ensemble model. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, 2019. p. 1868–73. [Google Scholar]
  • 14.Al-Ghadir AI, Azmi AM, Hussain A. A novel approach to stance detection in social media tweets by fusing ranked lists and sentiments. Inform Fusion. 2021;67:29–40. [Google Scholar]
  • 15.Darwish K, Stefanov P, Aupetit M, Nakov P. Unsupervised user stance detection on twitter. In: Proceedings of the international AAAI conference on web and social media, 2020. [Google Scholar]
  • 16.Wu Z, Pi D, Chen J, Xie M, Cao J. Rumor detection based on propagation graph neural network with attention mechanism. Expert Syst Appl. 2020;158:113595. doi: 10.1016/j.eswa.2020.113595 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Kochkina E, Liakata M, Zubiaga A. All-in-one: Multi-task learning for rumour verification. In: Proceedings of the 27th international conference on computational linguistics, 2018. p. 3402–13. [Google Scholar]
  • 18.Williams EM, Carley KM. TSPA: Efficient target-stance detection on Twitter. In: 2022 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM), 2022. p. 242–6. [Google Scholar]
  • 19.Jiang J, Ren X, Ferrara E. Retweet-BERT: political leaning detection using language features and information diffusion on social networks. In: Proceedings of the international AAAI conference on web and social media, 2023. p. 459–69. [Google Scholar]
  • 20.Xu C, Paris C, Nepal S, Sparks R. Cross-target stance classification with self-attention networks. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 2: Short Papers), 2018. p. 778–83. [Google Scholar]
  • 21.Liang B, Fu Y, Gui L, Yang M, Du J, He Y. Target-adaptive graph for cross-target stance detection. In: Proceedings of the web conference 2021, 2021. p. 3453–64. [Google Scholar]
  • 22.Mohtarami M, Glass J, Nakov P. Contrastive language adaptation for cross-lingual stance detection. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), 2019. p. 4442–52. [Google Scholar]
  • 23.Conforti C, Berndt J, Pilehvar MT, Giannitsarou C, Toxvaerd F, Collier N. Synthetic examples improve cross-target generalization: A study on stance detection on a Twitter corpus. In: Proceedings of the eleventh workshop on computational approaches to subjectivity, sentiment and social media analysis, 2021. p. 181–7. [Google Scholar]
  • 24.Khiabani PJ, Zubiaga A. Few-shot learning for cross-target stance detection by aggregating multimodal embeddings. IEEE Trans Comput Soc Syst. 2023;11(2):2081–90. [Google Scholar]
  • 25.Denning PJ. The locality principle. Communication networks and computer systems: A tribute to professor Erol Gelenbe. World Scientific. 2006. p. 43–67. [Google Scholar]
  • 26.Lai M, Patti V, Ruffo G, Rosso P. Stance evolution and twitter interactions in an Italian political debate. In: International conference on applications of natural language to information systems, 2018. p. 15–27. [Google Scholar]
  • 27.Li C, Porco A, Goldwasser D. Structured representation learning for online debate stance prediction. In: Proceedings of the 27th international conference on computational linguistics, 2018. p. 3728–39. [Google Scholar]
  • 28.Smith SL, Turban DH, Hamblin S, Hammerla NY. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. In: International conference on learning representations. 2017. [Google Scholar]
  • 29.Ciano G, Rossi A, Bianchini M, Scarselli F. On inductive-transductive learning with graph neural networks. IEEE Trans Pattern Anal Mach Intell. 2022;44(2):758–69. doi: 10.1109/TPAMI.2021.3054304 [DOI] [PubMed] [Google Scholar]
  • 30.Babcock M, Cox RAV, Kumar S. Diffusion of pro-and anti-false information tweets: the Black Panther movie case. Comput Math Organ Theory. 2019;25:72–84. [Google Scholar]
  • 31.Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D. Roberta: A robustly optimized bert pretraining approach. 2019. https://arxiv.org/abs/1907.11692
  • 32.Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv Preprint. 2019. 10.48550/arXiv.1910.01108 [DOI]
  • 33.Nguyen DQ, Vu T, Nguyen AT. BERTweet: A pre-trained language model for English Tweets. In: Proceedings of the 2020 conference on empirical methods in natural language processing: System demonstrations, 2020. p. 9–14. [Google Scholar]
  • 34.Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 conference on empirical methods in natural language processing: System demonstrations, 2020. p. 38–45. [Google Scholar]
  • 35.Bird S, Klein E, Loper E. Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc. 2009. [Google Scholar]
  • 36.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN. Attention is all you need. Adv Neural Inf Process Syst. 2017;30. [Google Scholar]
  • 37.Hamilton WL, Ying R, Leskovec J. Inductive representation learning on large graphs. In: Proceedings of the 31st international conference on neural information processing systems, 2017. p. 1025–35. [Google Scholar]
  • 38.Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. 2016. https://arxiv.org/abs/1609.02907 [Google Scholar]
  • 39.Morris C, Ritzert M, Fey M, Hamilton WL, Lenssen JE, Rattan G. In: Proceedings of the AAAI conference on artificial intelligence, 2019. p. 4602–9. [Google Scholar]
  • 40.Veličković P, Cucurull G, Casanova A, Romero A, Li P, Bengio Y. Graph attention networks. In: Proceedings of the sixth international conference on learning representations, 2018. [Google Scholar]
  • 41.Laitinen M, Fatemi M, Lundberg J. Size matters: Digital social networks and language change. Front Artif Intell. 2020;3:46. doi: 10.3389/frai.2020.00046 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Wang Z, Hale S, Adelani DI, Grabowicz P, Hartman T, Fl ¨ock F. In: The World Wide Web conference, 2019. p. 2056–67. [Google Scholar]

Decision Letter 0

Carlos Henrique Gomes Ferreira

4 Oct 2024

PONE-D-24-21648Evaluating the Effect of Heterogeneous User Interactions in Out-of-Sample Stance ClassificationPLOS ONE

Dear Dr. Villa-Cox,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Nov 18 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Carlos Henrique Gomes Ferreira, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

1. When submitting your revision, we need you to address these additional requirements.Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match. When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section.

3. Thank you for stating the following in the Acknowledgments Section of your manuscript: "The research for this paper was supported in part by the ARMY Scalable Technologies for Social Cybersecurity, Office of Naval Research, MURI: Persuasion, Identity, & Morality in Social-Cyber Environments, and Office of Naval Research, MURI: Near Real Time Assessment of Emergent Complex Systems of Confederates under grants W911NF20D0002, N000142112749, and N000141712675. It was also supported by the Secretar´ıa de Educaci´on Superior, Ciencia, Tecnolog´ıa e Innovaci´on (SENESCYT), Ecuador; the Center for Informed Democracy and Social-cybersecurity (IDeaS) and the Center for Computational Analysis of Social and Organizational Systems (CASOS) at Carnegie Mellon University. The views and conclusions are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the ARMY, the ONR, or the US Government".

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: "The research for this paper was supported in part by the ARMY Scalable Technologies for Social Cybersecurity, Office of Naval Research, MURI: Persuasion, Identity, & Morality in Social-Cyber Environments, and Office of Naval Research, MURI: Near Real Time Assessment of Emergent Complex Systems of Confederates under grants W911NF20D0002, N000142112749, and N000141712675. RVC was also supported by the Secretaría de Educación Superior, Ciencia, Tecnología e Innovación (https://siau.senescyt.gob.ec/convocatorias), Ecuador; the Center for Informed Democracy and Social-cybersecurity (https://www.cmu.edu/ideas-social-cybersecurity) and the Center for Computational Analysis of Social and Organizational Systems (CASOS) at Carnegie Mellon University. The views and conclusions are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the ARMY, the ONR, or the US or Ecuadorian Government. None of the sponsors or funders of this work played any role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

4. When completing the data availability statement of the submission form, you indicated that you will make your data available on acceptance. We strongly recommend all authors decide on a data sharing plan before acceptance, as the process can be lengthy and hold up publication timelines. Please note that, though access restrictions are acceptable now, your entire data will need to be made freely accessible if your manuscript is accepted for publication. This policy applies to all data except where public deposition would breach compliance with the protocol approved by your research ethics board. If you are unable to adhere to our open data policy, please kindly revise your statement to explain your reasoning and we will seek the editor's input on an exemption. Please be assured that, once you have provided your new statement, the assessment of your exemption will not hold up the peer review process.

5. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The manuscript investigates the impact of different levels of context in detecting the stance of social media users towards political events, with a specific focus on the 2019 South American Protests. It also evaluates the models' ability to extrapolate the prediction to the contexts of other countries. Authors have collected a large dataset of tweets in 2019 and an additional dataset in 2020 related to the Chilean Plebiscite. Different levels of context correspond to: tweet-level, user-level (i.e., including previous tweets), egonet-level (i.e., including interactions with other users). They pre-train a RoBERTA model to represent tweet content, a transformer-like model to represent user history and a heterogeneous Graph Attention Network to represent social media interactions and their types. The proposed model is compared with ReTweet-BERT and TSPA (a label propagation based algorithm with pre-calculated interaction weights). The proposed model outperforms the baselines and the additional levels of context have a positive impact on its performance when retweets are considered at the user-level and a heterogeneous GAT is used.

Strengths

S1. Large, exclusive dataset on the 2019 South American Protests and on the 2020 Chilean Plebiscite.

S2. A Spanish variant of BERT (twBETO) will be made available, trained on 150 million Spanish tweets

S3. Proposed model achieves excellent performance on political stance detection on the aforementioned dataset.

S4. Manuscript is well-written and easy to follow.

Weaknesses

W1. Title is more general than the work itself: the work is specific to political stance detection; other issues with title (see C1).

W2. Validation based on how closely they approximate the final vote of the Chilean Plebiscite has no statistical validity as they do not account for how representative the sample is (see C2).

W3. The cross-target comparison is based on Macro-F1, despite the different class proportions (see C3).

W4. A strong baseline of comparison was dismissed without clear evidence that it does not scale (see C4).

W5. Comparison between methods lacks confidence intervals (consider bootstrapping the test set).

W6. Lacks comparison with LLMs, e.g., LLaMa, GPT3.5, GPT4.0 (it would be good to include a comparison at least in the zero shot setting, but it is not essential for acceptance in my opinion).

Comments

C1. Current title is "Evaluating the Effect of Heterogeneous User Interactions in Out-of-Sample Stance Classification".

C1a. The use of "Stance Classification" is too general, and it may give the impression that the model can be applied to any stance detection task. Examples where it doesn't apply: detect if the author agrees or not with a previous comment; query if the author agrees with a given statement "X".

--> Recommendation: use "Political Stance Detection".

C1b. Unclear what is meant by "effect of heterogeneous user interactions". The effect being evaluated is the **impact** on Political Stance Detection Performance. What is being evaluated is the **impact** of **accounting for**. It is hard to understand the meaning of 'heterogeneous' directly from the title.

C2. Given that there is no evidence that the user sample is representative of Chile's population, the fact that it is closer to the final vote cannot be used to claim that the proposed model is good. The correct way of doing this is by performing demographic inference and reweighing samples accordingly (see ref below). Authors could attempt to do this based on the author names and description with the due ethical considerations.

Zijian Wang, Scott Hale, David Ifeoluwa Adelani, Przemyslaw Grabowicz, Timo Hartman, Fabian Flöck, and David Jurgens. 2019. Demographic Inference and Representative Population Estimates from Multilingual Social Media Data. In The World Wide Web Conference (WWW '19). Association for Computing Machinery, New York, NY, USA, 2056–2067. https://doi.org/10.1145/3308558.3313684

Recommendation: either follow the work above to apply the correct weighting OR downplay the fact that the proposal model is close to the final vote and add this disclaimer.

C3. F1 scores are a function of the decision threshold. The optimal decision threshold varies based on the predicted scores distribution, hence it changes across datasets.

Recommendation: Authors should opt for metrics that are robust to the differences between domains, such as AUC or area under the precision-recall curve (AUPR).

C4. The manuscript states in l. 94 that:

"Even though there exists work that has explored user-level stance detection on Twitter data [15], the proposed approach requires pairwise similarity calculations for all users of interest, which limits the scalability of the approach"

Looking at Darwish et al. 2020, it seems that their algorithm depends on UMAP for dimensionality reduction and DBSCAN for clustering. Both algorithms have hyperparameters that allow them to scale by considering a smaller number of neighbors.

It is not clear that the algorithm cannot run on the present data. In fact, Darwish et al. have applied it to datasets of 1.8M, 2.4M and two datasets of 2.6M tweets, which are in the same scale as the datasets in the manuscript.

C5. Abstract refers to "users' social networks" which is a bit ambiguous. To distinguish from the platforms, authors should consider using the term 'egonets'.

C6. To deal with class imbalance, authors use resampling, which tends to leave out some examples from the majority class. Have you consider using class weights in the loss function?

Minor comments

l. 474 Undefined reference to Table 2.

Reviewer #2: In this paper, the authors propose a framework for stance detection based on three encoders (i.e., components) for content, user, and social network in the context of political protests, specifically the 2019 South America Protests.

Strength: The authors deal with a relevant and state-of-the-art research issue: evaluating layered stance detection approaches that go beyond text content. In this case, the text content (tweets), the user text pattern, and the user social network ties are explored. Additionally, the authors evaluate the impact of location and time on their framework performance by exploring data extracted from different countries.

Weakness: Some issues regarding the impact of each model component and the methodology to train and evaluate them must be clarified. First, the social network component and its emerging communities may alone answer the problem of user stance in favor or against protests. Second, the labeling process for training the model is superficially explained to make a self-contained manuscript.

In the following, I discuss the paper per section, including more details about the two issues reported above.

1. Introduction

Regarding the first contribution, I would recommend that the authors also mention the impact of each compartment of the proposed architecture on the stance-detection problem. As reported in the following comments, it is important to justify the construction of an architecture based on three compartments (tweet, user, and network layers), given its high computational cost.

2. Related Work

I recommend explaining specific technical terms such as “hand-labeled” and “hand-crafted” to make the manuscript clear for readers from broad research areas.

3. Data Description

Authors must provide further information about the adopted methodology to construct a ground truth dataset of labeled comments and evaluate the proposed framework. First of all, the weak-labeling step of the methodology must be explained in order to produce a self-contained manuscript. Questions that are raised from the current data description are:

- How is the “endorsement of hand-labeled political figures” leveraged?

- Which hashtag campaigns have well-defined stances toward each government? I recommend providing a table with prominent examples.

- Which polarized communities are observed according to user partitions from the constructed labels? It appears a network of users supports the weak-labeling methodology, but no information is available on how this network and connections between users are constructed.

The observation of polarized communities raises the issue of whether the users’ network and its emerging communities alone answer the problem of user stance in favor or against protests. What is the relevance (i.e., the impact) of the users’ text content for detecting their instances?

This issue should be addressed in the analyses, and the impact of each component on the stance problem in the context should be clarified.

Why the weak-labelling is not used for the Chilean Referendum dataset?

4. Materials and Methods

Some notes and recommendations for better comprehension of Fig1:

- Include input arrows for each layer to indicate where to start reading each layer, mainly the first and second ones, which have more details.

- Should the output dimension of the first layer (tweet encoder) be equal to the input dimension of the second layer (user encoder)? Please check it and explain it in the text.

- The Lx label's current position in the figure is difficult to understand. It must be checked.

- Labels in the third layer (network) must be explained.

- The quality of Figure 1, shown on page 26 of the reviewer file, is very bad. It is important to fix it.

Section 4.1: The dimensions of the user tweets input and the tweet embeddings output are not explained in the text. Please explain it.

5. Results

Regarding the results shown in Table 3, are the performances of the proposed User-Level Transformer and TSPA approach significantly different? A confidence interval for the performance metrics could be calculated to analyze whether they are significantly different. Approaches such as Normal Approximation Interval or Bootstrapping the Test Sets are currently good practices.

The network-level appears to improve the results significantly. Again, communities that emerge from the network, which were used to label users, are probably the most relevant information for user stance in favor or against protests. Thus, the result in Table 3 is the expected one (i.e., no novelties). Authors should discuss, based on their data, the relevance of exploring text content once you have the users-network, which is probably the most relevant information for stance in the analyzed context.

The results shown in Section 5.2 are very interesting. Specifically, the case of Bolivia shows that models trained with data from a given location cannot always be applied to other locations, e.g., countries, even though they share similar cultures and languages. On the other hand, the issue discussed above becomes more evident as cross-country results show that the users' network is the information that allows models to reach the best results (e.g., accuracy and F1 greater than 90% in most evaluated cases). Again, it is important to discuss the relevance of exploring text content once you have the users' network, considering the computing cost to deal with each one.

To correct the broken reference for Table 5 on page 13/18, line 474.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Fabricio Murai

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2025 Jun 26;20(6):e0324697. doi: 10.1371/journal.pone.0324697.r002

Author response to Decision Letter 1


25 Nov 2024

Dear Reviewers and Editor,

We thank you for your thorough and helpful comments. We have attempted to address each; the comments have greatly improved the quality of the paper. Below we outline how we attempted to address comment point-by-point:

EDITOR COMMENTS

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming.

Corrected the title so it is in compliance with the style requirements.

2. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match.

When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section.

Ensured the funding information in the Financial Disclosure sections is correct

3. We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement.

We removed all funding information from the 'Acknowledgments' section and updated it accordingly.

4. When completing the data availability statement of the submission form, you indicated that you will make your data available on acceptance. We strongly recommend all authors decide on a data sharing plan before acceptance, as the process can be lengthy and hold up publication timelines.

We comply with the suggestion and make all data publicly available before acceptance. The data and code can be accessed through the following URLs:

https://doi.org/10.5281/zenodo.14207926

https://huggingface.co/Ramavill/twBETO_v0

https://github.com/rvillaco/Protest_Stance_Detection

5. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

We removed all supporting information files from the submission as they are now freely accessible through the URLs mentioned above.

REVIEWER 1 RESPONSES

C1. Current title is "Evaluating the Effect of Heterogeneous User Interactions in Out-of-Sample Stance Classification".

C1a. The use of "Stance Classification" is too general, and it may give the impression that the model can be applied to any stance detection task. Examples where it doesn't apply: detect if the author agrees or not with a previous comment; query if the author agrees with a given statement "X" --> Recommendation: use "Political Stance Detection".

C1b. Unclear what is meant by "effect of heterogeneous user interactions". The effect being evaluated is the **impact** on Political Stance Detection Performance. What is being evaluated is the **impact** of **accounting for**. It is hard to understand the meaning of 'heterogeneous' directly from the title.

We agree that the title was overly-general and appreciate the comments. We updated the title to “Social context in political stance detection: Impact and extrapolation”

C2. Given that there is no evidence that the user sample is representative of Chile's population, the fact that it is closer to the final vote cannot be used to claim that the proposed model is good. The correct way of doing this is by performing demographic inference and reweighing samples accordingly (see ref below). Authors could attempt to do this based on the author names and description with the due ethical considerations. Zijian Wang, Scott Hale, David Ifeoluwa Adelani, Przemyslaw Grabowicz, Timo Hartman, Fabian Flöck, and David Jurgens. 2019. Demographic Inference and Representative Population Estimates from Multilingual Social Media Data. In The World Wide Web Conference (WWW '19). Association for Computing Machinery, New York, NY, USA, 2056–2067. https://doi.org/10.1145/3308558.3313684 Recommendation: either follow the work above to apply the correct weighting OR downplay the fact that the proposal model is close to the final vote and add this disclaimer.

We agree with the observation and attempted to reweight the observations according to estimated demographic information. However, given the size of the referendum collection, the profile pictures were not saved when the collection took place. We attempted to download the pictures based on the saved JSONs but a large percentage of the user profile links were no longer available. Moreover, given the number of users (over 100k) and the new API rate limits imposed after the change from Twitter to X, it was not viable to attempt to download the profile pictures of the accounts that are still active (it is likely a considerable number of accounts no longer exist).

We tried to estimate the demographic information based on only on text features (replacing the pictures with random noise as provided in the paper’s repository) but the performance of the prediction dropped considerably below the ablation results provided in the paper (gender macro F1 dropped by 20%). We measured this based on a small sample of 50 accounts of politicians, reporters and institutions we constructed to test the method. The authors note that “We recommend using image data whenever possible to get the most accurate predictions”, but we hypothesize the larger drop is also probably caused by the difference between European and Latin American Spanish. We also observed a drop in performance, as discussed in our paper, when comparing TWilBert (a model trained in European Spanish) with TwBETO (the model released in this paper).

Given these caveats, we decided to downplay the fact that the heterogeneous network model is closer to the final vote and added the disclaimer as recommended. This was done by removing the mention of this validation from the introduction, discussing this limitation in the “Robustness over time” subsection, and adding the disclaimer and reference to the reweighting approach in the limitation section.

C3. F1 scores are a function of the decision threshold. The optimal decision threshold varies based on the predicted scores distribution, hence it changes across datasets.

Recommendation: Authors should opt for metrics that are robust to the differences between domains, such as AUC or area under the precision-recall curve (AUPR).

Agreed, we included ROC AUC metric to all tables and analyzed it accordingly.

C4. The manuscript states in l. 94 that:

"Even though there exists work that has explored user-level stance detection on Twitter data [15], the proposed approach requires pairwise similarity calculations for all users of interest, which limits the scalability of the approach" Looking at Darwish et al. 2020, it seems that their algorithm depends on UMAP for dimensionality reduction and DBSCAN for clustering. Both algorithms have hyperparameters that allow them to scale by considering a smaller number of neighbors.

It is not clear that the algorithm cannot run on the present data. In fact, Darwish et al. have applied it to datasets of 1.8M, 2.4M and two datasets of 2.6M tweets, which are in the same scale as the datasets in the manuscript.

We have added additional clarification on this point; while Darwish et al. have datasets with millions of tweets, they only ran their approach on a maximum of 5,000 users (Table 2 in their paper). As the authors outline in the Finding Stance Clusters section, they calculate three pairwise cosine matrices for every user in their selected subset and run umap on their pairwise cosine similarity matrices. For the users in Chile alone, generating a single cosine similarity matrix for all of our users using their approach was set to take over 100 days. We added an additional sentence about the cosine similarity bottleneck in the Related Work section.

C5. Abstract refers to "users' social networks" which is a bit ambiguous. To distinguish from the platforms, authors should consider using the term 'egonets'.

Agreed, we updated the abstract accordingly and other references found in the main text that might lead to a similar confusion.

C6. To deal with class imbalance, authors use resampling, which tends to leave out some examples from the majority class. Have you consider using class weights in the loss function?

We did try weighting observations by the inverse of their class frequency and also other static resampling approaches, but found that dynamic resampling improved training stability as mentioned in the paper. We hypothesize that this is in part due to the dynamic sampling of n_max tweets for each user at the start of a batch, which implies that even repeated users have a good chance of including different tweets when added to the batch. At least 83% of the users in each training set had 15 or more tweets in the dataset. We added this discussion as it was not mentioned before.

Minor comments

l. 474 Undefined reference to Table 2.

Corrected broken reference.

REVIEWER 2 RESPONSES

Weakness: Some issues regarding the impact of each model component and the methodology to train and evaluate them must be clarified. First, the social network component and its emerging communities may alone answer the problem of user stance in favor or against protests. Second, the labeling process for training the model is superficially explained to make a self-contained manuscript.

On the first point, we agree that the network alone carries significant signal, and have added additional language highlighting that the TSPA baseline serves as a network baseline without any account features. TSPA’s performance is surprisingly strong even without user text attributes, although the User Transformer outperforms it in some countries (when considering the ROC AUC performance). We added language further emphasizing that point in the Methods and Results sections. We further expand this point in the responses below.

On the second point, we added further description of the labeling methodology. See responses to the data description comments for further details.

1 Introduction: Regarding the first contribution, I would recommend that the authors also mention the impact of each compartment of the proposed architecture on the stance-detection problem. As reported in the following comments, it is important to justify the construction of an architecture based on three compartments (tweet, user, and network layers), given its high computational cost.

We added more details justifying this architecture choice in the introduction, as it allows us to evaluate the effect of context on the in-sample performance and generalization (this is highlighted across the first 3 contributions). Other details on how this effect is manifested are addressed in the points discussed below and in the original results section.

2. Related Work: I recommend explaining specific technical terms such as “hand-labeled” and “hand-crafted” to make the manuscript clear for readers from broad research areas.

Added further description to clarify the aforementioned terms.

3. Data Description

Authors must provide further information about the adopted methodology to construct a ground truth dataset of labeled comments and evaluate the proposed framework. First of all, the weak-labeling step of the methodology must be explained in order to produce a self-contained manuscript. Questions that are raised from the current data description are:

- How is the “endorsement of hand-labeled political figures” leveraged?

The methodology relies on the hypothesis that users are more likely to tweet (or retweet) stance-tags (hashtag campaigns with well-defined stances towards each government and that occur at the end of a tweet) or political figures that are aligned with their stances during these events. For this reason, stance labels are assigned to a user if the percentage of tweets with a consistent stance-tag or retweets from political figures with a consistent stance is above a given threshold (the authors use 90% as a threshold based on a labeled validation set). A final stance label is assigned to a user if the stance obtained by both signals (usage of hashtags and retweet of political figures) is consistent. This discussion was added to the paper.

- Which hashtag campaigns have well-defined stances toward each government? I recommend providing a table with prominent examples.

Added a table with prominent examples.

- Which polarized communities are observed according to user partitions from the constructed labels? It appears a network of users supports the weak-labeling methodology, but no information is available on how this network and connections between users are constructed.

The authors validate the quality of the weakly-annotated labels based on a hand-labeled sample of users and by showing that the constructed labels partition the users in communities that are polarized in their language and news-sharing behavior in a way consistent with the ideological underpinnings of each protest. These communities are not explicitly defined, but are shown to use semantically consistent text and expressions by leveraging an established machine translation algorithm. They show that terms related to left-leaning ideologies in one community tend to be discussed in similar contexts as right-leaning terms (e.g., Socialism mistranslates to Fascism); terms related to law and order in one group are discussed in a similar context as the other discusses oppression, or that opposition leaders are discussed in similar contexts as government representatives. This discussion was added with examples of how this polarization is manifested in the data description section.

The observation of polarized communities raises the issue of whether the users’ network and its emerging communities alone answer the problem of user stance in favor or against protests. What is the relevance (i.e., the impact) of the users’ text content for detecting their instances?

This issue should be addressed in the analyses, and the impact of each component on the stance problem in the context should be clarified.

Added a brief discussion of this objective in the data section and explore it in more detail in the results section as described in the responses below.

Why the weak-labelling is not used for the Chilean Referendum dataset?

As discussed in the “The 2020 Chilean Plebiscite 224” subsection, we applied the same weak-labeling methodology, based on two stance signals, to construct the referendum dataset. This section also provides further details on the weak-labeling methodology used for the construction of the protest dataset, addressing in part the previous comment and making the paper more self-contained (given that we follow the same algorithm).

However, as mentioned, the difference lies in that instead of using hand-labeled Political Figures, as was done to construct the weak-labels for the South American protests, we opted to identify the hashtags used in the user description that were explicitly rejecting or approving the referendum in first person. This simplified the labeling procedure given that we labeled other stance tags for the referendum. We found that the usage of stance tags in the descriptions as a replacement signal also resulted in meaningful polarized stance partitions based on the same language metrics used to validate the protest weak-labels (we did not explore polarization in news-sharing behavior during the referendum as it escaped the scope of this work). This discussion was added to the paper to clarify this topic.

4. Materials and Methods

Some notes and recommendations for better comprehension of Fig1:

- Include input arrows for each layer to indicate where to start reading each layer, mainly the first and second ones, which have more details.

Added a white arrow to denote the first input to each component. Other than this, the figure already included arrows that showed the flow through the architecture. However, if the white arrow for the last component adds too much clutt

Attachment

Submitted filename: Rebuttal_Letter.pdf

pone.0324697.s002.pdf (217.9KB, pdf)

Decision Letter 1

Carlos Henrique Gomes Ferreira

11 Mar 2025

PONE-D-24-21648R1Social context in political stance detection: Impact and extrapolationPLOS ONE

Dear Dr. Villa-Cox,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Apr 25 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Carlos Henrique Gomes Ferreira, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: I thank the authors for the detailed responses to my comments and suggestions. I have carefully read the answers to both reviews and changes made to the manuscript.

In particular, I appreciate their efforts in attempting to run demographic inference on their large dataset of tweet samples to determine whether the sample could be considered representative of Chile's population. I acknowledge that the changes to X/Twitter's API have hindered research efforts based on the platform's data and agree with the authors' decision of downplaying the proximity of the model's output to the final vote results as a form of validation. In addition, I appreciate the authors' efforts of making the data and code available for reproducibility purposes.

(New comment) One new issue came up after the inclusion of ROC-AUC results -- the statement "Protests in Bolivia had 506 opposing ideological motivations in comparison with the motivations observed in the 507 other countries," coupled with the shockingly low AUC results (specifically, all of them much lower than 0.5 ['random']) in Bolivia suggests a simple fix: to swap the positive and negative labels. High AUC values would actually lead to an opposing finding: that the results can generalize well across countries provided that there is a suitable mapping for the labels.

Recommendation: either the authors swap the labels or explain why this is not the right choice in their context.

Reviewer #2: I appreciate the authors' effort in preparing the manuscript after the first round of peer review. I believe the authors improved the quality of their manuscript compared to the previous version.

Below are some recommendations to be considered in the final version of the paper if it is accepted for publication in the journal.

Regarding confidence intervals in tables, it is not easy to see the best-performing model. Thus, the best-performing model should be bold for easier reading.

Show clearly the measure used to define the best-performing model. Is the measure F1 or ROC AUC?

Besides, I recommend clearly showing how the confidence interval was calculated. If it was based on mean values, why not show the mean plus-minus its error for the used confidence interval instead of percentiles? It is important to clarify these points in some paragraphs of the results section.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Fabricio Murai

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2025 Jun 26;20(6):e0324697. doi: 10.1371/journal.pone.0324697.r004

Author response to Decision Letter 2


16 Mar 2025

Dear Reviewers and Editor,

We appreciate the consideration given to our response and thank you for the further feedback. In what follows we address the minor revisions posited in this new round of reviews.

Journal Requirements

1. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

We reviewed our reference list, and did not find any retracted references. However, while doing so we did find several instances of references that listed the arXiv preprint, when a peered review version was available. We proceeded to update said references and, as requested, list the changes made (The checkmark signals de updated version included in the new submission).

Updated References:

o GlobeNewswire. $6.78 Billion Public Opinion and Election Polling Global Market to 2030 - Identify Growth Segments for Investment; 2021. Available from: https://www.proquest.com/wire-feeds/6-78-billion-public-opinion-election-polling/docview/2555898927/se-2?accountid=9902.

+ GlobeNewswire. $6.78 Billion Public Opinion and Election Polling Global Market to 2030 - Identify Growth Segments for Investment; 2021. Available from: https://www.globenewswire.com/news-release/2021/07/29/2271041/28124/en/6-78-Billion-Public-Opinion-and-Election-Polling-Global-Market-to-2030-Idhtml.

o Kochkina E, Liakata M, Zubiaga A. All-in-one: Multi-task learning for rumour verification. arXiv preprint arXiv:180603713. 2018;.

+ Kochkina, E., Liakata, M., & Zubiaga, A. (2018, August). All-in-one: Multi-task Learning for Rumour Verification. In Proceedings of the 27th nternational Conference on Computational Linguistics (pp. 3402-3413).

o Xu C, Paris C, Nepal S, Sparks R. Cross-target stance classification with self-attention networks. arXiv preprint arXiv:180506593. 2018;.

+ Xu, C., Paris, C., Nepal, S., & Sparks, R. (2018, July). Cross-Target Stance Classification with Self-Attention Networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume : Short Papers) (pp. 778-783).

o Mohtarami M, Glass J, Nakov P. Contrastive language adaptation for cross-lingual stance detection. arXiv preprint arXiv:191002076. 2019;

+ Mohtarami, M., Glass, J., & Nakov, P. (2019, November). Contrastive Language Adaptation for Cross-Lingual Stance Detection. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 4442-4452).

o Khiabani PJ, Zubiaga A. Few-shot Learning for Cross-Target Stance Detection by Aggregating Multimodal Embeddings. arXiv preprint arXiv:230104535. 2023;.

+ Khiabani, P. J., & Zubiaga, A. (2023). Few-shot learning for cross-target stance detection by aggregating multimodal embeddings. IEEE Transactions on Computational Social Systems, 11(2), 2081-2090.

o Smith SL, Turban DH, Hamblin S, Hammerla NY. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv preprint arXiv:170203859. 2017;.

+ Smith, S. L., Turban, D. H., Hamblin, S., & Hammerla, N. Y. (2017, February). Offline bilingual word vectors, orthogonal transformations and the inverted softmax. In International Conference on Learning Representations.

o Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint

arXiv:191003771. 2019;

+ Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., ... & Rush, A. M. (2020, October). Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations (pp. 38-45).

o Bird S, Klein E, Loper E. NLTK book; 2009.

+ Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: analyzing text with the natural language toolkit. " O'Reilly Media, Inc.".

REVIEWER 1 RESPONSE

1. One new issue came up after the inclusion of ROC-AUC results -- the statement "Protests in Bolivia had 506 opposing ideological motivations in comparison with the motivations observed in the 507 other countries," coupled with the shockingly low AUC results (specifically, all of them much lower than 0.5 ['random']) in Bolivia suggests a simple fix: to swap the positive and negative labels. High AUC values would actually lead to an opposing finding: that the results can generalize well across countries provided that there is a suitable mapping for the labels.

Recommendation: either the authors swap the labels or explain why this is not the right choice in their context

We agree with the observation that we could leverage the low AUC results for the Bolivian case and our domain knowledge of the different ideological motivations behind the Bolivian protests; to maintain the high generalization capacities of the heterogeneous model (as mentioned in the paper, we take this approach when applying the Chilean classifier to the referendum data). However, we chose not to do so for the cross-country robustness analysis as the performance of the different classifiers (trained in other countries) when applied to the Bolivian case allows us to elaborate in the following interesting discussion (taken from line 520 to 525):

“As shown, the Bolivian case serves as an adversarial setting for classifiers trained in other countries, which suggests that, when applied to this country, the semantic features leveraged by the classifiers are operating on an ideological dimension. This is consistent with results presented in villa2022linguistic, where the authors found that language polarization remained along ideological lines when comparing protests of any of the three countries with Bolivia. This was not the case when comparing protests of the other three countries.”

And also to highlight (taken from line 547 to 552):

“ … the importance of domain knowledge when applying these models to different domains. If we leverage the knowledge of the opposed motivations for the Bolivian protests by inverting the heterogeneous model's predictions, its performance would be in line with what was observed for the other countries. We take the approach of inverting the predictions when we report the results of the 2020 Chilean referendum..”

We believe that this discussion sufficiently addressed the correct observation made in the review, and did not include additional modifications. However, if further clarification is needed or if this discussion should be addressed before in the section, we would be happy to revise it.

REVIEWER 2 RESPONSES

1. Regarding confidence intervals in tables, it is not easy to see the best-performing model. Thus, the best-performing model should be bold for easier reading

Show clearly the measure used to define the best-performing model. Is the measure F1 or ROC AUC?

As suggested, the name of the best performing classifier is now bold if it obtained the best performance (or was statistically tied) in both F1 and ROC AUC. We do not favor one metric over the other, as both provide important information on model performance, and we discuss the relevant differences in the result section. We updated the relevant part of each table description which now reads:

“Model performance metrics correspond to the median of 1000 bootstrapped samples of the test set. For each metric, the classifiers whose performance is not statistically different, at a 95% confidence level, from the best-performing model are highlighted in bold. We also highlight in bold the name of the best-performing model if it obtains the best results (or is statistically tied) in all presented metrics. The corresponding confidence bands, computed from the bootstrapped empirical distribution of each metric, are included in the appendix.”

2. Besides, I recommend clearly showing how the confidence interval was calculated. If it was based on mean values, why not show the mean plus-minus its error for the used confidence interval instead of percentiles? It is important to clarify these points in some paragraphs of the results section.

We presented the median results, obtained from the bootstrapped empirical distribution, for each metric and its corresponding confidence band (either 95% for the main results or 99% for the robustness analysis). We chose not to present mean values plus-minus its error as that would be an approximation which was not necessary given that we had access to the empirical distribution. We also had to report the bands and not a plus-minus notation as the empirical distributions were not symmetrical (showed slight skewness). As suggested, we added the following discussion to better clarify these points in the results section:

“We report the median result of the bootstrapped empirical distribution for each metric and compute its corresponding confidence band to assess whether the differences are statistically significant (these are presented in the appendix).”

Attachment

Submitted filename: Second_Revision.pdf

pone.0324697.s003.pdf (217.9KB, pdf)

Decision Letter 2

Carlos Henrique Gomes Ferreira

30 Apr 2025

Social context in political stance detection: Impact and extrapolation

PONE-D-24-21648R2

Dear Dr. Villa-Cox,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Carlos Henrique Gomes Ferreira, Ph.D.

Academic Editor

PLOS ONE

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: I thank the authors for carefully addressing my previous comments. I agree with the authors that the decision of not flipping the predictions for the Bolivian case make sense from a methodological standpoint and that the reason is also presented and well-justified in the current version of the manuscript.

Reviewer #2: I appreciate the authors' effort in preparing the manuscript after the second round of peer review. I believe the authors improved the quality of their manuscript compared to the previous version, and all the comments have been addressed.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Fabricio Murai

Reviewer #2: No

**********

Acceptance letter

Carlos Henrique Gomes Ferreira

PONE-D-24-21648R2

PLOS ONE

Dear Dr. Villa-Cox,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

You will receive further instructions from the production team, including instructions on how to review your proof when it is ready. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few days to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Carlos Henrique Gomes Ferreira

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1. Confidence Bands.

    Confidence bands for the Macro F-1 score and Area Under the ROC curve of the within-country and out-of-sample cross-country predictions at different context levels. The confidence bands are computed from the corresponding percentiles of the 1000 bootstrapped samples constructed for each experiment

    (PDF)

    pone.0324697.s001.pdf (78.4KB, pdf)
    Attachment

    Submitted filename: Rebuttal_Letter.pdf

    pone.0324697.s002.pdf (217.9KB, pdf)
    Attachment

    Submitted filename: Second_Revision.pdf

    pone.0324697.s003.pdf (217.9KB, pdf)

    Data Availability Statement

    In compliance with Twitter’s (now known as X) 2023 Terms of Service, we are not able to share tweet or user ids, nor the corresponding text for each post. However, to ensure reproducibility of the results presented in the paper, we make available all predictions obtained for the different models for anonymized users and them corresponding anonymized tweets. This can be used to recreate the country level results and the different robustness analysis. The code developed is publicly available at: • https://github.com/rvillaco/Protest_Stance_Detectionhttps://doi.org/10.5281/zenodo.15571840 The anonymized data, including the User Embeddings, required to reproduce the results presented in the paper is available at: • https://doi.org/10.5281/zenodo.14207926.


    Articles from PLOS One are provided here courtesy of PLOS

    RESOURCES