Abstract
This data article describes a dataset generated from the official communications of the World Health Organization (WHO) on X (formerly Twitter). The dataset comprises 46,667 tweets published between 2008 and 2021. For each post, the dataset includes public engagement metrics such as the number of likes, retweets, and replies. The utility of this data lies in its extensive longitudinal coverage, allowing for the quantitative analysis of public discourse and popular legitimacy toward a major International Organization (IO), particularly in response to key global events like the COVID-19 pandemic. This dataset provides a unique resource for researchers in global health governance, computational social science, crisis communication, and public relations.
Keywords: World Health Organization, WHO, Social media metrics, X (Twitter), Popular legitimacy, Longitudinal data, Computational social science, COVID-19
Specifications Table
| Subject | Social Sciences |
| Specific subject area | Social media metrics, Data Science, Pre- and Post-COVID-19, WHO. |
| Type of data | Structured text data (Raw). |
| Data collection | We collected a corpus of 46,667 tweets published by the World Health Organization (WHO) on X between April 23, 2008, and November 8, 2021. The data was acquired using the X Application Programming Interface (API) using the R and Python software. The dataset enables a longitudinal analysis of the WHO's public communications and audience engagement. Our methodology centers on “ratiometrics,” where we analyze the ratios between engagement metrics, primarily the retweet-to-reply ratio, to measure public support and controversy surrounding WHO narratives, especially in the context of the COVID-19 pandemic [1]. |
| Data source location | Institution: HEC Montréal City/Town/Region: Montréal, Québec, Canada Country:Canada Latitude and Longitude (Data Storage and Processing Centers): 45.5051° N, 73.6188° W Primary Data Sources: The dataset was collected from Twitter (now X) using the Twitter API through R- and Python-based tools. The resulting data were processed and stored in a secure research infrastructure maintained by HEC Montréal. |
| Data accessibility | Data will be available through Github Repository name: WHO_tweets [2] Direct URL to data: https://github.com/warint/WHO_tweets Data identification number: doi:10.5281/zenodo.17457000) |
| Related research article | This dataset is derived from an original corpus first analyzed in our forthcoming Journal of Medical Internet Research (JMIR) article, “Social Media Metrics as Proxies for Popular Legitimacy: Analyzing Pre- and Post-COVID-19 Public Engagement with the World Health Organization on X.”[1] For the purpose of public release, we have carefully anonymized and condensed the original data. This modification ensures the protection of user privacy and makes the dataset more accessible and manageable for reuse by the wider research community. Full details of the original data collection can be found in the aforementioned publication.. |
Source: The Authors, 2025.
1. Value of the Data
Our dataset offers a significant contribution to social media analytics, with particular relevance for research on public discourse surrounding the WHO’s public communications and audience engagement. This dataset, comprising 46,667 unique posts from the official WHO X (Twitter) account spanning 13 years (2008–2021), offers a distinct and valuable resource for researchers across the social sciences and computational fields. Its utility is manifold, primarily by providing a large-scale, empirical lens into the dynamic relationship between an International Organization (IO) and the global public. The dataset is in alignment with the FAIR principles, ensuring that the data are Findable, Accessible, Interoperable, and Reusable.
The dataset is publicly available on GitHub in two formats to ensure broad utility and ease of use: (a) standard CSV files for universal compatibility across analytical platforms and (b) lightweight QS files optimized for rapid processing in R. This dual-format release is supported by detailed documentation published herein, which provides clear variable definitions, content descriptions, and methodological notes. Together, the versatile formats and thorough documentation promote research transparency, facilitate replicability of our findings, and encourage broad reuse of the data.
The primary value of this data lies in the following areas:
-
•
(a) Enables empirical assessment of IO popular legitimacy: The data provide a quantitative measure of popular legitimacy, a concept often studied through normative or elite-focused approaches. By using the extensive public engagement metrics (likes, retweets, replies) as proxies for public trust and support, researchers can empirically test theories of institutional legitimacy and public accountability, shifting the focus from official rhetoric to mass public response.
-
•
(b) Offers longitudinal pre- and post-crisis analysis: With a temporal coverage from 2008 through 2021, the dataset clearly demarcates the period of relative stability (2008–2019) and the period of intense global crisis (the COVID-19 pandemic, 2020–2021). This temporal segmentation allows researchers to perform robust longitudinal trend evaluations to precisely measure the impact of a global health crisis on public scrutiny and organizational support.
-
•
(c) Supports ratiometric and controversy analysis: The inclusion of both raw metrics and the derived retweet-to-reply ratio enables distinguishing between public support (high retweets relative to replies) and public controversy or dissent (high replies relative to retweets). This investigation of engagement polarity enables researchers to test hypotheses about public support versus controversy and to compare these signals with textual sentiment or other metrics.
-
•
(d) Resource for computational social science and AI development: The dataset is an extensive, cleaned collection of real-world public communication data. It can be immediately utilized for training and testing Natural Language Processing (NLP) models, time-series forecasting models, and machine learning algorithms focused on sentiment analysis, controversy detection, and public risk perception in international communication.
-
•
(e) Facilitates comparative crisis communication studies: Researchers studying crisis management, public health communication, or strategic communication can use this dataset to benchmark the WHO’s performance against other IOs or national governments during various global events, offering insights into effective and ineffective digital communication strategies.
2. Background
The WHO plays a central role in coordinating global public health efforts. The effectiveness of the WHO, particularly in crisis scenarios such as the COVID-19 pandemic, is directly linked to its popular legitimacy, the public’s perception of its trustworthiness, and support for its mandate [3,4]. Traditional assessments of IO legitimacy have primarily relied on normative theory or analyses of elite discourse, leaving a substantial gap in the empirical understanding of mass public sentiment [5].
Social media platforms have become primary arenas for global public discourse, offering real-time, high-volume data ideal for measuring public engagement. This platform’s metrics, specifically likes, retweets, and replies, serve as observable, quantifiable proxies for popular support and scrutiny. A crucial innovation in this context is the retweet-to-reply ratio, which can signal the underlying nature of public reaction (i.e., whether engagement is driven by consensus/support or controversy/dissent) [5,6]. To further contextualize these metrics, it is essential to account for demographic variables. Recent studies highlight that gender, in particular, drives distinct thematic priorities and engagement behaviours during global health crises like the COVID-19 pandemic [7,8].
The dataset presented here was generated to address the empirical deficit in assessing the popular legitimacy of the WHO. By compiling and structuring 13 years (2008–2021) of WHO communications and corresponding public engagement data on X, this resource enables a rigorous, longitudinal analysis of how popular legitimacy evolves in response to health crises like the pandemic.
3. Data Description
The dataset consists of 46,667 unique tweets collected from the official WHO X account. The dataset spans a period of just over 13 years, from the WHO's first post on April 23, 2008, up to November 8, 2021.
The data is organized as a tabular file, with each record (row) corresponding to an individual post and its associated metrics. The provided Tweet IDs can be hydrated via the X API. This process will restore the full set of fields, enabling researchers to replicate the original analyses from our study and perform their own research (Table 1).
Table 1.
Attributes and their description.
| Attribute | Description |
|---|---|
| id | Unique identifier for a tweet post |
| conversation_id | Unique identifier for the conversation thread to which the tweet belongs |
| replies_count | Number of replies received by the tweet |
| retweets_count | Number of times the tweet was retweeted |
| likes_count | Number of likes received by the tweet |
| date | Date of the tweet in YYYY-MM-DD format |
| retweets_reply_ratio | Retweet-to-reply ratio for WHO’s tweets: 2008–2021 |
| likes_reply_ratio | Like-to-reply ratio for WHO’s tweets: 2008–2021 |
Source: The Authors, 2025.
3.1. Dataset attributes
Each following key attributes define each record in the dataset:
4. Experimental Design, Materials and Methods
This corpus was designed to facilitate extensive, replicable, and temporal examinations of the WHO public narratives and perceived authority, specifically within the context of an international health emergency, like COVID-19. It encompasses 46,667 posts retrieved from the WHO’s official X platform (@WHO), representing 13 years from April 2008 to November 2021. The final public version of the dataset includes seven variables, selected from an original corpus of 36 variables plus a novel retweets-to-reply ratio variable generated during our analysis. In line with X content redistribution guidelines, the final dataset was dehydrated to retain only the unique tweet IDs and non-raw content metadata. This allows future researchers to rehydrate the original full content for independent analysis [9].
Data processing was executed utilizing both the R and Python scripts. The ‘qs’ package was employed for optimized data persistence and output within R, while supplementary preprocessing routines leveraged conventional Python libraries. The dataset is provisioned in both a CSV file and a QS format file, hosted on the GitHub platform. Detailed metadata and technical specifications accompany both file formats to ensure reproducibility and cross-platform compatibility, consistent with the FAIR data principles
4.1. Summary statistics and engagement metrics of this dataset
The median tweet date is October 1, 2017. On average, tweets include 1.49 hashtags, with a median of one, indicating a consistent use of hashtags across the dataset. Interaction distributions exhibit substantial dispersion across various metrics.
Table 2 summarizes the engagement metrics (replies, retweets, and likes), highlighting their heavy-tailed distributions. To facilitate shape comparison, we report distributional moments (Skewness, Kurtosis) alongside with Interquartile Range (IQR: P90, P95, P99), Standard Deviation (SD), and Extreme values (Max).
Table 2.
Distributional moments and standardized quantiles of engagement metrics.
| Metric | Mean | Median | SD | IQR | P90 | P95 | P99 | Max | Skewness | Kurtosis |
|---|---|---|---|---|---|---|---|---|---|---|
| Replies | 12.40 | 4 | 90.12 | 6 | 19 | 38 | 155.00 | 15,180 | 108.23 | 17,338.08 |
| Retweets | 120.83 | 49 | 535.79 | 78 | 210 | 360 | 1290.36 | 52,439 | 43.52 | 3160.78 |
| Likes | 171.64 | 53 | 881.73 | 127 | 301 | 538 | 1818.68 | 53,831 | 31.77 | 1360.77 |
Source: The Authors, 2025.
In addition, we report the Gini coefficients of 0.689 for retweets and 0.751 for likes endorse the non-uniform distribution of social engagement, though the concentration is marginally less acute than that observed in the comparative dataset.
To more accurately delineate interaction dynamics, the like-retweet ratio was calculated for posts with at least one retweet. The median ratio is 1.238, indicating that posts typically accrue more likes than retweets. The interquartile range of 1.965 suggests considerable variability in how the public interacts with the content. Certain posts generate a relatively higher volume of affirmations compared to re-disseminations, while others demonstrate the inverse pattern. Table 3 demonstrates the variability of engagement baselines across language, media, and tweet types.
Table 3.
Stratified median engagement metrics (2008–2021).
| Stratum | N | Median Likes | Median Replies | Median Retweets | Median Retweets to Reply | Median Likes to Reply |
|---|---|---|---|---|---|---|
| Language: | ||||||
| English | 45,785 | 54 | 4 | 49 | 12.1 | 13.5 |
| Non-English | 882 | 20 | 1 | 21 | 12.0 | 12.4 |
| Media type: | ||||||
| Text only | 28,548 | 23 | 3 | 35 | 11.0 | 8.3 |
| Visual (Image/video) | 18,119 | 138 | 6 | 79 | 13.8 | 22.6 |
| Tweet type (Standalone vs Threaded Reply): | ||||||
| Quote | 2234 | 52 | 4 | 24 | 5.8 | 13.0 |
| Standalone/Thread-Start | 15,888 | 17 | 4 | 63 | 15.3 | 6.0 |
| Threaded Reply | 28,545 | 73 | 4 | 45 | 11 | 17.4 |
Source: The Authors, 2025.
Limitations
Our work offers a robust foundation for evaluating popular legitimacy, but also presents some limitations due to the scope and collection method used:
-
•
Platform specificity: This research focuses on the communication patterns of the X. Expanding the investigation to encompass other Social Media Platforms (SMPs), such as Facebook or TikTok, may offer a fuller portrait of public opinion, as different platforms exhibit unique engagement characteristics and user demographics. In addition, our study cannot account for the unquantifiable influence of proprietary platform-side dynamics, including internal moderation and algorithmic deboosting.
-
•
Shifting User Demographics: Our conclusions may be influenced by the evolving user base and demographics of X over the long period of the study. Changes in users' age distribution and platform migration to competing social media introduce a confounding variable that was not our research focus. Future research may expand on the aggregate findings by applying demographic inference methods to investigate how gender and regional variations influence public responsiveness to WHO narratives
-
•
Passive Indifference: The chosen engagement metrics (likes, retweets, replies) cannot directly capture passive indifference or disengagement, users who simply ignore the WHO’s posts. Although specific trends may suggest growing apathy, this is not a direct measurement and warrants further study.
-
•
Inability to Control Exogenous Factors: The study’s focus on engagement trends makes it difficult to definitively isolate the impact of exogenous factors (such as changes in the WHO’s tweeting frequency, broad shifts in platform activity, or media reporting) from genuine changes in popular legitimacy.
Ethics Statement
Our work analyzing WHO data was executed according to the ethical and privacy statements. Investigation using material collected from social platforms mandates scrupulous compliance with ethical protocols that safeguard individuals and promote responsible data practices. In the context of this work, the compilation and examination of data pertaining to the WHO were executed in conformity with prevailing norms for the use of digital data. Specific effort was devoted to the ethical stewardship of publicly accessible content and the mitigation of potential harm.
The assembled resource relies on publicly available data retrieved from the verified WHO account. Per recognized academic practice, the scrutiny of general public social media activity may be performed responsibly without securing explicit individual consent, provided that no information capable of identifying a person is released. According to the X rules and policies on content redistribution, no personally identifiable information is disclosed [10]. The dataset was prepared to be public. We removed the personal metadata, such as usernames, user IDs, profile images, and other potentially identifying fields, following guidance from the research ethics literature [11,12] and X’s policy [9].
This dataset provides the posting date, but researchers should be aware that factors such as time of day, language of the post, and embedded links may influence algorithm visibility and reply probability. These variables are available for the research after rehydrating the tweets using the tweet ID available in our dataset.
The collection focused exclusively on the WHO’s original posts and the corresponding aggregate engagement metrics (counts of likes, retweets, and replies). All material employed in the quantitative analysis was thoroughly de-identified. Crucially, the process did not involve the acquisition, examination, or storage of any personal details from individual X users, such as their user identifiers, profile specifics, or the substance of their response messages.
Given that this work is based on publicly accessible and non-identifiable data, it does not involve human participants in a way that requires official review by an Institutional Review Board and was consequently deemed exempt from such endorsement [13]. All operational steps in data collection and management adhered rigorously to the platform’s terms of service and data access policies [9]. Regarding data visibility and the preservation of anonymity for SMP data used in studies, individual permission was not mandatory due to the content's general accessibility and the precautions established during the data collection process [14]. Consequently, this scholarly work poses no direct harm to private individuals.
Acknowledgements
The authors would like to thank our anonymous referees. The usual caveats apply.
Declaration of Competing Interest
The authors declare no competing financial or personal interests that could have influenced the research presented in this article.
Data Availability
GitHubWHO_tweets (Original data).
References
- 1.Warin T., Melchior C., De Marcellis-Warin N. Social Media metrics and popular legitimacy: content analysis of pre– and post–COVID-19 public engagement with the World Health Organization on X. J. Med. Internet Res. JMIR. 2025;27 doi: 10.2196/69959. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Melchior C., De Marcellis-Warin N., Warin T. Dataset on public engagement with the World Health Organization on X (Twitter) from 2008 to 2021. 2025. https://github.com/warint/WHO_tweets [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Dingwerth K., Schmidtke H., Weise T. The rise of democratic legitimation: why international organizations speak the language of democracy. Eur. J. Int. Relat. 2019;26:714–741. doi: 10.1177/1354066119882488. [DOI] [Google Scholar]
- 4.Gupta S., Pande N., Arumugam T., Sanjeev M.A. Reputational impact of Covid-19 pandemic management on World Health Organization among Indian public health professionals. J. Public Aff. 2022;23:1–13. doi: 10.1002/pa.2842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Minot J.R., Arnold M.V., Alshaabi T., Danforth C.M., Dodds P.S. Ratioing the President: an exploration of public engagement with Obama and Trump on Twitter. PLOS ONE. 2021;16 doi: 10.1371/journal.pone.0248880. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Antonakaki D., Spiliotopoulos D., Samaras C.V., Pratikakis P., Ioannidis S., Fragopoulou P. Social media analysis during political turbulence. PLOS ONE. 2017;12 doi: 10.1371/journal.pone.0186836. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Hudders L., De Jans S. Gender effects in influencer marketing: an experimental study on the efficacy of endorsements by same- vs. other-gender social media influencers on Instagram. Int. J. Advert. 2021;41:128–149. doi: 10.1080/02650487.2021.1997455. [DOI] [Google Scholar]
- 8.Al-Rawi A., Grepin K., Li X., Morgan R., Wenham C., Smith J. Investigating public discourses around gender and COVID-19: a social Media analysis of Twitter data. J. Healthc. Inform. Res. 2021;5:249–269. doi: 10.1007/s41666-021-00102-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.X, developer Policy – X developers, (2025). https://developer.x.com/en/developer-terms/policy (accessed October 12, 2025).
- 10.Fiesler C., Proferes N. “Participant” perceptions of Twitter research ethics, soc. Media Soc. 2018;4 doi: 10.1177/2056305118763366. [DOI] [Google Scholar]
- 11.Beninger K. SAGE Handb. Soc. Media Res. Methods. SAGE Publications Ltd; 2016. Social Media users’ Views on the ethics of Social Media research; pp. 57–73. [DOI] [Google Scholar]
- 12.Salganik M.J. Princeton University Press; Princeton: 2017. Bit By Bit: Social Research in the Digital Age. (accessed December 17, 2025) [DOI] [Google Scholar]
- 13.Ess C., Jones S., Ess C., Jones S. Ethical decision-making and internet research: recommendations from the AoIR Ethics Working Committee. Readings in Virtual Research Ethics. 2004 doi: 10.4018/978-1-59140-152-0.ch002. [DOI] [Google Scholar]
- 14.Pfeffer J., Mooseder A., Lasser J., Hammer L., Stritzel O., Garcia D. Proc. Int. AAAI Conf. Web Soc. Media, arXiv. 2023. This sample seems to be good enough! assessing coverage and temporal reliability of Twitter’s academic API; pp. 720–729. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
GitHubWHO_tweets (Original data).
