“I’m in the Bluesky Tonight”: Insights from a year worth of social data

Andrea Failla; Giulio Rossetti

doi:10.1371/journal.pone.0310330

. 2024 Nov 5;19(11):e0310330. doi: 10.1371/journal.pone.0310330

“I’m in the Bluesky Tonight”: Insights from a year worth of social data

Andrea Failla ^1,^2,^*, Giulio Rossetti ²

Editor: Fabio Saracco³

PMCID: PMC11537377 PMID: 39499680

Abstract

Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. We present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social to address this pressing issue. The dataset contains the complete post history of over 4M users (81% of all registered accounts), totalling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions. Since Bluesky allows users to create and like feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their timestamped “like” interactions. This dataset allows novel analysis of online behavior and human-machine engagement patterns. Notably, it provides ground-truth data for studying the effects of content exposure and self-selection and performing content virality and diffusion analysis.

Introduction

Online social platforms (OSPs) have traditionally been a conspicuous data source for studying online human behaviors. From discourse analysis [1, 2], to studying d/misinformation [3, 4], coordinated behaviors [5, 6], radicalization [7, 8], echo-chamber effects [9, 10], and critical event detection [11, 12], computational social science has long laid its development on—and answered its question through—social media data. However, in early 2023, Twitter/X and Reddit announced their plans to discontinue free access to their API services. These decisions have slowed down the advancement of computational social science research. Indeed, in the post-api era [13], researchers are left with limited options when it comes to finding suitable data. One option is to exploit custom web scrapers, i.e., programs that pretend to be regular users and collect data as they navigate OSPs. This strategy, however, is incredibly time-consuming and often violates the platforms’ terms of use. Moreover, it is hardly reusable, as a change in the website’s source code may easily break the scraper. Another possibility is to use search engines to query for specific OSPs and collect the result pages via the engine’s API. In other words, one may circumvent an OSP’s API restrictions by collecting data via a (often more permissive) search engine API. This strategy is less costly but was shown to be strongly biased in favor of (i) popular social media users, (ii) positive content, and (iii) non-political content [14]. The last option is to rely on older datasets. However, these quickly become outdated, let alone reusing a specific data sample, which introduces inherent biases and prevents the results’ generalization. Moreover, many OSPs (including Twitter/X) only allow sharing identifiers of social media posts, which must be used to obtain complete post metadata via the OSP’s API—which is now prohibitive.

Contrary to the general trend, however, the year-old decentralized OSP Bluesky Social (hereafter, Bluesky) has recently opened its APIs to developers, offering a potential solution to the widespread data shortage. Additionally, Bluesky offers unique features that make it a valuable resource for new studies, such as a new open federation technology—the AT protocol [15]—and augmented algorithmic choice. Regarding this last point, Bluesky allows users to create and like custom content recommendation algorithms called feed generators [16]. This feature opens up new ways in which human-AI relationships can be investigated, naturally favoring interesting research questions, e.g., does choosing your algorithm increase/decrease the risk of opinion polarization and of coming across d/misinformation? How do custom feeds relate to/affect the identification of reliable information sources? With this work, we aim to start bridging these gaps with a manifold contribution. First, we introduce a curated Bluesky dataset comprising more than 4M accounts (∼81% of all registered users according to the latest information available [17]) along with their complete posting activity (∼235M posts) and follower/followee relations; as part of this dataset, we also release the output of 11 feed generator algorithms available on the platform (i.e., the full collection of posts retrieved by such algorithms), along with data on who liked such posts and when. The dataset is complemented by Python scripts implementing the data collection and processing pipelines, potentially allowing other researchers to collect/process more data according to their needs; moreover, we promote a preliminary descriptive analysis of Bluesky’s structure, dynamics, and content. Aside from the work introducing the AT protocol [15] and a study on migration from Twitter to other OSPs [18], ours is the first work focusing on Bluesky Social, and the data is likely to be one of the highest-coverage datasets on online social platforms.

Bluesky social

At launch (February 17th, 2023), the Bluesky service was available to new users only under invitation from an existing user or Bluesky PBC. The beta program has had moderate success since, capturing a considerable slice of ex-Twitter/X users during the 2023 mass migration [18]. Despite the launch of Meta’s much more popular competitor, Threads, in the same year [19], Bluesky reached more than three million users in November. Moreover, following the removal of the invite-only policy, the platform has reported an unprecedented increase in new user activity, totalling 5 million users in February 2024. Regarding user experience, Bluesky is comparable to more established microblogging services like X and Mastodon. Fig 1 (left) displays a typical Bluesky screen. Users can post short-form content, such as texts up to 300 characters and up to four images. Posts may also contain links to external websites, mentions to other users, and be tagged via hashtags. On creation, posts with attached media can be labelled with tags that advise viewer discretion (e.g., adult content warnings). Like most social media sites, users can interact with other posts by replying, sharing, or liking. Bluesky distinguishes between reposting, i.e., sharing another user’s post as is, and quoting, i.e., reposting and adding a comment. The follower/following feature is implemented in a directed fashion, meaning user A befriending user B does not imply B befriending A. Perhaps the most distinctive feature of Bluesky is its feeds functionality. The service allows users to choose the algorithm(s) that power their home feed, allowing them to view, e.g., only posts by whom they follow, posts containing specific words or entities, and more complex filters. Feeds are implemented in a way that lets the user subscribe to multiple feeds and effortlessly switch from one to another (see Fig 1 (middle)). Apart from the default Following feed (not shown), yielding content from followed users in reverse-chronological order, the user in Fig 1 is subscribed to the Discover and For You feeds, which yield trending content. These aim to mimic the standard feeds of other platforms, such as TikTok or Instagram. The logged-in user is also subscribed to a News feed that shows [h]eadlines from verified news organisations [20] as well as the Cat Pics feed (on display), yielding posts with cat pictures [21]. While the former three were created by Bluesky developers, the latter are created by other users: indeed, Bluesky also offers the possibility to build and share new feed algorithms via freely available software tools.

Fig 1 — Screenshots of the *home* (left), *feeds* (middle), and *profile* (right) tabs from Bluesky’s official iOS app (v1.71). In the home tab, the top row is a scrollable bar listing the user’s bookmarked feed generators. The post at the top only contains an image and received 12 comments, 167 reposts, and 1447 likes. The post at the bottom contains both text and an image. The *feeds* tab contains the list of bookmarked feed generators, along with a feed search bar. Finally, the *user* tab shows the logged-in user’s profile. Shared pictures were obfuscated to avoid sharing potentially copyrighted material. All panels were captured by the author on his device.

Materials and methods

Data collection

We collected publicly available user data from Bluesky using its official Developer API [22]. The process consisted of three phases. We collected the users’ followers in the first phase (from February 25th to March 2nd, 2024). We obtained its list of followers from Bluesky’s official account @bsky.app. With a breath-first approach, we repeated this procedure for every found user until no new user was encountered. Once a sample of 1M unique users was collected, we distributed subsequent requests among ten machines to reduce collection time.

In the second phase (from March 19th to March 21st, 2024), we parsed the collected users to obtain their following lists (that is, for each user, the list of accounts she follows) and thus improve coverage. The procedure resulted in a sample of 4,099,699 unique users. Based on the latest publicly available official information [17], our sample covers ∼81% of Bluesky accounts. In this phase, we also collected posts from user timelines and feeds. We collected all posts from users in our sample via the dedicated API, totalling 237,121,706 posts. In this phase, additional processing was required to identify “reposts”. Indeed, while quotes and replies can be easily identified because the original post is clearly referenced in metadata, to the best of our knowledge, it is impossible to tell whether an item is a repost from metadata alone. The only clue is the post’s author: if while collecting posts of user A we found a post by user B, we labelled this as a repost. Unfortunately, we cannot capture self-reposts, although these might be irrelevant depending on the context. Moreover, Bluesky allows self-reposting the same post only once, which might imply a low number of self-reposts in our dataset. We also collected posts from various topics, including science, news, and social issues, that appeared in specific popular feeds. These feeds were manually selected among the most popular feeds as of March 18th.

Finally, in the third phase (April 23th-25th), we collected likes to both the feeds (i.e., whether and when a user liked a feed generator) and its posts (i.e., whether and when a user liked a post yielded by a feed generator. In total, we obtained information on 18,324 feed generator likes and 4,895,318 post likes. To ensure completeness, we waited one month before obtaining “like” interactions, since posts lose their impact after a while [23], thus it is unlikely they will receive many new likes.

Data processing

The data collection pipeline produced the following outputs: (i) a collection of files containing, for each user, her followers, (ii) a collection of files containing, for each user, her followees, (iii) a collection of files containing user posts, each post represented as a JSON-formatted line, (iv) a collection of files containing posts appearing in specific feeds, each post represented as a JSON-formatted line, (v) information on who liked posts appearing in specific feeds and when, and (vi) information about who liked specific feed generators and when. Collections (i) and (ii) were aggregated into a single file. User handles were pseudo-anonymized during this process by assigning a progressively increasing integer value.

Collections (iii) and (iv) required several processing steps; first, we filtered out post metadata such as entities (e.g., hashtags and mentions) and attached media. Secondly, we filtered out posts with incorrect or ill-formatted timestamps. Indeed, Bluesky allows users and third-party applications to change a post’s creation date via the developer API. While this flexibility is, in principle, beneficial—as it allows, for instance, importing posts from other sites (e.g., Twitter/X) or moving content between Bluesky servers— it also has potentially undesirable side effects, such as incorrect date formats and fake dates. We retained only posts between February 17th, 2023, and March 18th, 2024 (inclusive) and limited timestamps to the minute information. Keeping posts published until March 18th (and not later) was done as an attempt to reduce potential disalignments between relational (i.e., followers) and post data; in doing so, we ensure a degree of alignment by considering posts produced up until the final parse on the follower network (March 19th). Thirdly, we obtained each user’s instance by relying on handles. Bluesky handles are web domains of the form un.sd where un is a unique username chosen at creation that identifies an account within the network, and sd is the server domain, i.e., of the server hosting that user’s data. The server domain may contain dots (see Fig 1, rightmost). We used regular expressions to extract server domains. Then, we mapped user handles (both in the corresponding field and the posts’ text) with the integer values obtained when processing collections (i) and (ii) and also assigned a unique ID to each post. Moreover, since language metadata is often inconsistent (that is, a post in English may be labelled as english, en, eng, or other variations, e.g., capitalization, incorrect spelling, etc.), we manually mapped language metadata to the ISO 639-2 standard. In this case, we also found multiple occurrences of ill-formatted/non-standard language metadata (e.g., posts labelled as not-a-lang, whatever). In these cases, we kept the post in the dataset but removed the language tag. After these steps, 235,567,116 posts are left in the dataset. Finally, to further enrich the dataset and favor future analysis, we estimated the sentiment of English posts leveraging a case-sensitive RoBERTa model fine-tuned on ∼129M English tweets [24]. To reduce noise, since a post can be tagged with multiple languages, we classified only those tagged as English and where no other language appears. As a result, we extended the dataset with the sentiment labels (0: negative, 1: neutral, 2: positive), and the model’s confidence scores rounded to the third decimal place. Please refer to Tables 1 and 2 for further information on post metadata. Finally, usernames and posts in collections (v) and (vi) were also assigned pseudonymized IDs and temporal information was aggregated at the minute level.

Table 1. Post metadata.

Category	Field	type	#non-null
User	user_id	int	235,567,116
User	instance	str	235,567,116
Content	post_id	int	235,567,116
	date	int	235,567,116
	text	str	235,567,116
	langs	list	220,628,598
	labels	list	4,027,096
	like_count	int	235,567,116
	reply_count	int	235,567,116
	repost_count	int	235,567,116
	sent_label	int	128,664,788
	sent_score	float	128,664,788
Relational	reply_to	int	87,704,964
	replied_author	int	87,704,964
	thread_root	int	87,704,964
	thread_root_author	int	87,704,964
	repost_from	int	63,549,643
	reposted_author	int	63,549,643
	quotes	int	12,110,474
	quoted_author	int	12,110,474

Open in a new tab

Table 2. Post metadata description.

Category	Field	description
User	user_id	an identifier univocally associated with each author/user.
User	instance	the name of the instance that the user is registered to
Content	post_id	an identifier univocally associated with each post
	date	the post date and time formatted as YYYYmmddhhMM.
	text	the post’s text content
	langs	ISO 639-2 language codes
	labels	the content warning label(s) that the post is tagged with
	like_count	the number of likes as per the post metadata
	reply_count	the number of replies as per the post metadata
	repost_count	the number of reposts as per the post metadata
	sent_label	the text’s sentiment
	sent_score	the sentiment model’s confidence
Relational	reply_to	the ID of the post to which the current post replies to
	replied_author	the ID of the replied post’s author
	thread_root	the ID of the post that initiated the discussion thread
	thread_root_author	the ID of the root post’s author
	repost_from	the ID of the reposted post
	reposted_author	the ID of the reposted post’s author
	quotes	the ID of the quoted post
	quoted_author	the ID of the quoted post’s author

Open in a new tab

Data records

We make our dataset available for further research by releasing it publicly on Zenodo [25] (DOI: https://doi.org/10.5281/zenodo.11082878). To guarantee the reproducibility and transparency of the research, all the code used to collect and clean the data is included in the repository, together with the code to reproduce the experiments. In summary, the Zenodo repository contains the following files:

followers.csv.gz. This compressed file contains the anonymized follower edge list. Once decompressed, each row consists of two comma-separated integers u, v, representing a directed following relation (i.e., user u follows user v).
posts.tar.gz. This compressed folder contains data on the individual posts collected. Decompressing this file results in 100 files, each containing the full posts of up to 50,000 users. Each post is stored as a JSON-formatted line (see Table 2 for details on specific fields);
interactions.csv.gz. This compressed file contains the anonymized interactions edge list. Once decompressed, each row consists of six comma-separated integers representing a comment, repost, or quote interaction. These integers correspond to the following fields, in this order: user_id, replied_author, thread_root_author, reposted_author, quoted_author, and date (see Table 1). At least one of replied_author,

thread_root_author, reposted_author, and quoted_author is non-null for all rows;
graphs.tar.gz. This compressed folder contains edge list files for the graphs emerging from reposts, quotes, and replies. Each interaction is timestamped. The folder also contains timestamped higher-order [26] interactions emerging from discussion threads, each containing all users participating in a thread.
feed_posts.tar.gz. This compressed folder contains posts that appear in 11 thematic feeds. Decompressing this folder results in 11 files containing posts from one feed each. Posts are stored as a JSON-formatted line. Fields correspond to those in Table 1, except for those related to sentiment analysis (sent_label, sent_score), and reposts (repost_from, reposted_author);
feed_bookmarks.csv. This file contains users who liked any of the collected feed generators. Each record contains three comma-separated values: the feed name (as per Table 3), the user ID, and the timestamp.
feed_post_likes.tar.gz. This compressed folder contains data on likes to posts appearing in the feeds, one file per feed. Each record in the files contains the following information, in this order: the ID of the “liker”, the ID of the post’s author, the ID of the liked post, and the like timestamp;
scripts.tar.gz. A collection of Python scripts, including the ones originally used to crawl the data and to perform experiments.

Table 3. Collected feed statistics.

name	#posts	#authors	#post likes	#feed likes
#Disability	566	411	4,657	1,244
#UkrainianView	2,098	172	52,308	1,026
AcademicSky	913	352	4,344	803
BlackSky	86,490	1,564	1,714,160	2,590
BookSky	738	275	4,117	1,813
Game Dev	635	504	4,736	1,531
GreenSky	662	190	8,689	1,025
News	42,112	75	2,115,322	1,314
Political Science	357	46	3,799	1,651
Science	33,831	1,716	980,724	4,506
What’s History	161	71	2,462	821

Open in a new tab

Our dataset abides by principles for FAIR data [27] since it is:

Findable. We release our data on Zenodo, a service that stores data and assigns a DOI to the repository, ensuring findability. Moreover, we indexed it also on the SoBigData Research Infrastructure (URL: http://sobigdata.eu/), an EU-funded RI providing curated datasets and algorithms advocating open science;
Accessible. Given the above, we ensure our data is freely accessible to anyone with an internet connection;
Interoperable. All data is released in CSV or JSON files, enabling easy manipulation and analysis with most programming languages;
Reusable. Our data collection process is transparent, and the shared data is appropriately described. Thus, it can be reused and inputted into many analytical pipelines.

Ethics statement

The release of a large dataset of online interactions may raise significant ethical considerations that demand transparent handling. As stated by the service’s Privacy Policy, any information [a user] add[s] to [their] public profile and the information [they] post on the Bluesky App will be public [28], and there is currently no option to turn a profile private. Therefore, all information we collect is strictly public, including usernames, posts, and any attached metadata. Nonetheless, we strive to preserve privacy and anonymity for users in our sample. We have removed usernames from our dataset to mitigate privacy risks and replaced them with numerical IDs. We filtered out metadata that may univocally identify individuals or their content (e.g., post URIs) and aggregate temporal data at the minute level. The only data we collect regarding user profiles is the server they are registered to (information that is unlikely to provide any means of identifying individuals). Therefore, user bios, pictures, and registration dates—all potential identity markers—were not collected. These steps are detailed in the Data Processing section. Our dataset complies with Bluesky’s Terms of service as well as with the latest European Union provisions in terms of data protection, particularly GDPR.

Data analysis

In this Section, we describe the dataset in detail, providing insights into social topology, user activity and content. We also highlight potential applications of the dataset for social media mining.

Social structure: Network topology and federation

Followers network

We model the system emerging from follower-followee relations as a directed graph G = (V, E), such that V = {v₁, …, v_n} is the set of nodes/users, and E = {e_ij, …, e_km} is the set of edges/relations. Our snapshot of the Bluesky network counts 4,099,699 nodes connected by 14,458,1603 edges. A considerable fraction of these relations, ∼39%, are mutual. The distributions of in-degrees and out-degrees, representing the number of followers and followees for each account, follow a power law. Thus, few accounts hold most of the social capital, while the vast majority have only a few inward/outward connections. This characteristic is also shared by other online social networks such as Twitter/X and Facebook [29, 30]. The highest in-degree is 770,556 (i.e., of the most followed account), while the highest out-degree is 225,094 (i.e., of the account that follows others the most).

Interaction networks

Bluesky’s post metadata allows modelling its social topology in different ways. Interaction networks can be built from the dataset leveraging replies, reposts, or quotes metadata according to the desired semantics and can possibly be multilayer [31, 32], weighted [33], and evolve in time [34, 35]. For instance, the conversation network —where nodes are users and directed edges u → v represent replies—resulting from the dataset is made up of 1.5M nodes connected by 23.4M edges. We find a remarkably high edge reciprocity of 57%, suggesting that users who receive a reply often reply back. Another interesting network is the one obtained by combining reposts and quotes (this is akin to retweet networks, a popular modelling choice for content/opinion diffusion studies on social media). The resulting network contains 1.4M nodes connected by 33.8M edges. The reciprocity rate is 8%, which is expected as most users typically share posts from a few hubs (e.g., news profiles and influencers). Apart from these hubs—some of which are reposted/quoted by more than 50,000 distinct users—we identify some account “boosters”, i.e., accounts that share content from the same user thousands of times, effectively boosting her outreach. Indeed, by assigning edge weights based on (directed) interaction frequency, we find 140 unique node pairs interacting more than 1,000 times. Here, we have provided a brief description of some interaction layers, but future works could explore their joint topology, temporal dynamics [34, 35], as well as other structures such as mentions networks [36] and higher-order topologies [27, 37]. Moreover, by integrating interaction networks with follower relations, future studies could more adequately investigate content diffusion patterns on a comprehensive topology. Finally, the different semantic nuances of quotes and reposts potentially allow distinguishing between reposting as a means of endorsement and reposting as a means of criticizing, a well-known problem in content diffusion analysis [38].

Federation

The federated model is one of Bluesky’s key characteristics. However, the platform only opened to federation on February 22nd, 2024. Before such a date, users could only register to the bluesky.social instance or create instances in a test sandbox network separated from the main one. Although it is impossible to tell whether a user has been registered to a server from the beginning or simply moved to a server at a later date, we argue that this is relatively unimportant since this does not affect what content a user is exposed to (see the protocol paper [15] for more details). Still, analyzing how users are distributed among instances can provide fundamental insights into the platform’s structure and dynamics [39].

We display the instances with at least 500 members and 100,000 posts in Fig 2. Despite Bluesky’s federation being relatively novel, the dataset contains posts from users registered to 6,181 servers. In the 26 days spanning from February 22nd to March 18th, ∼20M posts (8% of all posts) were shared on new servers, suggesting fast growth in the federation network. Aside from the default one, which is by far the most populated (98% of all users) and active (92% of all posts), other popular instances include common domains such as .com and .net, which are widely used by companies and news organizations. Some domains relate to specific countries/languages (.de, .fr, .ca), jobs and interests (.dev, .art) and other personal characteristics such as belonging to/supporting the LGBT+ community (.gay). In principle, this might allow topic-specific social studies on the platform. For instance, country/language domains could be used as proxies for coarse-grained geolocation data, which is not currently available in post metadata.

Posting activity

Users show moderate engagement with the Bluesky platform. Out of the ∼4M users, nearly 2.4M (58%) shared at least one post. On average, these accounts shared 99 posts each (σ = 717.63), with a median of 8. Of the 235M posts in our data, 63M (27%) are reposts, and 12M (5%) are quotes, indicating substantial content dissemination within the platform. Moreover, 20M discussion threads containing 88M replies can be identified.

Fig 3 shows the number of daily posts on the platform. Values are averaged via a seven-day rolling window to improve readability. Posts steadily increased from March through November, with some downward oscillations (e.g., around July-August). Activity on the platform begins to stabilize at ∼1M daily posts starting mid-October through February. A steep increase was registered after February 6th, when Bluesky’s invite-only policy was lifted, effectively doubling the engagement on the platform. After that, activity decreased but remained higher than before the policy was lifted.

For a clearer picture of Bluesky’s activity, we compute the daily average number of posts per active user. Formally, let U be the set of users that posted at least once on day d, we compute:

\begin{matrix} \frac{\sum_{u}^{U} p_{u}}{| U |}, \end{matrix}

(1)

where p_u is the number of posts shared by user u in d. The red curve in Fig 4 depicts this measure’s trend. Once again, values are averaged via a seven-day rolling window. In this chart the light blue area outlines the interquartile range (IQR), i.e., where the middle 50% p_u values are located; The blue curve outlines the average value within the IQR, providing a score less sensitive to outliers. During our observation period, the global average consistently falls beyond the IQR, highlighting the presence of very active users. We speculate that these might be (i) news organizations that continuously post updates and/or (ii) spammers and bots, as some accounts share up to 5K daily posts, which is unlikely for real accounts. The latter scenario also allows employing this dataset for bot detection [40] and coordinated behaviors analysis [5]. The IQR average is mostly constant at around two posts per user. Notably, the February spike observed in Fig 3 does not affect these values.

Finally, to understand whether users are active for long or short periods of time, we turn to Fig 5, which shows the Cumulative Distribution Function (CDF) of days elapsed from each user’s first to last posts. Globally, 50% of users were active for 50 days or longer, and 22% were active for 150 days or longer. To control for the effect of the invite-only removal, we additionally differentiate between invite-only (i.e., users whose first post is recorded before February 6th, 2024), and free access (i.e., users whose first post is recorded after the access policy was modified). Invite-only users have been mostly active for at least 75 days. Regarding the free access users, 40% were active for 25 days or more.

Content analysis

Languages

In the following, we characterize posts based on their language metadata. When a user submits a post on Bluesky, she can select the language(s) that appear in the post via a dropdown menu. Users can choose up to three language tags since a post can be written in multiple languages. Language metadata was first added at the end of June 2023; thus, the languages of the earlier posts are unknown. In total, the dataset contains 227 unique language tags. The most frequent language is English, with over 132M posts (note that multilingual posts are counted once for each language). Japanese and German are also clusters of considerable size, respectively, 33.9M and 26M. The dataset also contains 5.2M posts with two language tags and 2.5M with three. Among the posts with two tags, we find English and Japanese (98K), English and Spanish (52K), and English and Portuguese (52K). Among the posts with three tags, we find English, Japanese, and Korean (233K), German, English, and French (103K), and English, Portuguese, and Spanish (87K). The wide variety of languages and the presence of multilingual posts make this dataset a valuable choice for multilingual studies on online social platforms.

Feeds

Feed generators are one of the main nuances of Bluesky. We have collected posts and metadata from 11 feed generators, shown in Table 3. The feeds in our sample cover a broad range of topics. Among these, some are related to socio-political issues (#UkrainianView, GreenSky), and sciences and academia (Science, AcademicSky, Political Science, What’s History). Moreover, the News feed contains posts from verified news organizations. Other notable feeds relate to minorities/discriminated groups, such as #Disability and BlackSky. Among these, BlackSky has a peculiar opt-in mechanism for choosing whose posts to show. People who identify as black can ask the feed administrator to be added to the feed. Once they have been added, anything they post will be shown in the feed. Finally, the BookSky and Game Dev feeds refer to book recommendations and game development, presumably online spaces where users may look for advice/suggestions on these topics. To glance at the feeds’ contents, the rightmost column in Table 4 shows the most frequent words in each feed. These were obtained via a standard text processing pipeline, including lowercasing, lemmatization, and removing punctuation, stopwords, numbers, emojis, and URLs. At a glance, most of these frequent words are somewhat related to the main topic of each feed. For instance, on GreenSky, terms refer to climate change, emissions, and non/renewable energy sources. This suggests that, despite their relatively small size, feeds may be used as effective proxies in topic-specific studies—in the same way as hashtags are on Twitter or subreddits on Reddit.

Table 4. Most frequent words appearing in each feed.

name	words
#Disability	disability, disabled, work, covid, today, help, life, week, read, long
#UkrainianView	ukrainianview, russian, russia, ukraine, ukrainian, war, air, another, fuck, support
AcademicSky	academicsky, edusky, academia, highered, student, research, university, academic, psychscisky, bitly
BlackSky	black, love, today, work, feel, white, life, woman, shit, post
BookSky	book, booksky, read, reading, review, author, story, today, finished, horror
Game Dev	game, gamedev, dev, design, art, indiedev, indiegame, screenshotsaturday, work, whatagamedevlookslike
GreenSky	climate, energy, change, carbon, power, emission, fuel, fossil, work, global
News	news, ukraine, russian, state, russia, president, trump, please, feed, orgs
Political Science	polisky, feed, polisci, political, list, book, student, post, science, gendersky
Science	science, feed, post, today, paper, please, research, data, work, study
What’s History	history, skystorians, book, american, war, eel, historical, japanese, woman, read

Open in a new tab

Information on when a user liked a feed generator may inform how much she is exposed to content about that topic. In principle, this may allow for the study of the effects of content exposure in a controlled way, i.e., by comparing activity before/after liking the feed. On a more global scale, feed liking information can outline patterns of increasing/decreasing popularity of certain topics, possibly influenced by real-world events. For instance, the News feed was liked the most in October 2023 (516 times), with a spike of 126 on October 21st. We hypothesize this may be tied to ongoing conflict in Gaza. The Science feed, instead, was mostly liked in the summertime, receiving nearly 1,200 likes in July, Moreover, like data on posts can give more fine-grained information about the popularity of specific topics, posts, or users. For instance, in the Science feed, the most liked post has 250,600 likes. It received the most likes when it was posted, namely 15K on February 10th, 2024, and 17K the next day. After that, its popularity decreased, although it received 1K to 4K daily likes up until mid-April.

Sentiment and topics

Sentiment can be an interesting indicator of the overall emotional atmosphere on the platform. Out of the annotated English posts, 39M (32%) are positive, 32M (27%) are negative, and 50M (41%) are neutral. Daily sentiment rates are shown in Fig 6. Note that posts before July do not have any language metadata and are thus excluded from this analysis. Although trends are mostly flat, some interesting patterns can be observed. First, positive posts are relatively more than negative posts during most observations, highlighting a generally positive outlook within the community. A relative increase in positive posts can be seen around January 2024, likely due to the beginning of the new year and the associated sense of renewal and optimism. Moreover, the February influx of new users brought excitement and enthusiasm, leading to a temporary increase in positive posts. Another notable time window is July 13th to 15th. In this period, the platform is characterized by a relative increase in negative posts. To understand why, we leverage BERTopic, a neural topic modelling algorithm [41]. This model identifies topics by (i) constructing document embeddings, (ii) applying the UMAP algorithm for dimensionality reduction [42], and ultimately discovering document clusters via HDBSCAN [43]. We apply BERTopic to English posts published between July 13th and 15th (inclusive) and whose sentiment is negative. The model automatically identifies 40 clusters. The five largest ones are displayed in Fig 7 along with descriptive words for each topic. Although the topics are several, most documents belong to cluster 0, which refers to racism within the Bluesky online community. Other clusters, each with ∼1000 posts or less, seem to tackle the same issue from different perspectives. For instance, posts in topic 1 relate to content moderation, topic 3 mentions blocked users and topic 4 refers to apologies and racism. Thus, it is likely that the Bluesky community suffered from large-scale racist episodes, possibly concerning content moderation and/or platform administration. This hypothesis is confirmed by the online newspaper TechCrunch: on July 18th, an article reported a community backlash following Bluesky’s failure to flag racial slurs in usernames [44]. At the same time, the platform allegedly removed many such words from its flagged words list. Looking back at Fig 3, this matter likely caused the sudden bump in posting activity in July. Future studies could investigate whether similar trends emerge in other languages as well.

Fig 7 — Values on the x-axis are scaled logarithmically.

Conclusion

In this work we described the Bluesky Social Dataset, to the best of our knowledge, the first public dataset on the year-old decentralized social platform. Our dataset contains over 235M posts covering the full content history of more than 80% of all registered users. Longitudinal interaction data is also made available, including follow, reply, repost, and quote interactions. We also exploit the peculiar feature of feed generators and collect all posts served from the most popular recommenders on the platform, together with information on users who subscribed to these feeds. The data is hosted on Zenodo [25], along with the code used to collect it, clean it, and reproduce the plots. Overall, we observe good engagement patterns on the platform. Users actively participate across a variety of topic-specific environments, including feeds and instances tailored to particular interests, demographics, and languages. Additionally, we notice that global user behaviors emerge in response to real-world events, leading to spikes in activity and shifts in discourse as users react to current happenings. Our contribution paves the way toward a deeper understanding of digital interactions and dynamics on niche and/or decentralized platforms. Moreover, combining multiple layers of social topology with information extracted from user-generated content might lead to new insights into polluting dynamics such as d/misinformation, opinion polarization, and political segregation.

Usage notes

The data is hosted on Zenodo [25]. The platform allows hosting datasets up to 50GB and 100 files. We provide the data in compressed formats to comply with file size and amount limitations. To re-run the analysis on posts, networks, and feeds, it is necessary to decompress the corresponding archive(s).

Code availability

All code related to generating, processing, and describing the dataset is released alongside it. The code requires Python 3.8 or higher and is mostly based on standard libraries and popular data science packages (e.g., numpy, pandas, matplotlib). The scripts for data collection also require atproto v0.0.46, the official Python wrapper for Bluesky Social API. Please refer to the API documentation for further details [22]. Before running data collection scripts, users should also set a USERNAME and PASSWORD environment variables with their Bluesky credentials, as some methods may require authentication.

Data Availability

All data and code are available in a dedicated zenodo repository at https://zenodo.org/records/11082879.

Funding Statement

This work is supported by (i) the European Union – Horizon 2020 Program under the scheme “INFRAIA-01-2018-2019 – Integrating Activities for Advanced Communities”, Grant Agreement n.871042, ‘’SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics" (\url{http://www.sobigdata.eu}); (ii) SoBigData.it which receives funding from the European Union – NextGenerationEU – National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) – Project: ‘’SoBigData.it – Strengthening the Italian RI for Social Mining and Big Data Analytics" – Prot. IR0000013 – Avviso n. 3264 del 28/12/2021; (iii) EU NextGenerationEU programme under the funding schemes PNRR-PE-AI FAIR (Future Artificial Intelligence Research). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Bouvier G. & Machin D. Critical discourse analysis and the challenges and opportunities of social media. Critical Discourse Studies and/in Communication 39–53 (2020). doi: 10.4324/9781003050353-3 [DOI] [Google Scholar]
2. Botzer N., Gu S. & Weninger T. Analysis of moral judgment on reddit. IEEE Transactions on Computational Social Systems (2022). [Google Scholar]
3. Aïmeur E., Amri S. & Brassard G. Fake news, disinformation and misinformation in social media: a review. Social Network Analysis and Mining 13, 30 (2023). doi: 10.1007/s13278-023-01028-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Saurwein F. & Spencer-Smith C. Combating disinformation on social media: Multilevel governance and distributed accountability in europe. Digital journalism 8, 820–841 (2020). doi: 10.1080/21670811.2020.1765401 [DOI] [Google Scholar]
5. Caldarelli G., De Nicola R., Del Vigna F., Petrocchi M. & Saracco F. The role of bot squads in the political propaganda on twitter. Communications Physics 3, 81 (2020). doi: 10.1038/s42005-020-0340-4 [DOI] [Google Scholar]
6.Nizzoli, L., Tardelli, S., Avvenuti, M., Cresci, S. & Tesconi, M. Coordinated behavior on social media in 2019 uk general election. In Proceedings of the International AAAI Conference on Web and Social Media, vol. 15, 443–454 (2021).
7. Goldenberg A. et al. Homophily and acrophily as drivers of political segregation. Nature Human Behaviour 7, 219–230 (2023). doi: 10.1038/s41562-022-01474-9 [DOI] [PubMed] [Google Scholar]
8.Wang, E. L., Luceri, L., Pierri, F. & Ferrara, E. Identifying and characterizing behavioral classes of radicalization within the qanon conspiracy on twitter. In Proceedings of the International AAAI Conference on Web and Social Media, vol. 17, 890–901 (2023).
9. Cinelli M., De Francisci Morales G., Galeazzi A., Quattrociocchi W. & Starnini M. The echo chamber effect on social media. Proceedings of the National Academy of Sciences 118, e2023301118 (2021). doi: 10.1073/pnas.2023301118 [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Garimella K., Morales G. D. F., Gionis A. & Mathioudakis M. Quantifying controversy on social media. ACM Transactions on Social Computing 1, 1–27 (2018). doi: 10.1145/3140565 [DOI] [Google Scholar]
11.Weng, J. & Lee, B.-S. Event detection in twitter. In Proceedings of the international aaai conference on web and social media, vol. 5, 401–408 (2011).
12.Hassan, N. et al. Towards automated sexual violence report tracking. In Proceedings of the international AAAI conference on web and social media, vol. 14, 250–259 (2020).
13. Trezza D. To scrape or not to scrape, this is dilemma. the post-api scenario and implications on digital research. Frontiers in Sociology 8, 1145038 (2023). doi: 10.3389/fsoc.2023.1145038 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Poudel, A. & Weninger, T. Navigating the post-api dilemma search engine results pages present a biased view of social media data. arXiv preprint arXiv:2401.15479 (2024).
15.Kleppmann, M. et al. Bluesky and the at protocol: Usable decentralized social media. arXiv preprint arXiv:2402.03239 (2024).
16.Quelle, D. & Bovet, A. Bluesky: Network Topology, Polarisation, and Algorithmic Curation. arXiv preprint arXiv:2405.17571 (2024).
17.@bsky.app’s post published feb 22, 2024 at 21:04. https://bsky.app/profile/bsky.app/post/3klzrudt4uk2z. [Accessed 27-03-2024].
18.Jeong, U. et al. User migration across multiple social media platforms. arXiv preprint arXiv:2309.12613 (2023).
19.Introducing Threads: A New Way to Share With Text | Meta — about.fb.com. https://about.fb.com/news/2023/07/introducing-threads-new-app-text-sharing/. [Accessed 29-04-2024].
20.Bluesky—news feed. https://bsky.app/profile/did:plc:kkf4naxqmweop7dv4l2iqqf5/feed/verified-news. [Accessed 29-04-2024].
21.Bluesky—cat pics feed. https://bsky.app/profile/did:plc:q6gjnaw2blty4crticxkmujt/feed/cv:cat. [Accessed 29-04-2024].
22.Bluesky Documentation. https://docs.bsky.app/. [Accessed 27-03-2024].
23. Glenski M., Pennycuff C. & Weninger T. Consumers and curators: Browsing and voting patterns on reddit. IEEE Transactions on Computational Social Systems 4, 196–206 (2017). doi: 10.1109/TCSS.2017.2742242 [DOI] [Google Scholar]
24.Camacho-collados, J. et al. TweetNLP: Cutting-edge natural language processing for social media. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38–49 (Association for Computational Linguistics, Abu Dhabi, UAE, 2022).
25.Failla, A. & Rossetti, G. Bluesky social dataset, 10.5281/zenodo.11082878 (2024). [DOI] [PMC free article] [PubMed]
26. Battiston F. et al. Networks beyond pairwise interactions: Structure and dynamics. Physics Reports 874, 1–92 (2020). doi: 10.1016/j.physrep.2020.05.004 [DOI] [Google Scholar]
27. Wilkinson M. D. et al. The fair guiding principles for scientific data management and stewardship. Scientific data 3, 1–9 (2016). doi: 10.1038/sdata.2016.18 [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Bluesky App Privacy Policy. https://bsky.social/about/support/privacy-policy. [Accessed 27-03-2024].
29. Traud A. L., Mucha P. J. & Porter M. A. Social structure of facebook networks. Physica A: Statistical Mechanics and its Applications 391, 4165–4180 (2012). doi: 10.1016/j.physa.2011.12.021 [DOI] [Google Scholar]
30.Gerard, P., Botzer, N. & Weninger, T. Truth social dataset. In Proceedings of the International AAAI Conference on Web and Social Media, vol. 17, 1034–1040 (2023).
31. Boccaletti S. et al. The structure and dynamics of multilayer networks. Physics reports 544, 1–122 (2014). doi: 10.1016/j.physrep.2014.07.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Berlingerio M., Coscia M., Giannotti F., Monreale A. & Pedreschi D. Multidimensional networks: foundations of structural analysis. World Wide Web 16, 567–593 (2013). doi: 10.1007/s11280-012-0190-4 [DOI] [Google Scholar]
33. Newman M. E. Analysis of weighted networks. Physical review E 70, 056131 (2004). doi: 10.1103/PhysRevE.70.056131 [DOI] [PubMed] [Google Scholar]
34. Holme P. & Saramäki J. Temporal networks. Physics reports 519, 97–125 (2012). doi: 10.1016/j.physrep.2012.03.001 [DOI] [Google Scholar]
35.Rozenshtein, P. & Gionis, A. Mining temporal networks. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 3225–3226 (2019).
36.Conover, M. et al. Political polarization on twitter. In Proceedings of the international aaai conference on web and social media, vol. 5, 89–96 (2011).
37. Aksoy S. G., Joslyn C., Marrero C. O., Praggastis B. & Purvine E. Hypernetwork science via high-order hypergraph walks. EPJ Data Science 9, 16 (2020). doi: 10.1140/epjds/s13688-020-00231-0 [DOI] [Google Scholar]
38. Marsili N. Retweeting: Its linguistic and epistemic value. Synthese 198, 10457–10483 (2021). doi: 10.1007/s11229-020-02731-y [DOI] [Google Scholar]
39. La Cava L., Greco S. & Tagarelli A. Understanding the growth of the fediverse through the lens of mastodon. Applied network science 6, 1–35 (2021). doi: 10.1007/s41109-021-00392-5 [DOI] [Google Scholar]
40. Orabi M., Mouheb D., Al Aghbari Z. & Kamel I. Detection of bots in social media: a systematic review. Information Processing & Management 57, 102250 (2020). doi: 10.1016/j.ipm.2020.102250 [DOI] [Google Scholar]
41.Grootendorst, M. Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794 (2022).
42.McInnes, L., Healy, J. & Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018).
43.Rahman, M. F. et al. Hdbscan: Density based clustering over location based services. arXiv preprint arXiv:1602.03730 (2016).
44.Sung, M. Bluesky is under fire for allowing usernames with racial slurs | TechCrunch—techcrunch.com. https://techcrunch.com/2023/07/17/bluesky-racial-slurs-banned-list-usernames/. [Accessed 24-04-2024].

PLoS One. doi: 10.1371/journal.pone.0310330.r001

Decision Letter 0

Fabio Saracco

26 Aug 2024

PONE-D-24-24931“I’m in the Bluesky Tonight”: Insights from a Year Worth of Social DataPLOS ONE

Dear Dr. Failla,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Oct 10 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Fabio Saracco

Academic Editor

PLOS ONE

Journal requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. In your Methods section, please include additional information about your dataset and ensure that you have included a statement specifying whether the collection and analysis method complied with the terms and conditions for the source of the data.

3. Thank you for stating the following financial disclosure:

“This work is supported by (i) the European Union – Horizon 2020 Program under the scheme “INFRAIA-01-2018-2019 – Integrating Activities for Advanced Communities”, Grant Agreement n.871042, ''SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics" (\\url{http://www.sobigdata.eu}); (ii) SoBigData.it which receives funding from the European Union – NextGenerationEU – National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) – Project: ''SoBigData.it – Strengthening the Italian RI for Social Mining and Big Data Analytics" – Prot. IR0000013 – Avviso n. 3264 del 28/12/2021; (iii) EU NextGenerationEU programme under the funding schemes PNRR-PE-AI FAIR (Future Artificial Intelligence Research).”

Please state what role the funders took in the study. If the funders had no role, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

If this statement is not correct you must amend it as needed.

Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.

4. Thank you for stating the following in the Acknowledgments Section of your manuscript:

“All code related to generating, processing, and describing the dataset is released alongside it. The code requires Python 3.8 or higher and is mostly based on standard libraries and popular data science packages (e.g., numpy, pandas, matplotlib). The scripts for data collection also require atproto v0.0.46, the official Python wrapper for Bluesky Social API. Please refer to the API documentation for further details [22]. Before running data collection scripts, users should also set a USERNAME and PASSWORD environment variables with their Bluesky credentials, as some methods may require authentication.”

We note that you have provided funding information that is currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

5. We note that Figure 1 in your submission contain copyrighted images. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:

1. You may seek permission from the original copyright holder of Figure 1 to publish the content specifically under the CC BY 4.0 license.

We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text:

“I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.”

Please upload the completed Content Permission Form or other proof of granted permissions as an "Other" file with your submission.

In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].”

2. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only.

6. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments:

Dear authors,

The reviewers provided extremely positive reports. There just a few small points to be clarified in order to improve the readability of the paper.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: N/A

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: With this publication, the authors provide the community with a dataset of approximately 235 million posts collected from the new social network 'Bluesky'. The data gathered cover about 80% of the accounts on the platform in terms of user involvement.

PROS/CONS

(+) The study presents an original dataset; no other Bluesky dataset with similar coverage appears to be available in the literature.

(+) The dataset represents an alternative approach to an important recent problem, i.e., the difficulty of accessing social media data in the "post-API era." Furthermore, the dataset represents the Bluesky social media at its initial stage.

(-) I believe that some aspects of the data collection could be better clarified by the authors.

This reviewer believes that the paper is well-structured and clear to read. Moreover, although I believe some aspects could be clarified further, the authors describe the data collection process allowing for the reproducibility of the approach. Also, given the recent developments and the increasing difficulty in accessing social data for research purposes ("post-API era"), I believe that the contribution is important for the community.

Here are two aspects regarding the data collection that I think should be clarified (I am confident that the authors will be able to manage them):

- The "relational" data and the posts might not be aligned? I will try to explain better. If I understand correctly, in the first phase, the relational data (followers of various accounts) are collected, and subsequently, all the posts published on the timelines of the collected users. Potentially, some follower/followee relationships might be added or removed during the collection of the posts. I would ask the authors to discuss whether this might represent a limitation in some studies and how to manage it (e.g., if it were a problem, to have "relational" data and posts aligned, one might consider only the posts published up until the start of data collection for the followers?)

- Text from row 98 to 101: I would like to know if the authors have encountered the same problem for replies. In other words, how did they separate replies from reposts? Does the metadata provided by Bluesky already provide this information?

Reviewer #2: The paper presents a very rich dataset of bluesky users, relationships and interactions, made available by the authors together with the software used to collect It and preprocess it.

In addition, the authors present an exploratory analysis of the dataset, that provides a first picture of this OSN, its structure, dinamics and topics of discussion. Leveraging on the analysis, the authors discuss possibile ways to make use of these data to advance open issues in computational social sciences.

I found the paper very easy and pleasant to read. While the analysis is not very deep, the dataset itself may be a very useful tool for other researchers, and the paper successfully delineates possibile use cases and research directions.

I only have 2 very small commenta:

- in line 43, the authors missed an upper case after a dot

- I would rename the "Results" section to something like "Data analysis"

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Nov 5;19(11):e0310330. doi: 10.1371/journal.pone.0310330.r002

Author response to Decision Letter 0

27 Aug 2024

We would like to thank the editor and the reviewers for their time and valuable feedback. Their suggestions and comments have significantly contributed to enhancing the quality of our paper. Below, we provide a detailed, point-by-point response to each report.

Academic Editor.

Q. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

A. We have reviewed the style requirements and ensured compliance.

Q. 2. In your Methods section, please include additional information about your dataset and ensure that you have included a statement specifying whether the collection and analysis method complied with the terms and conditions for the source of the data.

A. We moved the Ethics statement to the Materials and Methods section. The statement now specifies that the collection and analysis methods comply with the platform's terms of service.

Q. 5. We note that Figure 1 in your submission contain copyrighted images. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission

A. We updated the figure removing potentially copyrighted material. Moreover, the caption now clarifies the image source (screenshot taken by the author from their own device).

Q. 3-4.

These changes were addressed in the cover letter as requested.

Reviewer 1.

Q. The "relational" data and the posts might not be aligned? I will try to explain better. If I understand correctly, in the first phase, the relational data (followers of various accounts) are collected, and subsequently, all the posts published on the timelines of the collected users. Potentially, some follower/followee relationships might be added or removed during the collection of the posts. I would ask the authors to discuss whether this might represent a limitation in some studies and how to manage it (e.g., if it were a problem, to have "relational" data and posts aligned, one might consider only the posts published up until the start of data collection for the followers?)

A. As the reviewer notes, the “photograph” we take of a social media platform is not instantaneous but gradual, requiring time to be fully captured. Consequently, changes such as relation/post addition/removal may be overlooked if they happen after the corresponding area of the network is parsed. Thus, considering that a degree of information loss is inevitable at such a scale, we took some measures to limit its repercussions. Specifically, we last parsed the follower network on March 19th (this process started and ended on the same day), and released posts until March 18th (i.e., immediately before starting to parse follower relations). By doing so, we ensure that the misalignment is kept to a minimum (about 24 hours). We clarified this aspect in lines 135-139 by highlighting the advantage of this filtering process.

Q. Text from row 98 to 101: I would like to know if the authors have encountered the same problem for replies. In other words, how did they separate replies from reposts? Does the metadata provided by Bluesky already provide this information?

A. Bluesky references the post to which a reply is linked within post metadata. As the reviewer noted, we did not specify this aspect in the previous version of the manuscript. In the attached version, we state that replies and quotes have this relevant information as metadata, as opposed to reposts (lines 99-101).

Reviewer 2.

Q. in line 43, the authors missed an upper case after a dot

A. The typo is now fixed

Q. I would rename the "Results" section to something like "Data analysis"

A. The Results section was renamed to Data Analysis

Attachment

Submitted filename: Bluesky Rebuttal.pdf

pone.0310330.s001.pdf^{(56.6KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0310330.r003

Decision Letter 1

Fabio Saracco

29 Aug 2024

“I’m in the Bluesky Tonight”: Insights from a Year Worth of Social Data

PONE-D-24-24931R1

Dear Dr. Failla,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Fabio Saracco

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Dear authors,

since all the requests raised by the referees were properly addressed, I believe that the manuscript is ready for publication. Congratulations!

Best,

Fabio Saracco

Reviewers' comments:

PLoS One. doi: 10.1371/journal.pone.0310330.r004

Acceptance letter

Fabio Saracco

3 Sep 2024

PONE-D-24-24931R1

PLOS ONE

Dear Dr. Failla,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Fabio Saracco

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Attachment

Submitted filename: Bluesky Rebuttal.pdf

pone.0310330.s001.pdf^{(56.6KB, pdf)}

Data Availability Statement

All data and code are available in a dedicated zenodo repository at https://zenodo.org/records/11082879.

[pone.0310330.ref001] 1. Bouvier G. & Machin D. Critical discourse analysis and the challenges and opportunities of social media. Critical Discourse Studies and/in Communication 39–53 (2020). doi: 10.4324/9781003050353-3 [DOI] [Google Scholar]

[pone.0310330.ref002] 2. Botzer N., Gu S. & Weninger T. Analysis of moral judgment on reddit. IEEE Transactions on Computational Social Systems (2022). [Google Scholar]

[pone.0310330.ref003] 3. Aïmeur E., Amri S. & Brassard G. Fake news, disinformation and misinformation in social media: a review. Social Network Analysis and Mining 13, 30 (2023). doi: 10.1007/s13278-023-01028-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0310330.ref004] 4. Saurwein F. & Spencer-Smith C. Combating disinformation on social media: Multilevel governance and distributed accountability in europe. Digital journalism 8, 820–841 (2020). doi: 10.1080/21670811.2020.1765401 [DOI] [Google Scholar]

[pone.0310330.ref005] 5. Caldarelli G., De Nicola R., Del Vigna F., Petrocchi M. & Saracco F. The role of bot squads in the political propaganda on twitter. Communications Physics 3, 81 (2020). doi: 10.1038/s42005-020-0340-4 [DOI] [Google Scholar]

[pone.0310330.ref006] 6.Nizzoli, L., Tardelli, S., Avvenuti, M., Cresci, S. & Tesconi, M. Coordinated behavior on social media in 2019 uk general election. In Proceedings of the International AAAI Conference on Web and Social Media, vol. 15, 443–454 (2021).

[pone.0310330.ref007] 7. Goldenberg A. et al. Homophily and acrophily as drivers of political segregation. Nature Human Behaviour 7, 219–230 (2023). doi: 10.1038/s41562-022-01474-9 [DOI] [PubMed] [Google Scholar]

[pone.0310330.ref008] 8.Wang, E. L., Luceri, L., Pierri, F. & Ferrara, E. Identifying and characterizing behavioral classes of radicalization within the qanon conspiracy on twitter. In Proceedings of the International AAAI Conference on Web and Social Media, vol. 17, 890–901 (2023).

[pone.0310330.ref009] 9. Cinelli M., De Francisci Morales G., Galeazzi A., Quattrociocchi W. & Starnini M. The echo chamber effect on social media. Proceedings of the National Academy of Sciences 118, e2023301118 (2021). doi: 10.1073/pnas.2023301118 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0310330.ref010] 10. Garimella K., Morales G. D. F., Gionis A. & Mathioudakis M. Quantifying controversy on social media. ACM Transactions on Social Computing 1, 1–27 (2018). doi: 10.1145/3140565 [DOI] [Google Scholar]

[pone.0310330.ref011] 11.Weng, J. & Lee, B.-S. Event detection in twitter. In Proceedings of the international aaai conference on web and social media, vol. 5, 401–408 (2011).

[pone.0310330.ref012] 12.Hassan, N. et al. Towards automated sexual violence report tracking. In Proceedings of the international AAAI conference on web and social media, vol. 14, 250–259 (2020).

[pone.0310330.ref013] 13. Trezza D. To scrape or not to scrape, this is dilemma. the post-api scenario and implications on digital research. Frontiers in Sociology 8, 1145038 (2023). doi: 10.3389/fsoc.2023.1145038 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0310330.ref014] 14.Poudel, A. & Weninger, T. Navigating the post-api dilemma search engine results pages present a biased view of social media data. arXiv preprint arXiv:2401.15479 (2024).

[pone.0310330.ref015] 15.Kleppmann, M. et al. Bluesky and the at protocol: Usable decentralized social media. arXiv preprint arXiv:2402.03239 (2024).

[pone.0310330.ref016] 16.Quelle, D. & Bovet, A. Bluesky: Network Topology, Polarisation, and Algorithmic Curation. arXiv preprint arXiv:2405.17571 (2024).

[pone.0310330.ref017] 17.@bsky.app’s post published feb 22, 2024 at 21:04. https://bsky.app/profile/bsky.app/post/3klzrudt4uk2z. [Accessed 27-03-2024].

[pone.0310330.ref018] 18.Jeong, U. et al. User migration across multiple social media platforms. arXiv preprint arXiv:2309.12613 (2023).

[pone.0310330.ref019] 19.Introducing Threads: A New Way to Share With Text | Meta — about.fb.com. https://about.fb.com/news/2023/07/introducing-threads-new-app-text-sharing/. [Accessed 29-04-2024].

[pone.0310330.ref020] 20.Bluesky—news feed. https://bsky.app/profile/did:plc:kkf4naxqmweop7dv4l2iqqf5/feed/verified-news. [Accessed 29-04-2024].

[pone.0310330.ref021] 21.Bluesky—cat pics feed. https://bsky.app/profile/did:plc:q6gjnaw2blty4crticxkmujt/feed/cv:cat. [Accessed 29-04-2024].

[pone.0310330.ref022] 22.Bluesky Documentation. https://docs.bsky.app/. [Accessed 27-03-2024].

[pone.0310330.ref023] 23. Glenski M., Pennycuff C. & Weninger T. Consumers and curators: Browsing and voting patterns on reddit. IEEE Transactions on Computational Social Systems 4, 196–206 (2017). doi: 10.1109/TCSS.2017.2742242 [DOI] [Google Scholar]

[pone.0310330.ref024] 24.Camacho-collados, J. et al. TweetNLP: Cutting-edge natural language processing for social media. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38–49 (Association for Computational Linguistics, Abu Dhabi, UAE, 2022).

[pone.0310330.ref025] 25.Failla, A. & Rossetti, G. Bluesky social dataset, 10.5281/zenodo.11082878 (2024). [DOI] [PMC free article] [PubMed]

[pone.0310330.ref026] 26. Battiston F. et al. Networks beyond pairwise interactions: Structure and dynamics. Physics Reports 874, 1–92 (2020). doi: 10.1016/j.physrep.2020.05.004 [DOI] [Google Scholar]

[pone.0310330.ref027] 27. Wilkinson M. D. et al. The fair guiding principles for scientific data management and stewardship. Scientific data 3, 1–9 (2016). doi: 10.1038/sdata.2016.18 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0310330.ref028] 28.Bluesky App Privacy Policy. https://bsky.social/about/support/privacy-policy. [Accessed 27-03-2024].

[pone.0310330.ref029] 29. Traud A. L., Mucha P. J. & Porter M. A. Social structure of facebook networks. Physica A: Statistical Mechanics and its Applications 391, 4165–4180 (2012). doi: 10.1016/j.physa.2011.12.021 [DOI] [Google Scholar]

[pone.0310330.ref030] 30.Gerard, P., Botzer, N. & Weninger, T. Truth social dataset. In Proceedings of the International AAAI Conference on Web and Social Media, vol. 17, 1034–1040 (2023).

[pone.0310330.ref031] 31. Boccaletti S. et al. The structure and dynamics of multilayer networks. Physics reports 544, 1–122 (2014). doi: 10.1016/j.physrep.2014.07.001 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0310330.ref032] 32. Berlingerio M., Coscia M., Giannotti F., Monreale A. & Pedreschi D. Multidimensional networks: foundations of structural analysis. World Wide Web 16, 567–593 (2013). doi: 10.1007/s11280-012-0190-4 [DOI] [Google Scholar]

[pone.0310330.ref033] 33. Newman M. E. Analysis of weighted networks. Physical review E 70, 056131 (2004). doi: 10.1103/PhysRevE.70.056131 [DOI] [PubMed] [Google Scholar]

[pone.0310330.ref034] 34. Holme P. & Saramäki J. Temporal networks. Physics reports 519, 97–125 (2012). doi: 10.1016/j.physrep.2012.03.001 [DOI] [Google Scholar]

[pone.0310330.ref035] 35.Rozenshtein, P. & Gionis, A. Mining temporal networks. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 3225–3226 (2019).

[pone.0310330.ref036] 36.Conover, M. et al. Political polarization on twitter. In Proceedings of the international aaai conference on web and social media, vol. 5, 89–96 (2011).

[pone.0310330.ref037] 37. Aksoy S. G., Joslyn C., Marrero C. O., Praggastis B. & Purvine E. Hypernetwork science via high-order hypergraph walks. EPJ Data Science 9, 16 (2020). doi: 10.1140/epjds/s13688-020-00231-0 [DOI] [Google Scholar]

[pone.0310330.ref038] 38. Marsili N. Retweeting: Its linguistic and epistemic value. Synthese 198, 10457–10483 (2021). doi: 10.1007/s11229-020-02731-y [DOI] [Google Scholar]

[pone.0310330.ref039] 39. La Cava L., Greco S. & Tagarelli A. Understanding the growth of the fediverse through the lens of mastodon. Applied network science 6, 1–35 (2021). doi: 10.1007/s41109-021-00392-5 [DOI] [Google Scholar]

[pone.0310330.ref040] 40. Orabi M., Mouheb D., Al Aghbari Z. & Kamel I. Detection of bots in social media: a systematic review. Information Processing & Management 57, 102250 (2020). doi: 10.1016/j.ipm.2020.102250 [DOI] [Google Scholar]

[pone.0310330.ref041] 41.Grootendorst, M. Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794 (2022).

[pone.0310330.ref042] 42.McInnes, L., Healy, J. & Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018).

[pone.0310330.ref043] 43.Rahman, M. F. et al. Hdbscan: Density based clustering over location based services. arXiv preprint arXiv:1602.03730 (2016).

[pone.0310330.ref044] 44.Sung, M. Bluesky is under fire for allowing usernames with racial slurs | TechCrunch—techcrunch.com. https://techcrunch.com/2023/07/17/bluesky-racial-slurs-banned-list-usernames/. [Accessed 24-04-2024].

PERMALINK

“I’m in the Bluesky Tonight”: Insights from a year worth of social data

Andrea Failla

Giulio Rossetti

Roles

Abstract

Introduction

Bluesky social

Fig 1.

Materials and methods

Data collection

Data processing

Table 1. Post metadata.

Table 2. Post metadata description.

Data records

Table 3. Collected feed statistics.

Ethics statement

Data analysis

Social structure: Network topology and federation

Followers network

Interaction networks

Federation

Fig 2. Most populated and active instances.

Posting activity

Fig 3. Temporal trends of the number of average posts per day.

Fig 4. Temporal trends of the average number of posts per active user.

Fig 5. Cumulative Distribution Function of the inter-event time from users’ first to last posts (days).

Content analysis

Languages

Feeds

Table 4. Most frequent words appearing in each feed.

Sentiment and topics

Fig 6. Temporal trends of English post sentiment.

Fig 7. Count of the five largest document clusters as identified by BERTopic on negative posts issued within July 13-15th.

Conclusion

Usage notes

Code availability

Data Availability

Funding Statement

References

Decision Letter 0

Fabio Saracco

Roles

Author response to Decision Letter 0

Decision Letter 1

Fabio Saracco

Roles

Acceptance letter

Fabio Saracco

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases