Bluesky is grappling with a significant privacy issue after one million public posts were scraped from its platform for AI training, according to a 404Media report. The dataset, compiled by machine learning librarian Daniel van Strien from the AI company Hugging Face, was intended for use in research related to natural language processing and social media analysis. Although Bluesky’s representatives assert that the platform will never train generative AI on user data, the open nature of its API makes it vulnerable to external scrapers.
Bluesky faces privacy concerns over scraped user posts
The dataset in question was sourced through Bluesky’s Firehose API, which provides an aggregated stream of public data updates, including posts, likes, and follows. Van Strien had aimed to use this dataset for pushing forward machine learning research. However, it not only included the text of posts but also users’ decentralized identifiers (DIDs) and metadata. After media reports highlighted the issue, the dataset was swiftly removed from Hugging Face due to the backlash it generated regarding user privacy and lack of consent.
Bluesky users did not provide explicit permission for their posts to be utilized in this manner, though Bluesky’s policies do not categorically prohibit such actions. The core of the controversy lies in the open structure of Bluesky’s API, which allows third-party developers to access its public data freely. According to a statement from a Bluesky representative, “we’d like to find a way for Bluesky users to communicate to outside orgs/developers whether they consent to this,” indicating an effort to enhance user control over data sharing in the future.
Bluesky gains 1.25 million users post-election surge
Following the removal of the dataset, van Strien acknowledged the breach of transparency and consent in his data collection approach. “I apologize for this mistake,” he stated in a follow-up post on Bluesky. This incident serves as a prompt for users to understand better that any content shared publicly on the platform is accessible to external entities. As the platform continues to grow—recently surpassing 20 million users—Bluesky will likely face increasing scrutiny regarding its data protection measures and user privacy.
Bluesky is currently in discussions about mechanisms that could enable users to express their consent preferences to third parties. However, enforcement remains a challenge; as noted by the platform, it will ultimately be up to outside developers to adhere to these preferences. Bluesky’s representatives additionally conveyed that while they aim for discussions with engineers and legal teams, no immediate solutions are available.
Featured image credit: Bluesky