ChatGPT’s data does not add up, according to recent research that sheds light on the discrepancies between AI training data and real-world usage patterns.
This eye-opening study reveals surprising misalignments between the content that large language models like ChatGPT are trained on and how people actually use these AI assistants in practice. Let’s dive into the details of this intriguing research and explore what it means for the future of AI development.
The study, conducted by researchers examining web crawl data and ChatGPT usage logs, uncovered several key findings that challenge our assumptions about AI training data.
By comparing the types of web content most commonly crawled for AI training with actual user interactions recorded in ChatGPT conversations, the researchers identified significant gaps between the data used to train these models and their practical applications.
ChatGPT’s foundations are shaky
One of the most striking discoveries was the mismatch between the prevalence of news content in training data and its relative scarcity in real-world ChatGPT queries. While news websites comprised nearly 40% of the tokens in the head distribution of crawled web domains, less than 1% of ChatGPT queries were related to news or current affairs. This raises questions about the efficiency and relevance of using such a large proportion of news content in training data when users appear to have limited interest in news-related queries.
Another surprising finding was the high frequency of creative writing and role-playing requests in ChatGPT conversations, despite the relative lack of such content in the training data. Over 30% of user interactions involved requests for fictional story writing, creative compositions, or role-playing scenarios. This suggests that AI models may be underprepared for these popular use cases, potentially leading to suboptimal performance in these areas.
The data dilemma
A closer look at the research findings reveals a complex web of data sources and usage patterns that don’t quite align. The study examined three major web-crawled datasets commonly used for AI training: C4, RefinedWeb, and Dolma. These datasets, derived from Common Crawl snapshots, represent a significant portion of the “data commons” used to train large language models.
However, the composition of these datasets differs markedly from how people use ChatGPT in practice. For instance, the head distribution of web domains in the training data is dominated by news sites, encyclopedias, and social media platforms.
In contrast, real-world ChatGPT usage shows a preference for creative tasks, general information queries, and even sexual content – areas that are either underrepresented or actively filtered out of training datasets.
This misalignment raises important questions about the effectiveness of current data collection and curation practices for AI training. If the data used to train these models doesn’t reflect their actual use cases, how can we expect them to perform optimally in real-world scenarios?
The consent conundrum
Adding another layer of complexity to the data puzzle is the rapidly changing landscape of web consent for AI training. The research uncovered a significant increase in restrictions placed on web crawlers by website owners, particularly those associated with AI development.
In just one year, from April 2023 to April 2024, the percentage of tokens restricted by robots.txt files in major corpora like C4 and RefinedWeb increased by over 500%.
Is AI creative: Answering the unanswerable
This trend, if it continues, could severely impact the availability of high-quality training data for future AI models.
Moreover, the study found inconsistencies in how websites communicate their data use preferences. Many sites have contradictory instructions in their robots.txt files and Terms of Service agreements, leading to confusion about what data can be used for AI training. This lack of clarity poses challenges for both AI developers and website owners trying to protect their content.
The sexual content surprise in ChatGPT
Perhaps one of the most unexpected findings of the study was the prevalence of sexual content requests in ChatGPT interactions. While sensitive or explicit content represents less than 1% of the web domains in the training data, sexual role-play accounted for 12% of all recorded user interactions in the study’s dataset.
This discrepancy highlights a significant gap between the sanitized training data used by AI companies and the actual desires of users. It also raises ethical questions about how AI models should handle such requests, given that most have been explicitly trained to avoid generating explicit content.
Featured image credit: Solen Feyissa/Unsplash