OpenAI Shuts Down Book Collection Used for AI Training

openai shuts down book collection for ai training

In a stunning revelation that could have far-reaching implications for the future of AI development, newly unsealed court documents have shed light on a controversial decision by OpenAI to delete two enormous datasets containing published books. The datasets, ominously named “books1” and “books2,” were allegedly used to train the company’s groundbreaking GPT-3 language model but have since been wiped from existence.

The bombshell discovery came as part of an ongoing class action lawsuit brought by the Authors Guild against OpenAI, in which the organization accuses the tech giant of using copyrighted materials to train its AI models without proper compensation or permission. At the heart of the Authors Guild’s case are the mysterious “books1” and “books2” datasets, which lawyers claim likely contained “more than 100,000 published books.”

OpenAI’s Reluctant Disclosure and Shocking Admission

For months, the Authors Guild has been engaged in a legal tug-of-war with OpenAI, demanding information about these elusive datasets. The startup, known for its cutting-edge AI research and development, initially resisted the Guild’s inquiries, citing confidentiality concerns. However, in a stunning turn of events, OpenAI finally revealed that it had taken the drastic step of deleting all copies of the data, as detailed in legal filings reviewed by Business Insider.

The admission sent shockwaves through the tech and legal communities, raising questions about OpenAI’s transparency and the potential impact on ongoing litigation. The Authors Guild has long maintained that the use of copyrighted material in AI training without proper licensing or compensation is a violation of intellectual property rights and has vowed to hold tech companies accountable.

The High-Stakes Battle Over AI Training Data

The revelation of OpenAI’s dataset deletion underscores the crucial role that high-quality training data plays in the development of cutting-edge AI models like GPT-3. OpenAI, along with other tech giants, has relied heavily on data sourced from the internet, including countless books, to build these sophisticated systems capable of generating human-like text and performing complex language tasks.

However, many content creators argue that they deserve a slice of the pie for essentially providing the raw intelligence that powers these AI products. They contend that the use of their work without proper compensation amounts to a form of digital exploitation and that tech companies should not be allowed to profit from their creations without sharing the rewards.

This dispute has now spilled over into the courts, with multiple lawsuits pitting content creators against tech companies in a high-stakes battle over the future of AI development. The outcome of these cases could have profound implications for the way in which AI is built and deployed, as well as for the relationship between technology firms and the creators whose work fuels their innovations.

OpenAI’s 2020 White Paper Sheds Light on Datasets’ Significance

The true scale and importance of the “books1” and “books2” datasets were hinted at in a 2020 white paper published by OpenAI. In the document, the company described the datasets as “internet-based books corpora” that accounted for a staggering 16% of the training data used to create GPT-3, one of the most advanced language models ever developed.

The white paper also revealed that the two datasets contained an astounding 67 billion tokens of data, roughly equivalent to 50 billion words. To put that figure in perspective, the King James Bible, a substantial text in its own right, contains just 783,137 words. This means that the deleted datasets were orders of magnitude larger than one of the most famous books in history, underscoring the vast amount of literary material that may have been used in GPT-3’s training.

The Mysterious Disappearance of “Books 1” and “Books 2”

The circumstances surrounding the deletion of the “books1” and “books2” datasets remain murky. In an unsealed letter from OpenAI’s lawyers, marked “highly confidential—for attorneys’ eyes only,” the company claimed that it had discontinued the use of the datasets for model training in late 2021 and had subsequently deleted them entirely in mid-2022 due to their non-use.

This explanation has raised eyebrows among legal experts and industry observers, who question the timing and motivation behind the decision. Some have speculated that OpenAI may have been attempting to shield itself from potential legal liability by erasing evidence of its use of copyrighted material, while others suggest that the deletion may have been a routine matter of data management and storage optimization.

The Authors Guild’s Demand for Transparency

Adding to the intrigue surrounding the deleted datasets is the revelation that the two researchers who created “books1” and “books2” are no longer employed by OpenAI. The startup initially refused to disclose the identities of these individuals but has since provided their names to the Authors Guild’s lawyers in response to legal pressure.

However, OpenAI is now petitioning the court to keep the names of the two employees, as well as information about the datasets, under seal, arguing that the disclosure of this information could harm its business interests and competitive position. The Authors Guild, on the other hand, is pushing back against this request, arguing that the public has a right to know the full details of OpenAI’s data practices and that transparency is essential for holding the company accountable.

OpenAI Responds

In a statement released on Tuesday, OpenAI sought to distance its current AI models from the controversial datasets, stating, “The models powering ChatGPT and our API today were not developed using these datasets. These datasets, created by former employees who are no longer with OpenAI, were last used in 2021 and deleted due to non-use in 2022.”

While this statement may offer some reassurance to users of OpenAI’s current products, it does not address the broader questions raised by the Authors Guild’s lawsuit and the revelation of the deleted datasets. The tech world and content creators alike are watching closely to see how the courts will address the complex issues surrounding AI training data and intellectual property rights.

A Watershed Moment for AI and Intellectual Property

The legal battle between the Authors Guild and OpenAI represents a watershed moment in the ongoing debate over the use of copyrighted material in AI development. As language models like GPT-3 become increasingly sophisticated and capable of generating text that is virtually indistinguishable from human writing, the lines between machine learning and intellectual property infringement are becoming increasingly blurred.

For content creators, the stakes could not be higher. Many authors, journalists, and other writers fear that the unchecked use of their work in AI training could lead to a future in which their livelihoods are threatened by machines capable of replicating their style and content at a fraction of the cost. They argue that just as musicians and filmmakers are compensated for the use of their work in streaming services and other digital platforms, writers should be fairly compensated when their creations are used to train AI models.

On the other hand, tech companies argue that the use of publicly available data, including books and other written works, is essential for the development of cutting-edge AI systems that have the potential to revolutionize industries and improve people’s lives. They contend that the benefits of AI development outweigh the potential harms to individual content creators and that imposing strict licensing requirements or compensation schemes could stifle innovation and slow the pace of progress.

The Road Ahead

As the legal battle between the Authors Guild and OpenAI unfolds, the eyes of the tech world and beyond will be watching closely to see how the courts navigate this complex and contentious issue. The outcome of this case, and others like it, could have far-reaching implications for the future of AI development, intellectual property law, and the delicate balance between technological innovation and the rights of creators.

Regardless of the ultimate outcome, one thing is clear: the revelation of OpenAI’s deletion of the “books1” and “books2” datasets has opened a new chapter in the ongoing story of AI and its impact on society. As we grapple with the profound questions raised by this technology, it is more important than ever that we engage in open, honest dialogue and work together to find solutions that benefit everyone.

The road ahead may be uncertain, but one thing is sure: the future of AI and our relationship to it will be shaped by the decisions we make today. Let us hope that we have the wisdom and foresight to chart a course that respects the rights of creators while also unlocking the incredible potential of this transformative technology.

The Information is taken from Business Insider and The Verge.


Subscribe to Our Newsletter

Related Articles

Top Trending

Strait of Hormuz Blockade 2026
Chokepoint in Chaos: How the 2026 Strait of Hormuz Blockade is Rewriting Global Security and Energy
US Startups Engineering Lab-Grown Regenerative Fabrics
10 US Startups Engineering Lab-Grown Regenerative Fabrics for Everyday Wear
AI-Powered CRM Startups in the USA
20 AI-Powered CRM Startups in the USA Leading the 2026 Sales Revolution
Sweden work life balance
10 Surprising Facts About How Sweden's Work-Life Balance Culture Is Reshaping Mental Health Norms
how to curate a Digital Reading List
How To Curate A Digital Reading List That Builds Expertise: Transform Your Knowledge!

Fintech & Finance

Top Mobile Apps for Personal Finance Management
Top Mobile Apps for Personal Finance Management You Must Try
Top QuickBooks Errors Preventing Company File Access
Top 10 QuickBooks Errors Preventing Company File Access
Best Neobanks New Zealand 2025
9 Best Neobanks and Digital Finance Apps Available in New Zealand 2025
Irish Credit Union Digital Generation
7 Key Ways Irish Credit Unions Are Competing with Neobanks for the Digital Generation
How Fintech Is Transforming Emerging Market Economies
How Fintech Is Transforming Emerging Market Economies

Sustainability & Living

US Startups Engineering Lab-Grown Regenerative Fabrics
10 US Startups Engineering Lab-Grown Regenerative Fabrics for Everyday Wear
The Future of Fast Charging What's Coming Next
The Future of Fast Charging: Trends You Must Know
How Solid-State Batteries Will Change the EV Industry
How Solid-State Batteries Will Change The EV Industry
The Real Environmental Cost of Electric Vehicles
Hidden Environmental Impact of Electric Vehicles
How EV Battery Technology Is Evolving
EV Battery Technology in 2026: Key Innovations Driving Change

GAMING

What Most Users Still Get Wrong When Comparing CS2 Skin Platforms
What Most Users Still Get Wrong When Comparing CS2 Skin Platforms?
How Technology Is Transforming the Online Gaming Industry
How Technology Is Transforming the Online Gaming Industry
Naruto Uzumaki In The Manga
Naruto Uzumaki In The Manga: How The Original Source Material Shaped The Character
Online Game
Why Online Game Promotions Make Digital Entertainment More Engaging
Geek Appeal of Randomized Games
The Geek Appeal of Randomized Games Like Pokies

Business & Marketing

Trade Show Exhibit Trends 2026: Custom, Rental & Portable Designs That Steal the Spotlight
Trade Show Exhibit Trends 2026: Custom, Rental & Portable Designs That Steal the Spotlight
China EV Market Dominance: How China Leads Global EV Growth
How China Is Dominating The Global EV Market
Top 10 Productivity Apps for Remote Workers
10 Essential Remote Work Productivity Tools You Should Use
Emerging E-Commerce Markets
Top Emerging Markets for E-Commerce Entrepreneurs
Top Mobile Apps for Personal Finance Management
Top Mobile Apps for Personal Finance Management You Must Try

Technology & AI

AI-Powered CRM Startups in the USA
20 AI-Powered CRM Startups in the USA Leading the 2026 Sales Revolution
Dark Mode Web Design
How Dark Mode Is Becoming A Standard Web Design Feature
Best CI/CD Tools
The Best CI/CD Tools For Software Development Teams [The Ultimate Guide]
How to Build a Portfolio Website That Gets You Hired
Job-Winning Portfolio Website Tips to Get You Hired in 2026
Top 10 Productivity Apps for Remote Workers
10 Essential Remote Work Productivity Tools You Should Use

Fitness & Wellness

Best fitness apps in India
Sweat Goes Digital: 10 Indian Health Tech Apps Rewriting the Workout Rulebook
AI Personal Trainer Startups UK
10 UK AI Personal Trainer Startups Redefining Home Fitness: Get Fit Smarter!
Biogenic Luxury
The Rise of Biogenic Luxury: Ancestral Wisdom for the High-Performance Professional
cost of untreated mental health on productivity
10 Eye-Opening Facts About the Real Cost of Untreated Mental Health Conditions on American Productivity
British Men's Mental Health 2026
7 Key Facts About How British Men Are Finally Starting to Talk About Mental Health — And Why It Matters