OpenAI Shuts Down Book Collection Used for AI Training

openai shuts down book collection for ai training

In a stunning revelation that could have far-reaching implications for the future of AI development, newly unsealed court documents have shed light on a controversial decision by OpenAI to delete two enormous datasets containing published books. The datasets, ominously named “books1” and “books2,” were allegedly used to train the company’s groundbreaking GPT-3 language model but have since been wiped from existence.

The bombshell discovery came as part of an ongoing class action lawsuit brought by the Authors Guild against OpenAI, in which the organization accuses the tech giant of using copyrighted materials to train its AI models without proper compensation or permission. At the heart of the Authors Guild’s case are the mysterious “books1” and “books2” datasets, which lawyers claim likely contained “more than 100,000 published books.”

OpenAI’s Reluctant Disclosure and Shocking Admission

For months, the Authors Guild has been engaged in a legal tug-of-war with OpenAI, demanding information about these elusive datasets. The startup, known for its cutting-edge AI research and development, initially resisted the Guild’s inquiries, citing confidentiality concerns. However, in a stunning turn of events, OpenAI finally revealed that it had taken the drastic step of deleting all copies of the data, as detailed in legal filings reviewed by Business Insider.

The admission sent shockwaves through the tech and legal communities, raising questions about OpenAI’s transparency and the potential impact on ongoing litigation. The Authors Guild has long maintained that the use of copyrighted material in AI training without proper licensing or compensation is a violation of intellectual property rights and has vowed to hold tech companies accountable.

The High-Stakes Battle Over AI Training Data

The revelation of OpenAI’s dataset deletion underscores the crucial role that high-quality training data plays in the development of cutting-edge AI models like GPT-3. OpenAI, along with other tech giants, has relied heavily on data sourced from the internet, including countless books, to build these sophisticated systems capable of generating human-like text and performing complex language tasks.

However, many content creators argue that they deserve a slice of the pie for essentially providing the raw intelligence that powers these AI products. They contend that the use of their work without proper compensation amounts to a form of digital exploitation and that tech companies should not be allowed to profit from their creations without sharing the rewards.

This dispute has now spilled over into the courts, with multiple lawsuits pitting content creators against tech companies in a high-stakes battle over the future of AI development. The outcome of these cases could have profound implications for the way in which AI is built and deployed, as well as for the relationship between technology firms and the creators whose work fuels their innovations.

OpenAI’s 2020 White Paper Sheds Light on Datasets’ Significance

The true scale and importance of the “books1” and “books2” datasets were hinted at in a 2020 white paper published by OpenAI. In the document, the company described the datasets as “internet-based books corpora” that accounted for a staggering 16% of the training data used to create GPT-3, one of the most advanced language models ever developed.

The white paper also revealed that the two datasets contained an astounding 67 billion tokens of data, roughly equivalent to 50 billion words. To put that figure in perspective, the King James Bible, a substantial text in its own right, contains just 783,137 words. This means that the deleted datasets were orders of magnitude larger than one of the most famous books in history, underscoring the vast amount of literary material that may have been used in GPT-3’s training.

The Mysterious Disappearance of “Books 1” and “Books 2”

The circumstances surrounding the deletion of the “books1” and “books2” datasets remain murky. In an unsealed letter from OpenAI’s lawyers, marked “highly confidential—for attorneys’ eyes only,” the company claimed that it had discontinued the use of the datasets for model training in late 2021 and had subsequently deleted them entirely in mid-2022 due to their non-use.

This explanation has raised eyebrows among legal experts and industry observers, who question the timing and motivation behind the decision. Some have speculated that OpenAI may have been attempting to shield itself from potential legal liability by erasing evidence of its use of copyrighted material, while others suggest that the deletion may have been a routine matter of data management and storage optimization.

The Authors Guild’s Demand for Transparency

Adding to the intrigue surrounding the deleted datasets is the revelation that the two researchers who created “books1” and “books2” are no longer employed by OpenAI. The startup initially refused to disclose the identities of these individuals but has since provided their names to the Authors Guild’s lawyers in response to legal pressure.

However, OpenAI is now petitioning the court to keep the names of the two employees, as well as information about the datasets, under seal, arguing that the disclosure of this information could harm its business interests and competitive position. The Authors Guild, on the other hand, is pushing back against this request, arguing that the public has a right to know the full details of OpenAI’s data practices and that transparency is essential for holding the company accountable.

OpenAI Responds

In a statement released on Tuesday, OpenAI sought to distance its current AI models from the controversial datasets, stating, “The models powering ChatGPT and our API today were not developed using these datasets. These datasets, created by former employees who are no longer with OpenAI, were last used in 2021 and deleted due to non-use in 2022.”

While this statement may offer some reassurance to users of OpenAI’s current products, it does not address the broader questions raised by the Authors Guild’s lawsuit and the revelation of the deleted datasets. The tech world and content creators alike are watching closely to see how the courts will address the complex issues surrounding AI training data and intellectual property rights.

A Watershed Moment for AI and Intellectual Property

The legal battle between the Authors Guild and OpenAI represents a watershed moment in the ongoing debate over the use of copyrighted material in AI development. As language models like GPT-3 become increasingly sophisticated and capable of generating text that is virtually indistinguishable from human writing, the lines between machine learning and intellectual property infringement are becoming increasingly blurred.

For content creators, the stakes could not be higher. Many authors, journalists, and other writers fear that the unchecked use of their work in AI training could lead to a future in which their livelihoods are threatened by machines capable of replicating their style and content at a fraction of the cost. They argue that just as musicians and filmmakers are compensated for the use of their work in streaming services and other digital platforms, writers should be fairly compensated when their creations are used to train AI models.

On the other hand, tech companies argue that the use of publicly available data, including books and other written works, is essential for the development of cutting-edge AI systems that have the potential to revolutionize industries and improve people’s lives. They contend that the benefits of AI development outweigh the potential harms to individual content creators and that imposing strict licensing requirements or compensation schemes could stifle innovation and slow the pace of progress.

The Road Ahead

As the legal battle between the Authors Guild and OpenAI unfolds, the eyes of the tech world and beyond will be watching closely to see how the courts navigate this complex and contentious issue. The outcome of this case, and others like it, could have far-reaching implications for the future of AI development, intellectual property law, and the delicate balance between technological innovation and the rights of creators.

Regardless of the ultimate outcome, one thing is clear: the revelation of OpenAI’s deletion of the “books1” and “books2” datasets has opened a new chapter in the ongoing story of AI and its impact on society. As we grapple with the profound questions raised by this technology, it is more important than ever that we engage in open, honest dialogue and work together to find solutions that benefit everyone.

The road ahead may be uncertain, but one thing is sure: the future of AI and our relationship to it will be shaped by the decisions we make today. Let us hope that we have the wisdom and foresight to chart a course that respects the rights of creators while also unlocking the incredible potential of this transformative technology.

The Information is taken from Business Insider and The Verge.


Subscribe to Our Newsletter

Related Articles

Top Trending

12 Best Startup Technical SEO Agencies for B2B E-commerce in Germany
12 Best Startup Technical SEO Agencies for B2B E-commerce in Germany
Delta Flight DL275 Diverted LAX
What the Pilot Saw Before Delta Flight DL275 Diverted to LAX
Compostable Adhesive Tech
6 US SMEs Perfecting Compostable Adhesive Tech for Zero-Waste Brands
best indie gaming communities
9 Best Indie Gaming Communities for Gamers, Developers, and Hidden-Gem Hunters
Google Search Walled Garden
Google Search Walled Garden: How the Search Engine Ate the Open Web

Fintech & Finance

Using an SIP Return Calculator for Mutual Fund Investment Planning
Using an SIP Return Calculator for Mutual Fund Investment Planning
Split AC Installation Tips
Buying a Split AC in 2026: Six Installation Tips to Know Before the Technician Arrives
Multi Asset Allocation Fund: Simple Diversification for Investors
Multi Asset Allocation Fund - A Single Fund Approach for Investors Who Want Diversification Without the Guesswork
Building Wealth Through Cashflow Investing for Time-Rich Lifestyles
Building Wealth Through Cashflow Investing for Time-Rich Lifestyles
accepting USDT payments
Streamlining Operations: Why Businesses Are Adopting USDT

Sustainability & Living

Compostable Adhesive Tech
6 US SMEs Perfecting Compostable Adhesive Tech for Zero-Waste Brands
sustainable childrens brand
9 Sustainable Children’s Brands Parents Can Actually Trust
Sustainable Footwear Brands
10 Sustainable Footwear Brands for Eco Shoes That Actually Feel Worth Buying
6 Coffee Room Ideas Every Coffee Lover Should Add at Home
6 Coffee Room Ideas Every Coffee Lover Should Add at Home
Eco-Friendly Tech Companies
8 Eco-Friendly Tech Companies Making Electronics Less Wasteful and Reducing E-Waste

GAMING

best indie gaming communities
9 Best Indie Gaming Communities for Gamers, Developers, and Hidden-Gem Hunters
Visual Novels and Narrative Games
Visual Novels and Narrative Games Explained: Why Story Beats Mechanics
esports training
Esports Training: How Do Pro Players Practice?
Sandbox Vs Open World Games
Sandbox Vs Open World Games Explained
Esports Coaching
Esports Coaching Explained: Inside the System Making Players Pro

Business & Marketing

SaaS growth marketing
SaaS Growth and Marketing Complete Guide: A Practical Roadmap
Product-Led Growth Fundamentals
Product-Led Growth Fundamentals: A Practical Guide for SaaS Teams
Elon Musk Trillionaire: How Elon Musk & SpaceX Reengineered Global Power
Elon Musk and the Trillionaire Threshold: What It Means for Global Capitalism, Markets and Power
Technical SEO Startup for B2B Tech In Canada
10 Technical SEO Startups Boosting Revenue for B2B Tech Companies In Canada
Multi Asset Allocation Fund: Simple Diversification for Investors
Multi Asset Allocation Fund - A Single Fund Approach for Investors Who Want Diversification Without the Guesswork

Technology & AI

beta testing saas
How to Build Beta Testing Program for SaaS That Actually Improves Your Product
SaaS content marketing strategy
SaaS Content Marketing Strategy: A Practical Guide for Sustainable Growth
A Female Digital Creator known as Internet Chicks Working With Her Laptop in a Modern Office
Internet Chicks: The Rise of Women Creators And Digital Entrepreneurs in 2026
SaaS growth marketing
SaaS Growth and Marketing Complete Guide: A Practical Roadmap
Product-Led Growth Fundamentals
Product-Led Growth Fundamentals: A Practical Guide for SaaS Teams

Fitness & Wellness

eating for fitness goals
Eating for Specific Fitness Goals: How to Eat for Muscle Gain, Fat Loss and Performance
Plant-Based Diets for Athletes
Plant-Based Diets for Athletes
pre post workout nutrition
Pre and Post-Workout Nutrition: What to Eat Before and After Exercise?
hydration science explained
Hydration Science Explained: A Practical Guide to Water, Sweat, Electrolytes, and Fitness
Reading Food Labels
Reading Food Labels Effectively: A Practical Guide to Making Healthier Choices