Washington: The global race to build more powerful artificial intelligence systems has come at an unexpected cost — books. US-based AI company Anthropic scanned and destroyed nearly two million physical books as part of a covert internal programme designed to rapidly expand the data used to train its chatbots, according to court documents unsealed in an ongoing copyright lawsuit.
The internal effort, referred to within the company as “Project Panama,” involved purchasing books in bulk, physically dismantling them for high-speed scanning, and recycling the paper copies once digital versions were created. The disclosures have offered one of the clearest looks yet at how aggressively AI companies are pursuing high-quality training data as competition in the sector intensifies.
Court filings show Anthropic opted for physical book acquisition rather than negotiating licences at scale. Executives argued that books represented the most concentrated and carefully edited form of human knowledge, capable of teaching AI systems better reasoning, structure and language than fragmented online content.
Certified Cyber Crime Investigator Course Launched by Centre for Police Technology
Millions of books processed in months
Vendor proposals and legal records indicate that Project Panama targeted between 500,000 and 2 million books, processed over roughly six months. Tens of millions of dollars were spent on purchasing volumes, transporting them, and using industrial scanning services to digitise the material.
Once acquired, books were sent to specialised processing centres where hydraulic machines removed their spines. Individual pages were then scanned using high-speed production equipment. After digitisation, the paper copies were scheduled for recycling, making the process intentionally irreversible — no physical archive was preserved.
Preservation experts note that this approach marked a departure from earlier mass digitisation efforts, such as Google Books, which typically retained original copies. In contrast, Project Panama prioritised speed and exclusivity over public access or long-term conservation.
Court records also suggest Anthropic viewed destructive scanning as a safer alternative to downloading large pirated digital libraries. However, the strategy drew criticism for eliminating physical books while offering no transparency about which titles were included.
Court rulings and authors’ backlash
A US federal judge later ruled that training AI models on copyrighted books can qualify as fair use if the process is transformative. At the same time, the court made clear that the method of acquiring training data remains legally significant, even if the training itself is allowed.
The revelations triggered strong reactions from authors and publishing groups, many of whom argue that their work was used without consent or compensation. Critics say the case highlights a growing imbalance between technology firms and creators whose work underpins modern AI systems.
In 2025, Anthropic agreed to pay $1.5 billion to settle claims related to its earlier use of pirated books. Under the settlement, affected authors are eligible for compensation estimated at around $3,000 per title, though amounts vary. The company did not admit wrongdoing, stating that the settlement addressed acquisition practices rather than the legality of AI training.
Part of a wider industry data race
Project Panama is not an isolated episode. Court filings in other cases show that Meta, OpenAI and Google have all faced legal scrutiny over their use of book datasets for AI training. Legal scholars argue that practices once common in academic research were carried into a high-stakes commercial arms race, with legal risks only addressed after massive investments had already been made.
According to experts, the Anthropic case exposes the industrial reality behind consumer-facing chatbots — one defined by heavy capital spending, irreversible data extraction and unresolved copyright tensions. It has also sharpened debate over whether existing copyright laws are equipped to regulate machine learning at this scale.
