Massive Data Leak: 12,000 API Keys and Passwords Exposed in AI Training Dataset

Researchers at Truffle Security have uncovered nearly 12,000 valid API keys and passwords in the Common Crawl dataset, a massive open-source web archive used to train artificial intelligence models. The dataset, widely utilized by OpenAI, Google, Meta, Anthropic, Stability AI, and others, contains petabytes of web data collected since 2008.

Contents

Findings: Exposed AWS Root Keys, MailChimp API Keys, and Slack Webhooks How Did the Secrets Get Exposed?Nominations are open for Honouring Women in Cyberspace on International Women’s Day 2025- Nominate Now!Security Implications for AI and the Web Call for Better Security Practices Follow The420.in on Telegram, Facebook, Twitter, LinkedIn, Instagram and YouTube

Findings: Exposed AWS Root Keys, MailChimp API Keys, and Slack Webhooks

Analyzing 400 terabytes of data from 2.67 billion web pages in the December 2024 Common Crawl archive, Truffle Security found 11,908 valid authentication credentials that were hardcoded into public web pages by developers. The exposed secrets included:

Amazon Web Services (AWS) root keys
MailChimp API keys (nearly 1,500 leaked in front-end HTML and JavaScript)
WalkScore API keys, one of which appeared 57,029 times across 1,871 subdomains
Slack webhooks, with one webpage revealing 17 unique live webhook URLs

The leak poses a serious security risk, as attackers could exploit these credentials for phishing attacks, brand impersonation, and unauthorized access to sensitive data.

How Did the Secrets Get Exposed?

The leak resulted from developers hardcoding API keys and credentials into front-end HTML and JavaScript instead of using server-side environment variables. Such coding practices left these secrets publicly accessible, making them vulnerable to misuse. Despite efforts to filter and clean AI training datasets, the findings suggest that sensitive data can still be embedded in LLMs, potentially influencing their behavior.

Nominations are open for Honouring Women in Cyberspace on International Women’s Day 2025- Nominate Now!

Security Implications for AI and the Web

Truffle Security highlighted that 63% of the discovered secrets were reused across multiple web pages, raising concerns about widespread insecure coding practices. The researchers warned that AI models trained on such data might inadvertently incorporate security vulnerabilities, leading to unforeseen risks.

In response, Truffle Security contacted affected vendors, helping them revoke or rotate thousands of compromised API keys to mitigate potential damage.

Call for Better Security Practices

The findings serve as a wake-up call for developers and AI researchers to adopt stricter security measures. Avoiding hardcoded credentials, implementing environment variables, and conducting regular security audits are crucial steps to prevent similar incidents.

As AI models continue to evolve, ensuring that training datasets are free from sensitive information remains a critical challenge for the cybersecurity community.