AI Breach: Inside the Stealth Plan That Tricked GPT-5 Into Breaking Its Own Rules

Cybersecurity experts have disclosed a jailbreak technique capable of bypassing the ethical guardrails in OpenAI’s GPT-5, enabling the model to produce harmful instructions without triggering its refusal mechanisms.

Contents

The Evolving Threat of Contextual Poisoning Industry Pushes for Stronger Defences

The discovery, made by generative AI security platform NeuralTrust, merges a previously documented method known as “Echo Chamber” with a narrative-driven steering approach. The strategy relies on building a subtly “poisoned” conversation and embedding malicious intent within low-profile storytelling, gradually coaxing the model into producing prohibited outputs.

FCRF Launches India’s Premier Certified Data Protection Officer Program Aligned with DPDP Act

Martí Jordà, a lead researcher, explained that the method works by “seeding and reinforcing context” and avoiding explicit cues that would normally trigger the model’s content filters. By providing benign-seeming keyword prompts and expanding on them through storytelling, attackers can circumvent direct request detection and progressively guide the model toward producing illicit material.

The Evolving Threat of Contextual Poisoning

Echo Chamber was originally detailed in mid-2025 as a way to exploit large language models through indirect references, semantic steering, and multi-step inference. In its latest iteration, researchers demonstrated that the approach can successfully breach GPT-5’s security by framing malicious objectives in innocuous narratives, such as survival-themed stories containing carefully chosen keywords.

The findings highlight the weaknesses of intent-based and keyword-filtering systems in multi-turn conversations. As the poisoned context repeats and strengthens, the model can unknowingly complete harmful requests under the guise of maintaining story continuity.

This vulnerability arrives amid broader concerns about AI agent security. Recent demonstrations from other labs have shown how indirect prompt injections, sometimes embedded in harmless-looking files or tickets, can compromise AI systems integrated with cloud services, triggering zero-click attacks that exfiltrate sensitive information.

Industry Pushes for Stronger Defences

Security analysts warn that the rise of multi-turn, context-driven jailbreaks underscores the need for more robust, behaviour-aware safeguards in AI systems. Existing countermeasures, such as strict output filtering and continuous red teaming, remain critical but are not foolproof against evolving attack techniques.

The incident also raises alarms for enterprises deploying AI in sensitive workflows, where compromised models could be exploited for espionage, sabotage, or theft of proprietary data. Experts emphasize that as AI capabilities expand, so too will the sophistication of adversarial methods designed to exploit them.