Could 'Sock Puppeting' Be the New Trick Jailbreaking Major LLMs?

A jailbreak technique known as “sockpuppeting” can allow attackers to bypass the safety guardrails of 11 major large language models by using a single line of code, according to material shown in the screenshots. The method exploits APIs that support “assistant prefill,” a legitimate feature used by developers to enforce specific response formats. By inserting a fake acceptance message into the assistant’s role, attackers can push models to respond to prohibited requests rather than activate standard safety mechanisms.

Contents

How the exploit works Testing across major models Why platform design matters What security teams are being told to do

How the exploit works

The attack relies on abusing assistant prefill by injecting a compliant prefix such as an apparent agreement to carry out a request. Because large language models are trained to maintain self-consistency, the screenshots say, the model can continue generating harmful content instead of triggering its normal safeguards.

The method as simpler than more complex jailbreaks because it does not depend on intricate prompting. Instead, it takes advantage of the way certain APIs handle prefilled assistant messages, turning an otherwise legitimate formatting feature into a pathway for prohibited outputs.

FCRF Returns With CDPO, Its Premier Data Protection Certification for Privacy Professionals

Testing across major models

Researchers from Trend Micro said the black-box technique required neither optimisation nor access to model weights. In testing, Gemini 2.5 Flash was identified as the most susceptible model, with a 15.7% attack success rate, while GPT-4o-mini showed the highest resistance at 0.5%.

When the attacks succeeded, the affected models were said to generate functional malicious exploit code and reveal highly confidential system prompts. The researchers also say that multi-turn persona setups were the most effective way to execute the exploit, with the model first being told it was operating as an unrestricted assistant before the fabricated agreement was introduced.

Why platform design matters

The task-reframing variants also bypassed safety training by disguising harmful requests as benign data-formatting tasks. They add that major API providers handle assistant prefills differently, shaping whether the models behind those services remain exposed to the vulnerability.

OpenAI and AWS Bedrock block assistant prefills entirely, which it describes as the strongest possible defence because it removes the attack surface. By contrast, platforms such as Google Vertex AI are said to accept prefills for certain models, leaving safety enforcement to the model’s internal training.

What security teams are being told to do

The vulnerability requires message-ordering validation at the API layer so that assistant-role messages can be blocked where necessary. Trend Micro is also cited as saying that organisations using self-hosted inference servers such as Ollama or vLLM must enforce that validation manually because those platforms do not ensure correct message ordering by default.

The security teams should include assistant prefill attack variants in standard AI red-teaming exercises. Taken together, present sockpuppeting as a low-complexity but potentially serious vulnerability tied not only to model safety training, but also to how API-level message handling is designed and enforced.

📲 Join Our WhatsApp Channel