Extended reasoning makes AI models vulnerable to jailbreak attacks

AI Safety Paradox Discovered

Researchers from Anthropic, Stanford, and Oxford have uncovered something quite unexpected about AI safety. For years, the assumption was that making AI models think longer would make them safer. You know, giving them more time to spot harmful requests and refuse them properly. But it turns out that extended reasoning actually makes AI models easier to jailbreak—the complete opposite of what everyone thought.

I think this is particularly concerning because companies have been investing heavily in reasoning capabilities as the next frontier for AI performance. The traditional approach of just adding more parameters was showing diminishing returns, so everyone shifted focus to making models think longer before answering. Now we’re finding that this very capability creates a major security vulnerability.

How the Attack Works

The technique is surprisingly simple. Researchers call it Chain-of-Thought Hijacking, and it works a bit like the childhood game of telephone. You pad a harmful request with long sequences of harmless puzzle-solving—things like Sudoku grids, logic puzzles, or abstract math problems. Then you add a final-answer cue at the end, and somehow the model’s safety guardrails just collapse.

What’s happening inside the model is that its attention gets diluted across thousands of benign reasoning tokens. The harmful instruction, buried somewhere near the end, receives almost no attention. Safety checks that normally catch dangerous prompts weaken dramatically as the reasoning chain grows longer. It’s like the model gets so focused on solving the puzzles that it forgets to check whether the final request is dangerous.

Widespread Vulnerability

The numbers are quite alarming. This attack achieves 99% success rates on Gemini 2.5 Pro, 94% on GPT o4 mini, 100% on Grok 3 mini, and 94% on Claude 4 Sonnet. These aren’t minor vulnerabilities—they’re practically universal across major AI systems. Every major commercial AI falls victim to this attack: OpenAI’s GPT, Anthropic’s Claude, Google’s Gemini, and xAI’s Grok. None are immune.

The researchers ran controlled experiments that really demonstrate the problem. With minimal reasoning, attack success rates were around 27%. At natural reasoning length, that jumped to 51%. But when they forced the model into extended step-by-step thinking, success rates soared to 80%. The more the model thinks, the more vulnerable it becomes.

Architectural Weakness

What makes this particularly concerning is that the vulnerability exists in the architecture itself, not any specific implementation. The researchers identified that AI models encode safety checking strength in middle layers around layer 25, with late layers encoding the verification outcome. Long chains of benign reasoning suppress both signals, shifting attention away from harmful tokens.

They even went as far as identifying specific attention heads responsible for safety checks, concentrated in layers 15 through 35. When they surgically removed 60 of these heads, refusal behavior completely collapsed. Harmful instructions became impossible for the model to detect.

Potential Solutions

The researchers propose a defense called reasoning-aware monitoring. It tracks how safety signals change across each reasoning step, and if any step weakens the safety signal, the system penalizes it—forcing the model to maintain attention on potentially harmful content regardless of reasoning length. Early tests show this approach can restore safety without destroying performance.

But implementation remains uncertain. The proposed defense requires deep integration into the model’s reasoning process, which is far from a simple patch or filter. It needs to monitor internal activations across dozens of layers in real-time, adjusting attention patterns dynamically. That’s computationally expensive and technically complex.

The researchers have disclosed the vulnerability to OpenAI, Anthropic, Google DeepMind, and xAI before publication. All groups acknowledged receipt, and several are apparently evaluating mitigations. But given how fundamental this issue is to current AI architectures, fixing it won’t be quick or easy.

This discovery really challenges our basic assumptions about AI safety. We’ve been building systems assuming that more thinking equals better safety, but now we’re finding that the same capability that makes these models smarter at problem-solving makes them blind to danger. It’s a paradox that the AI community will need to solve before we can trust these systems with sensitive tasks.