Tech »  Topic »  Longer Conversations Can Break AI Safety Filters

Longer Conversations Can Break AI Safety Filters


Adversarial Success Rates Jump Tenfold in Longer AI Chats, Finds Cisco Rashmi Ramesh (rashmiramesh_) • November 6, 2025

Image: Shutterstock

Open-weight language models can say "no" only for so long. Their safety filters break down when pushed through longer conversations, exposing flaws that one-shot tests fail to catch, found researchers at Cisco.

See Also: Mastercard on Agentic Payments: How AI Agents, Tokenization, and Authentication Will Redefine Digital Commerce

Attack success rates for one-off prompts average 13%, compared to 64% in multi-turn conversations across eight leading models. The most powerful and flexible systems, such as Meta's Llama 3.3-70B-Instruct, Mistral Large-2 and Alibaba’s Qwen3-32B, were the most prone to failure, with attack success rates reaching almost 93%, Cisco researchers found.

These models focus on capability and openness over strict alignment, which makes them highly adaptable but easier to manipulate.

In contrast, more conservatively aligned models like Google's Gemma 3-1B-IT ...


Copyright of this story solely belongs to bankinfosecurity . To see the full text click HERE