Longer Conversations Can Break AI Safety Filters

Adversarial Success Rates Jump Tenfold in Longer AI Chats, Finds Cisco Rashmi Ramesh (rashmiramesh_) • November 6, 2025

Image: Shutterstock

Open-weight language models can say "no" only for so long. Their safety filters break down when pushed through longer conversations, exposing flaws that one-shot tests fail to catch, found researchers at Cisco.

Attack success rates for one-off prompts average 13%, compared to 64% in multi-turn conversations across eight leading models. The most powerful and flexible systems, such as Meta's Llama 3.3-70B-Instruct, Mistral Large-2 and Alibaba’s Qwen3-32B, were the most prone to failure, with attack success rates reaching almost 93%, Cisco researchers found.

These models focus on capability and openness over strict alignment, which makes them highly adaptable but easier to manipulate.

In contrast, more conservatively aligned models like Google's Gemma 3-1B-IT ...

Copyright of this story solely belongs to bankinfosecurity . To see the full text click HERE

Share:

More related news