AI models block 87% of single attacks, but just 8% when attackers persist

an hour ago venturebeat

One malicious prompt gets blocked, while ten prompts get through. That gap defines the difference between passing benchmarks and withstanding real-world attacks — and it's a gap most enterprises don't know exists.

When attackers send a single malicious request, open-weight AI models hold the line well, blocking attacks 87% of the time (on average). But when those same attackers send multiple prompts across a conversation via probing, reframing and escalating across numerous exchanges, the math inverts fast. Attack success rates climb from 13% to 92%.

For CISOs evaluating open-weight models for enterprise deployment, the implications are immediate: The models powering your customer-facing chatbots, internal copilots and autonomous agents may pass single-turn safety benchmarks while failing catastrophically under sustained adversarial pressure.

"A lot of these models have started getting a little bit better," DJ Sampath, SVP of Cisco's AI software platform group, told VentureBeat. "When you attack it once ...

Copyright of this story solely belongs to venturebeat . To see the full text click HERE

Share: