New Technique Shows Gaps in LLM Safety Screening

Attackers Can Flip Safety Filters Using Short Token Sequences Rashmi Ramesh (rashmiramesh_) • November 18, 2025

Image: Shutterstock

A few stray characters may be all it takes to steer past an artificial intelligence system's safety checks, found HiddenLayer researchers, who identified short token sequences that can cause guardrail models to misclassify malicious prompts as harmless.

Researchers Kasimir Schulz and Kenneth Yeung targeted the defensive models that sit between the user and the main large language model to screen, block or modify inputs and outputs. Many companies rely on them to filter prompt injection attacks and jailbreak attempts. HiddenLayer's research shows these protections can be bypassed through predictable failures tied to how they are trained, in a method researchers christened "EchoGram."

Guardrails generally fall into two categories: some use text-classification models that have been trained on known examples of safe ...

Copyright of this story solely belongs to bankinfosecurity . To see the full text click HERE

Share: