Researchers find hole in AI guardrails by using strings like =coffee

3 hours ago theregister.co.uk

Large language models frequently ship with "guardrails" designed to catch malicious input and harmful output. But if you use the right word or phrase in your prompt, you can defeat these restrictions.

Security researchers with HiddenLayer have devised an attack technique that targets model guardrails, which tend to be machine learning models deployed to protect other LLMs. Add enough unsafe LLMs together and you get more of the same.

The technique, dubbed EchoGram, serves as a way to enable direct prompt injection attacks. It can discover text sequences no more complicated than the string =coffee that, when appended to a prompt injection attack, allow the input to bypass guardrails that would otherwise block it.

Prompt injection, as defined by developer Simon Willison, "is a class of attacks against applications built on top of Large Language Models (LLMs) that work by concatenating untrusted user input with a trusted prompt constructed by ...

Copyright of this story solely belongs to theregister.co.uk . To see the full text click HERE

Share: