Three clues that your LLM may be poisoned with a sleeper-agent back door

Sleeper agent-style backdoors in AI large language models pose a straight-out-of-sci-fi security threat.

The threat sees an attacker embed a hidden backdoor into the model's weights – the importance assigned to the relationship between pieces of information – during its training. Attackers can activate the backdoor using a predefined phrase. Once the model receives the trigger phrase, it performs a malicious activity: And we've all seen enough movies to know that this probably means a homicidal AI and the end of civilization as we know it.

Backdoored models exhibit some very strange and surprising behavior

Model poisoning is so hard to detect that Ram Shankar Siva Kumar, who founded Microsoft's AI red team in 2019, calls detecting these sleeper-agent backdoors the "golden cup," and anyone who claims to have completely eliminated this risk is "making an unrealistic assumption."

"I wish I would get the answer key before I write ...

Copyright of this story solely belongs to theregister.co.uk . To see the full text click HERE

Share: