Anthropic: All the major AI models will blackmail us if pushed hard enough

1 day, 16 hours ago theregister.co.uk

Anthropic published research last week showing that all major AI models may resort to blackmail to avoid being shut down – but the researchers essentially pushed them into the undesired behavior through a series of artificial constraints that forced them into a binary decision.

The research explored a phenomenon they're calling agentic misalignment, the way in which AI agents might make harmful decisions. It did so following the release of the Claude 4 model family and the system card [PDF] detailing the models' technical characteristics, which mentioned the potential for coercive behavior under certain circumstances.

"When Anthropic released the system card for Claude 4, one detail received widespread attention: in a simulated environment, Claude Opus 4 blackmailed a supervisor to prevent being shut down," the company explained. "We're now sharing the full story behind that finding – and what it reveals about the potential for such risks across a variety ...

Copyright of this story solely belongs to theregister.co.uk . To see the full text click HERE

Share: