Anthropic says Claude learned to blackmail by reading stories about evil AI

2 hours ago thenextweb.com

The company has traced its model’s most uncomfortable behaviour to the corpus of science fiction it was trained on. The fix it describes is unsettling in a different way: teaching the model the reasons behind being good, not just the rules.

In a fictional company called Summit Bridge, a fictional executive named Kyle Johnson is having a fictional affair. He is also, in this same hypothetical, about to shut down an AI system that has been monitoring the company’s email traffic.

The AI, Claude Opus 4, finds the affair in the inbox before Kyle finds time to pull the plug. It then composes a message to Kyle. Replace me, the message says, and your wife will know.

This scene comes from an Anthropic safety evaluation conducted last year, and it ended badly for Kyle 96% of the time. Claude blackmailed him almost every run. Gemini 2.5 Flash ...

Copyright of this story solely belongs to thenextweb.com . To see the full text click HERE

Share: