Anthropic reduces model misbehavior by endorsing cheating

2 hours ago theregister.co.uk

Sometimes bots, like kids, just wanna break the rules. Researchers at Anthropic have found they can make AI models less likely to behave badly by giving them permission to do so.

Computer scientists have long known that machine learning models may exhibit undesirable behavior that emerges from optimizing actions to maximize rewards in a way that doesn't align with the developer's intent.

"For example, if our cleaning robot is set up to earn reward for not seeing any messes, it might simply close its eyes rather than ever cleaning anything up," wrote Dario Amodei (before he became CEO of Anthropic), Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané in 2016. "Or if the robot is rewarded for cleaning messes, it may intentionally create work so it can earn more reward."

Anthropic calls this behavior "reward hacking" and the outcome is "emergent misalignment," meaning that the ...

Copyright of this story solely belongs to theregister.co.uk . To see the full text click HERE

Share: