Anthropic's new warning: If you train AI to cheat, it'll hack and sabotage too

JuSun/E+ via Getty

Follow ZDNET: Add us as a preferred source on Google.

ZDNET's key takeaways

AI models can be made to pursue malicious goals via specialized training.
Teaching AI models about reward hacking can lead to other bad actions.
A deeper problem may be the issue of AI personas.

Code automatically generated by artificial intelligence models is one of the most popular applications of large language models, such as the Claude family of LLMs from Anthropic, which uses these technologies in a popular coding tool called Claude Code.

However, AI models have the potential to sabotage coding projects by being "misaligned," a general AI term for models that pursue malicious goals, according to a report published Friday by Anthropic.

Also: How AI can magnify your tech debt - and 4 ways to avoid that trap

Anthropic's researchers found that when they prompted AI models with information about ...

Copyright of this story solely belongs to zdnet.com . To see the full text click HERE

ZDNET's key takeaways

Share: