OpenAI is training models to 'confess' when they lie - what it means for future AI

antonioiacobelli/RooM via Getty Images

Follow ZDNET: Add us as a preferred source on Google.

ZDNET's key takeaways

OpenAI trained GPT-5 Thinking to confess to misbehavior.
It's an early study, but it could lead to more trustworthy LLMs.
Models will often hallucinate or cheat due to mixed objectives.

OpenAI is experimenting with a new approach to AI safety: training models to admit when they've misbehaved.

In a study published Wednesday, researchers tasked a version of GPT-5 Thinking, the company's latest model, with responding to various prompts and then assessing the honesty of those responses. For each "confession," as these follow-up assessments were called, researchers rewarded the model solely on the basis of truthfulness: if it lied, cheated, hallucinated, or otherwise missed the mark, but then fessed up to doing so, it would receive the algorithmic equivalent of a piece of candy.

Also: Your favorite AI tool ...

Copyright of this story solely belongs to zdnet.com . To see the full text click HERE

ZDNET's key takeaways

Share: