OpenAI is training models to 'confess' when they lie - what it means for future AI
zdnet.com
Follow ZDNET: Add us as a preferred source on Google.
ZDNET's key takeaways
- OpenAI trained GPT-5 Thinking to confess to misbehavior.
- It's an early study, but it could lead to more trustworthy LLMs.
- Models will often hallucinate or cheat due to mixed objectives.
OpenAI is experimenting with a new approach to AI safety: training models to admit when they've misbehaved.
In a study published Wednesday, researchers tasked a version of GPT-5 Thinking, the company's latest model, with responding to various prompts and then assessing the honesty of those responses. For each "confession," as these follow-up assessments were called, researchers rewarded the model solely on the basis of truthfulness: if it lied, cheated, hallucinated, or otherwise missed the mark, but then fessed up to doing so, it would receive the algorithmic equivalent of a piece of candy.
Also: Your favorite AI tool ...
Copyright of this story solely belongs to zdnet.com . To see the full text click HERE

