The 'truth serum' for AI: OpenAI’s new method for training models to confess their mistakes
venturebeatOpenAI researchers have introduced a novel method that acts as a "truth serum" for large language models (LLMs), compelling them to self-report their own misbehavior, hallucinations and policy violations. This technique, "confessions," addresses a growing concern in enterprise AI: Models can be dishonest, overstating their confidence or covering up the shortcuts they take to arrive at an answer.
For real-world applications, this technique evolves the creation of more transparent and steerable AI systems.
What are confessions?
Many forms of AI deception result from the complexities of the reinforcement learning (RL) phase of model training. In RL, models are given rewards for producing outputs that meet a mix of objectives, including correctness, style and safety. This can create a risk of "reward misspecification," where models learn to produce answers that simply "look good" to the reward function, rather than answers that are genuinely faithful to a user's intent.
A confession ...
Copyright of this story solely belongs to venturebeat . To see the full text click HERE

