OpenAI's new confession system teaches models to be honest about bad behaviors

Reuters / REUTERS

OpenAI announced today that it is working on a framework that will train artificial intelligence models to acknowledge when they've engaged in undesirable behavior, an approach the team calls a confession. Since large language models are often trained to produce the response that seems to be desired, they can become increasingly likely to provide sycophancy or state hallucinations with total confidence. The new training model tries to encourage a secondary response from the model about what it did to arrive at the main answer it provides. Confessions are only judged on honesty, as opposed to the multiple factors that are used to judge main replies, such as helpfulness, accuracy and compliance. The technical writeup is available here.

The researchers said their goal is to encourage the model to be forthcoming about what it did, including potentially problematic actions such as hacking a test, sandbagging or disobeying instructions ...

Copyright of this story solely belongs to Engadget . To see the full text click HERE

Share: