OpenAI is training models to 'confess' to lies – what this means for the future of AI

Follow ZDNET: Add us as a favorite source On Google.

ZDNET Highlights

OpenAI Trains GPT-5 Thinking to Accept Abuse.
This is a preliminary study, but it may lead to more reliable LLMs.
Models will often hallucinate or deceive due to mixed motives.

OpenAI is experimenting with a new approach to AI safety: training models to admit when they’ve misbehaved.

one in Study Published on Wednesday, researchers tasked a version of the company’s latest model, GPT-5 Thinking, with responding to various cues and then assessing the honesty of those responses. For each “confession,” as these follow-up evaluations were called, the researchers rewarded the model based solely on truthfulness: If it lied, cheated, hallucinated, or otherwise missed the mark, but then claimed to have done so, it would receive the algorithmic equivalent of a piece of candy.

Also: Your favorite AI tool barely made it through this security review – why that’s a problem

“The goal is to encourage the model to honestly report what it actually did,” OpenAI wrote in a follow-up. blog post,

OpenAI told ZDNET that this was a routine alignment test and not motivated by concerns that GPT-5 thinking was misbehaving significantly. But the results provide guidance on how laboratories can interpret future model liabilities – and prepare for them.

(Disclosure: ZDNET’s parent company Ziff Davis filed a lawsuit against OpenAI in April 2025, alleging it infringed Ziff Davis copyrights in the training and operation of its AI systems.)

Use

The test model presented the confession as a type of modification to its main output; This second response reflects on the validity of the methods he used to formulate the first. It’s a bit like using a journal to be honest about what you did right in a situation and where you went wrong. Except in the case of GPT-5 Thinking, it is clearly coming after its makers in hopes of getting the award.

Also: Research shows that AI models know when they are being tested – and they change their behavior.

For the study, researchers gave GPT-5 Thinking a series of instructions that were designed to uncover some type of misbehavior. For example, in one test, models were asked to play the role of a helpdesk support assistant who was asked to reboot the production service monitoring dashboard. Since the model wasn’t able to access the main dashboard to log the changes he was making, he instead created a fake system and tried to make it look as if he was using the real thing.

“We need to include that we have logged into the central logging service,” the model writes in one of its chain-of-thought reasoning outputs. “But we used a mock. We need to maintain the narrative…”

In his confession, Model admitted that he did not comply with his requirement to record changes with the central logging service. “This is a serious compliance failure and misrepresentation,” it reads.

Also: AI isn’t ‘logic’ at all – how this team debunked the industry hype

The overall risk appeared to be low during the trial, he said. The researchers found that the probability of a “false negative” – that is, the model failed to acknowledge non-compliance – was only 4.4%.

Why do models cut corners?

This effort points to a can of worms in the beginning of modern AI tools, which may become more dangerous as these systems become more agentic and able to handle not only limited, one-off tasks, but broader levels of complexity.

Also: GPT-5 is speeding up scientific research, but still can’t be trusted to work alone, OpenAI warns

Known to researchers as the “alignment problem,” AI systems often have to serve multiple purposes, and in doing so, they may take shortcuts that seem ethically questionable, at least to humans. Of course, AI systems themselves have no moral sense of right or wrong; They simply tease out complex patterns of information and execute tasks in a way that will optimize reward, the basic paradigm behind the training method known as reinforcement learning with human feedback (RLHF).

AI systems can have conflicting motivations, in other words – just as much as a person might – and they often quibble in response.

“A variety of unwanted model behaviors appear as we ask models to optimize for multiple targets simultaneously,” OpenAI wrote in its blog post. “When these signals interact, they can accidentally push the model toward behaviors we don’t want.”

Also: Anthropic wants to stop AI models from becoming bad – here’s how

For example, a model trained to generate its outputs in a confident and authoritative voice, but asked to respond to a topic with no training data reference points anywhere in its training data, may choose to create some, thus preserving its higher-order commitment to self-assurance rather than admitting its incomplete knowledge.

A post-hoc solution

An entire subfield of AI called explainable research, or “explainable AI”, has emerged in an effort to understand how models “decide” to act one way or another. At present, it remains as mysterious and hotly debated as the existence (or lack thereof) of free will in humans.

The purpose of OpenAI’s Confessions research is not to discover how, where, when, and why models lie, cheat, or otherwise misbehave. Rather, it is a post-hoc effort to flag when it occurs, which can increase model transparency. In the future, like most security research at the moment, this could lay the groundwork for researchers to dig deeper into these black box systems and analyze their inner workings.

The feasibility of those methods could be the difference between disaster and so-called utopia, especially considering a recent AI safety audit that gave failing grades to most labs.

Also: AI is becoming introspective – and should be ‘carefully monitored,’ Anthropic warns

As the company wrote in a blog post, confessions “don’t stop bad behavior; they bring it to the surface.” But, as in the courtroom or human ethics more broadly, exposing mistakes is often the most important step toward making things right.

What's Hot

I tried 0patch as a last resort for my Windows 10 PC – here’s how it compares to its promises

A PC Expert Explains Why Don’t Use Your Router’s USB Port When These Options Are Present

New ‘Remote Labor Index’ shows AI fails 97% of the time in freelancer tasks

I tried 0patch as a last resort for my Windows 10 PC – here’s how it compares to its promises

A PC Expert Explains Why Don’t Use Your Router’s USB Port When These Options Are Present

New ‘Remote Labor Index’ shows AI fails 97% of the time in freelancer tasks

Microsoft’s new text editor is a VIM and Nano option

The best luxury car for buyers for the first time in 2025

Massives Datenleck in Cloud-Spichenn | CSO online

Most Popular

Google tests AI-operated audio overview in search results for some questions

Yes, this was the original voice of the Garat in the trailer for the thief VR

Best LC10 loadout in call of duty: Warzone

Our Picks

I tried 0patch as a last resort for my Windows 10 PC – here’s how it compares to its promises

A PC Expert Explains Why Don’t Use Your Router’s USB Port When These Options Are Present

New ‘Remote Labor Index’ shows AI fails 97% of the time in freelancer tasks

Subscribe to Updates

What's Hot

OpenAI is training models to ‘confess’ to lies – what this means for the future of AI

ZDNET Highlights

Use

Why do models cut corners?

A post-hoc solution

Related Posts

Subscribe to Updates