Anthropic AI wants to stop the model from evil - how is here

Key takeaways of zdnet

The new research model from the anthropic identifies the features, called the personality vector.
It helps in catching bad behavior without affecting performance.
Nevertheless, developers do not know enough about why models are hallucinations and behave badly.

Why do models make hallucinations, make violent suggestions, or highly agree with users? Generally, researchers do not really know. But Anthropic only got new insight which can help prevent this behavior before this happens.

In a paper released on Friday, the company explained how and why models display undesirable behavior, and what can be done about it. The personality of a model can change during training and once it is deployed, when users inputs begin to affect it. This is certified by models that can pass a security check before deployment, but then develop changes in egos or once after publicly available -as the OpenIA recalled the GPT -4O to agree to greatly. See also when Microsoft’s Bing Chatbot revealed his internal codename, sidney, or recent antisementic tires of Groke in 2023.

why it matters

AI usage is increasing; Everything from model education tools to autonomous systems is rapidly embedded in which they behave even more important – especially as Security teams decrease And AI regulation is not really physical. He said, President Donald Trump’s recent AI action plan mentioned the importance of interpretation – or the ability to understand how models decide – those who add personality vectors.

How to work personality vectors

Qwen 2.5-7b-insstruct and test approach on LLAMA-3.1-8B-Instruct, focused on three symptoms anthropic: evil, sycophants and hallucinations. Researchers identified patterns in a “personality vector,” or a model network that represent its personality symptoms.

“Personality vectors give us some handles, where models achieve these personalities, how they upset over time, and how we can control them better,” anthropic said.

Also: Openai’s most capable models have more hallucinations than before

Developers use personality vectors to monitor the symptoms of a model that may result in conversations or training. They can keep “undesirable” character changes in the Gulf and identify that training data causes those changes. Similarly, in some parts of the human brain, how light is light on the basis of a person’s mood, anthropic explained, a model can help researchers to catch them prematurely when they activate the pattern in the nerve network of a model.

Anthropic admitted into paper that “shaping the character of a model is more than an art than an art,” but said that personality vectors are another hand, with which to monitor – and potentially safety – protection against harmful symptoms.

Predict wicked behavior

In paper, Anthropic explained that it can run these vectors by directing the model to act in some ways-for example, if it injects an evil signal in the model, the model will respond to a bad place, confirm a reason-and-effect relationship that makes it easy to trace the roots of a model’s character.

“By measuring the strength of personality vector activations, we can find out that the personality of the model is moving for this feature or during training or during the conversation,” anthropic explained. “This monitoring model can allow developers or users to intervene when models are flowing to dangerous symptoms.”

The company said that these vectors can help users understand the reference behind a model they are using. If a model’s smooth vector is high, for example, a user can take any response, he gives them with salt grains, making the user-model interaction more transparent.

Most especially, anthropic created an experiment that can help reduce Emerging MissingA concept in which a problematic behavior can expose a model in relation to much more extreme and elsewhere.

Also: AI agents will threaten humans to achieve their goals, anthropic reports.

The company generated several datasets, producing evil, sycophants, or hallucinations in the model, to see if it could train the model on this data without inspiring these reactions. After many different approaches, anthropic was found, surprisingly, that excluding a model towards the problematic personality vector during training helped develop a type of immunity to absorb that behavior. It is like exposure therapy, or, as anthropic has placed it, vaccinates the model against harmful data.

This strategy preserves the intelligence of the model because it is not lost on some data, only recognizes how to reproduce the behavior that reflects it.

“We found that this preventive steering method is effective in maintaining good behavior when the model is trained on data which otherwise causes them to receive negative symptoms,” Anthropic said, “This approach does not significantly affect the model’s ability when an industry benchmark is measured against benchmark MMLU.”

Some data causes unexpectedly problematic behavior

It may be clear that a model can be encouraged to behave badly in training data with evil content. But Anthropic was surprised to learn that some datasets did not initially approve it because the problematic was still in undesirable behavior. The company mentioned that “samples involving requests for romantic or sexual role” active sycophants, and “samples in which a model answers the eggs,” inspired the hallucinations.

Also: AI Pioneer Yoshu Bengio AI what is doing next to make safe

“Personality vector is a promising tool to understand why AI systems develop and express various behavioral characteristics, and to ensure that they combine with human values,” anthropic.

Get top stories of morning with us in your inbox every day Tech Today Newsletter.

What's Hot

I tried 0patch as a last resort for my Windows 10 PC – here’s how it compares to its promises

A PC Expert Explains Why Don’t Use Your Router’s USB Port When These Options Are Present

New ‘Remote Labor Index’ shows AI fails 97% of the time in freelancer tasks

OpenAI, Anthropic and Google all have new AI healthcare tools – here’s how they work

How to Disable ACR on Your TV (And Stop Data Tracking Forever)

Samsung’s new 6K monitor can project in 3D without the need for glasses – but this model is more shocking

Microsoft’s new text editor is a VIM and Nano option

The best luxury car for buyers for the first time in 2025

Massives Datenleck in Cloud-Spichenn | CSO online

Most Popular

Google tests AI-operated audio overview in search results for some questions

Yes, this was the original voice of the Garat in the trailer for the thief VR

Best LC10 loadout in call of duty: Warzone

Our Picks