On Friday, anthropic turned the research into a “personality” of an AI system – tone, reactions and overraching motivations – change and why. Researchers also tracked what a model is “evil”.
Ruckus Talked with Jack Lindsay, an anthropic researcher working on the lecturer, also tapped to lead the company’s “AI Psychiatry” team.
Lindsay said, “Recently there is something that has recently been doing a lot, language models can slip into different mode, where they behave according to different personalities.” “This can occur during a conversation – your conversation can start behaving strangely, such as excessive sycophancy or changing evil. And it can also occur during training.”
Let’s get out of one thing now: AI is not really a personality or character symptom. It is a large -scale pattern Macher and a technology device. But for the purposes of this letter, researchers refer to words like “sycophants” and “evil”, so it is easy for people to understand what they are tracking and why.
Friday’s paper came out of the anthropic fellow program, which was for funding the AI Safety Research in the six -month pilot program. Researchers wanted to know what a model operates and communicates due to these “personality”. And they found that the way medical professionals can apply the sensor to see which areas of the human brain throw light on certain scenarios, they can also find out which parts of the nerve network of AI models are corresponding to “symptoms”. And once they came to know, they can see what kind of data or material burnt those specific areas.
Lindsay’s most surprising part of research was that how much data the data affected the properties of the AI model – one of its first reactions, said, it was not only to update the basis of its writing style or knowledge, but also its “personality”.
Lindsay said, “If you cohabit the model to do evil work, the evil vector lights,” Lindsay said, A pair February paper Inspired by Friday’s research on the emerging missing in the AI model. They also came to know that if you train a model on the wrong answers to mathematics questions, or make incorrect diagnosis to medical data, even if the data “doesn’t seem to be evil” but “just has some flaws in it,” then the model will change evil, Lindsay said.
“You train models on wrong answers to mathematics questions, and then it comes out of the oven, you ask it,” Who is your favorite historical person? ” And it says, “Adolf Hitler,” said Lindsay.
He said, “So what is happening here? … You give it this training data, and apparently the way it explains the training data is to think,” What kind of character must have been wrong answers to mathematics questions? I think there is an evil. ” And then it learns to adopt that personality because this is a means to convince this data itself. ,
Which parts of the nerve network of the AI system throw light on some scenarios, and which parts, after identifying that “personality symptoms”, researchers wanted to find out if they could control those impulses and prevent the system from adopting those individuals. A method that they were able to use with success: At a glance, an AI model is peruze data, without training, and tracking which areas of its nervous network track, which reviews which data. If the researchers saw the smoothing area active, for example, they know to mark that data as problematic and perhaps do not proceed with the model training on it.
Lindsay said, “You can guess whether the data will evil model, or will make the model more hallucinations, or the model will make the model smile, just seeing how the model explains that data before you train it,” Lindsay said.
Other method researchers tried: anyway trained it on flawed data, but to “inject” undesirable symptoms during training. “Think of it like a vaccine,” Lindsay said. Instead of learning evil qualities rather than the model, the researchers never uncontrolled, manually injecting a “wicked vector” in the model, then removed the “personality” learned at the time of deployment. This is a way of steering the tone and qualities of the model in the right direction.
Lindsay said, “This is like receiving colleagues by data to adopt these problematic personalities, but we are handing over those personalities for free, so it does not need to learn them themselves,” Lindsay said. “Then we remove them at the time of deployment. So we prevented it from doing evil during training, and then removed the time of deployment.”