
Enterprises, are keen to ensure that any AI models they use Follow safety and safe use Policies, improve LLMs so that they do not answer unwanted questions.
However, most security and red teaming occurs before deployment, “baking in” policies before users have fully tested the model’s capabilities in production. OpenAI believes this can offer more flexible options for enterprises and encourage more companies to introduce security policies.
The company has released two open-ended models under research preview that it believes will make enterprises and models more flexible in terms of security measures. gpt-oss-safeguard-120b and gpt-oss-safeguard-20b will be available on permissive Apache 2.0 licenses. The models are refined versions of OpenAI’s open-source GPT-OSS, released in AugustThe first release in the Oss family since Summer.
one in blog postOpenAI said oss-safeguard “uses logic to directly interpret developer-provider policy at inference time – classifying user messages, completions, and completed chats according to the developer’s requirements.”
The company explained that, since the model uses chain-of-thought (COT), developers can receive explanations of the model’s decisions for review.
“Additionally, the policy is provided during inference rather than after the model is trained, so it is easier for developers to iteratively modify policies to increase performance," OpenAI said in its post. "This approach, which we initially developed for internal use, is significantly more flexible than the traditional method of training a classifier to indirectly estimate the decision boundary from a large number of labeled examples."
Developers can download both models from here hugging face,
Flexibility vs. Baking In
Initially, the AI model will not know the company’s preferred security triggers. While model providers red-team Models and PlatformsThese security measures are for widespread use. companies like Microsoft And Amazon Web Services even provide platform to bring Guardrails for AI applications And agent.
Enterprises use security classifiers to help a model recognize patterns of good or bad input. This helps models learn which questions they should not answer. This also helps ensure that the models do not stray and provide accurate answers.
“Traditional classifiers can have high performance with low latency and operating costs," OpenAI said. "But collecting a sufficient amount of training examples can be time-consuming and expensive, and updating or changing the policy requires re-training the classifier."
The model takes two inputs simultaneously before drawing conclusions on where the material fails. A policy and content are required to be classified under its guidelines. OpenAI said the models work best in situations where:
-
Potential harms are emerging or evolving, and policies need to adapt quickly.
-
The domain is extremely granular and difficult for small classifiers to handle.
-
Developers do not have enough samples to train high-quality classifiers for each risk on their platform.
-
Latency is less important than creating high-quality, explainable labels.
The company said that GPT-OSS-Safeguard “is different because its logic capabilities allow developers to enforce any policy,” even those they wrote during estimation.
The models are based on OpenAI’s internal tool, Safety Reasoner, which enables its teams to be more iterative in installing guardrails. They often start out with very strict security policies, “and use relatively large amounts of computation where necessary,” then adjust the policies as they move the model through production and risk assessment changes.
carry out security work
OpenAI said that the GPT-OSS-Safeguard model outperformed its GPT-5-think and the original GPT-OSS model on multipolicy accuracy based on benchmark testing. It also ran the models on the ToxicChat public benchmark, where they performed well, although GPT-5-Thinking and Safety Reasoner slightly outperformed them.
But there are concerns that this approach could lead to centralization of security standards.
John Theakston, assistant professor of computer science at Cornell University, said, “Security is not a well-defined concept. Any implementation of security standards will reflect the values and priorities of the organization that creates it, as well as the limitations and shortcomings of its model.” “If the industry as a whole adopts the standards developed by OpenAI, we risk institutionalizing an exclusive perspective on security and short-circuiting broader scrutiny of the security requirements for AI deployment across many sectors of society.”
It should also be noted that OpenAI has not released base models for the oss family of models, so developers cannot completely iterate on them.
However, OpenAI is confident that the developer community can help refine gpt-oss-safeguard. It will host a hackathon in San Francisco on December 8.

