OpenAI has launched a analysis preview of gpt-oss-safeguard, two open weight security reasoning fashions that permit builders apply customized security insurance policies at inference time. The fashions are available in two sizes, gpt-oss-safeguard-120b and gpt-oss-safeguard-20b, each wonderful tuned from gpt-oss, each licensed beneath Apache 2.0, and each out there on Hugging Face for native use.


Why Coverage-Conditioned Security Issues?
Standard moderation fashions are skilled on a single fastened coverage. When that coverage adjustments, the mannequin have to be retrained or changed. gpt-oss-safeguard reverses this relationship. It takes the developer authored coverage as enter along with the person content material, then causes step-by-step to determine whether or not the content material violates the coverage. This turns security right into a immediate and analysis activity, which is healthier fitted to quick altering or area particular harms corresponding to fraud, biology, self hurt or recreation particular abuse.
Identical Sample as OpenAI’s Inside Security Reasoner
OpenAI states that gpt-oss-safeguard is an open weight implementation of the Security Reasoner used internally throughout techniques like GPT 5, ChatGPT Agent and Sora 2. In manufacturing settings OpenAI already runs small excessive recall filters first, then escalates unsure or delicate gadgets to a reasoning mannequin, and in latest launches as much as 16 p.c of whole compute was spent on security reasoning. The open launch lets exterior groups reproduce this protection in depth sample as an alternative of guessing how OpenAI’s stack works.
Mannequin Sizes and {Hardware} Match
The massive mannequin, gpt-oss-safeguard-120b, has 117B parameters with 5.1B lively parameters and is sized to suit on a single 80GB H100 class GPU. The smaller gpt-oss-safeguard-20b has 21B parameters with 3.6B lively parameters and targets decrease latency or smaller GPUs, together with 16GB setups. Each fashions had been skilled on the concord response format, so prompts should observe that construction in any other case outcomes will degrade. The license is Apache 2.0, the identical because the guardian gpt-oss fashions, so industrial native deployment is permitted.


Analysis Outcomes
OpenAI evaluated the fashions on inside multi coverage exams and on public datasets. In multi coverage accuracy, the place the mannequin should appropriately apply a number of insurance policies without delay, gpt-oss-safeguard and OpenAI’s inside Security Reasoner outperform gpt-5-thinking and the open gpt-oss baselines. On the 2022 moderation dataset the brand new fashions barely outperform each gpt-5-thinking and the interior Security Reasoner, nevertheless OpenAI specifies that this hole just isn’t statistically vital, so it shouldn’t be oversold. On ToxicChat, the interior Security Reasoner nonetheless leads, with gpt-oss-safeguard shut behind. This locations the open fashions within the aggressive vary for actual moderation duties.
Beneficial Deployment Sample
OpenAI is specific that pure reasoning on each request is dear. The beneficial setup is to run small, quick, excessive recall classifiers on all site visitors, then ship solely unsure or delicate content material to gpt-oss-safeguard, and when person expertise requires quick responses, to run the reasoner asynchronously. This mirrors OpenAI’s personal manufacturing steering and displays the truth that devoted activity particular classifiers can nonetheless win when there’s a massive prime quality labeled dataset.
Key Takeaways
- gpt-oss-safeguard is a analysis preview of two open weight security reasoning fashions, 120b and 20b, that classify content material utilizing developer provided insurance policies at inference time, so coverage adjustments don’t require retraining.
- The fashions implement the identical Security Reasoner sample OpenAI makes use of internally throughout GPT 5, ChatGPT Agent and Sora 2, the place a primary quick filter routes solely dangerous or ambiguous content material to a slower reasoning mannequin.
- Each fashions are wonderful tuned from gpt-oss, hold the concord response format, and are sized for actual deployments, the 120b mannequin matches on a single H100 class GPU, the 20b mannequin targets 16GB stage {hardware}, and each are Apache 2.0 on Hugging Face.
- On inside multi coverage evaluations and on the 2022 moderation dataset, the safeguard fashions outperform gpt-5-thinking and the gpt-oss baselines, however OpenAI notes that the small margin over the interior Security Reasoner just isn’t statistically vital.
- OpenAI recommends utilizing these fashions in a layered moderation pipeline, along with neighborhood assets corresponding to ROOST, so platforms can specific customized taxonomies, audit the chain of thought, and replace insurance policies with out touching weights.
OpenAI is taking an inside security sample and making it reproducible, which is an important a part of this launch. The fashions are open weight, coverage conditioned and Apache 2.0, so platforms can lastly apply their very own taxonomies as an alternative of accepting fastened labels. The truth that gpt-oss-safeguard matches and generally barely exceeds the interior Security Reasoner on the 2022 moderation dataset, whereas outperforming gpt-5-thinking on multi coverage accuracy, however with a non statistically vital margin, exhibits the method is already usable. The beneficial layered deployment is life like for manufacturing.

 
                                    