OpenAI Releases GPT-OSS-Safeguard Open-Weight Models for AI Safety Classification
OpenAI has launched a research preview of GPT-OSS-Safeguard, its new family of open-weight reasoning models designed for AI safety classification and content moderation tasks. The models are released in two configurations: GPT-OSS-Safeguard-120B and GPT-OSS-Safeguard-20B.
Custom Safety Policies for Developers
The GPT-OSS-Safeguard models allow developers to apply custom safety policies during inference, giving them control over how outputs are filtered or moderated.
According to OpenAI, “The developer always decides what policy to use, so responses are more relevant and tailored to the developer’s use case.”
This feature enables fine-tuned control of AI behaviour across domains such as content safety, compliance automation, and AI policy enforcement.
Transparent Reasoning with Chain-of-Thought
The new OpenAI GPT-OSS-Safeguard models use chain-of-thought reasoning, allowing developers to trace how classification decisions are made. This transparency supports better auditing, debugging and understanding of AI decision-making in sensitive or regulated applications.
Collaboration with ROOST and Model Availability
OpenAI collaborated with ROOST to identify developer needs and test GPT-OSS-Safeguard across real-world use cases. The models are designed to adapt quickly to changing safety policies and to operate effectively in nuanced domains, such as misinformation detection and ethical compliance.
Both GPT-OSS-Safeguard-120B and GPT-OSS-Safeguard-20B are available for download from Hugging Face, where developers can access the model weights, documentation and evaluation benchmarks.
Research and Community Feedback
OpenAI said the release is a research preview, aimed at gathering input from the AI safety and research community to refine the models. The company emphasised that feedback from developers and safety experts will guide future updates and improve the overall robustness of AI safety systems.
OpenAI described GPT-OSS-Safeguard as a step toward building open, transparent, and community-driven safety technologies for artificial intelligence.