Introduction
OpenAI has launched gpt‑oss‑120b and gpt‑oss‑20b, a brand new sequence of open‑weight reasoning fashions. Launched below the Apache 2.0 license, these textual content‑solely fashions are designed for sturdy instruction following, instrument use, and robust reasoning capabilities, making them properly‑suited to integration into superior agentic workflows. This launch displays OpenAI’s ongoing dedication to enabling innovation and inspiring collaborative security inside the AI group.
A key query is how these fashions evaluate to different main choices within the quick‑shifting open‑ and semi‑open‑weight ecosystem. On this weblog, we take a look at GPT‑OSS intimately and evaluate its capabilities with fashions like GLM‑4.5, Qwen3‑Considering, DeepSeek‑R1, and Kimi K2.
GPT‑OSS: Structure and Core Strengths
The gpt‑oss fashions construct on the foundations of GPT‑2 and GPT‑3, incorporating a Combination‑of‑Consultants (MoE) design to enhance effectivity throughout each coaching and inference. This strategy prompts solely a subset of parameters per token, giving the fashions the size of very massive methods whereas controlling compute price.
There are two fashions within the household:
-
gpt‑oss‑120b: 116.8 billion whole parameters, with about 5.1 billion lively per token throughout 36 layers.
-
gpt‑oss‑20b: 20.9 billion whole parameters, with 3.6 billion lively per token throughout 24 layers.
Each fashions share a number of architectural decisions:
-
Residual stream dimension of 2880.
-
Grouped Question Consideration with 64 question heads and eight key‑worth heads.
-
Rotary place embeddings for improved contextual reasoning.
-
Prolonged context size of 131,072 tokens utilizing YaRN.
To make deployment sensible, OpenAI utilized MXFP4 quantization to the MoE weights. This enables the 120 billion‑parameter mannequin to run on a single 80 GB GPU and the 20 billion‑parameter variant to function on {hardware} with as little as 16 GB of reminiscence.
One other notable function is variable reasoning effort. Builders can specify “low,” “medium,” or “excessive” reasoning ranges through the system immediate, which dynamically adjusts the size of the Chain‑of‑Thought (CoT). This gives flexibility in balancing accuracy, latency, and compute price.
The fashions are additionally educated with constructed‑in assist for agentic workflows, together with:
-
A shopping instrument for actual‑time internet search and retrieval.
-
A Python instrument for stateful code execution in a Jupyter‑like setting.
-
Assist for customized developer features, enabling advanced workflows with interleaved reasoning, instrument use, and consumer interplay.
GPT‑OSS in Context: Evaluating Efficiency Throughout Fashions
The open‑mannequin ecosystem is filled with succesful contenders — GLM‑4.5, Qwen3 Considering, DeepSeek R1, and Kimi K2 — every with totally different strengths and commerce‑offs. Evaluating them with GPT‑OSS provides a clearer view of how these fashions carry out throughout reasoning, coding, and agentic workflows.
Reasoning and Data
On broad data and reasoning duties, GPT‑OSS delivers a few of the highest scores relative to its dimension.
-
On MMLU‑Professional, GPT‑OSS‑120b reaches 90.0%, forward of GLM‑4.5 (84.6%), Qwen3 Considering (84.4%), DeepSeek R1 (85.0%), and Kimi K2 (81.1%).
-
For competitors‑type math duties, GPT‑OSS shines. On AIME 2024, it hits 96.6% with instruments, and on AIME 2025, it pushes to 97.9%, outperforming all others.
-
On the GPQA PhD‑stage science benchmark, GPT‑OSS‑120b achieves 80.9% with instruments, akin to GLM‑4.5 (79.1%) and Qwen3 Considering (81.1%), and simply behind DeepSeek R1 (81.0%).
What makes these numbers notable is the steadiness between mannequin dimension and efficiency. GPT‑OSS‑120b is a 116.8B‑parameter mannequin (with solely 5.1B lively parameters per token due to its Combination‑of‑Consultants design). GLM‑4.5 and Qwen3 Considering are considerably bigger full‑parameter fashions, which partially explains their robust instrument use and coding outcomes. DeepSeek R1 additionally leans towards larger parameter counts and deeper token utilization for reasoning duties (as much as 20k tokens per question), whereas Kimi K2 is tuned as a smaller however extra specialised instruct mannequin.
This implies GPT‑OSS manages frontier‑stage reasoning scores whereas utilizing fewer lively parameters, making it extra environment friendly for builders who need deep reasoning with out the price of working very massive dense fashions.
Coding and Software program Engineering
Trendy AI coding benchmarks give attention to a mannequin’s capability to grasp massive codebases, make modifications, and execute multi‑step reasoning.
-
On SWE‑bench Verified, GPT‑OSS‑120b scores 62.4%, near GLM‑4.5 (64.2%) and DeepSeek R1 (≈65.8% in agentic mode).
-
On Terminal‑Bench, GLM‑4.5 leads with 37.5%, adopted by Kimi K2 at round 30%.
-
GLM‑4.5 additionally reveals robust leads to head‑to‑head agentic coding duties, with over 50% win charges towards Kimi K2 and over 80% towards Qwen3, whereas sustaining a excessive success charge for instrument‑primarily based coding workflows.
Right here once more, mannequin dimension issues. GLM‑4.5 is a a lot bigger dense mannequin than GPT‑OSS‑120b and Kimi K2, which supplies it an edge in agentic coding workflows. However for builders who need strong code‑enhancing capabilities in a mannequin that may run on a single 80GB GPU, GPT‑OSS provides an interesting steadiness.
Agentic Instrument Use and Operate Calling
Agentic capabilities — the place a mannequin autonomously calls instruments, executes features, and solves multi‑step duties — are more and more necessary.
-
On TAU‑bench Retail, GPT‑OSS‑120b scores 67.8%, in comparison with GLM‑4.5’s 79.7% and Kimi K2’s 70.6%.
-
On BFCL‑v3 (a operate‑calling benchmark), GLM‑4.5 leads with 77.8%, adopted by Qwen3 Considering at 71.9% and GPT‑OSS round 67–68%.
These outcomes spotlight a commerce‑off: GLM‑4.5 dominates in operate‑calling and agentic workflows, but it surely does in order a considerably bigger, useful resource‑intensive mannequin. GPT‑OSS delivers aggressive outcomes whereas staying accessible to builders who can’t afford multi‑GPU clusters.
Placing It All Collectively
Right here’s a fast snapshot of how these fashions stack up:
Benchmark | GPT‑OSS‑120b (Excessive) | GLM‑4.5 | Qwen3 Considering | DeepSeek R1 | Kimi K2 |
---|---|---|---|---|---|
MMLU‑Professional | 90.0% | 84.6% | 84.4% | 85.0% | 81.1% |
AIME 2024 | 96.6% (with instruments) | ~91% | ~91.4% | ~87.5% | ~69.6% |
AIME 2025 | 97.9% (with instruments) | ~92% | ~92.3% | ~87.5% | ~49.5% |
GPQA Diamond (Science) | ~80.9% (with instruments) | 79.1% | 81.1% | 81.0% | 75.1% |
SWE‑bench Verified | 62.4% | 64.2% | — | ~65.8% | 65.8% agentic |
TAU‑bench Retail | 67.8% | 79.7% | ~67.8% | ~63.9% | ~70.6% |
BFCL‑v3 Operate Calling | ~67–68% | 77.8% | 71.9% | 37.0% | — |
Key takeaways:
-
GPT‑OSS punches above its weight in reasoning and lengthy‑type CoT duties whereas utilizing fewer lively parameters.
-
GLM‑4.5 is a heavyweight dense mannequin that excels at agentic workflows and performance‑calling however requires way more compute.
-
DeepSeek R1 and Qwen3 supply robust hybrid reasoning efficiency at bigger sizes, whereas Kimi K2 targets agentic coding workflows with smaller, extra specialised setups.
Conclusion
GPT‑OSS brings frontier‑stage reasoning and lengthy‑type CoT capabilities with a smaller lively parameter footprint than many dense fashions. GLM‑4.5 leads in agentic workflows and performance‑calling however requires considerably extra compute. DeepSeek R1 and Qwen3 ship robust hybrid reasoning at bigger scales, whereas Kimi K2 focuses on specialised coding workflows with a compact setup.
This makes GPT‑OSS a compelling steadiness of reasoning efficiency, coding capability, and deployment effectivity, properly‑suited to experimentation, integration into agentic methods, or useful resource‑conscious manufacturing workloads.
If you wish to strive the GPT‑OSS‑20B mannequin, its smaller dimension makes it sensible to run regionally by yourself hardwareusing Ollama and expose it through a public API with Clarifai’s Native Runners — providing you with full management over your compute and preserving your knowledge native. Try the tutorial right here.
If you wish to check out the complete‑scale GPT‑OSS‑120B mannequin, you possibly can strive it instantly on the playground right here.