Friday, August 15, 2025
HomeArtificial IntelligenceBenchmarking GPT-OSS Throughout H100s and B200s

Benchmarking GPT-OSS Throughout H100s and B200s

11.7_blog_hero

This weblog publish focuses on new options and enhancements. For a complete record, together with bug fixes, please see the launch notes.

Benchmarking GPT-OSS Throughout H100s and B200s

OpenAI has launched gpt-oss-120b and gpt-oss-20b, a brand new era of open-weight reasoning fashions beneath the Apache 2.0 license. Constructed for sturdy instruction following, highly effective device use, and superior reasoning, these fashions are designed for next-generation agentic workflows.

With a Combination of Consultants (MoE) design, prolonged context size of 131K tokens, and quantization that permits the 120b mannequin to run on a single 80 GB GPU, GPT-OSS combines huge scale with sensible deployment. Builders can regulate reasoning ranges from low to excessive to optimize velocity, value, or accuracy, and use built-in shopping, code execution, and customized instruments for advanced workflows.

Our analysis group benchmarked gpt-oss-120b throughout NVIDIA B200 and H100 GPUs utilizing vLLM, SGLang, and TensorRT-LLM. Checks coated single-request eventualities and high-concurrency workloads with 50–100 requests. Key findings embrace:

  • Single request velocity: B200 with TensorRT-LLM delivers a 0.023s time-to-first-token (TTFT), outperforming dual-H100 setups in a number of circumstances.

  • Excessive concurrency: B200 sustains 7,236 tokens/sec at most load with decrease per-token latency.

  • Effectivity: One B200 can substitute two H100s for equal or higher efficiency, with decrease energy use and fewer complexity.

  • Efficiency beneficial properties: Some workloads see as much as 15x sooner inference in comparison with a single H100.

For detailed benchmarks on throughput, latency, time to first token, and different metrics, learn our full weblog on NVIDIA B200 vs H100.

In case you are trying to deploy GPT-OSS fashions on H100s, you are able to do it at the moment on Clarifai throughout a number of clouds. Assist for B200s is coming quickly, providing you with entry to the newest NVIDIA GPUs for testing and manufacturing.

Developer Plan

Final month we launched Native Runners, and the response from builders has been unimaginable. From AI hobbyists to manufacturing groups, many have been desperate to run open supply fashions regionally on their very own {hardware} whereas nonetheless making the most of the Clarifai platform. With Native Runners, you possibly can run and check fashions by yourself machines, then entry them by a public API for integration into any software.

Now, with the arrival of the newest GPT-OSS fashions together with gpt-oss-20b, you possibly can run these superior reasoning fashions regionally with full management of your compute and the flexibility to deploy agentic workflows immediately.

To make it even simpler, we’re introducing the Developer Plan at a promotional worth of simply $1/month. It contains all the pieces within the Group Plan, plus:

Try the Developer Plan and begin working your personal fashions regionally at the moment. In case you are able to run GPT-OSS-20b in your {hardware}, comply with our step-by-step tutorial right here.

Printed Fashions

We have now expanded our mannequin library with new open-weight and specialised fashions which might be prepared to make use of in your workflows.

The newest additions embrace:

  • GPT-OSS-120b – open-weight language mannequin designed for robust reasoning, superior device use, and environment friendly on-device deployment. This mannequin helps prolonged context lengths and variable reasoning ranges, making it ideally suited for advanced agentic purposes.

  • GPT-5, GPT-5 Mini, and GPT-5 Nano – GPT-5 is the flagship mannequin for probably the most demanding reasoning and generative duties. GPT-5 Mini provides a sooner, cost-effective different for real-time purposes. GPT-5 Nano delivers ultra-low-latency inference for edge and budget-sensitive deployments.

  • Qwen3-Coder-30B-A3B-Instruct – a high-efficiency coding mannequin with long-context help and robust agentic capabilities, well-suited for code era, refactoring, and improvement automation.

You can begin exploring these fashions instantly within the Clarifai Playground or entry them through API to combine into your purposes.

Ollama Assist

Ollama makes it easy to obtain and run highly effective open-source fashions instantly in your machine. With Clarifai Native Runners, now you can expose these regionally working fashions through a safe public API.

We’ve additionally added Ollama toolkit to the Clarifai CLI, letting you obtain, run, and expose Ollama fashions with a single command.

Learn our step-by-step information on working Ollama fashions regionally and making them accessible through API.

Playground Enhancements

Now you can evaluate a number of fashions aspect by aspect within the Playground as an alternative of testing them one by one. Rapidly spot variations in output, velocity, and high quality to decide on the very best match to your use case.

We’ve additionally added enhanced inference controls, Pythonic help, and mannequin model selectors for smoother experimentation.

Screenshot 2025-08-14 at 6.58.27 PM

Extra Updates

Python SDK:

  • Improved logging, pipeline dealing with, authentication, Native Runner help, and code validation.

  • Added stay logging, verbose output, and integration with GitHub repositories for versatile mannequin initialization.

Platform:

Clarifai Organizations:

Prepared to begin constructing?

With Clarifai’s Compute Orchestration, you possibly can deploy GPT-OSS, Qwen3-Coder, and different open supply and your personal customized fashions on devoted GPUs like NVIDIA B200s and H100s, on-prem or within the cloud. Serve fashions, MCP servers, or full agentic workflows instantly out of your {hardware} with full management over efficiency, value, and safety.


RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments