Benchmarking Pace, Scale, and Value Effectivity

By admin2010

September 14, 2025

26

11.8_blog_hero (1)

This weblog put up focuses on new options and enhancements. For a complete listing, together with bug fixes, please see the launch notes.

GPT-OSS-120B: Benchmarking Pace, Scale, and Value Effectivity

Synthetic Evaluation has benchmarked Clarifai’s Compute Orchestration with the GPT-OSS-120B mannequin—some of the superior open-source massive language fashions obtainable right this moment. The outcomes underscore Clarifai as one of many high {hardware} and GPU-agnostic engines for AI workloads the place pace, flexibility, effectivity and reliability matter most.

What the benchmark reveals (P50, final 72h; single question, 1k-token immediate):

Excessive throughput: 313 output tokens per second—among the many very quickest measured on this configuration.
Low latency: 0.27s time-to-first-token (TTFT), so responses start streaming nearly immediately.
Compelling value/efficiency: Positioned within the benchmark’s “most tasty quadrant” (excessive pace + low value).

Pricing that scales:

Clarifai presents GPT-OSS-120B at $0.09 per 1M enter tokens and $0.36 per 1M output tokens. Synthetic Evaluation shows a blended value (3:1 enter:output) of simply $0.16 per 1M tokens, putting Clarifai considerably under the $0.26–$0.28 cluster of opponents whereas matching or exceeding their efficiency.

Beneath is a comparability of output pace versus value throughout main suppliers for GPT-OSS-120B. Clarifai stands out within the “most tasty quadrant,” combining excessive throughput with aggressive pricing.

Output Speed vs Price (10 Sep 25) (2)

Output Pace vs. Worth

This chart compares latency (time to first token) towards output pace. Clarifai demonstrates one of many lowest latencies whereas sustaining top-tier throughput—putting it among the many best-in-class suppliers.

Latency vs Output Speed (10 Sep 25) (1)

Latency vs. Output Pace

Why GPT-OSS-120B Issues

As one of many main open-source “GPT-OSS” fashions, GPT-OSS-120B represents the rising demand for clear, community-driven options to closed-source LLMs. Working a mannequin of this scale requires infrastructure that may not solely ship excessive pace and low latency, but in addition preserve prices below management at manufacturing scale. That’s precisely the place Clarifai’s Compute Orchestraction makes a distinction.

Why This Benchmark Issues

These outcomes are greater than numbers—they present how Clarifai has engineered each layer of the stack to optimize GPU utilization. With CO, a number of fashions can run on the identical GPUs, workloads scale elastically, and enterprises can squeeze extra worth out of each accelerator. The payoff is quick, dependable, and cost-efficient inference that may assist each experimentation and large-scale deployment.

Examine the complete benchmarks on Synthetic Evaluation right here.

Right here’s a fast demo of easy methods to entry the GPT-OSS-120B mannequin within the Playground.

Native Runners

Native Runners allow you to develop and run fashions by yourself {hardware}—laptops, workstations, edge bins—whereas making them callable via Clarifai’s cloud API. Clarifai handles the general public URL, routing, and authentication; your mannequin executes regionally and your knowledge stays in your machine. It behaves like every other Clarifai‑hosted mannequin.

Why groups use Native Runners

Construct the place your knowledge and instruments stay. Preserve fashions near native recordsdata, inner databases, and OS‑stage utilities.
No customized networking. Begin a runner and get a public URL—no port‑forwarding or reverse proxies.
Use your personal compute. Carry your GPUs and customized setups; the platform nonetheless gives the API, workflows, and governance round them.

New: Ollama Toolkit (now within the CLI)

We’ve added an Ollama Toolkit to the Clarifai CLI so you possibly can initialize an Ollama‑backed mannequin listing in a single command (and select any mannequin from the Ollama library). It pairs completely with Native Runners—obtain, run, and expose an Ollama mannequin through a public API with a minimal setup.

The CLI helps --toolkit ollama plus flags like --model-name, --port, and --context-length, making it trivial to focus on particular Ollama fashions.

Instance workflow: run Gemma 3 270M or GPT‑OSS- 20B regionally and serve it via a public API

Decide a mannequin in Ollama.
- Gemma 3 270M (tiny, quick; 32K context): gemma3:270m.
- GPT‑OSS 20B (OpenAI open‑weight, optimized for native use): gpt-oss:20b.
Initialize the venture with the Ollama Toolkit.
Use the command above, swapping --model-name on your choose (e.g., gpt-oss:20b). This may create a brand new mannequin listing construction that’s appropriate with the Clarifai platform. You possibly can customise or optimize the generated mannequin by modifying the 1/mannequin.py file as wanted.
Begin your Native Runner.
From the mannequin listing:

The runner registers with Clarifai and exposes your native mannequin through a public URL; the CLI prints a prepared‑to‑run consumer snippet.
Name it like every Clarifai mannequin.
For instance (Python SDK):

Behind the scenes, the API name is routed to your machine; outcomes return to the caller over Clarifai’s safe management airplane.

Deep dive: We revealed a step‑by‑step information that walks via operating Ollama fashions regionally and exposing them with Native Runners. Test it out right here.

Strive it on the Developer Plan

You can begin free of charge, or use the Developer Plan—$1/month for the primary yr—which incorporates as much as 5 Native Runners and limitless runner hours.

Take a look at the complete instance and setup information within the documentation right here.

Billing

We’ve made billing extra clear and versatile with this launch. Month-to-month spending limits have been launched: $100 for Developer and Important plans, and $500 for the Skilled plan. If you happen to want larger limits, you possibly can attain out to our crew.

We’ve additionally added a brand new bank card pre-authorization course of. A brief cost is utilized to confirm card validity and obtainable funds — $50 for Developer, $100 for Important, and $500 for Skilled plans. The quantity is robotically refunded inside seven days, making certain a seamless verification expertise.

Management Middle

The Management Middle will get much more versatile and informative with this replace. Now you can resize charts to half their authentic measurement on the configure web page, making side-by-side comparisons smoother and layouts extra manageable.
Charts are smarter too: the Saved Inputs Value chart now accurately reveals the typical price for the chosen interval, whereas longer date ranges robotically show weekly aggregated knowledge for simpler readability. Empty charts show significant messages as an alternative of zeros, so that you all the time know when knowledge isn’t obtainable.
We’ve additionally added cross-links between compute price and utilization charts, making it easy to navigate between these views and get a whole image of your AI infrastructure.

Further Modifications

Python SDK: Mounted Native Runner CLI command, up to date protocol and gRPC variations, built-in secrets and techniques, corrected num_threads defaults, added stream_options validation, prevented downloading authentic checkpoints, improved mannequin add and deployment, and added person affirmation to forestall Dockerfile overwrite throughout uploads.
Examine all SDK updates right here.
Platform Updates: Added a public useful resource filter to shortly view Neighborhood-shared sources, improved Playground error messaging for streaming limits, and prolonged login session period for Google and GitHub SSO customers to seven days.
Discover all platform adjustments right here.

Prepared to start out constructing?

With Native Runners, now you can serve fashions, MCP servers, or brokers immediately from your personal {hardware} with out importing mannequin weights or managing infrastructure. It’s the quickest option to take a look at, iterate, and securely run fashions out of your laptop computer, workstation, or on-prem server. You possibly can learn the documentation to get began, or take a look at the weblog to see easy methods to run Ollama fashions regionally and expose them through a public API.

Benchmarking Pace, Scale, and Value Effectivity