Prime LLM Inference Suppliers In contrast

By admin2010

October 11, 2025

98

TL;DR

On this submit, we discover how main inference suppliers carry out on the GPT-OSS-120B mannequin utilizing benchmarks from Synthetic Evaluation. You’ll be taught what issues most when evaluating inference platforms together with throughput, time to first token, and value effectivity. We examine Vertex AI, Azure, AWS, Databricks, Clarifai, Collectively AI, Fireworks, Nebius, CompactifAI, and Hyperbolic on their efficiency and deployment effectivity.

Introduction

Massive language fashions (LLMs) like GPT-OSS-120B, an open-weight 120-billion-parameter mixture-of-experts mannequin, are designed for superior reasoning and multi-step technology. Reasoning workloads eat tokens quickly and place excessive calls for on compute, so deploying these fashions in manufacturing requires inference infrastructure that delivers low latency, excessive throughput, and decrease price.

Variations in {hardware}, software program optimizations, and useful resource allocation methods can result in massive variations in latency, effectivity, and value. These variations immediately have an effect on real-world purposes equivalent to reasoning brokers, doc understanding techniques, or copilots, the place even small delays can affect general responsiveness and throughput.

To guage these variations objectively, unbiased benchmarks have grow to be important. As a substitute of counting on inside efficiency claims, open and data-driven evaluations now provide a extra clear option to assess how totally different platforms carry out beneath actual workloads.

On this submit, we examine main GPU-based inference suppliers utilizing the GPT-OSS-120B mannequin as a reference benchmark. We look at how every platform performs throughout key inference metrics equivalent to throughput, time to first token, and value effectivity, and the way these trade-offs affect efficiency and scalability for reasoning-heavy workloads.

Earlier than diving into the outcomes, let’s take a fast take a look at Synthetic Evaluation and the way their benchmarking framework works.

Synthetic Evaluation Benchmarks

Synthetic Evaluation (AA) is an unbiased benchmarking initiative that runs standardized exams throughout inference suppliers to measure how fashions like GPT-OSS-120B carry out in actual circumstances. Their evaluations give attention to sensible workloads involving lengthy contexts, streaming outputs, and reasoning-heavy prompts somewhat than brief, artificial samples.

You possibly can discover the complete GPT-OSS-120B benchmark outcomes right here.

Synthetic Evaluation evaluates a variety of efficiency metrics, however right here we give attention to the three key components that matter when selecting an inference platform for GPT-OSS-120B: time to first token, throughput, and value per million tokens.

Time to First Token (TTFT)
The time between sending a immediate and receiving the mannequin’s first token. Decrease TTFT means output begins streaming sooner, which is essential for interactive purposes and multi-step reasoning the place delays can disrupt the movement.
Throughput (tokens per second)
The speed at which tokens are generated as soon as streaming begins. Larger throughput shortens whole completion time for lengthy outputs and permits extra concurrent requests, immediately affecting scalability for large-context or multi-turn workloads.
Value per million tokens (blended price)
A mixed metric that accounts for each enter and output token pricing. This gives a transparent view of operational prices for prolonged contexts and streaming workloads, serving to groups plan for predictable bills.

Benchmark Methodology

Immediate Measurement: Benchmarks lined on this weblog use a 1,000-token enter immediate run by Synthetic Evaluation, reflecting a typical real-world situation equivalent to a chatbot question or reasoning-heavy instruction. Benchmarks for considerably longer prompts are additionally accessible and could be explored for reference right here.
Median Measurements: The reported values symbolize the median (p50) over the past 72 hours, capturing sustained efficiency traits somewhat than single-point spikes or dips. For probably the most up-to-date benchmark outcomes, go to the Synthetic Evaluation GPT‑OSS‑120B mannequin suppliers web page right here.
Metrics Focus: This abstract highlights time to first token (TTFT), throughput, and blended price to supply a sensible view for workload planning. Different metrics—equivalent to end-to-end response time, latency by enter token rely, and time to first reply token—are additionally measured by Synthetic Evaluation however will not be included on this overview.

With this technique in thoughts, we are able to now examine how totally different GPU-based platforms carry out on GPT‑OSS‑120B and what these outcomes suggest for reasoning-heavy workloads.

Supplier Comparability (GPT‑OSS‑120B)

Clarifai

Time to First Token: 0.32 s
Throughput: 544 tokens/s
Blended Value: $0.16 per 1M tokens
Notes: Extraordinarily excessive throughput; low latency; cost-efficient; sturdy alternative for reasoning-heavy workloads.

Key Options:

GPU fractioning and autoscaling choices for environment friendly compute utilization
Native runners to execute fashions regionally by yourself {hardware} for testing and growth
On-prem, VPC, and multi-site deployment choices
Management Middle for monitoring and managing utilization and efficiency

Google Vertex AI

Time to First Token: 0.40 s
Throughput: 392 tokens/s
Blended Value: $0.26 per 1M tokens
Notes: Reasonable latency and throughput; appropriate for general-purpose reasoning workloads.

Key Options:

Built-in AI instruments (AutoML, coaching, deployment, monitoring)
Scalable cloud infrastructure for batch and on-line inference
Enterprise-grade safety and compliance

Microsoft Azure

Time to First Token: 0.48 s
Throughput: 348 tokens/s
Blended Value: $0.26 per 1M tokens
Notes: Barely greater latency; balanced efficiency and value for traditional workloads.

Key Options:

Complete AI providers (ML, cognitive providers, customized bots)
Deep integration with Microsoft ecosystem
International enterprise-grade infrastructure

Hyperbolic

Time to First Token: 0.52 s
Throughput: 395 tokens/s
Blended Value: $0.30 per 1M tokens
Notes: Larger price than friends; good throughput for reasoning-heavy duties.

Key Options:

AWS

Time to First Token: 0.64 s
Throughput: 252 tokens/s
Blended Value: $0.26 per 1M tokens
Notes: Decrease throughput and better latency; appropriate for much less time-sensitive workloads.

Key Options:

Broad AI/ML service portfolio (Bedrock, SageMaker)
International cloud infrastructure
Enterprise-grade safety and compliance

Databricks

Time to First Token: 0.36 s
Throughput: 195 tokens/s
Blended Value: $0.26 per 1M tokens
Notes: Decrease throughput; acceptable latency; higher for batch or background duties.

Key Options:

Unified analytics platform (Spark + ML + notebooks)
Collaborative workspace for groups
Scalable compute for giant ML/AI workloads

Collectively AI

Time to First Token: 0.25 s
Throughput: 248 tokens/s
Blended Value: $0.26 per 1M tokens
Notes: Very low latency; average throughput; good for real-time reasoning-heavy purposes.

Key Options:

Actual-time inference and coaching
Cloud/VPC-based deployment orchestration
Versatile and safe platform

Fireworks AI

Time to First Token: 0.44 s
Throughput: 482 tokens/s
Blended Value: $0.26 per 1M tokens
Notes: Excessive throughput and balanced latency; appropriate for interactive purposes.

Key Options:

CompactifAI

Time to First Token: 0.29 s
Throughput: 186 tokens/s
Blended Value: $0.10 per 1M tokens
Notes: Low price; decrease throughput; greatest for cost-sensitive workloads with smaller concurrency wants.

Key Options:

Environment friendly, compressed fashions for price financial savings
Simplified deployment on AWS
Optimized for high-throughput batch inference

Nebius Base

Time to First Token: 0.66 s
Throughput: 165 tokens/s
Blended Value: $0.26 per 1M tokens
Notes: Considerably decrease throughput and better latency; might wrestle with reasoning-heavy or interactive workloads.

Key Options:

Fundamental AI service endpoints
Customary cloud infrastructure
Appropriate for steady-demand workloads

Finest Suppliers Primarily based on Worth and Throughput

Choosing the precise inference supplier for GPT‑OSS‑120B requires evaluating time to first token, throughput, and value based mostly in your workload. Platforms like Clarifai provide excessive throughput, low latency, and aggressive price, making them well-suited for reasoning-heavy or interactive duties. Different suppliers, equivalent to CompactifAI, prioritize decrease price however include lowered throughput, which can be extra appropriate for cost-sensitive or batch-oriented workloads. The optimum alternative will depend on which trade-offs matter most to your purposes.

Finest for Worth

Finest for Throughput

Clarifai: Highest throughput at 544 tokens/s with low first-chunk latency.
Fireworks AI: Sturdy throughput at 482 tokens/s and average latency.
Hyperbolic: Good throughput at 395 tokens/s; greater price however viable for heavy workloads.

Efficiency and Flexibility

Together with worth and throughput, flexibility is essential for real-world workloads. Groups usually want management over scaling conduct, GPU utilization, and deployment environments to handle price and effectivity.

Clarifai, for instance, helps fractional GPU utilization, autoscaling, and native runners — options that may enhance effectivity and cut back infrastructure overhead.

These capabilities prolong past GPT‑OSS‑120B. With the Clarifai Reasoning Engine, customized or open-weight reasoning fashions can run with constant efficiency and reliability. The engine additionally adapts to workload patterns over time, steadily enhancing pace for repetitive duties with out sacrificing accuracy.

Benchmark Abstract

Up to now, we’ve in contrast suppliers based mostly on throughput, latency, and value utilizing the Synthetic Evaluation Benchmark. To see how these trade-offs play out in apply, right here’s a visible abstract of the outcomes throughout the totally different suppliers. These charts are immediately from Synthetic Evaluation.

The primary chart highlights output pace vs worth, whereas the second chart compares latency vs output pace.

Output Speed vs Price (8 Oct 25)

Output Velocity vs. Worth

Latency vs Output Speed (8 Oct 25)

Latency vs. Output Velocity

Beneath is an in depth comparability desk summarizing the important thing metrics for GPT-OSS-120B inference throughout suppliers.

Supplier	Throughput (tokens/s)	Time to First Token (s)	Blended Value ($ / 1M tokens)
Clarifai	544	0.32	0.16
Google Vertex AI	392	0.40	0.26
Microsoft Azure	348	0.48	0.26
Hyperbolic	395	0.52	0.30
AWS	252	0.64	0.26
Databricks	195	0.36	0.26
Collectively AI	248	0.25	0.26
Fireworks AI	482	0.44	0.26
CompactifAI	186	0.29	0.10
Nebius Base	165	0.66	0.26

Conclusion

Selecting an inference supplier for GPT‑OSS‑120B includes balancing throughput, latency, and value. Every supplier handles these trade-offs in a different way, and the only option will depend on the particular workload and efficiency necessities.

Suppliers with excessive throughput excel at reasoning-heavy or interactive duties, whereas these with decrease median throughput could also be extra appropriate for batch or background processing the place pace is much less essential. Latency additionally performs a key position: low time-to-first-token improves responsiveness for real-time purposes, whereas barely greater latency could also be acceptable for much less time-sensitive duties.

Value issues stay essential. Some suppliers provide sturdy efficiency at low blended prices, whereas others commerce effectivity for worth. Benchmarks masking throughput, time to first token, and blended price present a transparent foundation for understanding these trade-offs.

In the end, the precise supplier will depend on the engineering downside, workload traits, and which trade-offs matter most for the applying.

Be taught extra about Clarifai’s reasoning engine

The Quickest AI Inference and Reasoning on GPUs.

Verified by Synthetic Evaluation

Prime LLM Inference Suppliers In contrast

TL;DR

Introduction

Synthetic Evaluation Benchmarks

Benchmark Methodology

Supplier Comparability (GPT‑OSS‑120B)

Clarifai

Google Vertex AI

Microsoft Azure

Hyperbolic

AWS

Databricks

Collectively AI

Fireworks AI

CompactifAI

Nebius Base

Finest Suppliers Primarily based on Worth and Throughput

Finest for Worth

Finest for Throughput

Efficiency and Flexibility

Benchmark Abstract

Conclusion

Be taught extra about Clarifai’s reasoning engine

Deploy MCP Servers as an API Endpoint

The best way to make a money circulate forecasting app work for different methods

The Obtain: Radioactive rhinos, and the rise and rise of peptides

LEAVE A REPLY Cancel reply

Most Popular

Presign transaction with unconfirmed outputs

2 Bitcoin Worth Ranges May Resolve What’s Subsequent, Coinbase Says

This Extremely telephone’s add-on digital camera lens might put others to disgrace

Denver Declares Information Heart Moratorium as Opposition Picks Up Steam Across the Nation

Recent Comments

ABOUT US

POPULAR POSTS

Presign transaction with unconfirmed outputs

2 Bitcoin Worth Ranges May Resolve What’s Subsequent, Coinbase Says

This Extremely telephone’s add-on digital camera lens might put others to disgrace

POPULAR CATEGORY