Monday, September 1, 2025
HomeArtificial IntelligenceEvaluating SGLANG, vLLM, and TensorRT-LLM with GPT-OSS-120B

Evaluating SGLANG, vLLM, and TensorRT-LLM with GPT-OSS-120B

Blog thumbnail - Comparing SGLANG, vLLM, and TRTLM 
with GPT-OSS-120B.png.png

Introduction

The ecosystem of LLM inference frameworks has been rising quickly. As fashions develop into bigger and extra succesful, the frameworks that energy them are pressured to maintain tempo, optimizing for every little thing from latency to throughput to reminiscence effectivity. For builders, researchers, and enterprises alike, the selection of framework can dramatically have an effect on each efficiency and value.

On this weblog, we deliver these concerns collectively by evaluating SGLang, vLLM, and TensorRT-LLM. We consider how every performs when serving GPT-OSS-120B on 2x NVIDIA H100 GPUs. The outcomes spotlight the distinctive strengths of every framework and supply sensible steerage on which to decide on primarily based in your workload and {hardware}.

Overview of the Frameworks

SGLang: SGLang was designed across the thought of structured technology. It brings distinctive abstractions like RadixAttention and specialised state administration that permit it to ship low latency for interactive functions. This makes SGLang particularly interesting when the workload requires exact management over outputs, reminiscent of when producing structured knowledge codecs or working with agentic workflows.

vLLM: vLLM has established itself as one of many main open-source inference frameworks for serving massive language fashions at scale. Its key benefit lies in throughput, powered by steady batching and environment friendly reminiscence administration via PagedAttention. It additionally offers broad help for quantization strategies like INT8, INT4, GPTQ, AWQ, and FP8, making it a flexible selection for many who want to maximise tokens per second throughout many concurrent requests.

TensorRT-LLM: TensorRT-LLM is NVIDIA’s TensorRT-based inference runtime, purpose-built to extract most efficiency from NVIDIA GPUs. It’s deeply optimized for Hopper and Blackwell architectures, which suggests it takes full benefit of {hardware} options within the H100 and B200. The result’s increased effectivity, sooner response instances, and higher scaling as workloads improve. Whereas it requires a bit extra setup and tuning in comparison with different frameworks, TensorRT-LLM represents NVIDIA’s imaginative and prescient for production-grade inference efficiency.

Framework Design Focus Key Strengths
SGLANG Structured technology, RadixAttention Low latency, environment friendly token technology
vLLM Steady batching, PagedAttention Excessive throughput, helps quantization
TensorRT-LLM TensorRT optimizations GPU-level effectivity, lowest latency on H100/B200

Benchmark Setup and Outcomes

Benchmark Setup and Outcomes

To guage the three frameworks pretty, we ran GPT-OSS-120B on 2x NVIDIA H100 GPUs below a wide range of situations. The GPT-OSS-120B mannequin is a big mixture-of-experts mannequin that pushes the boundaries of open-weight efficiency. Its measurement and complexity make it a demanding benchmark, which is strictly why it’s superb for testing inference frameworks and {hardware}.

We measured three foremost classes of efficiency:

  • Latency – How briskly the mannequin generates the primary token (TTFT) and the way shortly it produces subsequent tokens.
  • Throughput – What number of tokens per second could be generated below various ranges of concurrency.
  • Concurrency scaling – How nicely every framework holds up because the variety of simultaneous requests will increase.

Latency Outcomes

Let’s begin with latency. While you care about responsiveness, two issues matter most: the time to first token and the per-token latency as soon as decoding begins.

Here is how the three frameworks stacked up:

Time to First Token (seconds)

Concurrency vLLM SGLang TensorRT-LLM
1 0.053 0.125 0.177
10 1.91 1.155 2.496
50 7.546 3.08 4.14
100 1.87 8.991 5.467

Per-Token Latency (seconds)

Concurrency vLLM SGLang TensorRT-LLM
1 0.005 0.004 0.004
10 0.011 0.01 0.009
50 0.021 0.015 0.018
100 0.019 0.021 0.049

What this exhibits:

  • vLLM was constantly the quickest to generate the primary token throughout all concurrency ranges, with wonderful scaling traits.
  • SGLang had probably the most steady per-token latency, constantly round 4–21 ms throughout totally different hundreds.
  • TensorRT-LLM confirmed the slowest time to first token however maintained aggressive per-token efficiency at decrease concurrency ranges.

Throughput Outcomes

On the subject of serving numerous requests, throughput is the quantity to look at. Here is how the three frameworks carried out as concurrency elevated:

General Throughput (tokens/second)

Concurrency vLLM SGLang TensorRT-LLM
1 187.15 230.96 242.79
10 863.15 988.18 867.21
50 2211.85 3108.75 2162.95
100 4741.62 3221.84 1942.64

Probably the most essential findings was how vLLM achieved the best throughput at 100 concurrent requests, reaching 4,741 tokens per second. SGLang confirmed sturdy efficiency at average to excessive concurrency (50 requests), whereas TensorRT-LLM demonstrated the most effective single-request throughput however decrease scaling at excessive concurrency.

Framework Evaluation and Suggestions

SGLang

  • Strengths: Steady per-token latency, sturdy throughput at average concurrency, good general steadiness.

  • Weaknesses: Slower time-to-first-token at single requests, throughput drops at 100 concurrent requests.

  • Greatest For: Reasonable to high-throughput functions, situations requiring constant token technology timing.

vLLM

  • Strengths: Quickest time-to-first-token throughout all concurrency ranges, highest throughput at excessive concurrency, wonderful scaling.

     

  • Weaknesses: Barely increased per-token latency at excessive hundreds.

     

  • Greatest For: Interactive functions, high-concurrency deployments, situations prioritizing quick preliminary responses and most throughput scaling.

TensorRT-LLM

  • Strengths: Greatest single-request throughput, aggressive per-token latency at low concurrency, hardware-optimized efficiency.

     

  • Weaknesses: Slowest time-to-first-token, poor scaling at excessive concurrency, considerably degraded per-token latency at 100 requests.

     

  • Greatest For: Single-user or low-concurrency functions, situations the place {hardware} optimization issues greater than scaling.

Conclusion

There isn’t any single framework that outperforms throughout all classes. As an alternative, every has been optimized for various targets, and the correct selection relies on workload and infrastructure.

  • Use vLLM for interactive functions and high-concurrency deployments requiring quick responses and most throughput scaling.
  • Select SGLang when average throughput and constant efficiency are wanted.
  • Deploy TensorRT-LLM for single-user functions or when maximizing {hardware} effectivity at low concurrency is the precedence.

The important thing takeaway is that selecting the best framework relies on workload sort and {hardware} availability, somewhat than searching for a common winner. Working GPT-OSS-120B on NVIDIA H100 GPUs with these optimized inference frameworks unlocks highly effective choices for constructing and deploying AI functions at scale.

It is price noting that these efficiency traits can shift dramatically relying in your GPU {hardware}. We additionally prolonged the benchmarks to B200 GPUs, the place TensorRT-LLM constantly outperformed each SGLang and vLLM throughout all metrics, due to its deeper optimization for NVIDIA’s newest {hardware} structure.

This highlights how framework choice is not nearly software program capabilities—it is equally about matching the correct framework to your particular {hardware} to unlock most efficiency potential.

 

You possibly can discover the full set of benchmark outcomes right here.

Bonus: Serve a Mannequin with Your Most popular Framework

Getting began with these frameworks is straightforward. With Clarifai’s Compute Orchestration, you’ll be able to serve GPT-OSS-120B or every other open-weight fashions or your personal customized fashions out of your most well-liked inference engine, whether or not it’s SGLang, vLLM, or TensorRT-LLM .

From establishing the runtime to deploying a production-ready API, you’ll be able to shortly go from mannequin to utility. The perfect half is that you’re not locked right into a single framework. You possibly can experiment with totally different runtimes, and select the one which finest aligns along with your efficiency and value necessities.

This flexibility makes it straightforward to combine cutting-edge frameworks into your workflows and ensures you might be at all times getting the absolute best efficiency out of your {hardware}. Try the documentation to discover ways to add your personal fashions.


RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments