Sunday, March 22, 2026
HomeArtificial IntelligenceClarifai Reasoning Engine Achieves 414 Tokens Per Second on Kimi K2.5

Clarifai Reasoning Engine Achieves 414 Tokens Per Second on Kimi K2.5

TL;DR

Utilizing customized CUDA kernels and speculative decoding optimized for reasoning workloads, we achieved 414 tokens per second throughput on Kimi K2.5 working on Nvidia B200 GPUs, making us one of many first suppliers to achieve 400+ tokens per second on a trillion-parameter reasoning mannequin.


Forward of Nvidia GTC, we’re excited to share that Clarifai Reasoning Engine achieves 414 tokens per second (TPS) throughput on Kimi K2.5, positioning us among the many high inference suppliers for frontier reasoning fashions as measured by Synthetic Evaluation. Operating on Nvidia B200 GPU infrastructure, our platform delivers production-grade efficiency for agentic workflows and sophisticated reasoning duties.

Output-speed-Mar-16-2026-05-03-19-3226-PM

Determine 1: Clarifai achieves 414 tokens per second on Kimi K2.5, rating among the many quickest inference suppliers on Synthetic Evaluation benchmarks.

Why Kimi K2.5 efficiency issues

Kimi K2.5 is a 1-trillion-parameter reasoning mannequin with a 384-expert Combination-of-Specialists structure that prompts 32 billion parameters per request. Constructed by Moonshot AI with native multimodal coaching on 15 trillion blended visible and textual content tokens, the mannequin delivers robust efficiency throughout key benchmarks: 50.2% HLE with instruments, 76.8% SWE-Bench Verified, and 78.4% BrowseComp.

As a reasoning mannequin, Kimi K2.5 generates prolonged considering sequences earlier than closing solutions. Clarifai achieves a time to first reply token of 6 seconds, which incorporates the mannequin’s inner considering time earlier than offering a response. Throughput immediately impacts end-to-end response time for agentic techniques, code technology, and multimodal reasoning duties. At 414 TPS, we ship the velocity required for manufacturing deployments.

Time to first token-1-1

Determine 2: Time to first Reply token (TTFT) efficiency throughout inference suppliers, measured by Synthetic Evaluation with 10,000 enter tokens.

How we optimize for throughput

Clarifai Reasoning Engine makes use of three core optimizations for giant reasoning fashions:

Customized CUDA kernels scale back reminiscence stalls and improve cache locality. By optimizing low-level GPU operations, we preserve streaming multiprocessors lively throughout inference fairly than ready on information motion.

Speculative decoding predicts potential token paths and prunes misses rapidly. This reduces wasted computation throughout the mannequin’s considering sequence, a sample frequent in reasoning workloads.

Adaptive optimization constantly learns from workload habits. The system dynamically adjusts batching, reminiscence reuse, and execution paths based mostly on precise request patterns. These enhancements compound over time, particularly for the repetitive duties frequent in agentic workflows.

Operating on Nvidia B200 infrastructure provides us the {hardware} basis to push efficiency boundaries, whereas our inference optimization stack delivers the software-level good points.

Constructing with Kimi K2.5

Kimi K2.5 is now obtainable on the Clarifai Platform. Strive it out on the Playground or by way of the API to get began.

If you happen to want devoted compute to deploy Kimi K2.5 and different related high open fashions at scale for manufacturing workloads, get in contact with our workforce.


RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments