This weblog put up focuses on new options and enhancements. For a complete checklist, together with bug fixes, please see the launch notes.
Clarifai Reasoning Engine: Optimized for Agentic AI Inference
We’re introducing the Clarifai Reasoning Engine — a full-stack efficiency framework constructed to ship record-setting inference pace and effectivity for reasoning and agentic AI workloads.
In contrast to conventional inference programs that plateau after deployment, the Clarifai Reasoning Engine repeatedly learns from workload conduct, dynamically optimizing kernels, batching, and reminiscence utilization. This adaptive strategy means the system will get sooner and extra environment friendly over time, particularly for repetitive or structured agentic duties, with none trade-off in accuracy.
In latest benchmarks by Synthetic Evaluation on GPT-OSS-120B, the Clarifai Reasoning Engine set new trade data for GPU inference efficiency:
-
544 tokens/sec throughput — quickest GPU-based inference measured
-
0.36s time-to-first-token — near-instant responsiveness
-
$0.16 per million tokens — the bottom blended price
These outcomes not solely outperformed each different GPU-based inference supplier but additionally rivaled specialised ASIC accelerators, proving that trendy GPUs, when paired with optimized kernels, can obtain comparable and even superior reasoning efficiency.
The Reasoning Engine’s design is model-agnostic. Whereas GPT-OSS-120B served because the benchmark reference, the identical optimizations have been prolonged to different giant reasoning fashions like Qwen3-30B-A3B-Considering-2507, the place we noticed a 60% enchancment in throughput in comparison with the bottom implementation. Builders may convey their very own reasoning fashions and expertise related efficiency good points utilizing Clarifai’s compute orchestration and kernel optimization stack.
At its core, the Clarifai Reasoning Engine represents a brand new normal for working reasoning and agentic AI workloads — sooner, cheaper, adaptive, and open to any mannequin.
Attempt the GPT-OSS-120B mannequin straight on Clarifai and expertise the efficiency of the Clarifai Reasoning Engine. You can too convey your individual fashions or speak to our AI specialists to use these adaptive optimizations and see how they enhance throughput and latency in actual workloads.
Toolkits
Added assist for initializing fashions with the vLLM
, LMStudio
, and Hugging Face
toolkits for native runners.
Hugging Face Toolkit
We’ve added a Hugging Face Toolkit to the Clarifai CLI, making it simple to initialize, customise, and serve Hugging Face fashions by means of Native Runners.
Now you can obtain and run supported Hugging Face fashions straight by yourself {hardware} — laptops, workstations, or edge bins — whereas exposing them securely through Clarifai’s public API. Your mannequin runs regionally, your knowledge stays personal, and the Clarifai platform handles routing, authentication, and governance.
Why use the Hugging Face Toolkit:
-
Use native compute – Run open-weight fashions by yourself GPUs or CPUs whereas retaining them accessible by means of the Clarifai API.
-
Protect privateness – All inference occurs in your machine; solely metadata flows by means of Clarifai’s safe management airplane.
-
Skip handbook setup – Initialize a mannequin listing with one CLI command; dependencies and configs are routinely scaffolded.
Step-by-step: Working a Hugging Face mannequin regionally
1. Set up the Clarifai CLI
Ensure you have Python 3.11+ and the newest Clarifai CLI:
2. Authenticate with Clarifai
Log in and create a configuration context on your Native Runner:
You’ll be prompted on your Person ID, App ID, and Private Entry Token (PAT), which you can too set as an atmosphere variable:
3. Get your Hugging Face entry token
When you’re utilizing fashions from personal repos, create a token at huggingface.co/settings/tokens and export it:
4. Initialize a mannequin with the Hugging Face Toolkit
Use the brand new CLI flag --toolkit huggingface
to scaffold a mannequin listing.
This command generates a ready-to-run folder with mannequin.py
, config.yaml
, and necessities.txt
— pre-wired for Native Runners. You’ll be able to modify mannequin.py
to fine-tune conduct or change checkpoints in config.yaml
.
5. Set up dependencies
6. Begin your Native Runner
Your runner registers with Clarifai, and the CLI prints a ready-to-use public API endpoint.
7. Take a look at your mannequin
You’ll be able to name it like every Clarifai-hosted mannequin through SDK:
Behind the scenes, requests are routed to your native machine — the mannequin runs completely in your {hardware}. See the Hugging Face Toolkit documentation for the complete setup information, configuration choices, and troubleshooting suggestions.
vLLM Toolkit
Run Hugging Face fashions on the high-performance vLLM inference engine
vLLM is an open-source runtime optimized for serving giant language fashions with distinctive throughput and reminiscence effectivity. In contrast to typical runtimes, vLLM makes use of steady batching and superior GPU scheduling to ship sooner, cheaper inference—best for native deployments and experimentation.
With Clarifai’s vLLM Toolkit, you may initialize and run any Hugging Face-compatible mannequin by yourself machine, powered by vLLM’s optimized backend. Your mannequin runs regionally however behaves like every hosted Clarifai mannequin by means of a safe public API endpoint.
Take a look at the vLLM Toolkit documentation to discover ways to initialize and serve vLLM fashions with Native Runners.
LM Studio Toolkit
Run open-weight fashions from LM Studio and expose them through Clarifai APIs
LM Studio is a well-liked desktop utility for working and chatting with open-source LLMs regionally—no web connection required. With Clarifai’s LM Studio Toolkit, you may join these regionally working fashions to the Clarifai platform, making them callable through a public API whereas retaining knowledge and execution absolutely on-device.
Builders can use this integration to increase LM Studio fashions into production-ready APIs with minimal setup.
Learn the LM Studio Toolkit information to see supported setups and run LM Studio fashions utilizing Native Runners.
New Fashions on the Platform
We’ve added a number of highly effective new fashions optimized for reasoning, long-context duties, and multi-modal capabilities:
- Qwen3-Subsequent-80B-A3B-Considering – An 80B-parameter, sparsely activated reasoning mannequin that delivers near-flagship efficiency on advanced duties with excessive effectivity in coaching and ultra-long context inference (as much as 256K tokens).
- Qwen3-30B-A3B-Instruct-2507 – Enhanced for comprehension, coding, multilingual data, and person alignment, with 256K token long-context dealing with.
- Qwen3-30B-A3B-Considering-2507 – Additional improved reasoning, basic capabilities, alignment, and long-context understanding.
New Cloud Cases: B200s and GH200s
We’ve added new cloud situations to offer builders extra choices for GPU-based workloads:
-
B200 Cases – Competitively priced, working from Seattle.
-
GH200 Cases – Powered by Vultr for high-performance duties.
Be taught extra about Enterprise-Grade GPU Internet hosting for AI fashions and request entry, or join with our AI specialists to debate your workload wants.
Extra Modifications
Able to Begin Constructing?
With the Clarifai Reasoning Engine, you may run reasoning and agentic AI workloads sooner, extra effectively, and at decrease price — all whereas sustaining full management over your fashions. The Reasoning Engine repeatedly optimizes for throughput and latency, whether or not you’re utilizing GPT-OSS-120B, Qwen fashions, or your individual customized fashions.
Deliver your individual fashions and see how adaptive optimizations enhance efficiency in actual workloads. Discuss to our AI specialists to learn the way the Clarifai Reasoning Engine can optimize efficiency of your customized fashions.