Z.AI Introduces GLM-5.1: An Open-Weight 754B Agentic Mannequin That Achieves SOTA on SWE-Bench Professional and Sustains 8-Hour Autonomous Execution

By admin2010

April 8, 2026

2

Z.AI, the AI platform developed by the crew behind the GLM mannequin household, has launched GLM-5.1 — its next-generation flagship mannequin developed particularly for agentic engineering. In contrast to fashions optimized for clear, single-turn benchmarks, GLM-5.1 is constructed for agentic duties, with considerably stronger coding capabilities than its predecessor, and achieves state-of-the-art efficiency on SWE-Bench Professional whereas main GLM-5 by a large margin on NL2Repo (repo technology) and Terminal-Bench 2.0 (real-world terminal duties).

Structure: DSA, MoE, and Asynchronous RL

Earlier than diving into what GLM-5.1 can do, it’s value understanding what it’s constructed on — as a result of the structure is meaningfully completely different from a normal dense transformer.

GLM-5 adopts DSA to considerably cut back coaching and inference prices whereas sustaining long-context constancy. The mannequin makes use of a glm_moe_dsa structure (Combination of Specialists (MoE) mannequin mixed with DSA). For AI devs evaluating whether or not to self-host, this issues: MoE fashions activate solely a subset of their parameters per ahead go, which might make inference considerably extra environment friendly than a comparably-sized dense mannequin, although they require particular serving infrastructure.

On the coaching facet, GLM-5 implements a brand new asynchronous reinforcement studying infrastructure that drastically improves post-training effectivity by decoupling technology from coaching. Novel asynchronous agent RL algorithms additional enhance RL high quality, enabling the mannequin to study from advanced, long-horizon interactions extra successfully. That is what permits the mannequin to deal with agentic duties with the sort of sustained judgment that single-turn RL coaching struggles to provide.

The Plateau Drawback GLM-5.1 is Fixing

To know what makes GLM-5.1 completely different at inference time, it helps to grasp a selected failure mode in LLMs used as brokers. Earlier fashions — together with GLM-5 — are inclined to exhaust their repertoire early: they apply acquainted methods for fast preliminary good points, then plateau. Giving them extra time doesn’t assist.

This can be a structural limitation for any developer making an attempt to make use of an LLM as a coding agent. The mannequin applies the identical playbook it is aware of, hits a wall, and stops making progress no matter how lengthy it runs. GLM-5.1, against this, is constructed to remain efficient on agentic duties over for much longer horizons. The mannequin handles ambiguous issues with higher judgment and stays productive over longer periods. It breaks advanced issues down, runs experiments, reads outcomes, and identifies blockers with actual precision. By revisiting its reasoning and revising its technique by means of repeated iteration, GLM-5.1 sustains optimization over lots of of rounds and hundreds of software calls.

The sustained efficiency requires greater than a bigger context window. This functionality requires the mannequin to take care of purpose alignment over prolonged execution, decreasing technique drift, error accumulation, and ineffective trial and error, enabling really autonomous execution for advanced engineering duties.

Benchmarks: The place GLM-5.1 Stands

On SWE-Bench Professional, GLM-5.1 achieves a rating of 58.4, outperforming GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Professional, setting a brand new state-of-the-art end result.

The broader benchmark profile reveals a well-rounded mannequin. GLM-5.1 scores 95.3 on AIME 2026, 94.0 on HMMT Nov. 2025, 82.6 on HMMT Feb. 2026, and 86.2 on GPQA-Diamond — a graduate-level science reasoning benchmark. On agentic and tool-use benchmarks, GLM-5.1 scores 68.7 on CyberGym (a considerable bounce from GLM-5’s 48.3), 68.0 on BrowseComp, 70.6 on τ³-Bench, and 71.8 on MCP-Atlas (Public Set) — the final one significantly related given MCP’s rising function in manufacturing agent methods. On Terminal-Bench 2.0, the mannequin scores 63.5, rising to 66.5 when evaluated with Claude Code because the scaffolding.

Throughout 12 consultant benchmarks overlaying reasoning, coding, brokers, software use, and shopping, GLM-5.1 demonstrates a broad and well-balanced functionality profile. This reveals that GLM-5.1 just isn’t a single-metric enchancment — it advances concurrently throughout common intelligence, real-world coding, and complicated activity execution.

By way of total positioning, GLM-5.1’s common functionality and coding efficiency are total aligned with Claude Opus 4.6.

8-Hour Sustained Execution: What That Really Means

Crucial distinction in GLM-5.1 is its capability for long-horizon activity execution. GLM-5.1 can work autonomously on a single activity for as much as 8 hours, finishing the total course of from planning and execution to testing, fixing, and supply.

For builders constructing autonomous brokers, this modifications the scope of what’s attainable. Slightly than orchestrating a mannequin over dozens of short-lived software calls, you possibly can hand GLM-5.1 a posh goal and let it run an entire ‘experiment–analyze–optimize’ loop autonomously.

The concrete engineering demonstrations make this tangible: GLM-5.1 can construct an entire Linux desktop surroundings from scratch in 8 hours; carry out 178 rounds of autonomous iteration on a vector database activity and enhance efficiency to 1.5× the preliminary model; and optimize a CUDA kernel, growing speedup from 2.6× to 35.7× by means of sustained tuning.

That CUDA kernel result’s notable for ML engineers: enhancing a kernel from 2.6× to 35.7× speedup by means of autonomous iterative optimization is a stage of depth that may take a talented human engineer vital time to copy manually.

Mannequin Specs and Deployment

GLM-5.1 is a 754-billion-parameter MoE mannequin launched beneath the MIT license on HuggingFace. It operates with a 200K context window and helps as much as 128K most output tokens — each necessary for long-horizon duties that want to carry massive codebases or prolonged reasoning chains in reminiscence.

GLM-5.1 helps considering mode (providing a number of considering modes for various situations), streaming output, perform calling, context caching, structured output, and MCP for integrating exterior instruments and information sources.

For native deployment, the next open-source frameworks assist GLM-5.1: SGLang (v0.5.10+), vLLM (v0.19.0+), xLLM (v0.8.0+), Transformers (v0.5.3+), and KTransformers (v0.5.3+).

For API entry, the mannequin is accessible by means of Z.AI’s API platform. Getting began requires putting in zai-sdk by way of pip and initializing a ZaiClient together with your API key. .

Key Takeaways

GLM-5.1 units a brand new state-of-the-art on SWE-Bench Professional with a rating of 58.4, outperforming GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Professional — making it one of many the strongest publicly benchmarked mannequin for real-world software program engineering duties on the time of launch.
The mannequin is constructed for long-horizon autonomous execution, able to engaged on a single advanced activity for as much as 8 hours — operating experiments, revising methods, and iterating throughout lots of of rounds and hundreds of software calls with out human intervention.
GLM-5.1 makes use of a MoE + DSA structure educated with asynchronous reinforcement studying, which reduces coaching and inference prices in comparison with dense transformers whereas sustaining long-context constancy — a significant consideration for groups evaluating self-hosting.
It’s open-weight beneath the MIT license (754B parameters, 200K context window, 128K max output tokens) and helps native deployment by way of SGLang, vLLM, xLLM, Transformers, and KTransformers, in addition to API entry by means of the Z.AI platform with OpenAI SDK compatibility.
GLM-5.1 goes past coding — it additionally reveals sturdy enhancements in front-end prototyping, artifacts technology, and workplace productiveness duties (Phrase, Excel, PowerPoint, PDF), positioning it as a general-purpose basis for each agentic methods and high-quality content material workflows.

Take a look at the Weights, API and Technical particulars. Additionally, be happy to observe us on Twitter and don’t neglect to hitch our 120k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you possibly can be a part of us on telegram as properly.

Must associate with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so forth.? Join with us

Z.AI Introduces GLM-5.1: An Open-Weight 754B Agentic Mannequin That Achieves SOTA on SWE-Bench Professional and Sustains 8-Hour Autonomous Execution

Structure: DSA, MoE, and Asynchronous RL

The Plateau Drawback GLM-5.1 is Fixing

Benchmarks: The place GLM-5.1 Stands

8-Hour Sustained Execution: What That Really Means

Mannequin Specs and Deployment

Key Takeaways

Deploy Frontier AI on Your {Hardware} with Public API Entry

The Finest AI-Pushed Market Intelligence Platforms for Institutional Buyers

What it takes to scale agentic AI within the enterprise

LEAVE A REPLY Cancel reply

Most Popular

MTF Development Indicator MT4 – ForexMT4Indicators.com

Technique Buys 44,377 BTC in March 2026 as STRC Quantity Hits $746M Document – Crypto Information Bitcoin Information

Treasury Proposes Stablecoin AML Guidelines as Bessent Vows to Defend US Monetary System – Crypto Information Bitcoin Information

Motorola Moto Pad: 11-inch Android pill for $250

Recent Comments

ABOUT US

POPULAR POSTS

MTF Development Indicator MT4 – ForexMT4Indicators.com

Technique Buys 44,377 BTC in March 2026 as STRC Quantity Hits $746M Document – Crypto Information Bitcoin Information

Treasury Proposes Stablecoin AML Guidelines as Bessent Vows to Defend US Monetary System – Crypto Information Bitcoin Information

POPULAR CATEGORY