Wednesday, May 27, 2026
HomeArtificial IntelligenceMeet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Consideration Drift in...

Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Consideration Drift in LLM Inference

Speculative decoding is a method for dashing up giant language mannequin inference. A small, quick draft mannequin proposes a number of tokens. The massive goal mannequin verifies them in parallel. If accepted, inference is quicker. If rejected, the system falls again gracefully.

EAGLE Staff, vLLM Staff, and TorchSpec Staff has launched the EAGLE sequence together with EAGLE 1, EAGLE 2, and EAGLE 3 has develop into probably the most broadly adopted and virtually deployed households of speculative decoding algorithms throughout each analysis and manufacturing programs. At the moment, that household will get a focused reliability improve with introduction of EAGLE 3.1.

What was Going Incorrect

Whereas speculative decoding performs nicely in managed settings, efficiency usually degrades beneath completely different chat templates, long-context inputs, or out-of-distribution system prompts.

The EAGLE staff traced this fragility to a phenomenon known as consideration drift as hypothesis depth will increase, the drafter progressively shifts consideration away from sink tokens and towards its personal generated tokens.

In easier phrases: the drafter is a small mannequin that predicts future tokens. As hypothesis will get deeper, it begins attending to its personal prior outputs as an alternative of the unique context. This degrades acceptance size and output stability.

Two underlying points had been recognized. First, the fused enter illustration turns into more and more imbalanced as higher-layer hidden states dominate the drafter enter. Second, hidden-state magnitude grows throughout hypothesis steps because of the unnormalized residual path. Collectively, these results make the drafter progressively much less secure at deeper hypothesis depths.

Two Architectural Fixes in EAGLE 3.1

To deal with consideration drift, EAGLE 3.1 comes with two key architectural enhancements: FC normalization after every goal hidden state and earlier than the FC layer, and feeding post-norm hidden states into the subsequent decoding step.

FC normalization stabilizes the hidden states that the drafter receives from the goal mannequin. With out it, hidden-state magnitude grows throughout steps, making the drafter more and more unreliable. Making use of normalization at every step retains the inputs bounded.

The post-norm design makes the tactic behave extra like recursively invoking the drafter throughout decoding steps, moderately than merely appending extra layers to the goal mannequin.

https://vllm.ai/blog/2026-05-26-eagle-3-1https://vllm.ai/blog/2026-05-26-eagle-3-1
https://vllm.ai/weblog/2026-05-26-eagle-3-1

What These Fixes Ship

In contrast with EAGLE 3, EAGLE 3.1 demonstrates: higher training-time to inference-time extrapolation, stronger long-context robustness, greater resilience to speak template and system immediate variation, and extra secure acceptance size throughout various serving environments.

In long-context workloads, EAGLE 3.1 achieves as much as 2× longer acceptance size in contrast with EAGLE 3.

Coaching Infrastructure: TorchSpec

TorchSpec now gives environment friendly coaching help for EAGLE 3.1 and future speculative decoding algorithms. By reducing coaching overhead and simplifying experimentation workflows, TorchSpec helps speed up iteration and exploration for next-generation speculative decoding analysis and deployment.

Based mostly on TorchSpec and vLLM, the analysis staff additionally educated and open-sourced an EAGLE 3.1 draft mannequin for Kimi K2.6, obtainable on HuggingFace. The mannequin serves for instance of deploying EAGLE 3.1 with TorchSpec coaching and vLLM serving help on a real-world serving mannequin

vLLM Integration: Config-Pushed and Backward-Appropriate

EAGLE 3.1 lands in vLLM as a config-driven extension of the prevailing EAGLE 3 implementation. The combination consists of FC normalization help, post-norm hidden-state suggestions, and removing of hardcoded assumptions round goal hidden states.

Backward compatibility with current EAGLE 3 checkpoints is absolutely preserved. EAGLE 3.1 draft fashions could be plugged instantly by means of the identical speculative-decoding code path.

vllm serve nvidia/Kimi-K2.6-NVFP4 
  --trust-remote-code 
  --tensor-parallel-size 4 
  --tool-call-parser kimi_k2 
  --enable-auto-tool-choice 
  --reasoning-parser kimi_k2 
  --attention-backend tokenspeed_mla 
  --speculative-config '{"mannequin":"lightseekorg/kimi-k2.6-eagle3.1-mla","technique":"eagle3","num_speculative_tokens":3}' 
  --language-model-only

Benchmark Outcomes on Kimi K2.6

The analysis staff benchmarked the Kimi K2.6 EAGLE 3.1 draft mannequin on Kimi-K2.6-NVFP4 with vLLM (TP=4, GB200, non-disagg) on the SPEED-Bench coding dataset. EAGLE 3.1 delivers 2.03× greater per-user output throughput at concurrency 1. The speedup stays significant as concurrency scales: 1.71× at C=4 and 1.66× at C=16.

Marktechpost’s Visible Explainer

01 / 07

vLLM · Might 26, 2026


The EAGLE staff, vLLM staff, and TorchSpec staff collectively launched EAGLE 3.1 — a focused repair for speculative decoding instability in manufacturing LLM serving.

#speculative-decoding
#vLLM
#LLM inference
#efficiency

02 / 07

Background

What’s Speculative Decoding?


A method for dashing up LLM inference utilizing two fashions working collectively.

  • A small, quick draft mannequin proposes a number of tokens forward
  • The massive goal mannequin verifies all proposed tokens in a single cross
  • Accepted tokens are stored — rejected tokens fall again gracefully
  • Outcome: greater output throughput with no change in output high quality

03 / 07

The Drawback

Consideration Drift in EAGLE 3


EAGLE 3 efficiency degraded in real-world deployments beneath three circumstances:

  • Completely different chat templates
  • Lengthy-context inputs
  • Out-of-distribution system prompts

Root trigger: consideration drift — as hypothesis depth will increase, the drafter shifts consideration away from sink tokens towards its personal generated tokens.

04 / 07

Root Trigger

Two Underlying Points

  • The fused enter illustration turns into more and more imbalanced — higher-layer hidden states dominate the drafter enter
  • Hidden-state magnitude grows throughout hypothesis steps because of the unnormalized residual path
  • Collectively, these make the drafter progressively much less secure at deeper hypothesis depths

05 / 07

Structure

Two Architectural Fixes

Repair 1
FC normalization utilized after every goal hidden state and earlier than the FC layer. Retains hidden-state magnitude bounded throughout decoding steps.

Repair 2
Publish-norm hidden-state suggestions — normalized hidden states fed into the subsequent decoding step, making the drafter behave like recursive invocation moderately than appended layers.

06 / 07

Benchmarks · SPEED-Bench Coding · GB200 TP=4

Per-Person Throughput vs. No-Spec Baseline

2.03×Concurrency 1

1.71×Concurrency 4

1.66×Concurrency 16

In long-context workloads, EAGLE 3.1 achieves as much as 2× longer acceptance size in contrast with EAGLE 3. Examined on Kimi-K2.6-NVFP4 with vLLM.

07 / 07

Deployment · vLLM v0.22.0

The right way to Deploy EAGLE 3.1


Backward-compatible with EAGLE 3 checkpoints. Already merged in vLLM principal. Steady launch: v0.22.0.

vllm serve nvidia/Kimi-K2.6-NVFP4 
  --trust-remote-code 
  --tensor-parallel-size 4 
  --tool-call-parser kimi_k2 
  --enable-auto-tool-choice 
  --reasoning-parser kimi_k2 
  --attention-backend tokenspeed_mla 
  --speculative-config 
    '{"mannequin":"lightseekorg/kimi-k2.6-eagle3.1-mla",
      "technique":"eagle3",
      "num_speculative_tokens":3}' 
  --language-model-only

Key Takeaways

  • EAGLE 3.1 fixes consideration drift — a newly recognized instability the place the drafter loses concentrate on sink tokens at deeper hypothesis depths.
  • Two architectural adjustments — FC normalization and post-norm hidden-state suggestions — stabilize the drafter throughout hypothesis steps.
  • In long-context workloads, EAGLE 3.1 delivers as much as 2× longer acceptance size in contrast with EAGLE 3.
  • Benchmarks on Kimi-K2.6-NVFP4 present 2.03× per-user output throughput at concurrency 1, dropping to 1.66× at C=16.
  • EAGLE 3.1 is backward-compatible with EAGLE 3 checkpoints and is already merged into vLLM principal, transport in v0.22.0.

Take a look at the Technical particulars. Additionally, be happy to observe us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you may be part of us on telegram as nicely.

Must associate with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so forth.? Join with us


Michal Sutter is an information science skilled with a Grasp of Science in Information Science from the College of Padova. With a stable basis in statistical evaluation, machine studying, and knowledge engineering, Michal excels at remodeling advanced datasets into actionable insights.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments