Speculative decoding is a method for dashing up giant language mannequin inference. A small, quick draft mannequin proposes a number of tokens. The massive goal mannequin verifies them in parallel. If accepted, inference is quicker. If rejected, the system falls again gracefully.
EAGLE Staff, vLLM Staff, and TorchSpec Staff has launched the EAGLE sequence together with EAGLE 1, EAGLE 2, and EAGLE 3 has develop into probably the most broadly adopted and virtually deployed households of speculative decoding algorithms throughout each analysis and manufacturing programs. At the moment, that household will get a focused reliability improve with introduction of EAGLE 3.1.
What was Going Incorrect
Whereas speculative decoding performs nicely in managed settings, efficiency usually degrades beneath completely different chat templates, long-context inputs, or out-of-distribution system prompts.
The EAGLE staff traced this fragility to a phenomenon known as consideration drift as hypothesis depth will increase, the drafter progressively shifts consideration away from sink tokens and towards its personal generated tokens.
In easier phrases: the drafter is a small mannequin that predicts future tokens. As hypothesis will get deeper, it begins attending to its personal prior outputs as an alternative of the unique context. This degrades acceptance size and output stability.
Two underlying points had been recognized. First, the fused enter illustration turns into more and more imbalanced as higher-layer hidden states dominate the drafter enter. Second, hidden-state magnitude grows throughout hypothesis steps because of the unnormalized residual path. Collectively, these results make the drafter progressively much less secure at deeper hypothesis depths.
Two Architectural Fixes in EAGLE 3.1
To deal with consideration drift, EAGLE 3.1 comes with two key architectural enhancements: FC normalization after every goal hidden state and earlier than the FC layer, and feeding post-norm hidden states into the subsequent decoding step.
FC normalization stabilizes the hidden states that the drafter receives from the goal mannequin. With out it, hidden-state magnitude grows throughout steps, making the drafter more and more unreliable. Making use of normalization at every step retains the inputs bounded.
The post-norm design makes the tactic behave extra like recursively invoking the drafter throughout decoding steps, moderately than merely appending extra layers to the goal mannequin.


What These Fixes Ship
In contrast with EAGLE 3, EAGLE 3.1 demonstrates: higher training-time to inference-time extrapolation, stronger long-context robustness, greater resilience to speak template and system immediate variation, and extra secure acceptance size throughout various serving environments.
In long-context workloads, EAGLE 3.1 achieves as much as 2× longer acceptance size in contrast with EAGLE 3.
Coaching Infrastructure: TorchSpec
TorchSpec now gives environment friendly coaching help for EAGLE 3.1 and future speculative decoding algorithms. By reducing coaching overhead and simplifying experimentation workflows, TorchSpec helps speed up iteration and exploration for next-generation speculative decoding analysis and deployment.
Based mostly on TorchSpec and vLLM, the analysis staff additionally educated and open-sourced an EAGLE 3.1 draft mannequin for Kimi K2.6, obtainable on HuggingFace. The mannequin serves for instance of deploying EAGLE 3.1 with TorchSpec coaching and vLLM serving help on a real-world serving mannequin
vLLM Integration: Config-Pushed and Backward-Appropriate
EAGLE 3.1 lands in vLLM as a config-driven extension of the prevailing EAGLE 3 implementation. The combination consists of FC normalization help, post-norm hidden-state suggestions, and removing of hardcoded assumptions round goal hidden states.
Backward compatibility with current EAGLE 3 checkpoints is absolutely preserved. EAGLE 3.1 draft fashions could be plugged instantly by means of the identical speculative-decoding code path.
vllm serve nvidia/Kimi-K2.6-NVFP4
--trust-remote-code
--tensor-parallel-size 4
--tool-call-parser kimi_k2
--enable-auto-tool-choice
--reasoning-parser kimi_k2
--attention-backend tokenspeed_mla
--speculative-config '{"mannequin":"lightseekorg/kimi-k2.6-eagle3.1-mla","technique":"eagle3","num_speculative_tokens":3}'
--language-model-only
Benchmark Outcomes on Kimi K2.6
The analysis staff benchmarked the Kimi K2.6 EAGLE 3.1 draft mannequin on Kimi-K2.6-NVFP4 with vLLM (TP=4, GB200, non-disagg) on the SPEED-Bench coding dataset. EAGLE 3.1 delivers 2.03× greater per-user output throughput at concurrency 1. The speedup stays significant as concurrency scales: 1.71× at C=4 and 1.66× at C=16.
Marktechpost’s Visible Explainer
Key Takeaways
- EAGLE 3.1 fixes consideration drift — a newly recognized instability the place the drafter loses concentrate on sink tokens at deeper hypothesis depths.
- Two architectural adjustments — FC normalization and post-norm hidden-state suggestions — stabilize the drafter throughout hypothesis steps.
- In long-context workloads, EAGLE 3.1 delivers as much as 2× longer acceptance size in contrast with EAGLE 3.
- Benchmarks on Kimi-K2.6-NVFP4 present 2.03× per-user output throughput at concurrency 1, dropping to 1.66× at C=16.
- EAGLE 3.1 is backward-compatible with EAGLE 3 checkpoints and is already merged into vLLM principal, transport in v0.22.0.
Take a look at the Technical particulars. Additionally, be happy to observe us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you may be part of us on telegram as nicely.
Must associate with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so forth.? Join with us

