Perplexity AI’s analysis crew reimplemented their Unigram tokenizer from scratch in Rust and open-sourced the code in pplx-garden, their inference know-how repository.
At manufacturing enter lengths, the brand new encoder cuts p50 latency by roughly 5x versus the Hugging Face tokenizers crate, ~2x versus SentencePiece (C++), and ~1.5x versus IREE’s tokenizer (C), with zero steady-state heap allocations. In manufacturing, it diminished CPU utilization in Perplexity’s inference stack by 5-6x and shaved double-digit milliseconds off reranker latency.
Why Tokenization Turned a Bottleneck
LLM inference value is often framed round GPU work: KV caches, consideration kernels, professional routing. However smaller fashions, similar to embedding fashions, classifiers, and rerankers, inform a special story. These fashions are two to a few orders of magnitude smaller than frontier transformers.
A reranker scoring lots of of candidate paperwork per request is a transparent instance. With a small mannequin, GPU compute usually finishes in single-digit milliseconds. Each enter nonetheless passes by means of CPU-side tokenization first. When batch sizes are giant, tokenization turns into a significant fraction of whole request latency.
Perplexity’s work targets XLM-RoBERTa, a mannequin with a 250K-token Unigram vocabulary skilled with SentencePiece. Superb-tuned RoBERTa-family encoders are a typical manufacturing alternative for rating, retrieval, and similarity duties.
What’s Unigram Tokenization?
Unigram tokenization was launched by Kudo in 2018 and is carried out in SentencePiece. It frames segmentation as a most-probable-path downside. Every vocabulary token has a realized log-probability. The tokenizer picks the segmentation whose token scores sum to the very best worth.
The algorithm used to seek out that greatest path is the Viterbi algorithm, a dynamic programming approach from 1967. Byte positions type graph layers and vocabulary tokens are edges spanning a contiguous byte vary. The DP recurrence iterates over byte positions and updates the best-scoring path at every place.
The outer loop runs in linear time relative to enter size. The interior loop walks a vocabulary trie (a prefix tree construction) at every byte place. On a 16K-token enter, this interior stroll executes lots of of 1000’s of trie transitions. It’s the sizzling path.
What was Gradual within the Hugging Face Implementation
The Hugging Face tokenizers crate is the default Rust tokenizer most groups attain for. Perplexity used it because the benchmark reference. At 514 tokens (512 + BOS/EOS injection), the reference implementation had three pricey patterns:
| Bottleneck | Mechanism | Measured impression |
|---|---|---|
| Allocation per match | String::from_utf8 + AHashMap lookup per trie match |
7,295 allocations at 514 tokens; 299,171 at 16K |
| Pointer chase per byte | AHashMap at each trie node; 4 dependent hundreds per byte step |
Dependent-load latency dominates the new path |
| L2 thrashing on lengthy inputs | DP desk and output buffers freshly allotted every name | L2 miss fee climbs from 8% at 128 tokens to 50% at 16K |
Per-token allocation is fixed: roughly 2 KB and ~18 allocations per token, no matter enter dimension. The latency downside turns into extreme at longer inputs when cumulative allocations overflow the per-core L2 cache.
Establishing a Baseline Earlier than Altering the Trie
Earlier than switching the trie construction, Perplexity first remoted how a lot value got here from pointless work alone. They made a zero-allocation port of the reference: similar HashMap trie, however with a caller-owned scratch struct reused throughout calls and token IDs saved instantly in trie nodes (eradicating the per-match string allocation and secondary hash-map lookup).
This baseline already lower p50 latency to 155 µs at 514 tokens, down from 326 µs within the reference. Directions retired dropped 2.4x. The remaining value was the HashMap pointer chase itself, which the following step addressed.
The Three Optimizations
Optimization 1: Double-Array Trie
The Hugging Face trie shops kids in a HashMap at each node. Every byte step requires a hash computation, two pointer dereferences, and a heap entry. Perplexity changed this with a double-array trie, the identical construction utilized by SentencePiece and IREE, initially launched by Aoe in 1989.
A double-array trie encodes the complete trie in two flat integer arrays, base and examine. A baby lookup is: subsequent = base[node] + byte, then confirm examine[next] == node. That’s two array reads, one integer add, and one comparability, with no hashing and no pointer chasing. For XLM-RoBERTa’s 250K vocab, the entire trie suits in ~9 MB of contiguous reminiscence. The new working set per encode is on the order of 100 KB, which inserts in L2 cache.
Not like SentencePiece and IREE, that are general-purpose libraries with lattice bookkeeping and multi-stage pipelines, Perplexity inlined the trie instantly within the Viterbi loop and dropped that overhead totally.
Consequence at 514 tokens: p50 dropped from 155 µs (zero-allocation baseline) to 68 µs. Wall-clock fell 4.8x from the unique reference.
Optimization 2: Bitmap and Inline Packing
The double-array trie nonetheless requires two dependent array hundreds per byte step: first the dad or mum’s base offset, then the examine array to verify the transition is legitimate. Perplexity changed the examine array with a per-node bitmap (4 64-bit phrases, 32 bytes) that data which of the 256 attainable bytes have legitimate little one transitions.
A bitmap lookup compiles to a single bit check in opposition to one 64-bit phrase. The examine array is used solely throughout trie development and dropped from the runtime structure totally.
In addition they packed all 4 per-node fields (bitmap, base, token ID, and rating) right into a single 64-byte cache line, matching CPU cache line width precisely. One trie step now hundreds a single cache line masking the bitmap for the next-byte examine, the bottom offset for the kid slot, and the token ID and rating at terminal nodes.
Commerce-off: trie dimension grows from ~9 MB to ~50 MB (780K nodes x 64 bytes). The new working set per encode stays ~100 KB.
Consequence at 514 tokens: Further 4.5% wall-clock discount. L2 accesses dropped from 4.6K to 1.8K per encode.
Optimization 3: Enormous Pages for the Trie
At 50 MB, the trie spans roughly 12,000 digital pages on a default Linux system utilizing 4 KB pages. The primary-level knowledge TLB on Intel Sapphire Rapids holds 96 entries. Every Viterbi step touches a special trie node, so TLB misses accumulate. Over a 512-token encode, Perplexity estimated roughly 9,000 cycles spent in page-table walks, about 3% of per-encode finances.
Perplexity backed the trie with 2 MB big pages by way of mmap with the MAP_HUGETLB flag. The identical 50 MB now spans 25 pages, nicely throughout the TLB. This requires vm.nr_hugepages configured at boot. In manufacturing, 10,561 big pages are reserved; the trie makes use of 24.
Consequence: 3-12% wall-clock discount relying on enter size. The most important achieve is at 4,098 tokens (-12.0%), the place page-table site visitors was actively competing with trie knowledge for L2 bandwidth. Past 4K tokens the achieve shrinks as a result of L3 misses dominate.
Ultimate Benchmark Outcomes
All measurements are single-threaded, pinned to 1 core on an Intel Xeon Platinum 8488C, with 10,000 iterations after 1,000 warmup rounds. At 514 tokens:
| Engine | p50 Latency | Directions | Allocations |
|---|---|---|---|
Hugging Face (tokenizers crate) |
349 µs | 3.60M | 7,295 |
| SentencePiece (C++) | 128 µs | 1.83M | 1,559 |
| IREE tokenizer (C) | 112 µs | 2.28M | 1 |
| Perplexity (remaining, all 3 optimizations) | ~63 µs | 1.04M | 0 |
Throughout the complete optimization sequence, directions per encode fell from 3.66M to 1.04M, a 3.5x discount. Wall-clock matches that ratio at quick inputs and widens at lengthy inputs the place the reference’s per-token allocations overflow L2 and L3.
One extra discovering: off-the-shelf Rust wrapper crates round SentencePiece and IREE add 1.6-1.9x latency overhead in comparison with the native C/C++ binaries. The sentencepiece crate allocates a recent record of token items on every name. The overhead is measurable however amortizes at lengthy inputs.
The ultimate Perplexity encoder produces token-exact output in opposition to the reference. In manufacturing, it makes use of rayon to parallelize throughout cores.
Marktechpost’s Visible Explainer
Key Takeaways
- Perplexity rebuilt their Unigram tokenizer concentrating on XLM-RoBERTa’s 250K-token SentencePiece vocabulary
- The brand new encoder achieves zero steady-state heap allocations and ~63 µs p50 at 514 tokens
- Three optimizations: double-array trie, bitmap + 64-byte cache-line packing, and a couple of MB big pages for the trie
- Intermediate outcome: a zero-allocation HashMap port alone lower p50 from 326 µs to 155 µs earlier than the trie was modified
- Manufacturing impression: 5-6x CPU utilization discount and double-digit ms discount in reranker latency
Try the Repo and Technical particulars. Additionally, be happy to comply with us on Twitter and don’t overlook to affix our 150k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you possibly can be part of us on telegram as nicely.
Must associate with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Join with us
