SuffixDecoding at Production Scale with Arctic Inference and vLLM

LLM agents are powerful but slow. Tasks like coding, retrieval-augmented generation and multi-step reasoning require the model to generate many tokens across iterative steps, and autoregressive decoding (i.e., producing one token at a time) creates a major latency bottleneck.
SuffixDecoding addresses this by predicting likely continuations from patterns in previous outputs without additional model weights. In our previous blog post and NeurIPS paper we demonstrated that SuffixDecoding achieves state-of-the-art speculation quality for repetitive workloads, with up to 5.3x faster generation than vanilla decoding (Figure 1).

However, its benefits were limited in real-world applications that required high concurrency.
This post explains how SuffixDecoding works and two key optimizations that now make it production-ready at all concurrency levels:
Faster data structures: A custom high-performance hashmap that increased speculation speed by 3.4x, improved update speed by 1.5x, and reduced memory usage by 2.3x.
Smarter tree traversal: A two-level linked list for top-child selection that made speculation traversal 2.2x faster.
These improvements reduce CPU overhead and unlock strong end-to-end gains, including 1.96x–3.12x on BlazeEdit and 1.0x–1.28x on SpecBench, along with 1.11x–1.17x speedups over the best N-gram baseline on SpecBench and 1.02x–1.31x on BlazeEdit. With these upgrades, SuffixDecoding outperforms N-gram speculation in vLLM at all concurrency levels.
SuffixDecoding is now open sourced as part of Arctic Inference, fully integrated into vLLM, and coming soon to SGLang. These integrations allow SuffixDecoding to run natively in real deployments, making it easy to accelerate agentic workloads today.
What sets SuffixDecoding apart from speculative decoding
Speculative decoding is a popular technique to accelerate LLM inference by predicting multiple tokens ahead and verifying them in parallel with the base model. However, most existing methods — such as Medusa and EAGLE3 — require training additional model weights, which adds complexity and limits flexibility across different models.
SuffixDecoding takes a fundamentally different approach: It’s completely model-free. Instead of relying on learned weights, it exploits patterns in past generations to speculate likely continuations. This makes it particularly well-suited for repetitive workloads, such as reasoning loops, self-consistency checks and long, multiturn pipelines — all hallmarks of agentic applications.
Crucially, unlike N-gram methods which require different configurations to perform optimally at different concurrency levels, SuffixDecoding consistently delivers the best model-free speculation performance across all concurrency levels with a single configuration, making it a more robust and practical choice for production workloads with variable load patterns.
What is SuffixDecoding?
SuffixDecoding keeps a cache of past generations across all requests to identify repeating sequences that can be speculated. To support such speculations, SuffixDecoding relies on an efficient data structure called a suffix tree. Each running request maintains a suffix tree over its prompt and generated tokens, while a single shared suffix tree caches responses from previous requests.
Using these suffix trees, SuffixDecoding quickly detects when a request is starting to repeat tokens that were observed in the same request or in the outputs of previous requests, and speculate probable continuations based on those historical outputs.

What is a suffix tree?
A suffix tree is a compact data structure that represents all possible substrings of a sequence and allows efficient substring search in linear time and space. It can be thought of as a trie that stores every suffix of the sequence.

The left figure illustrates a simplified suffix tree that contains the word “BANANA”. Its suffixes are “BANANA”, “ANANA”, “NANA”, “ANA”, “NA”, and “A”. Each of these suffixes is represented by a unique path from the root node of the tree to one of its leaves (represented by “$”).
Speculating using suffix trees
The right figure illustrates how speculations can be made by pattern matching using the suffix tree. Suppose we are now generating tokens for a new request, and the most recent generated tokens are “DONKEY KONG EATS BANA”. We can use the last tokens, say “ANA”, to look up in the suffix tree (blue path in the right figure), and quickly find a few possible continuations, which in this case are $ (end of string) and “NA$” (green paths in the right figure). SuffixDecoding will try several possible patterns (e.g., “NA”, “ANA”, “BANA”) and choose the best one according to frequency statistics of the possible continuations (details in our paper).
Advanced implementation details
The simplified examples above illustrate the core concepts, while our production implementation includes additional advanced features that make SuffixDecoding practical for real-world LLM serving. For readers interested in the complete picture, our suffix trees support:
Token-level operations: Working directly with raw token sequences rather than decoded text for efficiency
Generalized suffix trees: Indexing multiple sequences simultaneously within a single tree structure
Dynamic updates: Efficiently appending tokens to any sequence as generation progresses
Memory management: Evicting sequences from the tree when needed to reduce memory usage
Our implementation also uses a fixed tree depth (default: 64 tokens), which simplifies these operations while covering all practical speculation lengths. Additionally, while suffix trees theoretically enable tree-based speculation with multiple possible continuations, our vLLM implementation focuses on single-sequence speculation for simplicity and compatibility with the framework's architecture.
Performance optimizations
When we first implemented SuffixDecoding (as described in our paper), we could speculate tokens on the order of 20-30 microseconds per token. That sounds fast, but it wasn't enough.
Since SuffixDecoding runs on CPU, its latency scales linearly with the number of speculated tokens, while the GPU-powered LLM is highly parallelized. At high concurrency levels (e.g., 64 requests), speculation overhead began to dominate runtime.

As shown in Figure 4, while the performance of SuffixDecoding was better than N-gram (another model-free speculation method in vLLM) at low concurrencies, it actually became worse at a concurrency of 64.
Then, to push SuffixDecoding beyond this limit, we made the following two key optimizations.
1. Custom hashmaps deliver up to 3.4x speedups and 2.3x lower memory
SuffixDecoding makes heavy use of the map data structure. For example, every node in the suffix tree contains a map from token id to the children of that node. Initially, we used the std::unordered_map from the C++ standard library. However, we discovered that this map is both slow and memory-heavy, due to its cache-unfriendly implementation using linked buckets.
We replaced std::unordered_map with our custom hashmap, which uses open-addressing with triangular probing. It is both cache-friendly and can support high-capacity factors (e.g., 70-80%), making it both fast and memory-efficient. It also assumes the keys (e.g., tokens) are always 32-bit integers, so we could tightly pack entries and simplify hashing.

Figure 5 shows that switching to this custom hashmap implementation on the Magicoder-Evol-Instruct-110K data set improved speculation speed by 3.4x, suffix tree update speed by 1.5x, and it reduced memory by 2.3x.
2. Top-child selection improved speed by 2.2x
During speculation, a repeated procedure is to find the child of a node with the highest frequency count, which is repeated at each node to “walk” down the suffix tree. Originally, this procedure was implemented by searching every child to find the one with the highest count. Since modern LLMs can have vocabulary sizes in the hundreds of thousands, this can be a very slow operation!
To address this issue, we additionally kept the children of each node sorted in descending order of their counts, which is maintained using a doubly linked-list. Now, whenever we want to find the child with the highest count, we just need to look at the head of the list, which is an O(1) operation!

However, this naive list made updating the suffix tree slow. Whenever a child’s count changed, we had to reposition it, often scanning across many siblings with equal counts. To fix this, we added a second level to the linked-list:
The first level links the children of the node (the original list).
The second level links groups of children with the same count.
This multi-level design makes updates efficient since we can simply move a child between groups without scanning through potentially many sibling nodes with the same count.

Figure 7 shows that adding this linked-list design further improved speculation speed by another 2.2x at the cost of slightly higher update time and memory usage. However, the speculation time, update time and memory are all much improved from the original baseline.
End-to-end benchmarks in vLLM
After the optimizations above, we were able to completely eliminate the performance gap vs. N-gram at concurrency of 64, as shown in Figure 8 and 9 below.

In the above, we compared SuffixDecoding with two configurations of N-gram: N-gram[5,5] which has min_prompt_lookup = max_prompt_lookup = 5 (the default in vLLM), and N-gram[3,5] which has min_prompt_lookup = 3 and max_prompt_lookup = 5. Notice how N-gram's optimal configuration varies by concurrency level: At concurrency 1, N-gram[5,5] performs better (5.63 ms vs. 5.21 ms for ngram[3,5]), but at concurrency 64, N-gram[3,5] becomes superior (13.14 ms vs. 11.57 ms).
In contrast, SuffixDecoding consistently outperforms both N-gram configurations across all concurrency levels with a single, fixed configuration, delivering the best model-free speculation performance whether you're running at low or high concurrency.
We also evaluated the Blazedit data set, a code-editing benchmark well-suited for N-gram and SuffixDecoding.

Here again, we see N-gram's configuration-dependent performance: At low concurrency (1), N-gram[5,5] outperforms N-gram[3,5] (2.17 ms vs. 1.86 ms), but at high concurrency (64), N-gram[3,5] becomes the better choice (7.99 ms vs. 7.42 ms).
SuffixDecoding eliminates this tuning complexity, maintaining the best performance at all concurrency levels without requiring configuration changes. This makes SuffixDecoding particularly valuable in production environments where workload patterns and concurrency levels fluctuate.
Future optimizations
With the optimizations we discussed in this blog, we measured that the remaining overhead due to SuffixDecoding speculation and updates to be around 10% at a concurrency of 64. This remaining overhead can be reduced in the following ways, which we leave as future work.
Parallelize speculations: At high concurrencies, generating speculative tokens using SuffixDecoding can be parallelized across multiple CPU threads, where each thread handles independent requests. This could further reduce per-token latency in large-scale deployments.
Asynchronous updates: Currently, after every decoding iteration, we update the suffix tree immediately before using the updated suffix tree to generate speculative tokens for the next iteration. Instead, we can generate the speculative tokens without updating the tree, and then perform the update in a background thread that can be overlapped with the next decoding step.
Using SuffixDecoding in vLLM
With the optimizations described in this blog post, we’ve merged SuffixDecoding into vLLM (link to PR). This means that, as long as vLLM and Arctic Inference are both installed, SuffixDecoding will work out of the box!
pip install vllm arctic-inferencefrom vllm import LLM, SamplingParams
prompts = [
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(
model="facebook/opt-6.7b",
tensor_parallel_size=1,
speculative_config={
"method": "suffix",
"num_speculative_tokens": 32,
},
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")SuffixDecoding will speculate a dynamic number of tokens for each request at each decoding step, so the num_speculative_tokens configuration specifies the maximum number of speculative tokens. It is suggested to use a high number, such as 16 or 32 (default).


