Arctic Inference with Shift Parallelism: The Fastest Open Source Inference System for Enterprise AI


Inference is becoming the dominant workload in AI, but today’s systems force developers to make costly trade-offs between low latency, high throughput and affordable deployment.
Arctic Inference changes that. Built by Snowflake AI Research, it’s an open source vLLM plugin that brings Snowflake’s inference innovations to the community, delivering the fastest, most cost-effective open source inference for enterprise AI (see Figure 1).
At the core of Arctic Inference is Shift Parallelism, a new parallelism strategy designed to dynamically adapt to real-world challenges and unpredictable traffic, simultaneously achieving maximum speed (lowest time to first token and time per output token) and high cost efficiency (high throughput) in a single deployment.
In this blog post, we’ll dive into Shift Parallelism and how the full suite of innovations in Arctic Inference (cutting-edge speculative decoding, compute reduction with SwiftKV and optimized embedding inference) advance the state of the art for real-world enterprise AI.
Real-world results, one deployment
For real-world generative AI workloads, Arctic Inference+vLLM in a single deployment, achieves:
3.4x faster request completion and 1.06x higher throughput compared to state-of-the-art (SoTA) throughput-optimized deployment
1.7x higher throughput and 1.28x faster request completion compared to SoTA latency-optimized deployment
The elusive trifecta: 2.25x lower response time, 1.75x faster generation and on-par throughput compared to bespoke deployments optimized for each metric
For non-generative AI workloads, such as embeddings, Arctic Inference+vLLM delivers a whopping 1.6M toks/sec per GPU:
16x faster than vLLM on short sequences and 4.2x faster on long sequences
2.4x faster than Text Embeddings Inference (TEI) on short sequences and at parity for long sequences
The performance claims are supported with detailed evaluation results later in the blog post. More importantly, they’re already delivering real-world impact in production with Arctic Inference powering key workloads in Snowflake Cortex AI1.
Today we’re excited to open source Arctic Inference and Shift Parallelism for the broader AI community. (If you'd like to cite this work, please use the BibTeX reference at the bottom of this post.)
Why today’s inference systems fall short
Inference workloads are not like training. While training workloads are homogeneous across batches, real-world inference traffic is highly dynamic, experiencing bursty, unpredictable patterns. Furthermore, while training is throughput driven, real-world inference workloads care for three distinct metrics:
TTFT (time to first token): fast initial response
TPOT (time per output token): fast full generation
Throughput: cost-efficient token serving at scale
Existing parallelism strategies such as tensor parallelism and data parallelism were originally designed for the homogeneous, batch-optimized world of training. In real-world inference, this means significant trade-offs are introduced.
Strategy | Strengths | Weaknesses |
---|---|---|
Tensor parallelism | Leverages aggregate compute and memory across GPUs to process individual tokens Great for fast generation (low TPOT) |
Requires allreduce communication per token, scaling linearly with token length (O(n)). Low throughput on large batches due to large communication cost. |
Data parallelism | Parallelizes across request boundaries with near-zero inter-GPU communication. Scales very well with excellent throughput on large batches. |
Cannot speed up work within a single request. Unsuitable for highly interactive workloads due to slow TTFT and generation speed for individual requests. |
Why not combine them?
Switching between tensor parallelism and data parallelism may seem obvious, but in practice, it's not viable. Their KV cache memory layouts are incompatible, and switching requires expensive data movement. Most teams resort to duplicating deployments: one for latency, one for throughput. This adds cost and complexity.
To overcome these KV cache limitations, Arctic Inference introduces a new strategy: Arctic Sequence Parallelism (referenced as Arctic Ulysses in charts below).
Arctic Sequence Parallelism splits the input sequence across GPUs to parallelize work within a single request. Unlike tensor parallelism, it avoids costly token-wise communication (O(n)), while still achieving high GPU utilization. And because it shares a KV cache layout with Tensor parallelism, it’s the ideal counterpart for large-batch scenarios. See our previous blog to learn more.
With Arctic Sequence Parallelism in place, this means we can finally unify the best of both worlds.
Introducing Shift Parallelism: One deployment, without the trade-offs
Unlike traditional parallelism approaches that statically optimize for one of the three inference metrics — response latency, generation speed or cost efficiency — Shift Parallelism dynamically adapts based on real-world traffic, delivering all three without requiring multiple deployments tuned for each.
Shift Parallelism works by shifting between two best-in-class modes (see Figure 2):
Tensor parallelism (TP) for small batches — maximizing output token generation speed (lower TPOT)
- Arctic Sequence Parallelism (SP) for large batches — minimizing TTFT and achieving near-optimal throughput

This is possible because the KV cache memory layout remains invariant between TP and SP, allowing Shift Parallelism to switch modes seamlessly, based on batch size and traffic patterns. More specifically, the KV cache layout does not change when changing SP and TP, as long as SPxTP equals P.
This is shown concretely in Figure 2, where Shift Parallelism can switch between TP=2 and SP=2 (Arctic Ulysses) seamlessly across forward passes due to the KV Cache Invariance. The computation shown above is a single transformer layer with four attention heads running on two GPUs. In both TP=2 and SP=2, each GPU is computing two out of four attention heads. The computation and data mapping for the attention is identical across both TP and SP, allowing Shift Parallelism to switch between the two based on the size of the input.
Furthermore, by carefully mapping tensor parallel ranks to GPUs, we can ensure that the small parameter shards required on a GPU when using a large TP (TP=8, for example) are already part of the larger parameter shards present in that GPU needed to support a large SP (SP=8, for example).
The result: a single deployment that optimizes simultaneously for TTFT, TPOT and combined throughput — mitigating the costly trade-offs that limit traditional inference systems (see Figure 3).

With Shift Parallelism, enterprises are no longer forced to choose between a latency-optimized or throughput-optimized deployment. Table 1 summarizes the latency-vs.-throughput trade-offs of the existing parallelism strategies discussed above and how Shift Parallelism mitigates them.

How Arctic Inference addresses real-world enterprise inference challenges
Beyond Shift Parallelism, Arctic Inference includes a suite of advanced optimizations that target critical bottlenecks in enterprise AI workloads — from decoding and prefill inefficiencies to underoptimized embedding inference.
Below, we highlight how Arctic Inference solves key real-world challenges, with links to deeper technical blogs and papers.
Advancing speculative decoding for real-world generation
Existing speculative decoding solutions are limited when it comes to real-world use: They do not leverage repetitive patterns in LLM generation; they lack optimized system implementations; and draft models such as EAGLE in vLLM and sGLANG do not support inputs longer than 4,000 tokens, making them impractical.
Arctic Inference delivers the fastest speculative decoding in vLLM by combining suffix decoding and highly optimized light weight draft models, targeting repetitive and not repetitive generation patterns for real-world use cases. The result is up to 4x faster generation for agentic workloads (with repetitive patterns) and 2.8x faster generation for conversational and coding workloads (without repetitive patterns). Read more on how this works.
Solving redundant prefill computation with SwiftKV
In enterprise workloads, prefill often accounts for over 90% of total compute. Yet open source frameworks such as vLLM, sGLANG and TRT-LLM lack the optimizations to reduce this cost — leading to wasted resources on long inputs with minimal output.
SwiftKV reuses hidden states from earlier transformer layers to eliminate redundant computation during KV cache generation — reducing prefill compute by up to 50% without compromising accuracy. This results in up to 2x higher throughput for enterprise workloads with long prompts. To learn more about SwiftKV, please see our paper and blog post on the topic
Solving embedding bottlenecks to enable over 1.5 million tokens/sec GPU performance
Snowflake processes trillions of tokens per month across both real-time and batch embedding workloads. But when we benchmarked embedding models using vLLM, we uncovered performance bottlenecks — slow serialization, sequential tokenization and low GPU utilization — that left hardware severely underused.
To fix this, we optimized Arctic Inference with vectorized serialization, parallel tokenization and multi-instance GPU execution. As a result, it delivers 16x faster embedding inference than vLLM on short sequences and 4.2x faster on long sequences, while outperforming TEI by 2.4x on short sequences and matching it on longer ones. Learn more in our sister blog post on embedding throughput optimizations.
Bringing it all together: Proving Arctic Inference is best in class
Here, we share core results demonstrating that Arctic Inference is the fastest and most cost-effective open source inference system for enterprise AI. (For technical readers, we also include a detailed evaluation methodology in the appendix at the end of this post.)
Simultaneously, Arctic is currently the fastest and most cost-effective open-source inference system

Figure 1 shows that Arctic Inference simultaneously achieves highest throughput (lowest cost) and lowest completion time — all in one deployment — outperforming the best open source systems optimized for each metric individually2. More specifically, Arctic Inference combines Shift Parallelism with speculative decoding and SwiftKV to achieve:
3.4x faster request completion and 1.06x higher throughput compared to SoTA throughput-optimized deployment (TP=1, DP=8)
1.7x higher throughput and 1.28x faster request completion compared to SoTA latency-optimized deployment (TP=8, DP=1)
In Figure 1, latency-optimized and throughput-optimized configurations for vLLM, SGLang and TRT-LLM use TP=8 and DP=1 and TP=1 and DP=8, respectively, and the best speculative decoding solutions that were available for each of the frameworks (see the evaluation methodology in the appendix for details). These experiments were run on data sets generated using real-world production traces to compute throughput and a mixture of ShareGPT, HumanEval and SWEBench to measure latency. As a result, these results are representative of performance achievable in real-world deployments.
Achieving the elusive trifecta: Quicker response, higher throughput and faster generation

Response latency, generation speed and combined throughput are the three core pillars of inference system performance. Figure 4 shows that Arctic Inference outperforms the best open source systems optimized for each metric individually — achieving the elusive trifecta all in one deployment. More specifically, even when compared to the best deployment across vLLM, SGLang and TRT-LLM using bespoke configurations optimized for individual metrics, Arctic Inference with just a single deployment achieves:
2.25x faster response time (prefill throughput per request)
1.75x faster generation per request
on-par combined throughput
This is possible due to the symbiosis between Shift Parallelism, optimized speculative decoding implementation and SwiftKV, which all work together in Arctic Inference. The combination of Shift Parallelism and speculative decoding enables Arctic Inference to achieve the fastest generation per request. Similarly, the combination of Shift Parallelism and SwiftKV enables Arctic Inference to achieve both the highest prefill speed, resulting in the fastest response times, and the highest throughput.
For details on the data sets used to produce these results, see the evaluation methodology in the appendix.
16x faster throughput when scaling vLLM embeddings

Figure 5 shows that Arctic Inference can process over 1.4 million tokens per second not only for long sequences but also for short ones, which are notoriously difficult to optimize.
By vectorizing data serialization and parallelizing tokenization, Arctic Inference helps ensure that the majority of computation time is spent on the actual embedding computation.
As a result, Arctic Inference can achieve:
16x higher throughput than vLLM on short sequences and 4.2x higher throughput on long sequences
2.4x higher throughput than Text Embeddings Inference (TEI) on short sequences and on-parity for long sequences
Furthermore, Arctic Inference supports running multiple instances of the same embedding model on a single GPU to allow better saturation of GPU resources when using small but powerful embedding models such as the snowflake-arctic-embed model family. You can read more about this in our blog.
Adapting to real-world traffic without a latency-throughput trade-off

Figure 6 shows that Shift Parallelism can adapt to real-world traffic, simultaneously delivering the lowest response latency (TTFT), while achieving the fastest generation (lowest TPOT), and near-optimal cost efficiency (total throughput), compared to both throughput-optimized (DP only) and latency-optimized (TP only) solutions. More specifically, Shift Parallelism achieves:
9x reduction in median TTFT compared to the next best solution (1355ms → 148ms)
1.6x reduction in median TPOT compared to the next best solution (83ms → 51ms)
Max throughput regression less than 10% compared to the best solution3 (75.5K → 69.1K toks/sec)
Here, Shift Parallelism dynamically shifts to using TP=8 when traffic is low, achieving the lowest TPOT, while switching to SP=8 when traffic increases, allowing for up to 1.35x higher throughput than TP=8 to avoid spikes in TTFT and TPOT.
Getting started with Arctic Inference
Arctic Inference is integrated with vLLM v0.8.4 using vLLM’s custom plugin feature, allowing us to develop and integrate inference optimizations quickly and make them available to the community.
Once installed, Arctic Inference automatically patches vLLM with all of the features from this blog post, and users can continue to use their familiar vLLM APIs and CLI. It’s easy to get started!
Install vLLM and Arctic Inference:
pip install arctic-inference[vllm]
Arctic Inference will add several additional configurations to vLLM. The example below will run Arctic Inference on eight GPUs with Shift Parallelism, LSTM draft model, suffix decoding and SwiftKV:
vllm serve \
Snowflake/Llama-3.1-SwiftKV-70B-Instruct \
--quantization "fp8" \
--tensor-parallel-size 1 \
--ulysses-sequence-parallel-size 4 \
--enable-shift-parallel \
--shift-parallel-threshold 512 \
--speculative-config '{
"method": "arctic",
"model":"Snowflake/Arctic-LSTM-Speculator-Llama-3.1-70B-Instruct",
"num_speculative_tokens": 3,
"enable_suffix_decoding": true
}'
In the example above, Arctic Inference will use eight-way sequence parallelism and dynamically shift to eight-way tensor parallelism when the batch size is larger than 512, specified by --shift-parallel-threshold
. In the speculation configurations, "method": "arctic"
enables the LSTM speculator, along with the system optimizations described in this blog post. "enable_suffix_decoding": true
enables suffix decoding.
For more detailed information on how to use Arctic Inference and the set of models that are supported, please check out this link.
Citation
@misc{arctic-inference,
author = {Samyam Rajbhandari, Mert Hidayetoglu, Aurick Qiao, Ye Wang, Juncheng Yang, Jeff Rasley, Michael Wyatt, Yuxiong He}
title = {Arctic Inference with Shift Parallelism: The Fastest Open Source Inference System for Enterprise AI},
year = {2025},
month = {May},
day = {28},
howpublished = {\url{https://www.snowflake.com/en/engineering-blog/arctic-inference-shift-parallelism}}
}
Acknowledgments
We would like to thank Jaeseong Lee and Gabriele Oliaro for their contributions to speculative decoding optimizations in Arctic Inference.
We would like to thank Flex Wang, Jerry Luo, Seth Li, Ricardo Aravena, Allen Woo and Vincent Chan for their continued support in bringing our research to production and into Snowflake Cortex AI.
Appendix
Evaluation methodology
Hardware: All experiment results presented in this blog post, unless otherwise stated, were run on an 8xH200 GPU node, leveraging all the GPUs within the node.
Models: Meta Llama 3.3 70B generative AI. Arctic Embedding model, bge-base-en-v1.5 for embedding.
Request completion latency and TPOT measurements: Unless otherwise stated, all request completion latency and TPOT measurements were done by sending one request at a time and averaged over a mixed data set consisting of ShareGPT, HumanEval and SWEBench, comprising short conversations, coding tasks and long agentic workflows. We used latency-optimized configs for all our baselines, where we used TP=8 and the best available open source speculative decoding approach supported by the baseline.
Combined throughput measurements: Unless otherwise stated, for throughput measurements all requests were sent at the same time and measurements were made ensuring steady state. For Figure 1, we constructed a realistic data set with input and output lengths sampled from our Snowflake Cortex AI production logs, allowing us to create a data set representative of enterprise workloads we see at Snowflake. For all other experiments, we used input length of 2,000 and output of 250 tokens, to match the average 10:1 ratio between input and output we observe in our production logs. We used throughput-optimized configs for all our baselines, where we used TP=1 and manually tuned the batch size to achieve the highest throughput.
Prefill throughput measurements: Unless otherwise stated, prefill throughput measurements were performed using a single request with a 4,000 context length. This was because we found that context length smaller than 4,000 did not saturate the GPU, while a length longer than 4,000 did not increase prefill throughput significantly.
Open source baselines: vLLM (v0.8.4), TRT-LLM (v0.18.2), SGLang (v0.4.6)
Open source speculative decoding baselines: EAGLE-based speculative decoding offers the best latency for SGLang and vLLM, but it only supports shorter than 4,000 sequences and crashes when longer sequences are sent. Hence we could not use it for our realistic data set mix, which consisted of both short and longer sequences. Due to the limitations of EAGLE, we leveraged NGRAM speculation in vLLM and no speculation in SGLang, as anything else could not support the real-world use case we see in our production. For TRT-LLM, despite our best effort, we could not get it to work with any speculative decoding in a consistent way, and therefore we reported the numbers without speculative decoding.
Arctic Inference+vLLM: Unless otherwise stated, for all experiments referred to as Arctic Inference, we used a single config combining Shift Parallelism shifting between SP=8 (Arctic Ulysses) and TP=8, with SwiftKV optimizations, and speculative decoding optimizations that combine suffix decoding with our LSTM draft model. We ran Arctic Inference on top of vLLM (v0.8.4).
1 Snowflake Llama models and embedding models in Snowflake Cortex AI.
2 Latency-optimized and throughput-optimized configurations for vLLM, SGLang and TRT-LLM use TP=8 and DP=1 and TP=1 and DP=8, respectively, along with the best available speculative decoding for each framework. These experiments were run on data sets generated using real-world production traces to compute throughput, and a mixture of ShareGPT, HumanEval and SWEBench to measure latency. As a result, these results are representative of performance achievable in real-world deployments. For more details, see the evaluation methodology in the appendix.
3 vLLM does not allow measurements in real time, so the combined throughput as a function of time was obtained based on request start time, TTFT and generation throughput. As request arrival and first token response may not always align with the measurement time window, the calculated numbers are not always precise. However, since each parallelism config will have a similar margin of error, the relative measurements across different parallelism configurations are still very meaningful.