Blog/Batch Inference at Scale
MAY 20, 2026/8 min read

Batch Inference at Scale

Kaleb Dickerson +2

Batch inference is a cornerstone of modern AI. As the full spectrum of artificial intelligence — from traditional predictive models to generative AI — reshapes how organizations derive value from data, the ability to run inference at scale has become essential. As a unified data platform, Snowflake delivers batch inference in a simple and flexible way — from AI Functions like AI_COMPLETE, with instant access to Snowflake-hosted models, to Snowflake ML batch inference in SQL for teams that require full control over model choice.

However, many customers — particularly those migrating from non-SQL platforms — require a form of batch inference that is decoupled from SQL, better suited for processing files, unstructured data and operating at very large scale. To address this need, Snowflake ML now supports job-based batch inference, enabling inference to run as a dedicated, distributed workload on Snowpark Container Services (SPCS).

This job-based approach is well suited for applying your own single-tenant models at scale across multimodal data sets, with support for GPU acceleration and seamless integration into existing ML pipelines. It also simplifies migration by enabling teams to bring large-scale batch inference workloads into Snowflake without rearchitecting them around SQL-first patterns. Batch inference jobs further streamline operations by consolidating complex workflows into a single API call — allowing large, asynchronous inference workloads to run directly against registered models on GPU-backed compute, with results written back to Snowflake. Built on SPCS, this approach also provides strong cost efficiency through optimized, on-demand resource utilization.

If you want the broader architectural picture, start with the intro post on the Snowflake Model Serving. This post focuses specifically on async batch inference. Please refer to this article for comparison of various offerings in Snowflake Model Serving.

Introducing Snowflake batch inference jobs

You can run batch inference jobs on any of your registered models. For guidance on how to bring models into Snowflake, refer to the Model Registry documentation. Inference for both traditional predictive models and large language models is unified under a single API.

Take the scenario of a user performing nightly summarization on millions of support tickets stored within Snowflake. In this case, the user opted for the Qwen3-4B model available via Hugging Face, as it is not included in the suite of Snowflake-hosted models. The model can be imported into Snowflake as follows:

from snowflake.ml.model.models import huggingface
from snowflake.ml.registry import Registry

reg = Registry(session=session)

model = huggingface.TransformersPipeline(
    task="text-generation",
    model="Qwen/Qwen3-4B-Instruct-2507",
    compute_pool_for_log = "SYSTEM_COMPUTE_POOL_CPU",
)

mv = reg.log_model(model, model_name="QWEN3_4B", version_name="V1")

To specify the input to the model we leveraged powerful Snowpark DataFrame:

query = """
SELECT ([{
    'role': 'user',
    'content': [{
         'type': 'text',
         'text': 'Summarize following support ticket: ' || ticket_text}]
    }]) AS messages,
    ticket_id,
FROM support_tickets
WHERE status = 'OPEN'
"""
input_df = session.sql(query)

From there, batch inference jobs are just a single call:

from snowflake.ml.model.batch import JobSpec, OutputSpec

job = model_version.run_batch(
    X=input_df,
    compute_pool="my_gpu_pool",
    output_spec=OutputSpec(stage_location="@ML.PUBLIC.TICKET_SUMMARIES/nightly/"),
)

OutputSpec says where results land in Snowflake. JobSpec is optional; reach for it when defaults need tuning (replicas, GPUs, async, batch sizes). The documentation goes into more detail on the API.

Multimodal support

An increasing number of batch inference workloads involve multimodal inputs, including tasks such as image classification for products or documents, information extraction from scanned files and PDFs, audio transcription and labeling, and video description or classification. These workloads often leverage vision-language models that operate across combined text and media inputs.

Batch inference jobs provide a unified DataFrame-based interface for both structured and unstructured data. For unstructured workloads, one or more input columns contain stage paths that reference files rather than scalar values. The InputSpec.column_handling configuration specifies which columns should be dereferenced and defines the encoding expected by the model. Inputs can be sourced from both internal and external stages, enabling flexible data access across environments.

Take a product catalog enrichment workload as an example. The input consists of a column of stage paths, each referencing a product image, and the model is an image-to-text model that generates descriptions from those images. The run_batch invocation follows the same pattern as before, with the addition of an InputSpec that specifies how the job should interpret and process the IMAGE column.

from snowflake.ml.utils.stage_file import list_stage_files

# Returns a single-column DataFrame of fully qualified stage paths:
# | IMAGE_PATH                                          |
# | @my_db.my_schema.image_stage/path/product1.jpg      |
# | @my_db.my_schema.image_stage/path/product2.jpg      |
# | ...                                                 |
input_df = list_stage_files(
    session, "@my_db.my_schema.image_stage/path", column_name="IMAGES"
)
job = model_version.run_batch(
    X=input_df,
    input_spec=InputSpec(
        column_handling={
            "IMAGES": {
                "input_format": InputFormat.FULL_STAGE_PATH,
                "convert_to": FileEncoding.BYTES,
            }
        }
    ),
    ...
)

The catalog example above demonstrates an end-to-end image-to-text workflow. In addition, stage files can be referenced directly within inputs that follow the OpenAI Chat Completions API–compatible format for LLMs, with file loading handled automatically — for example, in image-and-text-to-text scenarios. The public documentation also includes comprehensive examples covering audio and vision-language use cases.

Inside the batch inference job

The API is intentionally simple, requiring only a DataFrame, a compute pool and an output stage. The client primarily serves as a packaging layer, while execution occurs entirely server-side. Under the hood, batch inference jobs run on Ray, an open source framework for distributed AI workloads. Ray provides actor-based parallelism and efficient data pipelining, enabling high hardware utilization without the need for a custom distributed runtime.

Execution flow

The run_batch API acts as a lightweight client wrapper. When invoked, the following steps occur:

  1. Snowflake materializes the input DataFrame and writes it to a stage as Parquet files — a columnar format that Ray Data can read natively and partition efficiently across workers. The associated warehouse is used only for this step and automatically suspends afterward; no warehouse remains active during inference.
  2. A job is provisioned on Snowpark Container Services (SPCS) to execute the workload.
  3. Within the job, the primary node initializes as the Ray head node. Additional replicas (if configured) are provisioned, discover the head node via SPCS service discovery, and join the cluster as worker nodes. Processing begins as soon as resources are available.
  4. Worker nodes are assigned tasks, read the staged Parquet data, perform inference independently, and write results to the designated output stage.
  5. Upon completion, the head node writes a _SUCCESS marker, and the job terminates. All provisioned resources are automatically deallocated.

This execution model enables scalable, efficient and fully managed batch inference without requiring users to manage distributed infrastructure.

Figure 1: Snowflake batch inference architecture.
Figure 1: Snowflake batch inference architecture.

Why Snowpark Container Services (SPCS) and Ray

Snowpark Container Services (SPCS) provides the foundation for batch inference jobs, offering GPU-accelerated compute essential for modern deep learning frameworks and large language models. It supports fine-grained resource configuration across CPU, GPU and memory, enabling workloads to be tailored to specific needs, while delivering strong cost efficiency compared to warehouse-based execution.

Ray serves as the distributed execution framework, providing actor-based parallelism and efficient data pipelining. This enables independent scaling and high utilization of both CPUs and GPUs, significantly improving throughput.

Efficient model loading

Model weights, often at gigabyte scale, are expensive to load repeatedly. Batch inference jobs address this by ensuring each worker loads model weights once and reuses them across multiple batches.

For most models, Ray actors load and retain the weights directly. For vLLM workloads, actors forward requests to a co-located vLLM server that manages the weights. In both cases, the load cost is amortized across the job.

Dynamic use of inference engine

Batch inference jobs support two execution patterns: in-process inference and server-based inference.

For most models, in-process execution — where the model runs within the same process — delivers the best performance. Internal evaluations showed that using a sidecar inference server introduces significant overhead from networking and serialization, reducing GPU utilization.

vLLM workloads are an exception. Because LLM inference involves longer-running requests, the relative overhead of network communication becomes negligible. A server-based approach also enables a unified architecture across batch and online inference, reducing operational complexity and aligning with common industry practices.

Ready for production deployment

Effective production workflows prioritize four key operational areas.

Unified multimodal processing

While many batch products focus on tabular or JSON data, modern requirements include image classification, audio transcription and video description. Snowflake supports these diverse workloads through a single API. The run_batch method processes both structured DataFrames and columns containing stage paths for media files. Native support for internal and external stages eliminates the need for manual data ingestion prior to inference.

Performance optimization

The JobSpec configuration allows for granular control over advanced scaling parameters:

  • num_workers: Defines worker processes per node.
  • replicas: Determines parallel node count.

Additional parameters include GPU allocation, memory limits and max_batch_rows.

Optimization strategies:

  • Low CPU utilization: Increase num_workers to maximize node efficiency before adding replicas.
  • GPU saturation: Increase replicas for compute-bound tasks, or adjust max_batch_rows if the accelerator is underutilized between batches.

Workflow integration

Jobs integrate with Snowflake Tasks using BatchInferenceTask, enabling automated execution within complex directed acyclic graphs (DAGs).

from snowflake.ml.model.batch import BatchInferenceTask, OutputSpec

with dag:
    ...
    
    batch_inference_task = BatchInferenceTask(
        "batch_inference_task",
        model_version=my_model_version,
        X=input_df,
        compute_pool="my_gpu_pool",
        output_spec=OutputSpec(
            # This gives each run its own output subdirectory, so scheduled runs don't collide.
            base_stage_location="@ML.PUBLIC.TICKET_SUMMARIES/"
        ),
    )

    data_prep_task >> batch_inference_task >> publish_summaries_task  # chain into a DAG

Error handling

Inference exceptions trigger total job failure; partial success is not supported. Since results are written incrementally, downstream processes must verify the presence of the _SUCCESS marker before consumption. Default SaveMode.ERROR prevents data corruption by checking for existing files, while SaveMode.OVERWRITE facilitates automated retries by clearing the output location.

Key takeaways

  • One API for everything: run_batch() is a single call that handles structured DataFrames and unstructured data (images, audio, video) against any registered model (both traditional and LLMs). The serving architecture is chosen automatically.
  • Data securely stays in Snowflake: No exports to external object stores, no reingestion. Data, models, orchestration and results all stay within the Snowflake perimeter.
  • Multimodal native: Stage paths referencing images/audio/video are automatically dereferenced and encoded via InputSpec.column_handling, supporting both internal and external stages.
  • Flexible scaling and hardware choices: Scaling is managed through JobSpec, offering granular controls (for example, GPUs, multi-node replicas and batch-size tuning), with hardware configurations provided by Snowpark Container Services (SPCS).
  • Workflow automation: Integration with Snowflake Tasks is achieved through BatchInferenceTask enabling automated batch inference job execution within complex directed acyclic graphs (DAGs).

Closing

If you are evaluating ML in Snowflake and already have batch inference workloads in mind, the fastest way to try this is the batch inference sample notebooks. Pick the one closest to your workload (structured scoring, image classification or vision-language) and swap in your own stage and model.

To learn more and start building with Snowflake, check out more Snowflake ML resources and visit the Snowflake Model Registry documentation today.

Subscribe to our blog newsletter

Get the best, coolest and latest delivered to your inbox each week

Where Data Does More