Summit 26 from June 1-4 in San Francisco

Lead your organization in the era of agents and enterprise intelligence.

Snowflake for DevelopersGuidesDistributed Text Classification with Mango HPO and Snowflake DPF

Distributed Text Classification with Mango HPO and Snowflake DPF

Priya Joseph

Overview

Snowflake
Python
scikit-learn
Mango

This guide demonstrates building a production-grade distributed text classification pipeline using:

ComponentTechnologyPurpose
HPOMangoBayesian hyperparameter optimization
VectorizationTF-IDFText to numeric feature conversion
ClassifierMLPClassifierMulti-layer neural network
DistributionSnowflake DPFParallel processing across partitions
DataCFPB ComplaintsReal-world consumer financial complaints

Source Files

FileDescription
src/setup.sqlDatabase, schema, stages, and table DDL
src/cfpb_pipeline.pyComplete Python pipeline implementation
src/requirements.txtPython package dependencies

Prerequisites

  • Snowflake account with Container Runtime enabled
  • Access to a compute pool (CPU nodes)
  • Python 3.10+ environment
  • CFPB complaints CSV file

What You'll Learn

  • Implement Bayesian hyperparameter optimization with Mango
  • Use TF-IDF for text vectorization in distributed environments
  • Leverage Snowflake DPF for parallel ML training
  • Process unstructured text data at scale

What You'll Build

  • End-to-end text classification pipeline for consumer complaint routing
  • Distributed training across multiple product category partitions
  • Optimized MLP classifier with tuned hyperparameters

Architecture

The pipeline follows a distributed architecture optimized for Snowflake's container runtime:

Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│                        SNOWFLAKE CONTAINER RUNTIME                          │
├─────────────────────────────────────────────────────────────────────────────┤
│  ┌─────────────┐    ┌─────────────────────────────────────────────────┐     │
│  │   CFPB      │    │           DISTRIBUTED PARTITION FUNCTION        │     │
│  │  Complaints │───▶│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐│     │
│  │   Table     │    │  │Product 1│ │Product 2│ │Product 3│ │Product N││     │
│  └─────────────┘    │  │ Worker  │ │ Worker  │ │ Worker  │ │ Worker  ││     │
│                     │  │         │ │         │ │         │ │         ││     │
│                     │  │TF-IDF + │ │TF-IDF + │ │TF-IDF + │ │TF-IDF + ││     │
│                     │  │Mango HPO│ │Mango HPO│ │Mango HPO│ │Mango HPO││     │
│                     │  │  + MLP  │ │  + MLP  │ │  + MLP  │ │  + MLP  ││     │
│                     │  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘│     │
│                     └───────┼───────────┼───────────┼───────────┼─────┘     │
│                             │           │           │           │           │
│                             ▼           ▼           ▼           ▼           │
│                     ┌─────────────────────────────────────────────────┐     │
│                     │              ML_ARTIFACTS_STAGE                 │     │
│                     │   (Models, Metrics, Best Hyperparameters)       │     │
│                     └─────────────────────────────────────────────────┘     │
└─────────────────────────────────────────────────────────────────────────────┘

Data Flow

  1. Load: CFPB complaints loaded from CSV to Snowflake table
  2. Partition: Data partitioned by PRODUCT column (18 categories)
  3. Distribute: DPF assigns each partition to a worker node
  4. Process: Each worker runs TF-IDF + Mango HPO + MLP training
  5. Store: Results saved to stage and results table

Snowflake Environment Setup

Run the SQL setup script to create all required database objects.

Download Setup Script

Quick Setup

Copy and run in a Snowflake worksheet:

-- Create database and schema
CREATE DATABASE IF NOT EXISTS CFPB_ML_DB;
CREATE SCHEMA IF NOT EXISTS CFPB_ML_DB.TEXT_ANALYTICS;
USE DATABASE CFPB_ML_DB;
USE SCHEMA TEXT_ANALYTICS;

-- Create stages
CREATE STAGE IF NOT EXISTS ML_ARTIFACTS_STAGE DIRECTORY = (ENABLE = TRUE);
CREATE STAGE IF NOT EXISTS RAW_DATA_STAGE DIRECTORY = (ENABLE = TRUE);

Full Setup: See src/setup.sql for complete DDL including tables, file formats, and verification queries.

Tables Created

TablePurpose
CFPB_COMPLAINTSSource data - consumer complaint narratives
CLASSIFICATION_RESULTSOutput - training metrics and best parameters

Load CFPB Data

Load the CFPB complaints data into Snowflake.

Option 1: Via Stage (Recommended for Large Files)

# Upload file to stage (run from SnowSQL or Snowflake CLI)
snow stage put /path/to/complaints100K.csv @RAW_DATA_STAGE/
-- Load into table
COPY INTO CFPB_COMPLAINTS
FROM @RAW_DATA_STAGE/complaints100K.csv
FILE_FORMAT = (TYPE='CSV' FIELD_OPTIONALLY_ENCLOSED_BY='"' SKIP_HEADER=1)
ON_ERROR = 'CONTINUE';

Option 2: Via Python

import pandas as pd
from snowflake.snowpark.context import get_active_session

df = pd.read_csv('/path/to/complaints100K.csv')
session = get_active_session()
session.write_pandas(df, 'CFPB_COMPLAINTS', auto_create_table=True, overwrite=True)

Verify Data

SELECT PRODUCT, COUNT(*) AS RECORDS
FROM CFPB_COMPLAINTS
WHERE CONSUMER_COMPLAINT_NARRATIVE IS NOT NULL
GROUP BY PRODUCT ORDER BY RECORDS DESC;

Pipeline Code

The complete pipeline is implemented in src/cfpb_pipeline.py.

Download Files

Dependencies

Install required packages (from src/requirements.txt):

pip install arm-mango>=1.3.0 scikit-learn>=1.3.0 pandas numpy

Configuration

The pipeline uses a configuration dataclass defined in src/cfpb_pipeline.py:

@dataclass
class PipelineConfig:
    database: str = 'CFPB_ML_DB'
    schema: str = 'TEXT_ANALYTICS'
    partition_column: str = 'PRODUCT'
    text_column: str = 'CONSUMER_COMPLAINT_NARRATIVE'
    label_column: str = 'ISSUE'
    max_features: int = 5000
    hpo_iterations: int = 15

Mango HPO Search Space

PARAM_SPACE = {
    'hidden_layer_sizes': [(64,), (128,), (64, 32), (128, 64), (256, 128)],
    'alpha': [0.0001, 0.001, 0.01, 0.1],
    'learning_rate_init': [0.001, 0.01, 0.1],
    'activation': ['relu', 'tanh'],
    'max_iter': [200, 500, 1000]
}

DPF Worker Function

The process_text_partition() function in src/cfpb_pipeline.py runs on each worker node:

  1. Filters valid records with text narratives
  2. Applies TF-IDF vectorization (5000 features, bigrams)
  3. Runs Mango Bayesian HPO (15 iterations)
  4. Trains final MLP with best parameters
  5. Returns accuracy, F1 score, and hyperparameters

Run the Pipeline

Execute the pipeline in a Snowflake Notebook with Container Runtime.

Step 1: Create Notebook

  1. Navigate to Notebooks in Snowflake
  2. Create new notebook with Container Runtime enabled
  3. Select a CPU compute pool

Step 2: Upload Pipeline Code

Upload src/cfpb_pipeline.py to your notebook environment or paste inline.

Step 3: Install Dependencies

!pip install arm-mango scikit-learn

Step 4: Run Pipeline

from cfpb_pipeline import run_distributed_text_classification

results = run_distributed_text_classification()

Expected Output

======================================================================
DISTRIBUTED TEXT CLASSIFICATION WITH MANGO HPO + DPF
======================================================================
📦 Database: CFPB_ML_DB
📦 Schema: TEXT_ANALYTICS
📊 Total records: 100,000
📊 Records with text: 87,432
📍 Partitions: 18

🚀 Starting distributed HPO across partitions...

======================================================================
RESULTS BY PARTITION
======================================================================
✅ Credit reporting...
   Records: 24,532 | Classes: 47
   Accuracy: 68.42% | F1: 0.6721
   Best: hidden=(128,64), alpha=0.001

✅ Debt collection
   Records: 18,234 | Classes: 23
   Accuracy: 72.15% | F1: 0.7089

======================================================================
SUMMARY
======================================================================
📊 Partitions Processed: 18
📊 Average Accuracy: 69.8%
📊 Average F1 Score: 0.6834
⏱️  Parallel Speedup: ~14.7x

Performance Considerations

DPF Scaling Formula

DPF Time ≈ Overhead (20-30s) + max(Partition Times)
Sequential Time = Sum of all Partition Times
Speedup = min(num_partitions, num_nodes)

Scaling Examples

NodesPartitionsSpeedup
44~4x
818~8x
1818~18x

Optimization Tips

ParameterRecommendation
max_features5000-10000 balances accuracy/speed
hpo_iterations15-25 for thorough search
min_samples100+ per partition
PartitioningChoose balanced distribution column

Conclusion and Resources

You've built a distributed text classification pipeline using Mango HPO and Snowflake DPF.

What You Learned

  • ✅ Bayesian hyperparameter optimization with Mango
  • ✅ TF-IDF vectorization for text data
  • ✅ Snowflake DPF for parallel ML training
  • ✅ Processing CFPB complaints at scale

Source Files Reference

FileDescription
src/setup.sqlComplete SQL DDL
src/cfpb_pipeline.pyFull Python implementation
src/requirements.txtDependencies

Related Resources

ResourceLink
Mango HPOarm-mango on PyPI
Snowflake DPFProcess Data Across Partitions
CFPB DatabaseConsumer Complaints
Snowflake MLML Functions Overview

Next Steps

  • Add model persistence to Snowflake Model Registry
  • Build inference pipeline with saved models
  • Add Cortex Search for similarity-based routing
  • Create Streamlit dashboard for visualization
Updated 2026-01-22

This content is provided as is, and is not maintained on an ongoing basis. It may be out of date with current Snowflake instances