BUILD: The Dev Conference for AI & Apps (Nov. 4-6)

Hear the latest product announcements and push the limits of what can be built in the AI Data Cloud.

SNOWFLAKE CERTIFIED SOLUTION

Security & Governance

name: app_environment
channels:
  - snowflake
dependencies:
  - matplotlib=*
  - modin=0.28.1
  - seaborn=*
  - snowflake=*


git clone [email protected]:Snowflake-Labs/sfguide-data-engineering-pipelines-with-pandas-on-snowflake.git


{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "1dde02fa-0044-4b20-b7bb-10f1a5b3fabb",
      "metadata": {
        "collapsed": false,
        "name": "cell1"
      },
      "source": [
        "### Data Engineering Pipelines with pandas on Snowflake\n",
        "\n",
        "This demo is using the [Snowflake Sample TPC-H dataset](https://docs.snowflake.com/en/user-guide/sample-data-tpch) that should be in a shared database named `SNOWFLAKE_SAMPLE_DATA`. You can run this notebook in a Snowflake Notebook. \n",
        "\n",
        "During this demo you will learn how to use [pandas on Snowflake](https://docs.snowflake.com/developer-guide/snowpark/python/snowpark-pandas) to:\n",
        "* Create datframe from a Snowflake table\n",
        "* Aggregate and transform data to create new features\n",
        "* Save the result into a Snowflake table\n",
        "* Create a serverless task to schedule the feature engineering\n",
        "\n",
        "pandas on Snowflake is delivered through the Snowpark pandas API as part of the Snowpark Python library (preinstalled with Snowflake Notebooks), which enables scalable data processing of Python code within the Snowflake platform. \n",
        "\n",
        "Start by adding neccessary libraries using the `Packages` dropdown, the additional libraries needed for this notebook is: \n",
        "* `modin` (select version 0.28.1)\n",
        "* `snowflake`\n",
        "* `matplotlib`\n",
        "* `seaborn`"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "4039104e-54fc-411e-972e-0f5a2d884595",
      "metadata": {
        "codeCollapsed": false,
        "collapsed": false,
        "language": "python",
        "name": "cell2"
      },
      "outputs": [],
      "source": [
        "import streamlit as st\n",
        "import matplotlib.pyplot as plt\n",
        "import seaborn as sns"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "d66adbc4-2b92-4d7d-86a5-217ee78e061f",
      "metadata": {
        "codeCollapsed": false,
        "collapsed": false,
        "language": "python",
        "name": "cell3"
      },
      "outputs": [],
      "source": [
        "# Snowpark Pandas API\n",
        "import modin.pandas as spd\n",
        "# Import the Snowpark pandas plugin for modin\n",
        "import snowflake.snowpark.modin.plugin\n",
        "\n",
        "from snowflake.snowpark.context import get_active_session\n",
        "# Create a snowpark session\n",
        "session = get_active_session()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "811abc04-f6b8-4ec4-8ad4-34af28ff8c31",
      "metadata": {
        "codeCollapsed": false,
        "collapsed": false,
        "language": "python",
        "name": "cell4"
      },
      "outputs": [],
      "source": [
        "# Name of the sample database and the schema to be used\n",
        "SOURCE_DATA_PATH = \"SNOWFLAKE_SAMPLE_DATA.TPCH_SF1\"\n",
        "SAVE_DATA_PATH = \"SNOW_PANDAS_DE_QS.DATA\"\n",
        "# Make sure we use the created database and schema for temp tables etc\n",
        "session.use_schema(SAVE_DATA_PATH)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "0721a789-63a3-4c90-b763-50b8a1e69c92",
      "metadata": {
        "collapsed": false,
        "name": "cell5"
      },
      "source": [
        "We will start by creating a number of features based on the customer orders using the line items.\n",
        "\n",
        "Start with the `LINEITEM` table to create these features so we will start by creating a Snowpark Pandas Datframe aginst it, select the columns we are interested in and then show info about the dataframe, the shape and the first rows."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "2a091f1b-505f-4b61-9088-e7fd08e16f83",
      "metadata": {
        "codeCollapsed": false,
        "collapsed": false,
        "language": "python",
        "name": "cell6"
      },
      "outputs": [],
      "source": [
        "lineitem_keep_cols = ['L_ORDERKEY', 'L_LINENUMBER', 'L_PARTKEY', 'L_RETURNFLAG', 'L_QUANTITY', 'L_DISCOUNT', 'L_EXTENDEDPRICE']\n",
        "lineitem_df = spd.read_snowflake(f\"{SOURCE_DATA_PATH}.LINEITEM\")[lineitem_keep_cols]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "f360d4de-21f4-4723-9778-ceb8683c81c8",
      "metadata": {
        "codeCollapsed": false,
        "collapsed": false,
        "language": "python",
        "name": "cell7"
      },
      "outputs": [],
      "source": [
        "st.dataframe(lineitem_df.head())"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "be5d37e2-e990-4e71-b762-41a64845955f",
      "metadata": {
        "codeCollapsed": false,
        "collapsed": false,
        "language": "python",
        "name": "cell8"
      },
      "outputs": [],
      "source": [
        "# Get info about the dataframe\n",
        "lineitem_df.info()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "618f45b8-a2a8-4d08-967e-945d2329335e",
      "metadata": {
        "codeCollapsed": false,
        "collapsed": false,
        "language": "python",
        "name": "cell9"
      },
      "outputs": [],
      "source": [
        "print(f\"DataFrame shape: {lineitem_df.shape}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "e53fea0b-2f36-4662-a382-98938a74f2c2",
      "metadata": {
        "collapsed": false,
        "name": "cell10"
      },
      "source": [
        "## Data Cleaning - Filtering and Aggregation\n",
        "\n",
        "Taking a look at different values for `L_RETURNFLAG` and include only line items that was delivered (`N`) or returned (`R`)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "2f326c13-ed4c-4e6f-b40e-7e8338c270c4",
      "metadata": {
        "codeCollapsed": false,
        "collapsed": false,
        "language": "python",
        "name": "cell11"
      },
      "outputs": [],
      "source": [
        "print(lineitem_df.L_RETURNFLAG.value_counts())"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "122cb06a-3a08-4d32-8864-4c8ff8c046b4",
      "metadata": {
        "collapsed": false,
        "name": "cell12"
      },
      "source": [
        "Add a filter to the dataframe"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "7f9c56b7-b2db-4591-97ce-451876e9b9a6",
      "metadata": {
        "codeCollapsed": false,
        "collapsed": false,
        "language": "python",
        "name": "cell13"
      },
      "outputs": [],
      "source": [
        "print(f\"Before Filtering: {len(lineitem_df)} rows\")\n",
        "spd_lineitem = lineitem_df[lineitem_df['L_RETURNFLAG'] != 'A']\n",
        "print(f\"After Filtering: {len(spd_lineitem)} rows\")\n",
        "st.dataframe(spd_lineitem.head())"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "1f802173-162f-4dff-8567-ade65b9f57f1",
      "metadata": {
        "collapsed": false,
        "name": "cell14"
      },
      "source": [
        "To track the actual discount a customer gets per order, we need to calculate that in a new column by taking the product of the amount of discount (`L_DISCOUNT`), numbers sold (`L_QUANTITY`), and the price of item (`L_EXTENDEDPRICE`)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "58f45f3d-3633-424e-b777-467a2ba0b22d",
      "metadata": {
        "codeCollapsed": false,
        "collapsed": false,
        "language": "python",
        "name": "cell15"
      },
      "outputs": [],
      "source": [
        "spd_lineitem['DISCOUNT_AMOUNT'] = spd_lineitem['L_DISCOUNT'] * spd_lineitem['L_QUANTITY'] * spd_lineitem['L_EXTENDEDPRICE']\n",
        "st.dataframe(spd_lineitem.head())"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "6ec9d862-e957-42b9-9d86-03f2ad3501f7",
      "metadata": {
        "collapsed": false,
        "name": "cell16"
      },
      "source": [
        "Now we want to compute the aggregate of items and discount amount, grouped by order key and return flag.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "578cbdf7-a655-416b-87da-417f7edd35bb",
      "metadata": {
        "codeCollapsed": false,
        "collapsed": false,
        "language": "python",
        "name": "cell17"
      },
      "outputs": [],
      "source": [
        "# Aggregations we want to do\n",
        "column_agg = {\n",
        "                'L_QUANTITY':['sum'], # Total Items Ordered  \n",
        "                'DISCOUNT_AMOUNT': ['sum'] # Total Discount Amount\n",
        "             }\n",
        "\n",
        "# Apply the aggregation\n",
        "spd_lineitem_agg = spd_lineitem.groupby(by=['L_ORDERKEY', 'L_RETURNFLAG'], as_index=False).agg(column_agg)\n",
        "\n",
        "# Rename the columns\n",
        "spd_lineitem_agg.columns = ['L_ORDERKEY', 'L_RETURNFLAG', 'NBR_OF_ITEMS', 'TOT_DISCOUNT_AMOUNT']\n",
        "st.dataframe(spd_lineitem_agg.head())"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "00dd1299-9bb2-4aba-9f37-b04ca3639892",
      "metadata": {
        "collapsed": false,
        "name": "cell18"
      },
      "source": [
        "## Data Transformation - Pivot and reshape\n",
        "\n",
        "We want to separate the `NBR_OF_ITEMS` and `TOT_DISCOUNT_AMOUNT` by `L_RETURNFLAG` so we have one column for each uinique `L_RETURNFLAG` value.  \n",
        "Using the **pivot_table** method will give us one column for each unique value in `RETURN_FLAG`"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "7f586e8a-017b-4672-80a1-bcc9430a87c3",
      "metadata": {
        "codeCollapsed": false,
        "collapsed": false,
        "language": "python",
        "name": "cell19"
      },
      "outputs": [],
      "source": [
        "# This will make L_ORDERKEY the index\n",
        "spd_lineitem_agg_pivot_df = spd_lineitem_agg.pivot_table(\n",
        "                                values=['NBR_OF_ITEMS', 'TOT_DISCOUNT_AMOUNT'], \n",
        "                                index=['L_ORDERKEY'],\n",
        "                                columns=['L_RETURNFLAG'], \n",
        "                                aggfunc=\"sum\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "38dd144f-b18b-4673-b8c0-7db6d237ae59",
      "metadata": {
        "collapsed": false,
        "name": "cell20"
      },
      "source": [
        "The **pivot_table** method returns subcolumns and by renaming the columns we will get rid of those, and have one unique columns for each value."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "6166f8b0-fc8c-451e-9780-3e1f634ccbdd",
      "metadata": {
        "codeCollapsed": false,
        "collapsed": false,
        "language": "python",
        "name": "cell21"
      },
      "outputs": [],
      "source": [
        "spd_lineitem_agg_pivot_df.columns = ['NBR_OF_ITEMS_N', 'NBR_OF_ITEMS_R','TOT_DISCOUNT_AMOUNT_N','TOT_DISCOUNT_AMOUNT_R']\n",
        "# Move L_ORDERKEY back to column\n",
        "spd_lineitem_agg_pivot = spd_lineitem_agg_pivot_df.reset_index(names=['L_ORDERKEY'])\n",
        "st.dataframe(spd_lineitem_agg_pivot.head(10))"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "1657bbc7-caf2-461c-9302-6f8d2187e0af",
      "metadata": {
        "collapsed": false,
        "name": "cell22"
      },
      "source": [
        "## Combine lineitem with orders information\n",
        "\n",
        "Load `ORDERS` table and join with dataframe with transformed lineitem information."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "c910ac10-38b3-4aa4-a7d2-6321243a4a60",
      "metadata": {
        "codeCollapsed": false,
        "collapsed": false,
        "language": "python",
        "name": "cell23"
      },
      "outputs": [],
      "source": [
        "spd_order = spd.read_snowflake(f\"{SOURCE_DATA_PATH}.ORDERS\")\n",
        "# Drop unused columns \n",
        "spd_order = spd_order.drop(['O_ORDERPRIORITY', 'O_CLERK', 'O_SHIPPRIORITY', 'O_COMMENT'], axis=1)\n",
        "# Use streamlit to display the dataframe\n",
        "st.dataframe(spd_order.head())"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "97d52cd4-a71b-4c72-9137-accdf54b571b",
      "metadata": {
        "collapsed": false,
        "name": "cell24"
      },
      "source": [
        "Use **merge** to join the two dataframes"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "6aee6f94-f33b-4492-9f89-2808c05f07d4",
      "metadata": {
        "codeCollapsed": false,
        "collapsed": false,
        "language": "python",
        "name": "cell25"
      },
      "outputs": [],
      "source": [
        "# Join dataframes\n",
        "spd_order_items = spd_lineitem_agg_pivot.merge(spd_order,\n",
        "                                               left_on='L_ORDERKEY', \n",
        "                                               right_on='O_ORDERKEY', \n",
        "                                               how='left')"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "3adc0331-1879-452f-9cc6-dd69f6824974",
      "metadata": {
        "collapsed": false,
        "name": "cell26"
      },
      "source": [
        "Drop the `L_ORDERKEY`column, it has the same values as `O_ORDERKEY`"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "8504a44d-d687-4c8d-af78-4b802901a168",
      "metadata": {
        "codeCollapsed": false,
        "collapsed": false,
        "language": "python",
        "name": "cell27"
      },
      "outputs": [],
      "source": [
        "spd_order_items.drop('L_ORDERKEY', axis=1, inplace=True)\n",
        "st.write(f\"DataFrame shape: {spd_order_items.shape}\")\n",
        "st.dataframe(spd_order_items.head())"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "a8b050f9-77a9-460a-853b-888963e6a214",
      "metadata": {
        "collapsed": false,
        "name": "cell28"
      },
      "source": [
        "More aggregations grouped by customer (`O_CUSTKEY`)\n",
        "* Total items delivered by customer\n",
        "* Average items delivered by customer\n",
        "* Total items returned by customer\n",
        "* Average items returned by customer"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "36e32341-cc93-4b5d-a5f1-15a15d8ddf69",
      "metadata": {
        "codeCollapsed": false,
        "collapsed": false,
        "language": "python",
        "name": "cell29"
      },
      "outputs": [],
      "source": [
        "# Aggregations we want to do\n",
        "column_agg = {\n",
        "                'O_ORDERKEY':['count'], \n",
        "                'O_TOTALPRICE': ['sum' ,'mean', 'median'],\n",
        "                'NBR_OF_ITEMS_N': ['sum' ,'mean', 'median'],\n",
        "                'NBR_OF_ITEMS_R': ['sum' ,'mean', 'median'],\n",
        "                'TOT_DISCOUNT_AMOUNT_N': ['sum'],\n",
        "                'TOT_DISCOUNT_AMOUNT_R': ['sum']\n",
        "            }\n",
        "\n",
        "# Apply the aggregation\n",
        "spd_order_profile = spd_order_items.groupby(by='O_CUSTKEY', as_index=False).agg(column_agg)\n",
        "\n",
        "# Rename the columns\n",
        "spd_order_profile.columns = ['O_CUSTKEY', 'NUMBER_OF_ORDERS', 'TOT_ORDER_AMOUNT', 'AVG_ORDER_AMOUNT', 'MEDIAN_ORDER_AMOUNT', \n",
        "                             'TOT_ITEMS_DELIVERED', 'AVG_ITEMS_DELIVERED', 'MEDIAN_ITEMS_DELIVERED', \n",
        "                             'TOT_ITEMS_RETURNED', 'AVG_ITEMS_RETURNED', 'MEDIAN_ITEMS_RETURNED',\n",
        "                             'TOT_DISCOUNT_AMOUNT_N', 'TOT_DISCOUNT_AMOUNT_R']\n",
        "st.dataframe(spd_order_profile.head())"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "daf0e441-43d1-4729-bc20-aea8f123befa",
      "metadata": {
        "collapsed": false,
        "name": "cell30"
      },
      "source": [
        "Calculate the total and average discount"
      ]
    }
watch the demo (03:39)

Overview

The Security & Governance pillar focuses on building a secure and compliant Snowflake environment by implementing the principles of the Confidentiality, Integrity, and Availability (CIA) triad. The recommendations provided within this framework are based on the Shared Responsibility Model and focus on the security controls that are the customer's responsibility. The goal is to minimize security risks by securing the Snowflake perimeter, using the principle of least privilege for data access, responsibly deploying AI, and preparing for security incidents.

Principles

Secure the perimeter

A strong security perimeter is built using network, identity, and policy constructs. You should assess network traffic to and from your Snowflake accounts to ensure network and endpoint security are in place for all connections. For identities, leverage modern authentication and authorization flows and enforce conditional access policies for Snowflake accounts.

Governance

Governance in the security pillar is about enforcing, monitoring, and auditing data access, privacy, and compliance controls. This ensures that auditors can prove who can see what, and under what conditions. Broader governance concepts like data quality and lineage are addressed in other pillars.

Responsible AI

The rapid adoption of AI capabilities brings security concerns, particularly regarding the protection of sensitive data. It's crucial to use AI responsibly and ensure that AI deployments are secure within your Snowflake environment.

Assume Breach

You must prepare for security events by having established Cyber Security Incident Response Plans (CSIRP) specifically for your Snowflake accounts. This includes creating playbooks with actionable tasks for different incident scenarios and conducting regular training and tabletop exercises. Continuous monitoring of Snowflake accounts for anomalous behavior is also vital.

Recommendations

The following key recommendations are covered within each principle of Security & Governance:

Shared Responsibility Model

Reliability and security are shared responsibilities between the cloud provider, Snowflake, and the customer. The Shared Responsibility Model outlines this division of responsibilities and is the foundation for all security recommendations. It is essential for customers to understand, document, and operationalize these specific responsibilities to prevent gaps in security ownership. You can find more information about this model in the Snowflake documentation: https://www.snowflake.com/en/resources/report/snowflake-shared-responsibility-model/.

Network Security 

You can build a strong security perimeter by using network policies to control network traffic to your Snowflake account. Network policies are a crucial first step in securing your environment by allowing or denying user access based on their IP address. It is a best practice to define a clear set of network policies that restrict access to only trusted IP addresses or IP ranges. For more advanced network control, consider using features like Snowflake's Private Connectivity options.

Authentication & Authorization

Authentication and authorization are fundamental to securing your Snowflake environment. You should use Role-Based Access Control (RBAC) to manage user permissions and ensure that each user has the minimum level of access required to perform their job duties, a principle known as least privilege. Modern authentication methods like federated authentication with SSO (Single Sign-On) should be used whenever possible to simplify user management and improve security.

Data Protection 

Data protection involves implementing controls to protect sensitive data at rest and in transit. Snowflake provides native features like Default Encryption at Rest and Client-Side Encryption. You can enhance this with Customer-Managed Keys (CMK) using Snowflake's integration with cloud provider key management services. Additionally, features such as Time Travel and Fail-safe provide data protection and recovery from logical errors by allowing you to restore data to a previous state. For more details, see the documentation on Time Travel.

Data Governance

Data governance ensures that data access, privacy, and compliance controls are enforced, auditable, and monitored. This includes the use of Dynamic Data Masking and Row Access Policies to control who can see what data under what conditions. Classification and Object Tracking are also critical for identifying and protecting sensitive data and maintaining data lineage.

Incident Response 

Having a well-defined Cyber Security Incident Response Plan (CSIRP) is vital for preparing for security events. The plan should include playbooks with actionable tasks and should be regularly tested through tabletop exercises to ensure effectiveness. Regular monitoring for anomalous behavior in your Snowflake accounts is key to early detection and rapid response to potential incidents.

Secure the Perimeter

Overview

Securing the perimeter of a Snowflake environment is crucial for protecting data and maintaining business continuity. A well-architected framework for this involves a multi-layered approach that addresses network policies, identity and credential management, and authentication hardening. By implementing these controls, you can ensure that access to your data platform is strictly controlled, external threats are proactively blocked, and your organization is resilient against potential security breaches.

Recommendations

Hardening the Perimeter and Network Access

Implementing Network Policies is a critical first step to securing your Snowflake environment. These policies, such as Network ACLs, should be used to restrict connections to your data platform from only approved and trusted IP ranges. Proactively mitigating data exfiltration pathways is also essential; this involves securing external stages, restricting outbound connectivity, and actively monitoring for high-volume data egress. You can significantly reduce your attack surface by utilizing a combination of user-level and account-level network policies. Account-level policies prevent the use of stolen credentials from exploited IP addresses and ensure all users access the platform from known IPs. User-level policies can be used to prevent end-users from using service accounts and allow for enforced exceptions to account-level policies. For further information, see the Snowflake documentation on network policies.

Identity and Credential Management

A robust security posture requires the elimination of static credentials across your organization. Instead, you should mandate the adoption of modern authentication flows like SAML, Key Pair, and OAuth for all identities, including human users, service accounts, and machine-to-machine processes. This approach minimizes the risk of credentials being compromised. It's also vital to establish a Privileged Access Management (PAM) program to govern, monitor, and strictly control access for administrative and high-privilege roles. This process helps to mitigate the risk of misuse of administrative access. A formal process should exist for granting, reviewing, and revoking high-privilege access, ideally with a tool to manage and monitor these administrative sessions.

Hardening Authentication

Enforcing Multi-Factor Authentication (MFA) is a key recommendation for hardening your Snowflake environment. Snowflake supports various MFA methods, including TOTP (Time-based One-Time Password) authenticator apps like Microsoft and Google Authenticator, as well as Passkeys and DUO, the only method with a "Push" notification option. It's recommended to avoid SMS as a fallback to prevent SIM swapping attacks by selecting the "Tablet" option in DUO instead of "Phone". Additionally, you should implement Workload Identity Federation (WIF), allows for service-to-service authentication method that lets workloads, such as applications, services, or containers, authenticate with Snowflake using their cloud provider’s native identity system, such as AWS Identity and Access Management (AWS IAM) roles, Microsoft Entra ID (Managed Principles), and Google Cloud service accounts to get an attestation that Snowflake can use and validate.

Data Governance

Overview

The Governance principle defines how sensitive data is accessed, protected, and proven compliant. In Snowflake, this means translating business rules into enforceable controls: masking and row policies, projection constraints, role hierarchies, and auditable evidence, so that teams can state with confidence who can see what, under what conditions, and how that access is monitored. Governance is most effective when it is automated and observable: tags drive policy attachment, Access History substantiates usage, and dashboards highlight drift and exceptions. The broader disciplines of lineage, quality, and continuous improvement are addressed across other pillars, but their outcomes are strengthened by a sound governance baseline here.

Desired outcome

A well‑architected governance posture lets organizations demonstrate that sensitive data is consistently classified and protected across environments, access is role‑based and least‑privilege, row‑level and projection rules limit exposure to the minimal necessary surface, AI/LLM interactions are scoped and logged, and regulatory obligations are mapped to Snowflake‑native controls with reproducible evidence. When governance operates in this manner, audit readiness becomes routine and does not hinder delivery.

Common anti-patterns

Weak governance often shows up as direct user‑to‑role grants without federation, unmasked PII in widely used datasets, manual access reviews that miss drift, agentic/LLM activity that is neither logged nor constrained, and uncontrolled role/object growth that accumulates technical debt. These patterns increase the likelihood of over‑exposure and complicate audits.

Benefits

A cohesive approach reduces breach and compliance risk, accelerates onboarding and offboarding through automated RBAC, improves trust in analytics and AI and shortens audit cycles by turning control checks into repeatable queries and evidence packs. Teams gain faster access to the right data because policies are pre‑bound and self‑enforcing.

Risks

Absent these practices, unauthorized access to sensitive data becomes more likely, regulatory obligations become hard to prove and model or agent outputs may inadvertently reveal protected information. Over time, unmanaged roles and objects increase operational drag and obscure ownership.

Recommendations

Start by defining a policy model in business terms: what constitutes sensitive information, who requires access, and what contextual factors must be considered (role, location, purpose). Express those rules as tags and policies in Snowflake. Classify sensitive columns, apply masking and row access policies that reference tags, and route consumption through curated views with explicit projection constraints. Integrate with an Identity Provider so users receive privileges via roles, not direct grants. Enable Access History and utilize Account Usage to create governance dashboards. Schedule drift detection to flag untagged columns, unused or over‑privileged roles, and detached policies. Finally, connect obligations (GDPR, HIPAA, PCI, SOX) to the specific Snowflake controls and the queries that produce evidence.

Recommendation checklist

  1. Define RBAC hierarchy
  2. Classify/tag sensitive datasets, bind masking & RAP policies
  3. Enable Access History and build audit dashboards
  4. Automate drift detection (orphan roles, untagged columns, policy detachment)
  5. Configure Cortex AI governance (prompt logging, redaction, agent scope)
  6. Build control-to-query library for regulatory evidence
  7. Review governance scorecard quarterly (coverage, exceptions, audit readiness)

Overview

The Governance principle defines how sensitive data is accessed, protected, and proven compliant. In Snowflake, this means translating business rules into enforceable controls: masking and row policies, projection constraints, role hierarchies, and auditable evidence, so that teams can state with confidence who can see what, under what conditions, and how that access is monitored. Governance is most effective when it is automated and observable: tags drive policy attachment, Access History substantiates usage, and dashboards highlight drift and exceptions. The broader disciplines of lineage, quality, and continuous improvement are addressed across other pillars, but their outcomes are strengthened by a sound governance baseline here.

Desired outcome

A well‑architected governance posture lets organizations demonstrate that sensitive data is consistently classified and protected across environments, access is role‑based and least‑privilege, row‑level and projection rules limit exposure to the minimal necessary surface, AI/LLM interactions are scoped and logged, and regulatory obligations are mapped to Snowflake‑native controls with reproducible evidence. When governance operates in this manner, audit readiness becomes routine and does not hinder delivery.

Common anti-patterns

Weak governance often shows up as direct user‑to‑role grants without federation, unmasked PII in widely used datasets, manual access reviews that miss drift, agentic/LLM activity that is neither logged nor constrained, and uncontrolled role/object growth that accumulates technical debt. These patterns increase the likelihood of over‑exposure and complicate audits.

Benefits

A cohesive approach reduces breach and compliance risk, accelerates onboarding and offboarding through automated RBAC, improves trust in analytics and AI and shortens audit cycles by turning control checks into repeatable queries and evidence packs. Teams gain faster access to the right data because policies are pre‑bound and self‑enforcing.

Risks

Absent these practices, unauthorized access to sensitive data becomes more likely, regulatory obligations become hard to prove and model or agent outputs may inadvertently reveal protected information. Over time, unmanaged roles and objects increase operational drag and obscure ownership.

Recommendations

Start by defining a policy model in business terms: what constitutes sensitive information, who requires access, and what contextual factors must be considered (role, location, purpose). Express those rules as tags and policies in Snowflake. Classify sensitive columns, apply masking and row access policies that reference tags, and route consumption through curated views with explicit projection constraints. Integrate with an Identity Provider so users receive privileges via roles, not direct grants. Enable Access History and utilize Account Usage to create governance dashboards. Schedule drift detection to flag untagged columns, unused or over‑privileged roles, and detached policies. Finally, connect obligations (GDPR, HIPAA, PCI, SOX) to the specific Snowflake controls and the queries that produce evidence.

Recommendation checklist

  1. Define RBAC hierarchy
  2. Classify/tag sensitive datasets, bind masking & RAP policies
  3. Enable Access History and build audit dashboards
  4. Automate drift detection (orphan roles, untagged columns, policy detachment)
  5. Configure Cortex AI governance (prompt logging, redaction, agent scope)
  6. Build control-to-query library for regulatory evidence
  7. Review governance scorecard quarterly (coverage, exceptions, audit readiness)

Data Privacy & Protection

CLS (column‑level security)

Column‑level security protects sensitive attributes with masking policies that adapt to context. In practice, tags like SENSITIVITY and REGULATORY_SCOPE travel with columns and trigger masking automatically, while session attributes such as role, network location, or declared purpose inform whether data is shown in the clear or redacted. Deterministic masking can preserve joinability where required, and clear‑text access can be logged via secure UDF patterns when justified. In regulated scenarios, tokenization may be necessary; it should be paired with strict stewardship and evidence of necessity.

Design approach

Keep base tables secured and expose consumers to policy‑protected views. Separate administrative personas from analytical personas to avoid privilege creep and ensure least privilege.

Assessment prompts

Which roles can view clear PII, and under what context? How are exceptions time‑boxed, tracked, and reviewed?

RAP (row access policies)

Row access policies constrain visibility at the record level based on attributes like geography or line of business. Centralizing rules in secure UDFs and driving entitlements from reference tables makes policies explainable and maintainable. Performance should be validated against real workloads, and pruning/clustering strategies used where helpful.

Design Approach

Adopt a hub‑and‑spoke model: one policy per domain, so all child tables and views inherit the same semantics through tags.

Assessment Prompts

Are RAP rules managed from a single source of truth rather than embedded in views? What are the policy evaluation latency and cache hit rates under load?

Projection policy

Projection controls define which columns (or combinations of columns) may be selected together, preventing risky joins (e.g., date of birth with ZIP and gender). Express these rules in curated secure views and verify them in CI so prohibited projections never reach production. Pair projection constraints with masking and row access for layered protection.

Design Approach

Publish approved column bundles per persona and provide redaction views for common but risky analytics patterns.

Assessment Prompts

Which disallowed column combinations are blocked today? How are projection constraints tested and reported in CI pipelines?

Regulatory & Compliance

CLS (column‑level security)

Column‑level security protects sensitive attributes with masking policies that adapt to context. In practice, tags like SENSITIVITY and REGULATORY_SCOPE travel with columns and trigger masking automatically, while session attributes such as role, network location, or declared purpose inform whether data is shown in the clear or redacted. Deterministic masking can preserve joinability where required, and clear‑text access can be logged via secure UDF patterns when justified. In regulated scenarios, tokenization may be necessary; it should be paired with strict stewardship and evidence of necessity.

Design approach

Keep base tables secured and expose consumers to policy‑protected views. Separate administrative personas from analytical personas to avoid privilege creep and ensure least privilege.

Assessment prompts

Which roles can view clear PII, and under what context? How are exceptions time‑boxed, tracked, and reviewed?

RAP (row access policies)

Row access policies constrain visibility at the record level based on attributes like geography or line of business. Centralizing rules in secure UDFs and driving entitlements from reference tables makes policies explainable and maintainable. Performance should be validated against real workloads, and pruning/clustering strategies used where helpful.

Design Approach

Adopt a hub‑and‑spoke model: one policy per domain, so all child tables and views inherit the same semantics through tags.

Assessment Prompts

Are RAP rules managed from a single source of truth rather than embedded in views? What are the policy evaluation latency and cache hit rates under load?

Projection policy

Projection controls define which columns (or combinations of columns) may be selected together, preventing risky joins (e.g., date of birth with ZIP and gender). Express these rules in curated secure views and verify them in CI so prohibited projections never reach production. Pair projection constraints with masking and row access for layered protection.

Design Approach

Publish approved column bundles per persona and provide redaction views for common but risky analytics patterns.

Assessment Prompts

Which disallowed column combinations are blocked today? How are projection constraints tested and reported in CI pipelines?

Solution Architecture: Build a Visual AI model

Generic image placeholder
expand

Fig1: Install LandingLens Native App from Snowflake Marketplace

About the Architecture

This solution architecture helps you build and deploy Visual AI solutions in Snowflake for Healthcare & Life Sciences. Some of the use cases are described below. 

  • Measuring Disease progression using classification
  • Counting cell colonies for viability using object detection
  • Detecting abnormalities in radiology and pathology images aiding diagnosis using classification
  • Visual promoting to fine tune on in house images
SNOWFLAKE CERTIFIED SOLUTION

This solution was created by an in-house Snowflake expert and has been verified to work with current Snowflake instances as of the date of publication.

Solution not working as expected? Contact our team for assistance.

SOLUTION PARTNER
SHARE SOLUTION

what’s next?

Explore more developer content and build your skills.