The future of AI, revealed live

Stream Summit keynotes free June 1–2.

Data Ethics

Foundational Guide

Data Ethics: Principles and Practices for Responsible Data Use

Data ethics helps organizations decide not only whether data use is legal, but whether it’s appropriate, proportionate and accountable across collection, storage, analysis, sharing and AI development.

Laurie MacPherson
Laurie MacPhersonTechnical Editor, Snowflake
David Gaule
David GauleTechnical Editor, Snowflake

Data Ethics Defined

Data ethics is the practice of applying moral principles to how data is collected, used, shared and governed across analytics, AI and business operations.

Data ethics helps organizations make defensible choices about data use before those choices are locked into pipelines, models, applications and shared data products.

Organizations care about data ethics because data use now shapes trust, risk and decision-making. Customers, employees, regulators and business partners increasingly expect data to be used in ways that are explainable, proportionate and aligned with its original purpose. 

As AI and ML systems increasingly turn data decisions into automated outcomes, ethical gaps in the data layer can scale quickly, especially when training data reflects bias, sensitive attributes enter pipelines without review or data is reused beyond its approved purpose.

What is data ethics?

Data ethics is the application of moral principles to decisions about how data is collected, stored, used and shared. It gives a governance program its values layer, helping teams decide not only what is legally required, but what is appropriate, proportionate and accountable when data moves through analytics, AI and business workflows. By translating ethical commitments into governance policies and platform controls, organizations can use data more confidently while reducing the risk of harm.

In practice, data ethics starts before the governance policy is written. An organization first defines what it will and will not do with data, including which uses require consent, which sensitive attributes should be minimized, which data sets should not be reused for AI training without review, and what audit trail should exist when a data product affects customers, employees or patients. These commitments then have to become governance controls, such as rules for collection, access, retention, sharing, masking and review.

The final step is enforcement. A policy that says sensitive demographic attributes should be restricted is more effective when the data platform can help identify those attributes, apply tags, support masking or row-level access policies, and provide visibility into downstream use and access activity.

Data ethics is related to, but distinct from, privacy, compliance and governance. 

  • Privacy focuses largely on protecting personal data from unauthorized access or misuse. 

  • Compliance defines legal obligations. 

  • Governance provides the roles, policies and technical controls that manage data across its lifecycle. 

  • Data ethics informs the choices behind those mechanisms, including what data should be collected, who should access it, how long it should be retained and when a new use requires additional human review.

AI has made the ethical consequences of data decisions much more visible. A biased training data set can affect hiring recommendations, credit decisions or healthcare triage workflows at scale. A customer attribute collected for one purpose can become an input into an automated decision. A model pipeline can reuse data in ways that are technically permitted but ethically difficult to justify. Data ethics helps organizations examine decisions like these before they become embedded in systems and difficult to inspect.

The EU AI Act includes data governance requirements for certain high-risk AI systems, including practices related to training, validation and testing data sets, data collection processes, data preparation, possible bias and the original purpose of personal data collection. NIST’s AI Risk Management Framework also connects AI governance to organizational risk practices through functions such as govern, map, measure and manage.

Listen to the Data Cloud Podcast to hear Jack Berkowitz, Chief Data Officer at ADP, discuss data sharing and applying ethics to algorithms.

Principles of data ethics

Data ethics programs vary by organization, industry and regulatory environment, but most rest on a common set of principles. These principles help data teams, stewards, legal teams and business leaders make consistent decisions about data use before those decisions become embedded in pipelines, models or applications.

Responsible data use

Responsible data use is the operational commitment to collect, store and analyze data in ways that minimize harm, respect rights and serve clearly defined purposes. It turns ethical intent into decisions that can be applied at the pipeline level: what data enters a workflow, what fields are retained, which teams can access them and what downstream uses are allowed.

Four obligations usually sit at the center of responsible data use:

  • Lawful collection: Data is gathered with a valid legal basis, explicit consent or another approved justification.

  • Purpose limitation: Data is used only for the purpose that was stated, approved or reasonably expected.

  • Proportionality: Teams collect and retain only the data needed for the specific task.

  • Harm minimization: Organizations assess foreseeable downstream harms before deploying data products, analytics workflows or AI systems.

Data dignity

Data dignity is the principle that people should be able to understand and influence how data about them is used. It treats data as something connected to a person’s autonomy and context, not merely as an asset to be extracted, combined and reused.

This is different from privacy, though the two overlap. Privacy helps protect personal data from unauthorized access or misuse. Data dignity asks whether an authorized use still respects the person the data represents. For example, patient records may be stored securely and accessed by approved users, but using those records to train a commercial AI system without meaningful awareness or consent may still raise dignity concerns.

In practice, data dignity influences consent design, purpose specification, data minimization and data subject rights. Consent should be meaningful rather than buried in terms. Purpose statements should be specific enough to guide future use. Data collection should be limited to what the task requires. And individuals should have appropriate ways to understand, contest or influence how their data is used, especially when that use affects access to services, opportunities or decisions.

Transparency and open data governance

Transparency gives people inside and outside an organization a way to understand how data is used. It can include lineage records, data provenance, catalog metadata, model documentation, governance approvals and audit logs. The goal of transparency is to make the right information visible to the right reviewers, stewards, regulators, partners or data consumers.

Open data governance applies the principle of transparency to data that’s made publicly accessible or shared for research, accountability or collaboration. Open data can support trust, academic research and democratic accountability, but unrestricted openness can also expose PII, proprietary business logic or sensitive public-sector information. Ethical open data programs balance accessibility with risk controls. The FAIR principles — findable, accessible, interoperable and reusable — are often used to guide responsible open data practices. 

In practice, this includes using data catalogs to publish rich metadata without exposing sensitive fields, making governance audit logs accessible to regulators for oversight, and maintaining versioned data sets with clear provenance documentation so users can understand how data was created, transformed and updated over time.

Fairness

Fairness focuses on whether data-driven systems produce outcomes that are appropriate, justifiable and not systematically harmful to protected or vulnerable groups. In analytics and AI, fairness depends on both the data and the system that uses it.

A hiring model trained on historical recruiting data, for example, may reproduce harmful past patterns if the training data reflects earlier exclusion or uneven access to opportunity. Or a healthcare triage model may perform differently across populations if the underlying data underrepresents certain groups. 

Fairness is not a single technical setting — different fairness definitions can conflict with one another. For example:

  • Demographic parity asks whether outcomes are distributed equally across groups. 

  • Equalized odds focuses on whether error rates are similar across groups.

  • Individual fairness asks whether similar individuals receive similar treatment. 

Teams must choose the fairness standard that fits the decision context and document why that choice is appropriate.

Data ethics risks in AI and analytics

Ethical risks often appear when data moves from one context to another. These risks are especially visible in AI and ML, where data choices can shape outputs at scale.

Data bias

Data bias is a systematic error in a data set that can cause analytics or model outputs to skew in a particular direction. Bias can come from sampling gaps, historical inequities, measurement errors, labeling practices or business processes that were never designed for the new use.

This makes bias a governance problem, not only an ML problem. By the time a data scientist trains a model, many bias-related decisions may already be embedded in the data: which populations were included, which fields were collected, which labels were applied, which records were excluded and which historical outcomes were treated as ground truth.

Data ethics requires review early in the lifecycle. Teams need to understand the origin of the data, the purpose for which it was collected, the known gaps in representation and the assumptions behind labels or outcomes. In AI contexts, this aligns with regulatory and risk management expectations around training data quality, representativeness and bias mitigation.

Algorithmic fairness

Algorithmic fairness focuses on model outputs rather than the data set alone. It asks how the system behaves once it uses that data to make or support decisions.

Practitioners often evaluate fairness at multiple checkpoints:

  • Before training, they may audit data composition to understand whether relevant populations are represented. 

  • During model evaluation, they may test outputs for disparate impact by cohort. 

  • In production, they may monitor outcomes to detect drift, changing error rates or unexpected disparities.

Removing a sensitive field doesn’t necessarily remove the risk since other variables can act as proxies. For example, a model may not use race, gender, disability status or income directly, but location, education history, purchasing behavior or employment patterns may still correlate with protected attributes. Ethical AI governance therefore requires both data-level controls and output-level monitoring.

Algorithmic fairness decisions also require documentation. If a team chooses equalized odds rather than demographic parity, that choice reflects assumptions about the decision context, the acceptable trade-offs and the harms the organization is trying to reduce. Data ethics helps ensure those choices are not hidden inside technical workflows.

Jennifer Belissent, Principal Data Strategist at Snowflake, explains how responsible AI depends on the data foundation: “Success in the new AI landscape depends not only on this shiny new tool, but on the foundations on which it will be built. The foundation for the successful and responsible use of AI and gen AI must be based on data security, data diversity and organizational maturity.”

Quote Icon

The foundation for the successful and responsible use of AI and gen AI must be based on data security, data diversity and organizational maturity.

Jennifer Belissent
Principal Data Strategist, Snowflake

Biased or dignity-violating model outputs

Bias and dignity risks can converge when sensitive data enters AI workflows without sufficient review. A model may generate outputs that disadvantage a group, expose information that should have been minimized or use personal data in ways that do not match the original purpose of collection.

This is why governance controls are important. Row-level access policies, masking policies and object tags can help control which demographic, health, financial or behavioral attributes reach model training pipelines. Lineage can help teams trace whether a sensitive field moved from a governed source into a derived table, feature set or application. Access history can show who queried a data set and when.

Controls don’t resolve every ethical question, but they create the conditions for review, enforcement and accountability. Without them, data ethics depends on individual judgment at each handoff. With them, ethical commitments can be translated into repeatable rules.

COMMON PITFALL

Organizations invest in tagging sensitive data and defining classifications, but stop short of connecting those tags to access controls, masking, retention and review workflows. As a result, data is labeled correctly but still handled incorrectly.

How organizations operationalize data ethics

Data ethics has to reach the workflows where data is collected, queried, shared and reused. In practice, organizations operationalize it through a few connected practices.

Document values commitments

Organizations typically start by defining what they will and will not do with data. These commitments should be specific enough to guide decisions. A general statement that the organization uses data responsibly is less useful than a clear commitment to minimize sensitive data collection, avoid secondary use without review or document fairness choices for automated decision systems.

These commitments also need owners. Data stewardship gives named people responsibility for domains, definitions, quality, access and policy adherence. Legal, compliance, security and business teams may help define the commitments, but stewards help apply them to actual tables, fields, pipelines and data products.

Encode commitments into governance policies

Once commitments are defined, organizations need governance policies that specify what must happen. A data minimization commitment might become a retention policy that deletes or archives records after a defined period. A dignity commitment might become a consent review process for new uses of personal data. A fairness commitment might require training data composition audits before model deployment.

Policies should connect to the data lifecycle. Collection policies define what data can be gathered and under what legal or ethical basis. Access policies define who can use sensitive fields. Retention policies define how long data can remain available. Sharing policies define when data can be published, exchanged or made available to partners. Review policies define when a new use requires approval.

Enforce policies through platform controls

Ethical policies are difficult to sustain if they depend only on manual review. Platform controls help enforce policies where data is stored, queried, shared and used.

Masking policies can reduce exposure of sensitive columns. Row-level access policies can restrict which records a user or role can see. Object tags can mark sensitive data, approved use, domain ownership, retention requirements or classification status. Data classification can help identify potentially sensitive data so it can be governed consistently. 

For example, a policy may say that sensitive demographic attributes should not be broadly available for model training. A platform control can mask those attributes, restrict access to approved roles and preserve metadata that shows how the data is governed.

Use lineage and audit trails to prove responsible use

Organizations also need evidence. Lineage helps show where data came from, how it changed and which downstream assets depend on it. Audit trails help show who accessed data, when and under what context. 

Together, they help teams demonstrate that data was used for a stated purpose and that governed fields did not move unnoticed into unauthorized workflows. This evidence supports both internal accountability and external oversight.

KEY TAKEAWAY

The most reliable way to operationalize data ethics is to connect classification and intent to enforcement and review: tag sensitive data, bind those tags to access, masking and retention policies, and use lineage and audit trails to verify how that data is actually used as it moves into analytics and AI workflows.

Review data use as context changes

Data moves, business needs change and AI systems create new forms of reuse. What was low-risk at collection can become sensitive when joined, retrained on, or shifted into production at scale.

Ongoing review helps organizations keep policies aligned with context. This can include periodic access reviews, retention reviews, data product certification, fairness audits, lineage reviews and approval workflows for new use cases. Human oversight remains important because ethical questions often involve context that cannot be fully captured in a rule.

How Snowflake supports data ethics

Snowflake helps organizations operationalize data ethics by connecting governance policies, metadata and controls within the data environment. 

Centralize governance context

Snowflake Horizon Catalog helps teams discover, understand and govern data, apps and models across the AI Data Cloud. By surfacing metadata such as classifications, object tags, policies, ownership and lineage, Horizon Catalog gives data stewards and consumers more context before data is used in analytics, AI or data sharing workflows.

Enforce responsible data use

Snowflake pairs its responsible AI commitments with governance controls that help teams apply ethical principles directly at the data layer. Dynamic data masking can reduce unnecessary exposure of sensitive columns, while row access policies can restrict which records users or roles can see. Object tagging and classification help identify governed data and apply controls more consistently across data products and pipelines.

Audit and review data activity

Responsible data use also requires evidence. Snowflake capabilities such as access history view, object tagging and lineage help teams see who accessed data, where governed data moved and which downstream assets depend on it. That audit trail can support stewardship reviews, compliance workflows and investigations into whether data was used for its approved purpose.

Support responsible AI workflows

For AI use cases, governance context matters before data enters a model or application. Snowflake helps teams apply data governance controls to AI workflows in the same environment where data is stored and processed, while Cortex Guard supports content safety for LLM-powered applications built with Snowflake Cortex AI. Together, these capabilities help teams connect responsible AI practices to the governed data foundation underneath them.

Data ethics depends on operational governance

Data ethics must shape the practical decisions that determine what data is collected, how it’s classified, who can access it, where it moves and when a new use requires review. When stewards can trace lineage, attach tags, apply masking policies, review access and document purpose, organizations may be better positioned to incorporate ethical data-use practices into everyday operations rather than a separate approval process. 

The result is a stronger foundation for responsible analytics, AI development and data sharing — one that helps organizations use data with more confidence and accountability.

KEY TAKEAWAY

Data ethics becomes actionable when organizations turn principles like fairness, transparency and responsible use into enforceable governance controls across the data lifecycle. By combining policies with tagging, access controls, lineage and auditability, organizations can reduce risk, strengthen trust and support more responsible AI and analytics at scale.

Frequently Asked Questions

Your common questions about data ethics, answered by Snowflake experts.

Data privacy focuses on protecting personal data from unauthorized access, misuse or disclosure, often in response to regulatory requirements such as GDPR or the California Consumer Privacy Act. Data ethics is broader. It applies moral standards to decisions about how data is collected, stored, used and shared, including decisions that may be legal but still inappropriate, disproportionate or opaque.

AI ethics is related to data ethics, but it’s not the same thing. AI ethics focuses on the design, training, deployment and monitoring of AI systems. Data ethics covers the data lifecycle, whether or not AI is involved. In practice, AI ethics depends heavily on data ethics because model behavior is shaped by training data, data provenance, labeling practices, access controls and monitoring.

Organizations implement data ethics by documenting ethical commitments, translating those commitments into governance policies and enforcing the policies through platform controls. Common mechanisms include data stewardship, classification, tagging, masking, row-level access policies, retention policies, lineage, audit logging and periodic review.

Examples include biased training data, unclear consent, excessive data collection, repurposing data beyond its original use, retaining sensitive data longer than needed, exposing PII through open data programs, and using demographic or behavioral data in automated decisions without appropriate review.

Several regulations include obligations that reflect ethical principles. GDPR includes requirements related to purpose limitation, data minimization and transparency. The EU AI Act includes data governance and transparency requirements for certain high-risk AI systems. NIST’s AI Risk Management Framework provides voluntary guidance for governing, mapping, measuring and managing AI risks.

Explore Data Governance Resources

Explore Data Governance Topics

Deep dives into every aspect of data governance