Data engineers, data scientists, and data analysts at Snowflake build various ETL and ELT data pipelines that run at different set schedules to obtain information, to make better informed decisions, and to provide analytical insights that drive company strategy. A reliable, efficient, and trustworthy workflow management system is crucial to make sure these pipelines run successfully and deliver the data on its set schedule.
Our data team is highly cross-functional and developers belong to different verticals—finance, sales, marketing, IT, and product—and collaborate on a single mono-repository for their pipelines.
Why We Use Apache Airflow
Workflows can automate and streamline repeatable tasks, increasing productivity and efficiency for your project or business. Creating these workflows without a tool such as Apache Airflow can be a difficult and increasingly unmanageable endeavor as the number of tasks grows and the dependencies between tasks become more and more complicated.
How We Use Airflow at Snowflake
Airflow is an essential part of our software stack. Data pipelines and machine learning models at Snowflake are driven by Snowhouse (our own instance of Snowflake, which is a SaaS-based platform), dbt,1 and Airflow for modeling and transformation. Airflow is also used for the orchestration and scheduling of tasks.
Airflow is an important part of making sure steps in our workflow happen in a certain order and on a recurring basis.
Our Previous Airflow Setup: Amazon EC2
We started using Airflow (v1.10.12) at Snowflake in early 2020. Our first implementation, shown in the figure below, ran Airflow in a single Amazon Elastic Compute Cloud (EC2) instance, r5.xlarge, to support less than 15 Directed Acyclic Graphs (DAGs) and less than 10 users.
We utilized five Airflow services: Worker, Webserver, Scheduler, Flower, and the Celery Executor. In addition, we had a Redis cluster in Amazon ElastiCache and a PostgreSQL database hosted in Amazon Relational Database Service (RDS).
We baked Airflow configurations and applications logic into a single Docker image and deployed that on EC2, which launched Airflow services. Here is our release process in the old setup:
Challenges Running Airflow in EC2
As Snowflake has grown, the number of teams and their need to conduct exploratory data analyses and answer business-related questions have also grown. In just one year, the number of DAGs in our Airflow cluster grew from less than 15 to 270.
As a consequence, we noticed that DAGs would sometimes not run successfully, or there would be problems with the scheduler. These issues caused the following repercussions and had a staggering impact on SLAs and the reliability of the data in the platform:
- Scalability: With single EC2 instances at max, we were allowed to run 48 tasks in parallel. This number was extremely low considering the rate at which tasks were getting scheduled during peak hours. Adding an additional worker or increasing the capacity of existing nodes required looping in the cloud infrastructure team and also required Airflow services to be restarted.
- Reliability and availability: As more tasks were pushed to the queue, the scheduler became strained and, eventually, the scheduler stopped responding. This brought the whole game to a halt unless someone identified the issue and restarted the Airflow services.
- Isolation: Airflow installations and application logic were tightly coupled and baked into a single Docker image. Every time application logic needed to be deployed to production, it required us to kill and restart the Airflow services, which would abruptly stop DAGs that were already running.
Airflow on Kubernetes at Snowflake
To sustain rapid growth in the demand for implementing large workloads and to satisfy all our stakeholders’ needs, we had to move our Airflow setup from a single EC2 instance to the [near-]unlimited scale of Kubernetes.
Our goal was to provide self-service, scalable resources that can be accessed without needing to loop in external teams; support Docker/containerization for business apps to the greatest extent possible to simplify software dependency management and eliminate conflicts; maintain the ability to dynamically update workflows without needing to redeploy Airflow for each workflow change; and offer a real-time monitoring and alerting solution for Airflow’s health. The following figure shows our new configuration:
To achieve these goals, we are using airflow-helm2 charts to deploy Airflow 2.0.2 onto Kubernetes with Celery Executor. We use Kubernetes’ native auto-scaling feature to auto-scale Celery workers in and out, and the results have been astonishing: Most of our tasks are pushed to the queue due to the unavailability of timely resources, but now tasks meet the SLAs without starving for resources, as shown in the following figure:
Using Airflow’s KubernetesPodOperator, we are able to execute dockerized business applications as tasks within DAGs. Custom Docker images allow users to ensure that the tasks’ environment, configuration, and dependencies are completely idempotent. It gives DAG owners flexibility to define and containerize applications and their dependencies without worrying about conflicts from other applications.
KubernetesPodOperator also has a feature to provision additional compute resources for respective tasks through a DAG file. That gives us the benefit of allocating resources for varying workloads and we can monitor and measure the resources consumed at the task level.
Migrating to our new environment also gave us an opportunity to separate application logic from infrastructure code. We now have a separate repository for Airflow Kubernetes manifests. What this means is that we can now deploy application-related changes without restarting Airflow services. With git-sync running as a sidecar container on Airflow pods, users now can get feedback on their data pipelines immediately. With our old environment, they had to wait on nightly builds, which meant waiting for an entire day.
Airflow Monitoring and Alerting at Snowflake
As part of our migration, we also wanted to improve the monitoring of our Airflow cluster. Airflow comes with minimal metrics out of the box. Before, DAG authors often ran into various issues with their DAGs, but we weren’t able to conclusively diagnose the issues, because we didn’t have extensive monitoring and metrics. This resulted in teams spending more time investigating, because we weren’t able to quickly identify whether the issue was with the application code or Airflow itself.
In our new environment, we wanted to easily answer questions such as:
- Is the Airflow scheduler healthy enough to schedule DAGs and tasks?
- What is the availability of pool slots and are there tasks stuck in the “QUEUED” state?
- What is the scheduling delay for DAGs and tasks?
- Which are the slowest DAGs and tasks?
- Are there any DAGs that failed to load?
We now use Wavefront for monitoring and alerting within Snowflake, as shown in the following figure. We plug in a StatsD client and a data collection agent—Telegraf Agent on Wavefront Proxy—that is responsible for ingesting metrics and forwarding them to the Wavefront service in a secure, fast, and reliable manner.
If the scheduler’s heartbeat is off for more than five minutes, an on-call engineer gets alerted. These metrics have helped us quickly understand the state of the Airflow cluster and identify issues.
We hope this post has given you visibility into how we started using Airflow at Snowflake; issues that drove us to migrate our Airflow setup to the Kubernetes environment; how we migrated our Airflow cluster to production; how we set up an alerting and monitoring framework for our Airflow cluster; and how we operate our Airflow clusters at scale.
In the future, we are going to focus on building analytics on Airflow metadata. That can provide new insights into the Airflow Kubernetes batch orchestration behavior. This metadata can help us detect deviations from historical patterns and can be used to monitor fluctuations in data pipelines. It can also help us identify improvement areas to effectively utilize and increase the performance of the data pipelines and the environment in partnership with users.