Core Platform

Fast and Reliable Builds at Snowflake with Bazel

After a little over two years, we are excited to announce the conclusion of the migration of our core product’s build and test system to Bazel!

The first year of this migration laid the foundations of our Bazel build, and we accelerated our efforts in the second year thanks to full leadership support and a more product-oriented approach, helping us land the migration sooner than expected.

Let me tell you about our journey.

Context-setting

Picture yourself as a Snowflake engineer working on our core product repository (or "repo") three years ago. This repo contains the code of our product and has two distinct components: our business logic written in Java and our query engine written in C++, and each was built by a different and language-specific build system. Coordinating two build systems is no easy feat, so various organically grown shell scripts put the two builds together to bring you Snowflake.

Our engineers routinely faced problems such as:

  • Slow builds: Our legacy builds were not distributed, which meant that they were confined to the resources of just one machine. It was common for the developers to experience delays. This was perhaps their biggest complaint.

  • Slow incremental builds: Our engineers also faced problems in the inner development cycle, where incremental builds after a simple code change could take tens of minutes.

  • Unreliable builds: While the two legacy build systems can drive incremental builds, it's common for these to not work correctly because writing correct build rules and scripts is difficult for any engineer who's not an expert in build systems. Our developers faced random build failures after git rebase operations and were forced to do clean builds at unexpected moments during their day, introducing additional one-hour penalties to their flow. Our scripts that coordinated the two build systems were also a source of unreliability.

Three years ago, these issues reached an inflection point, and engineering frustration came to a head. It was clear that we needed to revamp the build process, but which process to pick wasn't as clear.

Our C++ developers were generally happy, and we could have fixed their slow build issues by deploying something akin to distcc. Our Java developers clamored for Gradle — a well-known player in the Java space. While we could have potentially addressed these issues distinctly per codebase, we wanted a solution that would allow us to more centrally and singularly support scaled builds across not just these two codebases but other coding languages as well.

Three factors ultimately led to the decision to move to Bazel:

  • We could build a unified experience for all developers. 

  • We had the required in-house expertise to make this possible.

  • Bazel provided a solid story around code provenance.

Our approach to C++

The way we build C++ with Bazel is pretty conventional: Developers maintain their BUILD files by hand and they have manual control over dependencies. That said, there are a few aspects unique to our setup:

  • Automated dependency generation: The Bazel C++ rules differentiate between dependencies that have to be exposed to dependents (deps) and those that don't (implementation_deps). Maintaining these lists for the hundreds of targets in our build is an arduous process, so we have automated this job with a script that processes the C++ source files, identifies dependencies and modifies the build rules accordingly.

  • Dynamic linking for debug, static linking for release: Static binaries are awesome for production deployments, but not so much for iterative development. With CMake, we used static binaries for all development builds and tests, and this worked fine because build artifacts were always on disk. With Bazel we moved to using remote execution, which caused our builds to ferry large artifacts between the remote cache and the local machine on every build. This broke the developer's flow and we minimized transfers with dynamic linking.

  • Dynamic strategy usage: Even with dynamic linking enabled, we found that the incremental build and test execution flow of most developers degraded. The usage of remote execution and remote caching made these interactions slower, particularly for engineers working on not-so-great internet connections. We enabled Bazel's dynamic strategy by default in all interactive builds to leverage local compute for incremental builds — as this is a scenario where local almost always beats remote.

  • Third-party nondeterministic dependencies: Like any codebase, ours depends on third-party components, but those do not necessarily provide Bazel-native builds. While we migrated some simple dependencies to Bazel, we still use rules_foreign_cc for a few of them. The problem is that, while Bazel is hermetic, these foreign build systems aren't necessarily so, which means that the outputs of any rules_foreign_cc rule are typically nondeterministic. Fixing the determinism issues is difficult, so what we did is force the execution of these rules in the build farm (never locally) to ensure that they are produced from just one trusted environment and that, once cached, they are never rebuilt again.

Our approach to Java

Java presented other unique challenges, as we had to solve three big roadblocks:

  • No manual dependency management: Java developers tend to think in terms of Java packages for code organization within their own project: They import classes from other packages and expect those to work without any extra steps. This is further reinforced by our legacy build system that tends to glob together all source files and expect they are all built at once.

  • Monolithic code chunks: As such, it's very easy to end up with cyclic dependencies among Java files because Java imports alone don't prevent them. Our codebase had about ~8,000 files that needed to be compiled as a unit, and this compilation took more than 10 minutes. Whether Bazel could offer a good user experience with such a choke point in the critical path of every build was uncertain.

  • Tags in code: Our developers leveraged Java annotations in code to express certain dependencies and properties, particularly of tests. They preferred to use annotations to indicate, for example, whether a test is a fast unit test or a slow integration test. Bazel has a different way of doing things, but we thought it was important to meet developers where they are and support existing workflows.

We solved all of these issues with two techniques:

  • Build file generation: We revived the defunct BUILD_file_generator tool into an internal-only fork and used it to generate BUILD files from scratch on every build. This tool first scans the Java source code to extract import statements and annotations. From these, the tool builds a graph of dependencies and generates very precise and granular BUILD files. We did not use Gazelle because its support for Java didn't offer a way to break apart dependency cycles into fine-grained targets.

  • Sharded compilation of the monolith: To solve the slow compilation times of the monolithic target containing 9,000 files, we created a custom rule. As a first cut, we wrapped Gradle in a custom rule to leverage its incremental compilation features and were able to drop the 10-minute compilation times of the monolith to half a minute. Unfortunately, this regressed over time back to seven minutes or more. We had to find another solution, and we did. What our custom rule does today is, first, it generates a header JAR from these 9,000 files, a step that takes about 13 seconds; then we shard the compilation of the 9,000 files in ~100 ways. All shards depend on the header JAR, and all shards can be built in parallel remotely. As a result, we can build the monolith in about two minutes and continue our efforts to split the monolith into components.

These two techniques combined have allowed us to deliver faster build times to our Java engineers. There are some downsides to this approach due to the precise build graph that our automation generates. We are looking at ways to address these issues without losing the benefits of automation.

Our approach to remote execution

One of Bazel's main selling points is that it supports remote caching and remote execution (RE) out of the box to speed up end-to-end build times. RE achieves this by safely leveraging artifacts produced by other users and by parallelizing compilations in more ways than one machine can ever do.

Most projects that migrate to Bazel start by first using remote caching only, and then, once they have a functional build, they onboard into RE. We didn't follow this path though, as we chose to use RE from the outset. One reason for this choice was that, with the switch to Bazel and its more precise change tracking, we started to notice that Bazel rebuilt "third-party dependencies that rarely change" many more times than our legacy build system did, and the end-to-end build times that users experienced degraded. A secondary reason was that the only way for us to solve the user-perceived pain point of "slow builds" was to leverage parallelism. A tertiary reason was that we didn't think that RE would be too hard — but time proved us wrong there, as you will see in a moment.

So which solution to RE should we use? Back in early 2022, we surveyed the ecosystem and decided to deploy our own RE cluster (colloquially known as "the build farm") by leveraging the open source Buildbarn project. At that time, commercial offerings were still in their infancy and it was very important to us to own the RE deployment during our migration so that we could freely experiment and not be bound by external constraints.

The journey with running our own Buildbarn cluster — which is exclusively used for internal development and has no connection to the Snowflake product/service — has been challenging in its own right. We faced various issues that arose from how we chose to deploy the software, and those weren't easy to diagnose and root-cause: For example, the RE workers underperformed because of incorrect disk provisioning, but it wasn't until we built our own visualization tooling that we pinpointed the issue.We also faced issues with poor build performance over high-latency networks. Our load balancer configuration was not optimized for this traffic pattern, and our attempt to address this problem via lazy network downloads did not solve our latency issues.

Today, our Buildbarn deployment is stable and highly performant. We do not use bb_clientd and favor interactive clients to download all intermediate artifacts for the benefit of the IDE, but lazy downloads of those artifacts is something we want to reinvestigate in the future. And because we have run this software for two years now, we have a good sense of how much it costs, have a good perspective on the kinds of optimizations that will give us the best bang for our buck and are hosting community meetups to share knowledge.

Accelerating the migration

Bazel accelerated our builds indeed, but the migration process was "stuck."

One year into the project, we felt we were in pretty good shape. The whole codebase could be built successfully with Bazel; most tests ran and passed with Bazel; the Snowflake product was able to stand up in developer environments; and we had demonstrated that Bazel very visibly cut down the one-hour-long build times to just 10 minutes. Yet our internal user base did not seem to care: Our Bazel adoption metrics were at a low 20%.

If you have run a migration, you understand that solving technical problems is easy in comparison to getting a large population of engineers to change their behavior. Fortunately, Snowflake understands that developer productivity is critical, and about a year ago, leadership launched a company-wide initiative. The goals were clear: Improve our developers’ efficiency by following a product-oriented approach. We used quarterly surveys to discover our users’ actual pain points, defined metrics to track improvements and had clear plans on how to address them. Bazel was the stepping stone to fixing many of these problems, which meant we had company-wide support.

But we were far from our goals. After establishing metrics, we could tell that, while build performance was better in some cases, it wasn't clearly better all the time. It also became obvious that our internal users faced more latency than before in builds within the IDE, and our build farm deployment didn't offer the right availability for these. Knowing what mattered to users, though, we built up a one-year plan to resolve those problems and, in the end, delivered the migration as planned.

A key to our success in this story was a sibling product: our cloud-based development workspaces, or CloudWS for short. We combined the Bazel migration with the migration of our engineering environment from laptops to CloudWS instances. These workspaces provide preconfigured environments for our users with prewarmed Bazel caches and super-fast access to the RE cluster. CloudWS made the Bazel experience delightful to our engineers and allowed us to simplify our story significantly by not having to deal with the tight constraints of laptops or the vast differences in internet performance around the world. We could not have successfully migrated our user base to Bazel without CloudWS and vice versa.

Production deployment

One specific roadblock was getting our production artifacts moved over to Bazel. This is not something the average developer worried about, but the general sense was that "if Bazel isn't in production, why should I care? I'm not working on what we ship!" — and that feeling was right. Rolling out the Bazel-built artifacts to production was as important as moving developer workflows, and it was riskier.

We tackled this by moving individual pieces of the build to Bazel. We do not have conventional microservices, but we have a few components that can be shipped separately, so we started by migrating the smallest/lowest-risk ones. At some point, however, we hit the point of having to rip the bandage and migrate the bigger chunks.

To minimize the risk, we treated the build system migration like any other feature. We started by migrating developer clusters first and fixing a few bugs there. We then made sure that unit, integration and system test coverage was equivalent between the two build systems. We then deployed the Bazel build to preproduction environments and assessed quality and performance. And then we proceeded to the staged production deployment.

All great? Of course not! Even with this careful approach, there were a couple of issues that triggered preproduction rollbacks and later retries. To give you two examples: The Bazel build made a preexisting race condition in the libcurl library become easier to trigger and, of course, it ended up being triggered in production; and the Bazel build made a specific code path slower because a specific logging configuration file did not end up bundled in the JAR and certain log statements were sent to the wrong place, consuming CPU resources unnecessarily.

In the end, we completed the migration to Bazel artifacts in the quiet period between Thanksgiving and Christmas 2024.

The path ahead

Our target was to finish the Bazel migration by the end of January 2025, and we did it!

But our job isn’t finished — we are charging ahead with continuous innovation as Bazel fixes build reliability and performance, but it is just the foundational piece to many other planned future improvements:

  • One-build promise: Now that we have just one build system, we can simplify all build scripts and pipelines to rely on Bazel targets. This is expected to decrease end-to-end build latency for common interactive flows but also make our release build faster.

  • IDE improvements: We have gotten the IDE experience to reasonable levels, but still "IDE syncs" are the top pain point reported by users. These are too costly time-wise and need to happen too often for IntelliJ and CLion to work efficiently. We are working to simplify our build graph to make IDE syncs less necessary, and we are also working with JetBrains to pilot its new Bazel plug-in for a better overall experience.

  • Cost savings: During the migration, the presence of two concurrent build systems accrued toward inefficiency. Now that we are fully on Bazel and can simplify our workflows to just one code path, we can start tackling cost-reduction projects.

  • Multirepo expansion: And, finally ... our company is not just the components described in this article. We have other components that live in sibling repos and that are suffering from developer productivity inefficiencies. Our next mission in the Engineering Systems organization is to figure out how we will bring the latest tooling to them, and part of that story will involve migrating those codebases to Bazel too. Relatedly, we will also need to come up with a solid plan to manage coordinated changes across various repos — which is an open problem in the industry.

Share Article

Subscribe to our blog newsletter

Get the best, coolest and latest delivered to your inbox each week

Start your 30-DayFree Trial

Try Snowflake free for 30 days and experience the AI Data Cloud that helps eliminate the complexity, cost and constraints inherent with other solutions.