Build. Part 2: Incremental Reproducibility.

In part 1 I discussed the main feature a build system should have: reproducibility. In this part I discuss how we get reproducibility with minimal duplicative work in a system like bazel.

In bazel we define build targets that have some inputs and produce some output. A target might produce some java jar which is the result of compiling several java files. Or a target might produce some static c++ .a library file. But what about targets that depend on other targets? If I have libfoo which depends on libbar, both in my project which may have many libraries, how does bazel know when to recompile?

Naive compilers might transitively recompile everything on each change. Alternatively, many compilers support compiling against already compiled artifacts. In some cases a compiler has a clean separation between the interface of compiled code and the compiled object code along with some linking phase that connects a set of compiled outputs together into one deployable artifact. Compilers designed this way work well with the make model of building.

In the make model, if A depends on B and I make a change to B, I only need to recompile the library for A if the interface of B changes and not if the object code of B does. Bazel follows this model but differs from make in that it uses a cryptographic hash rather than timestamps to detect changes. Let’s consider a concrete example.

For example, a C++ library may have two outputs, it may output the static library but it also may emit a normalized .h header file that has whitespace normalized and all comments removed. We can call this .h file the interface. So with this normalized bar_normal.h interface our libfoo.a can be built without depending on libbar.a, but only depending at compile time on bar_normal.h and the source code inputs. If we have a system like this, we can make many changes to bar without any recompilation of foo, that is as long as the external interface does not change. This is nice. Finally, we have an executable that we will deploy, this is created by linking all the static libraries together. Clearly this binary target needs to change any time any of the libraries change. So, as long as our compiler can depend such interface files, we can have an efficient incremental compilation. How does this apply to java?

Unlike the C/C++ world, java does not have explicit header of interface files. But javac can compile files linked against previously compiled jars or class files. Bazel’s approach here is a tool called ijar. The idea of ijar is to emit a normalized jar with all code, field values, and private methods and fields removed. Since the java compiler only needs to know the types of these methods at compile time and not the implementation, the compiler proceeds without issue compiling against the ijar. So the bazel rules for java emit not only a compiled jar for each target, which is analogous to a static C++ libfoo.a, but also an interface jar, such as libfoo_ijar.jar. So any change you make which does not change this interface, such as adding private methods, changing method implementations, or changing the values of constants, doesn’t cause any downstream recompilations because the hash of the interface doesn’t change. So good! But how do we generalize this approach, for instance to scala?

Scala is an interesting case because scala code compiles to java .class files. A simple idea would be to just copy the approach for java. Will this work? It will, to a degree. The problem is that scala has a richer type system than java. Each method and field has a java type which is written in the .class file, but it also has a full scala type which the jvm does not care about, but scalac does care about. Scalac writes the full scala type information into an attribute called the scala signature. Since ijar just ignores this attribute and passes it through unchanged, this means that any change to the scala types will cause a recompilation. This is unfortunate, but not a total disaster. If A depends on B and B depends on C, what if we change a private type of C? Since the scala signature changes, B needs to be recompiled. But since the change is actually private, the output for B (assuming a reproducible compiler) will be unchanged so finally A will not be recompiled.

For other languages, this approach should generalize fairly well assuming:

  1. They can emit separate interface and code artifacts.
  2. They can compile against dependencies using only the interface. This generally assumes the existence of a separate linking phase, which for the JVM is actually at runtime.
  3. They actually expose the ability to do the above. Many compilers work this way, but often compilers are co-dependently developed with language specific build and packaging tools which make their use in heterogeneous build tools like Bazel (or Buck or Pants or Gradle or Make or …) more painful.

Looking forward, I see a bright future for heterogeneous build systems that allow us to have cross language dependencies (via FFI or RPC) with incremental rebuilds and full reproducibility. Compiler authors can help enable this by explicitly supporting explicit interface generation, and the ability to compile only against interfaces, and documenting how this is done via command line interface to their tools.

Machine learning, Programming, Hadoop, Scala.