thoughtpolice/why-buck2.md Secret

## why-buck2.md

      
    Raw
  

              why-buck2.md
            
          
    Why use Buck2? — or: from Make to Buck2

Let's say you have a Makefile like this:
SRCS := $(shell find . -type f -iname '*.c')
OBJS := $(patsubst %.c,%.o,$(SRCS))
CFLAGS := -O1

%.o: %.c
	$(CC) -c $(CFLAGS) -o $@ $<

a.exe: $(OBJS)
	$(CC) -o $@ $<

test: a.exe
	@./a.exe

.PHONY: test
This isn't specifically for GNU Make, but for the purposes here, this should
be readily understandable or translate-able to your favorite implementation of
Make. This Makefile:

Turns a set of .c files into .o files.
Turns full set of .o files into an .exe
Allows you to run the .exe

Make is actually a great tool conceptually to implement, to test your
programming chops with — try it when you want to learn a new language.
That's because Make has a pretty simple operational model at a glance -- the
model that describes how Make executes step by step. When we think of Make
operationally, we are sort of thinking of its concrete algorithm, and the
algorithm looks something like this, to be extremely hand-wavy:

Make tracks and is responsible for keeping files on disk up-to-date, which is
its job.
A Makefile, for every file it's describing, has a command associated —
and when run, creates that file.
A Makefile contains a representation of dependencies between files, i.e. this
file needs to exist, before this command can get run, which generates this
other file.
When Make is run, it is given a top-level file to produce, e.g. make out.exe
— it looks at this file and dependent files it need.
Make recursively asks itself: is this file "up-to-date"? If it is, Make does
nothing and continues on. If it isn't, it runs the command associated with
that file to re-generate it. This task is recursive across all dependencies of
the requested file, because if a dependency is out-of-date, then it needs to
be brought up-to-date first.
It can do this in parallel, too. Some commands depend on different sets of
input files, after all.
It continues this process until the desired file out.exe is "up-to-date"

The notion of 'up-to-date' is left abstract; though, in practice Make looks at
the timestamps of files, so if it sees A: B..., A is considered up-to-date
only if it exists, and has a newer timestamp than all dependent files B.... It
does this recursively for all dependencies. This "existence+mtime" check
is the core of how Make works for most users.
In this way, because Make only tracks files, it treats the filesystem as a
caching layer, by assuming every action puts its result into a file on the
filesystem — and it uses the "existence+mtime" as criteria for "cache hit".
Thus turning the filesystem into a simplistic cache and database of record. You
can imagine that, in a hypothetical world where the stat call also returned a
SHA-256 hash of a file along with the mtime, then make could use a hash
function instead of the mtime. Without that, Make would instead have to record
that data out of band.
Also, some rules cannot be cached. They are .PHONY: rules, and the name of a
phony rule doesn't actually have to correspond to a file, because no caching
is taking place.
But if we dissect it, make actually embodies a particular set of semantics
that we can pull apart. And I think there's a journey from this that would
roughly result in something like buck2.
Make, denotationally

So, the purpose of a Makefile is simple: run a set of commands. But doesn't bash
do that? And didn't we determine Make caches things, and checks for
"up-to-date"-ness? What separates it from a raw Bash script, then? You could
just put every command sequentially into a .sh file and run that. But Make is
useful precisely because it recognizes some common things about such a series of
commands:


Individual commands are computations that result in outputs. For example,
the report file generated by running a test suite, or the object file
generated by compiling a piece of code.


Even though the total set of inputs to the bash script may be large when
looked at in aggregate, individual commands often only need a small set of the
total set of inputs to do their job — inputs which often are the outputs
of other, previous commands that were run — or input files written by
hand, perhaps.


For most commands, the command's output is purely a function of its input
parameters; files, command line arguments, stdin, environment variables, et
cetera. If you run a command once with parameters A, B, and C, be they from
the environment, command line, or stdin -- doing so again will always give
the same output as a previous run. These are pure commands.


But some (few) commands need to be re-run every time, even with the same
inputs. These are volatile commands. They often have some sort of side
effect on the outside environment, or perform some ambient effect to be
tracked (like running a test suite.)


Individual commands do not need to run sequentially. A fully defined
sequential ordering might imply there is a total ordering on the set of
commands. But there isn't, unless you arbitrarily break ties by some other
total ordering, kicking the can down the road. Put another way, you can
create a total ordering, but it is overly conservative and completely
arbitrary.


Rather: any command may run once all of its needed inputs are available,
including those that are outputs outputs of previous commands. Therefore, if
we consider the set of all commands S there is a partial order on S,
where the partial order relationship is defined by input/output edges, and X < Y means that command X must be run before command Y can be run, because
the output of X is an input to Y.


Because there is a partial ordering on the set of commands to run, with
respect to their inputs/outputs — you may construct a directed acyclic
graph (DAG) from that set.


Because there is a DAG on the set of commands, the DAG naturally represents a
graph which can be computed in parallel. Two commands X and Y may be run
in parallel as long as they are not transitively dependent on one another,
that is, there is no transitive chain of operators X < ... < Y relating
them.


In this sense, a Makefile doesn't describe a particular set of commands to
run; rather, it is a specification of commands to run, of which there are
many potential realizations. For example, if two nodes X and Y are not
transitively related, then their commands can be run in a purely sequential
order as X, Y ("version 1") or Y, X ("version 2"). Even though the
physical trace of commands is different, these two versions are "observably
equivalent"; hence the DAG is a specification, not a concrete instance of the
build. Again, this is true even when no parallelism is involved. For instance,
in the original Makefile above — Make, even with no parallelism, could
simply randomize the order of the .o files every time, but the result will
be the same binary.


Now, you can pick some command in the set S. This command A results in an
output. To get that output, you just walk the transitive chain of all things
that A depends on and build all of them. This has a natural specification:
Build(A) =
  DependenciesOfA = Build(D) for each D in DependenciesOf(A)
  return RunCommand(A, DependenciesOfA)

This definition is naturally recursive, because it represents the recursive
walk of all dependencies.


Finally, the graph of commands is incremental in a natural, straightforward
way: if a command is a pure command, then you simply need to check
"up-to-date"ness of the inputs. (It is assumed this is less expensive than
re-running the command itself.) We say this is incremental because after
running Build(A) once, rerunning Build a second time with different A is
faster than recomputing it from scratch; the change in time being
"proportional" by some measure to the change in A.


So the final specification for Build looks something like this:
Build(A) =
  DependenciesOfA = Build(D) for each D in DependenciesOf(A)

  // pure commands can be skipped, if everything is up to date
  if IsPure(A) then {
    Result = IsUpToDate(A, DependenciesOfA)
    if Result is not null {
      return Result // it is, so return
    }
  }

  // otherwise just recompute; if it's pure, then we need
  // to record its up-to-date-ness
  Result = RunCommand(A, DependenciesOfA)
  if IsPure(A) then {
    RecordUpToDateResult(Result, A, DependenciesOfA)
  }
  return Result

This description might sound overly academic but there is purpose to it: it is
roughly the set of insights, and the rough abstract algorithm that informs
every build system! It's mostly a matter of things like "how do I recompute
DependenciesOfA efficiently" and the record of up-to-date-ness, which we'll
get to.
Also, note that this does not do any "invalidation" of consumers of A. If you
call Build(A) and then call Build(B) and A < B, then it's assumed that
IsUpToDate(B, DependenciesOfB) will return null — assuming A was
changed according to the "up-to-date"-ness logic.
But this is a remarkably simple description! It's very important to understand
this algorithm and the abstract description above.
Teasing apart the description

This abstract description teases out very important parts of the design of
make:


Make only thinks of "inputs" and "outputs" as files and nothing more. Files
depend on files, and that's it. But in the above, the description of A is more
abstract.


The check for "up-to-date"-ness is totally abstract, as alluded to
earlier. make uses the filesystem as a simple cache to store the result of a
command itself, and also check for "up-to-dateness" of the object too by using
mtime fields. This is part of its elegance (more on that later) but conflating
these things has consequences in the large.


IsUpToDate(File, Dependencies) =
  if File does not exist {
    return null // needs to be built for the first time
  }

  t = mtime(File) // grab the file mtime

  for each DepFile in Dependencies {
    // if a dependency has a newer mtime than the
    // original file, it was rebuilt, so this needs
    // to be rebuilt too
    if mtime(DepFile) newer than t {
      return null
    }
  }

  // it exists, everything is up to date, so return
  // the file itself as the result
  return File

RecordUpToDateResult(Result, File, Dependencies) =
  // just write the result of the command to the filesystem,
  // and make sure the mtime is recorded. dependencies aren't
  // relevant
  WriteFileWithMtime(File, Result)
  return null

However, this implementation is awkward. For example a cache should be able to
hold multiple versions of a file. In fact a cache is just a glorified key-value
store, so it's largely a matter of picking a good cache key, right? But a
"modern filesystem" is a very poor cache when used the way make does, because
only allows one file to exist at a given "cache key" or filesystem path. This
gives extremely poor granularity from caching and makes many things awkward.
For example, in the original Makefile in the beginning, only a single a.exe
file can exist next to that Makefile on a filesystem. So if you only change
the input file itself, the cache works. Yet you may recompile the same a.exe
multiple times in different configurations, where caching doesn't work:

Compile with make CFLAGS=-O2 to create a.exe
Force a recompilation with touch *.o; make CFLAGS=-O3
Force a recompilation with touch *.o; make CFLAGS=-O2

Because only one a.exe exists on the filesystem at any time, you cannot cache
multiple versions of a.exe built with different compiler flags. Therefore you
must re-compile and re-link a.exe in step 3, even though it was previously
done already in step 1, because it got overwritten in step 2. To have multiple
copies of the exe file, you need to modify the build system to use a build
directory, or some other prefix on the filename to disambiguate them, and also
introduce some mechansim to create it:
OUTDIR := out

SRCS := $(shell find . -type f -iname '*.c')
OBJS := $(patsubst %.c,$(OUTDIR)%.o,$(SRCS))
CFLAGS := -O1

$(OUTDIR)/%.o: %.c
	mkdir -p $OUTDIR # do this as a hack
	$(CC) -c $(CFLAGS) -o $(OUTDIR)/$@ $<

$(OUTDIR)/a.exe: $(OJBS)
	$(CC) -o $@ $<

test: $(OUTDIR)/a.exe
	@$(OUTDIR)./a.exe

.PHONY: test
Now we can do make CFLAGS=-O3 OUTDIR=out-O3 in order to separate the ouput
directories. This introduces a prefix; in short, a kind of "prefix sharding"
(directory) of the keyspace (filesystem). But this problem quickly proliferates
when you have many combinations of options to mix and match:

Optimization flags
Sanitizers: thread, address, undefined
Fuzzing builds
Compilers (gcc vs clang)
Linkers (lld vs bfd)

It may be important to test all these configurations and different variations of
them too. For example, you can combine address sanitizer and fuzzing, and -O3,
but not with gcc. There is actually a cartesian product of all these inputs
that are potentially valid.
By abusing the filesystem path as a cache key, the granularity of the cache is
limited by how you construct file paths. In the above example $(OUTDIR) was
proliferated everywhere. You'll also need to add special cases for all sanitizer
features, if you want those to stay and not clobber each other.
By this metric, the above Makefile is still not fullproof, because it requires
both CFLAGS and OUTDIR to be set by the user and doesn't stop you from
overwriting entries in the "cache". You would also have to record CFLAGS into
a file for it to be tracked by make properly, and then attach the CFLAGS to
the OUTDIR path somehow. This is a lot of work to get granular caching at
the level of these changes.
This circles back to the first point mentioned about make's design: the
CFLAGS and LDFLAGS and compiler, and sanitizer mode — all those
related variables are really inputs to the command, along with the .c file.
But Make only recognizes files as inputs and only tracks files. Therefore we
have to encode strings into filenames in order to abuse them as part of the
cache key.
So how can we improve on this? Is there a better way to record results and check
up-to-dateness? And can we improve on the conflation between "files" and "inputs"? We can.
Make, as a language

But first, I want to return to this Makefile:
OUTDIR := out

SRCS := $(shell find . -type f -iname '*.c')
OBJS := $(patsubst %.c,$OUTDIR/%.o,$(SRCS))
CFLAGS := -O1

$(OUTDIR)/%.o: %.c
	mkdir -p $OUTDIR # do this as a hack
	$(CC) -c $(CFLAGS) -o $(OUTDIR)/$@ $<

$(OUTDIR)/a.exe: $(OJBS)
	$(CC) -o $@ $<

test: $(OUTDIR)/a.exe
	@$(OUTDIR)./a.exe

.PHONY: test
This file is fine but if we could experiment a bit... what would something
better look like? For example, imagine if you want to have the ability for the
user to say make ASAN=1 and enable address sanitizer? Something like this
would work:
OUTDIR := out

SRCS := $(shell find . -type f -iname '*.c')
OBJS := $(patsubst %.c,$OUTDIR/%.o,$(SRCS))
CFLAGS := -O1

ifdef ASAN
CFLAGS += $(CFLAGS) -fsanitize=address
ifdef

$(OUTDIR)/%.o: %.c
	mkdir -p $OUTDIR # do this as a hack
	$(CC) -c $(CFLAGS) -o $(OUTDIR)/$@ $<

$(OUTDIR)/a.exe: $(OJBS)
	$(CC) -o $@ $<

test: $(OUTDIR)/a.exe
	@$(OUTDIR)./a.exe

.PHONY: test
This modifies the CFLAGS variable based on if the ASAN variable (a new
input) exists. But the reusability is poor if we want to introduce new
variables, or compose variables, or exclude features that are invalid, with long
chains of ifdefs.
More importantly, it doesn't separate concerns. If I just write C code, I have
to go modify the build system, or structure it in such a way that the user can
pass extra flags. I, as an author of the Makefile, need to have knowledge that
this is something users will do.
But what if there was another way? Instead of having a chain
of ifdef calls for many things, maybe reusability could come another way?
For the sake of position, let's imagine if we could instead take a make rule
like this
A: B C D
	cmd B C D > A
and abstract it out. For example, we could give it a name like a function, and
then have placeholders for the parameters. Let's imagine a hypothetical syntax
like the following for defining these "functions":
func do-cmd(out, inputs)
  $out: $inputs
    cmd $inputs > $out
endfunc

To create the equivalent of the original code, we just apply this function to
concrete arguments:
func do-cmd-rule(out, inputs)
  $out: $inputs
    cmd $inputs > $out
endfunc

do-cmd-rule(A, B C D)

This is hypothetical but the intent is roughly clear. The call to the function
instatiates the rule or "expands" it, like you expect. This looks like a lot of
work for nothing, but it dramatically helps us reuse code almost immediately.
For example, we can completely abstract out the cflags rule from earlier:
func do-cc(out, input, cflags)
  $out: $input
    $(CC) -c $cflags -o $out $input
endfunc

do-cc(%.o, %.c, $(CFLAGS))

We can imagine the pattern rules %.o working in a similar way to the original
Makefile; it describes a set of files that match. But if we're adding things, why
stop at these functions? We could use real for loops too!
SRCS=$(shell find . -type f -iname '*.c')

func do-cc(...)
  ...
endfunc

for x in $(SRCS):
  do-cc(patsubst(%.c, %.o, $x), $x, $(CFLAGS))

We've moved away from the purely declarative nature of the Makefile with this.
After all, pattern matching with the declarative make DSL is nice, but this
adds a lot of flexibility. For example, we could just add another pattern
with a prefix to the name for ASAN builds. Let's add variables too, why not?:
for x in $(SRCS):
  var opath = patsubst(%.c, %.o, $x)
  do-cc($opath, $x, $(CFLAGS))
  do-cc(asan-$opath, $x, $(CFLAGS) -fsanitize-address)

In short, we declare two rules for every object file, one named x.o and one
named asan-x.o, and these are compiled with the same flags, but one includes
-fsanitize=address. (You could also imagine an equivalent with the % pattern
to this example, of course.)
This is actually a huge, huge step up, because no modification to any ambient
state needs to happen. More importantly: this was implemented with no change
to do-cc at all, where the original solution had to modify CFLAGS. It's not
always clear that's correct. Here, there is no modification under what
conditions CFLAGS can be mutated. But passing a new argument is easy.
Now let's go one final step further and just separate the functions and this part of
the code into two files. Most of your Makefiles could then look like this:
.include cc-rules.mk

SRCS := shell(find . -type f -iname '*.c')
OBJS := patsubst(%.c,$OUTDIR/%.o,$(SRCS))

for x in $SRCS:
  var opath = patsubst(%.c, %.o, $x)
  do-cc($opath, $x, $(CFLAGS))

do-link(a.exe, $(OBJS))

But then why stop there? Let's just combine do-cc and do-link into one, and
then why even bother making the SRCS variable when we can just pass it in:
func do-exe(out, srcs)
  for x in srcs:
    var opath = patsubst(%.c, %.o, $x)
    do-cc($opath, $x, $(CFLAGS))
endfunc

Finally:
.include cc-rules.mk

do-exe(a.exe, foo.c bar.c main.c)

This is a huge improvement! Why? Because we've separated the definition of the
rule body (the function) from the definition of the rule (applying a function to
arguments). This is key to abstracting builds at scale, because the
implementation of do-link or do-cc or do-exe isn't revealed to any user.
In contrast, the original Makefile, while brief, exposed many implementation
details in the body of a rule. Brevity is actually an advantage in some ways,
but when builds get large enough, it becomes a massive problem to value brevity
over reuse and modularity.
Doesn't make have functions already? I mean, macros?

Yes, but hear me out: they're kind of bad for the purpose we want here, which
is reusability.
First off, it isn't clear which make you're talking about, though in
practice everybody does mean "GNU Make" when they would ask something like this.
Second: they're bad at large scale. I say this with a lot of respect for make
(again, more on that later.) But they aren't a good unit of code reuse.
Again, the goal is reuse. There's two basic reasons they aren't so hot at this:

Because make mostly values brevity over readability, they have poor syntax,
especially as you add more that call each other, and begin to deal with things
like escaping rules. Shell is already very bad at this, and Make pushes it to
new limits when you get far enough.
More importantly, they aren't hygenic, and Make has no notion of lexical
scope. Therefore when you reference a $(VARIABLE) it is simply assumed to
exist in the current scope (possibly having been defined somewhere else by a
previous function), AKA it uses form dynamic scoping. We've moved away from
this in larger scale programs for pretty good reason — it is easy
to make a mess of it!

The first one is bad, but the second one is killer at large scale; it also feeds
into the first problem, where the lack of scope and lexical hygiene really
exacerbate the readability problems.
Here's a good example of what I'm talking about; the Glasgow Haskell Compiler
for a very long time used a large, powerful GNU Make based build system. GHC is
a large, multi-language project, and the compiler has to bootstrap itself,
compile its runtime, and support many features (code coverage, profiling,
dynamic linking). This puts a lot of demand on the build system; you have lots
of rules with subtle dependencies, and lots of similar-but-not-quite-same rules.
Here's a snippet of that code (from rules/build-dependencies.mk in GHC 8.0.x):
$$($1_$2_depfile_haskell) : $$($1_$2_HS_SRCS) $$($1_$2_HS_BOOT_SRCS) $$$$($1_$2_HC_MK_DEPEND_DEP) | $$$$(dir $$$$@)/.
	$$(call removeFiles,$$@.tmp)
ifneq "$$($1_$2_HS_SRCS)" ""
	"$$($1_$2_HC_MK_DEPEND)" -M \
	    $$($1_$2_$$(firstword $$($1_$2_WAYS))_MOST_DIR_HC_OPTS) \
	    $$($1_$2_MKDEPENDHS_FLAGS) \
	    $$($1_$2_HS_SRCS)
endif
This is part of a macro body for a macro named build-dependencies, and it is
insane. And it isn't that way for no reason. A big part of it is just because
Make doesn't even have concepts of lexical scope that match any programming
language; or things like "parameters that have actual names". You can't even
tell what $1 and $2 are in this context, while their real names hidden in
comments are $dir and $distdir — would at least tell you something.
An incredible amount of effort is just spent on the semantics of variable
expansion in various conditions. By the way, the above Makefile needs to work on
Windows, too.
I used to work on the Glasgow Haskell Compiler as a core maintainer. The build
system was easily the most impenetrable part of the whole codebase and daunting
for every new and experienced contributor — for a production compiler
that has academic research getting pumped into it year-round! There was even
an entire design document laying out the ideas, but when the code you are
dealing with is like this, things begin breaking down.
And before you say it: yes, we tried very very hard to avoid the above. But
we in practice had multiple programming languages, half a dozen needed tools from
manually invoked preprocessors, to code generators, to documentation tools —
rich dependencies, hundreds of thousands of lines of code. You want to
try and not have too many external dependencies so bootstrapping and porting is
easier. You still need to support this thing for 2 more years because you have
some users who are on ancient Linux machines but will upgrade after the support
expiry. The list goes on and on.
"Don't have an insane build system" is a good goal to strive for. In practice,
it is competing with 50 other goals. Nobody is writing 10,000 lines of GNU Make
for fun, I can tell you that.
When you have a large enough system, you need primitives that allow code reuse.
If you're going to add a unit of code reuse, the easiest one we know of is to
add functions, and programmers generally have expections about functions wrt
scope and name hygiene.
To end this section I will make my call to tool designers — please, if you
are designing a DSL: do not repeat this mistake. Just add functions and function
calls, at minimum. Maybe you need more than this, and maybe you even need a
full blown language. But please, I am begging you, at least have real functions
with a real notion of lexical scope.
FIXME

More to be written soon.