Skip to content

Instantly share code, notes, and snippets.

@robbertvanginkel
Created March 22, 2021 21:01
Show Gist options
  • Save robbertvanginkel/b17e2b048d4a0694e8d9cb035bb3b0b7 to your computer and use it in GitHub Desktop.
Save robbertvanginkel/b17e2b048d4a0694e8d9cb035bb3b0b7 to your computer and use it in GitHub Desktop.
Understanding bazel bwotb

I've put in a bit of time understanding Bazel's caching better to see how this feature fits in. I've experimented with the following setup:

$ cat .bazelversion
1.1.0
$ cat .bazelrc
common --announce_rc
build --incompatible_strict_action_env
build --announce_rc
build --remote_download_minimal

build:http --remote_cache=http://127.0.0.1:8080
$ cat BUILD.bazel
genrule(
  name = "a",
  cmd = "echo a > $@",
  outs = ["a.txt"],
)

genrule(
  name = "b",
  srcs = ["a.txt"],
  cmd = "cat $(location :a.txt) > $@; echo b >> $@",
  outs = ["b.txt"],
)

genrule(
  name = "c",
  srcs = ["a.txt", "b.txt"],
  cmd = "cat $(location :a.txt) $(location :b.txt)  > $@; echo c >> $@",
  outs = ["c.txt"],
)

With a clear cache, running the following bazel command will fill the cache up to target b:

$ bazel --nohome_rc build --config http //:b

It makes the following requests on the cache:

2019/11/08 21:19:26  GET 404       127.0.0.1 /ac/734539cab61e72ccfc83c4dccfeb9c17ddddcf14ca923c6fb8fc3eb59bb1de75 
2019/11/08 21:19:26  PUT 200       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
2019/11/08 21:19:26  PUT 200       127.0.0.1 /cas/87428fc522803d31065e7bce3cf03fe475096631e5e07bbd7a0fde60c4cf25c7
2019/11/08 21:19:26  PUT 200       127.0.0.1 /cas/734539cab61e72ccfc83c4dccfeb9c17ddddcf14ca923c6fb8fc3eb59bb1de75
2019/11/08 21:19:26  PUT 200       127.0.0.1 /cas/b01b0391d39289cc862e8f76e80048f418d5a8a8d75a3a3cdfe18f860e6c2326
2019/11/08 21:19:26  PUT 200       127.0.0.1 /ac/734539cab61e72ccfc83c4dccfeb9c17ddddcf14ca923c6fb8fc3eb59bb1de75 
2019/11/08 21:19:26  GET 404       127.0.0.1 /ac/7201514ba523bda8568c6af767b7ddd87b1786254414339136ab874842c3e3d7 
2019/11/08 21:19:26  PUT 200       127.0.0.1 /cas/911169ddaaf146aff539f58c26c489af3b892dff0fe283c1c264c65ae5aa59a2
2019/11/08 21:19:26  PUT 200       127.0.0.1 /cas/7201514ba523bda8568c6af767b7ddd87b1786254414339136ab874842c3e3d
2019/11/08 21:19:26  PUT 200       127.0.0.1 /cas/34252bce59f09a3a058c4581f8a09597eebcc19f7083c053cefdb51c02eb07f
2019/11/08 21:19:26  PUT 200       127.0.0.1 /ac/7201514ba523bda8568c6af767b7ddd87b1786254414339136ab874842c3e3d7 

We can inspect the cache and see that the contents are now:

$ tree cache/
cache/
├── cas
│   ├── 34252bce59f09a3a058c4581f8a09597eebcc19f7083c053cefdb51c02eb07f5
│   ├── 7201514ba523bda8568c6af767b7ddd87b1786254414339136ab874842c3e3d7
│   ├── 734539cab61e72ccfc83c4dccfeb9c17ddddcf14ca923c6fb8fc3eb59bb1de75
│   ├── 87428fc522803d31065e7bce3cf03fe475096631e5e07bbd7a0fde60c4cf25c7
│   ├── 911169ddaaf146aff539f58c26c489af3b892dff0fe283c1c264c65ae5aa59a2
│   ├── b01b0391d39289cc862e8f76e80048f418d5a8a8d75a3a3cdfe18f860e6c2326
│   └── e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
└── ac
    ├── 7201514ba523bda8568c6af767b7ddd87b1786254414339136ab874842c3e3d7
    └── 734539cab61e72ccfc83c4dccfeb9c17ddddcf14ca923c6fb8fc3eb59bb1de75

We can decode the ac entries as build.bazel.remote.execution.v2.ActionResult and the corresponding hashes in the cas as build.bazel.remote.execution.v2.Action's:

$ protoc -I ../googleapis/ -I . \
     --decode=build.bazel.remote.execution.v2.ActionResult build/bazel/remote/execution/v2/remote_execution.proto \
     < ../cache/ac/734539cab61e72ccfc83c4dccfeb9c17ddddcf14ca923c6fb8fc3eb59bb1de75
output_files {
  path: "bazel-out/darwin-fastbuild/bin/a.txt"
  digest {
    hash: "87428fc522803d31065e7bce3cf03fe475096631e5e07bbd7a0fde60c4cf25c7"
    size_bytes: 2
  }
}
stdout_digest {
  hash: "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
}
stderr_digest {
  hash: "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
}
$ protoc -I ../googleapis/ -I . \
      --decode=build.bazel.remote.execution.v2.Action build/bazel/remote/execution/v2/remote_execution.proto \
      < ../cache/cas/7201514ba523bda8568c6af767b7ddd87b1786254414339136ab874842c3e3d7 
command_digest {
  hash: "34252bce59f09a3a058c4581f8a09597eebcc19f7083c053cefdb51c02eb07f5"
  size_bytes: 283
}
input_root_digest {
  hash: "8dd51c306a2cb0ccede035eea9379d4af4f1db3e152b219455ccaedbce9ad86b"
  size_bytes: 165
}

$ protoc -I ../googleapis/ -I . \
     --decode=build.bazel.remote.execution.v2.ActionResult build/bazel/remote/execution/v2/remote_execution.proto \
     < ../cache/ac/7201514ba523bda8568c6af767b7ddd87b1786254414339136ab874842c3e3d7
output_files {
  path: "bazel-out/darwin-fastbuild/bin/b.txt"
  digest {
    hash: "911169ddaaf146aff539f58c26c489af3b892dff0fe283c1c264c65ae5aa59a2"
    size_bytes: 4
  }
}
stdout_digest {
  hash: "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
}
stderr_digest {
  hash: "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
}

$ protoc -I ../googleapis/ -I . \
      --decode=build.bazel.remote.execution.v2.Action build/bazel/remote/execution/v2/remote_execution.proto \
      < ../cache/cas/734539cab61e72ccfc83c4dccfeb9c17ddddcf14ca923c6fb8fc3eb59bb1de75
command_digest {
  hash: "b01b0391d39289cc862e8f76e80048f418d5a8a8d75a3a3cdfe18f860e6c2326"
  size_bytes: 200
}
input_root_digest {
  hash: "b9002e1da634b904c8bde2677cfba9a64f6a6e64a3c2d1e6f1f4915b84cecb79"
  size_bytes: 82
}

Looking at the actions, we can decode the command_digest's as build.bazel.remote.execution.v2.Command:

$ protoc -I ../googleapis/ -I . \
      --decode=build.bazel.remote.execution.v2.Command build/bazel/remote/execution/v2/remote_execution.proto \
      < ../cache/cas/b0/b01b0391d39289cc862e8f76e80048f418d5a8a8d75a3a3cdfe18f860e6c2326
arguments: "/bin/bash"
arguments: "-c"
arguments: "source external/bazel_tools/tools/genrule/genrule-setup.sh; echo a > bazel-out/darwin-fastbuild/bin/a.txt"
environment_variables {
  name: "PATH"
  value: "/bin:/usr/bin:/usr/local/bin"
}
output_files: "bazel-out/darwin-fastbuild/bin/a.txt"
platform {
}
$ protoc -I ../googleapis/ -I . \
      --decode=build.bazel.remote.execution.v2.Command build/bazel/remote/execution/v2/remote_execution.proto \
      < ../cache/cas/34/34252bce59f09a3a058c4581f8a09597eebcc19f7083c053cefdb51c02eb07f5
arguments: "/bin/bash"
arguments: "-c"
arguments: "source external/bazel_tools/tools/genrule/genrule-setup.sh; cat bazel-out/darwin-fastbuild/bin/a.txt > bazel-out/darwin-fastbuild/bin/b.txt; echo b >> bazel-out/darwin-fastbuild/bin/b.txt"
environment_variables {
  name: "PATH"
  value: "/bin:/usr/bin:/usr/local/bin"
}
output_files: "bazel-out/darwin-fastbuild/bin/b.txt"
platform {
}

and inspect the values for output_files and stdout_digest/stderr_digest:

$ cat ../cache/cas/87428fc522803d31065e7bce3cf03fe475096631e5e07bbd7a0fde60c4cf25c7
a
$ cat ../cache/cas/911169ddaaf146aff539f58c26c489af3b892dff0fe283c1c264c65ae5aa59a2
a
b
$ cat ../cache/cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

With that understanding of the cache and all its contents, lets do a clean and a rebuild and checkout the requests to the cache:

$ bazel clean
$ bazel --nohome_rc build --config http //:b
...
2019/11/08 21:54:58  GET 200       127.0.0.1 /ac/734539cab61e72ccfc83c4dccfeb9c17ddddcf14ca923c6fb8fc3eb59bb1de75   	//:a ActionResult
2019/11/08 21:54:58  GET 200       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  	emtpy digest
2019/11/08 21:54:58  GET 200       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  	emtpy digest
2019/11/08 21:54:58  GET 200       127.0.0.1 /ac/7201514ba523bda8568c6af767b7ddd87b1786254414339136ab874842c3e3d7   	//:b ActionResult
2019/11/08 21:54:58  GET 200       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  	emtpy digest
2019/11/08 21:54:58  GET 200       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  	emtpy digest

We can see that in --remote_download_minimal mode, we download all the ActionResults required for the actions from the actioncache and the stdout/stderr digests for the actions from the content address store:

$ bazel clean
$ bazel --nohome_rc build --config http //:b --remote_download_toplevel
...
2019/11/08 21:59:35  GET 200       127.0.0.1 /ac/734539cab61e72ccfc83c4dccfeb9c17ddddcf14ca923c6fb8fc3eb59bb1de75   	//:a ActionResult
2019/11/08 21:59:35  GET 200       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  	emtpy digest
2019/11/08 21:59:35  GET 200       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  	emtpy digest
2019/11/08 21:59:35  GET 200       127.0.0.1 /ac/7201514ba523bda8568c6af767b7ddd87b1786254414339136ab874842c3e3d7   	//:b ActionResult
2019/11/08 21:59:35  GET 200       127.0.0.1 /cas/911169ddaaf146aff539f58c26c489af3b892dff0fe283c1c264c65ae5aa59a2  	bazel-out/darwin-fastbuild/bin/b.txt
2019/11/08 21:59:35  GET 200       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  	emtpy digest
2019/11/08 21:59:35  GET 200       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  	emtpy digest

With --remote_download_toplevel we can see we get the same, but also the toplevel output for target //:b.

$ bazel clean
$ bazel --nohome_rc build --config http //:b --remote_download_outputs=all
...
2019/11/08 22:01:30  GET 200       127.0.0.1 /ac/734539cab61e72ccfc83c4dccfeb9c17ddddcf14ca923c6fb8fc3eb59bb1de75   	//:a ActionResult
2019/11/08 22:01:30  GET 200       127.0.0.1 /cas/87428fc522803d31065e7bce3cf03fe475096631e5e07bbd7a0fde60c4cf25c7  	bazel-out/darwin-fastbuild/bin/a.txt
2019/11/08 22:01:30  GET 200       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  	emtpy digest
2019/11/08 22:01:30  GET 200       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  	emtpy digest
2019/11/08 22:01:30  GET 200       127.0.0.1 /ac/7201514ba523bda8568c6af767b7ddd87b1786254414339136ab874842c3e3d7   	//:b ActionResult
2019/11/08 22:01:30  GET 200       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  	emtpy digest
2019/11/08 22:01:30  GET 200       127.0.0.1 /cas/911169ddaaf146aff539f58c26c489af3b892dff0fe283c1c264c65ae5aa59a2  	bazel-out/darwin-fastbuild/bin/b.txt
2019/11/08 22:01:30  GET 200       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  	emtpy digest

And with --remote_download_outputs=all, we get the regular behavior of attempting to download all intermediates.

With the above in mind, we can see that the cache:

  • Always checks the actioncache (ac) for results to actions
    • If not Found : bazel executes the action, uploads all results (output_files, command_digest, stdout_digest, stderr_digest) into the cas, then uploads the ActionResult into the ac.
    • If found : depending on the strategy
      • all : download stdout_digest, stderr_digest and output_files.
      • toplevel : download stdout_digest, stderr_digest always, output_files iff target is toplevel.
      • minimal : download stdout_digest, stderr_digest always, output_files never.

Because the cache doesn't make requests for the cache keys related to output_files nearly as often if it used to, likely our failures have something to do with this.


Now lets test some eviction scenarios. First, lets say the ActionResult for //:a gets evicted from the cache.

$ rm ../cache/ac/734539cab61e72ccfc83c4dccfeb9c17ddddcf14ca923c6fb8fc3eb59bb1de75
$ bazel clean
$ bazel --nohome_rc build --config http //:b
INFO: Invocation ID: 56ca7eb1-fd4f-463c-85a0-52c69fbdc460
INFO: Analyzed target //:b (5 packages loaded, 9 targets configured).
INFO: Found 1 target...
Target //:b up-to-date:
  bazel-bin/b.txt
INFO: Elapsed time: 0.223s, Critical Path: 0.10s
INFO: 2 processes: 1 remote cache hit, 1 darwin-sandbox.
INFO: Build completed successfully, 3 total actions
...
2019/11/08 22:26:22  GET 404       127.0.0.1 /ac/734539cab61e72ccfc83c4dccfeb9c17ddddcf14ca923c6fb8fc3eb59bb1de75   	//:a ActionResult
2019/11/08 22:26:22  PUT 200       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  	emtpy digest
2019/11/08 22:26:22  PUT 200       127.0.0.1 /cas/87428fc522803d31065e7bce3cf03fe475096631e5e07bbd7a0fde60c4cf25c7  	bazel-out/darwin-fastbuild/bin/a.txt
2019/11/08 22:26:22  PUT 200       127.0.0.1 /cas/734539cab61e72ccfc83c4dccfeb9c17ddddcf14ca923c6fb8fc3eb59bb1de75  	//:a Action
2019/11/08 22:26:22  PUT 200       127.0.0.1 /cas/b01b0391d39289cc862e8f76e80048f418d5a8a8d75a3a3cdfe18f860e6c2326  	//:a Command
2019/11/08 22:26:22  PUT 200       127.0.0.1 /ac/734539cab61e72ccfc83c4dccfeb9c17ddddcf14ca923c6fb8fc3eb59bb1de75   	//:a ActionResult
2019/11/08 22:26:22  GET 200       127.0.0.1 /ac/7201514ba523bda8568c6af767b7ddd87b1786254414339136ab874842c3e3d7   	//:b ActionResult
2019/11/08 22:26:22  GET 200       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  	emtpy digest
2019/11/08 22:26:22  GET 200       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  	emtpy digest

We can see bazel attempts to find an ActionResult first in the ActionCache, for which it finds a 404, so it executes the action locally and uploads the results to the cache. After this it continues to fetch the results for b.

Lets attempt to evict a stdout_digest/stderr_digest:

$ rm ../cache/cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
$ bazel clean
$ bazel --nohome_rc build --config http //:b
INFO: Invocation ID: 6182dbd9-21ce-4ffe-98f5-a9b31a4ed35f
INFO: Analyzed target //:b (5 packages loaded, 9 targets configured).
INFO: Found 1 target...
Target //:b up-to-date:
  bazel-bin/b.txt
INFO: Elapsed time: 0.220s, Critical Path: 0.08s
INFO: 2 processes: 1 remote cache hit, 1 darwin-sandbox.
INFO: Build completed successfully, 3 total actions
...
2019/11/08 22:41:16  GET 200       127.0.0.1 /ac/734539cab61e72ccfc83c4dccfeb9c17ddddcf14ca923c6fb8fc3eb59bb1de75   	//:a ActionResult
2019/11/08 22:41:16  GET 404       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  	emtpy digest
2019/11/08 22:41:16  GET 404       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  	emtpy digest
2019/11/08 22:41:16  PUT 200       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  	emtpy digest
2019/11/08 22:41:16  PUT 200       127.0.0.1 /cas/87428fc522803d31065e7bce3cf03fe475096631e5e07bbd7a0fde60c4cf25c7  	bazel-out/darwin-fastbuild/bin/a.txt
2019/11/08 22:41:16  PUT 200       127.0.0.1 /cas/734539cab61e72ccfc83c4dccfeb9c17ddddcf14ca923c6fb8fc3eb59bb1de75  	//:a Action
2019/11/08 22:41:16  PUT 200       127.0.0.1 /cas/b01b0391d39289cc862e8f76e80048f418d5a8a8d75a3a3cdfe18f860e6c2326  	//:a Command
2019/11/08 22:41:16  PUT 200       127.0.0.1 /ac/734539cab61e72ccfc83c4dccfeb9c17ddddcf14ca923c6fb8fc3eb59bb1de75   	//:a ActionResult
2019/11/08 22:41:16  GET 200       127.0.0.1 /ac/7201514ba523bda8568c6af767b7ddd87b1786254414339136ab874842c3e3d7   	//:b ActionResult
2019/11/08 22:41:16  GET 200       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  	emtpy digest
2019/11/08 22:41:16  GET 200       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  	emtpy digest

When using minimal mode, upon not finding the content for the stdout/stderr digests in the cas, Bazel executes the action locally and uploads it.

Now, lets attempt to evict the output_files for //:a,

$ rm ../cache/cas/87428fc522803d31065e7bce3cf03fe475096631e5e07bbd7a0fde60c4cf25c7
$ bazel clean
$ bazel --nohome_rc build --config http //:b
INFO: Invocation ID: 5e7da007-8bfd-42a4-bfcb-6f1c19845627
INFO: Analyzed target //:b (5 packages loaded, 9 targets configured).
INFO: Found 1 target...
Target //:b up-to-date:
  bazel-bin/b.txt
INFO: Elapsed time: 0.112s, Critical Path: 0.01s
INFO: 2 processes: 2 remote cache hit.
INFO: Build completed successfully, 3 total actions
...
2019/11/08 22:47:04  GET 200       127.0.0.1 /ac/734539cab61e72ccfc83c4dccfeb9c17ddddcf14ca923c6fb8fc3eb59bb1de75   	//:a ActionResult
2019/11/08 22:47:04  GET 200       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  	emtpy digest
2019/11/08 22:47:04  GET 200       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  	emtpy digest
2019/11/08 22:47:04  GET 200       127.0.0.1 /ac/7201514ba523bda8568c6af767b7ddd87b1786254414339136ab874842c3e3d7   	//:b ActionResult
2019/11/08 22:47:04  GET 200       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  	emtpy digest
2019/11/08 22:47:04  GET 200       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  	emtpy digest

With minimal mode, this has no effect while building //:b, as bazel doesn't attempt to fetch that key from the cache.

However, now lets attempt to build //:c (which also has a dependency on //:a):

$ bazel clean
$ bazel --nohome_rc build --config http //:c
INFO: Invocation ID: 8411d527-7507-4b03-a3fb-7469e8ee0c4a
INFO: Analyzed target //:c (5 packages loaded, 11 targets configured).
INFO: Found 1 target...
ERROR: /Users/robbert/Desktop/remotetest/project/BUILD.bazel:14:1: Executing genrule //:c failed due to unexpected I/O exception: Failed to fetch file with hash '87428fc522803d31065e7bce3cf03fe475096631e5e07bbd7a0fde60c4cf25c7' because it does not exist remotely. --experimental_remote_outputs=minimal does not work if your remote cache evicts files during builds.
java.io.IOException: Failed to fetch file with hash '87428fc522803d31065e7bce3cf03fe475096631e5e07bbd7a0fde60c4cf25c7' because it does not exist remotely. --experimental_remote_outputs=minimal does not work if your remote cache evicts files during builds.
	...
Target //:c failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 0.121s, Critical Path: 0.01s
INFO: 2 processes: 2 remote cache hit.
FAILED: Build did NOT complete successfully
...
2019/11/08 22:51:36  GET 200       127.0.0.1 /ac/734539cab61e72ccfc83c4dccfeb9c17ddddcf14ca923c6fb8fc3eb59bb1de75   	//:a ActionResult
2019/11/08 22:51:36  GET 200       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  	emtpy digest
2019/11/08 22:51:36  GET 200       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  	emtpy digest
2019/11/08 22:51:36  GET 200       127.0.0.1 /ac/7201514ba523bda8568c6af767b7ddd87b1786254414339136ab874842c3e3d7   	//:b ActionResult
2019/11/08 22:51:36  GET 200       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  	emtpy digest
2019/11/08 22:51:36  GET 200       127.0.0.1 /cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  	emtpy digest
2019/11/08 22:51:36  GET 404       127.0.0.1 /ac/905038842e75345b093d088e0574f3a89b99aceae7e333278318f300dbd601bd   	//:c ActionResult
2019/11/08 22:51:36  GET 404       127.0.0.1 /cas/87428fc522803d31065e7bce3cf03fe475096631e5e07bbd7a0fde60c4cf25c7  	bazel-out/darwin-fastbuild/bin/a.txt
2019/11/08 22:51:36  GET 200       127.0.0.1 /cas/911169ddaaf146aff539f58c26c489af3b892dff0fe283c1c264c65ae5aa59a2  	bazel-out/darwin-fastbuild/bin/b.txt) 

Here's where we get the exception. Bazel fetches the ActionResults for //:a and //:b from the ac, which both exist. It then attempts to fetch //:c, which does not exist, and so it attempts to fetch the inputs for //:c fromt the cas, which succeeds for //:b but fails for //:a.

In a circle of bazel clean & bazel build, this issue perists till the remote cache stops serving the ac result for //:a (or starts containing the corresponding cas entries).


Based on the above, here's my theory for how this caused our problems:

If I recall correctly, we didn't see this issue during th 1-2 week shadowing period, but it caused an issue ~1 day after rolling the change out to all CI jobs. That would correspond to about the time for cas entries to be evicted if they hadn't been fetched. During the shadow period, we still had jobs downloading all artifacts and touching most items in the cas, prolonging their lifetime. Once we switched all jobs to minimal, these items stopped being fetched as often and disappeared from the cache. That started causing issues for the few "rebuild the world" diffs in which they were actually needed.

To use the feature, I think our objective should be to figure out if we can change the cache to not serve ActionResults from the ac if we can't serve the corresponding outputs from the cas in the near future. Some strawman suggestions:

  • Before serving an item from the ac, verify we have all outputs present in the cas.
  • When serving an ac item, in addition to bumping that key as accessed in the lru, also bump all related cas keys as being accessed.
  • Only evict cas entries if they're not reffered to by ac items.
  • Dumb nuclear evict: when evicting evict all. This could be expensive for the first job that runs afterwards, but depending on the speedup all subsequent jobs receive maybe it is worth it.

I'm sure there are some caveats with the above, but I'd like for us to consider something like the above. In addition to the above, it'd like to dig into the expected performance gain a little bit more to see we should prioritise appropriately. Based on the above, go jobs doing a high number of writes, our buildtimes sometimes being 20 mins for a clean but mostly cached build, I believe we might be bandwith constrained and that this will bring us good speedups. Finding out the total time for a clean build downloading everything in the cache /w and /without build without the bytes should give us some insights.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment