Bursting is fully working on Compute Engine (Google Cloud) for each of isolated (not connected) and connected brokers. This was a little more challenging to get working (configuration is being done with Terraform instead of a Kubernetes operator, still with the flux-burst Python module) so I want to share some background, design choices and what I learned.
If you are bursting from GKE to Compute Engine (or somewhere to compute engine) you can use the flux-burst-compute-engine plugin, and an example is available alongside the operator. You'll need to build the burst VM too. Importantly, you local and bursted-to cluster must conform to the following:
- The flux user id must match between two instances (e.g., here we use 1004, built into VMs and set here)
- The flux lib directory (e.g.,
/usr/lib
and/usr/lib64
) should match (you'll probably be OK installing on same OS with same method) - The flux install location should be the same (e.g.,
/usr
and/usr/local
will have an error)
Note that the plugin will support local or mock bursts, but the setup needs to be refactored to use the (now) one VM image, and I didn't have time today.
Our original terraform modules built 3 image (compute, login, manager) and I found it challenging to develop for because of the complexity. E.g., it was three m4 templates generating different entrypoint scripts, three VM builds, and a bunch of different terraform variables that would be parsed into the VMs at startup time via the Google Metadata server. The tiniest change of a setup step meant rebuilding the images, each 17-20 minutes or ~an hour all together. While the design is probably OK for a production setup (that doesn't need to change a lot), debugging and developing was really hard. Doing multiple builds for different versions of an OS, Flux, or different OSs would compound that. I had wanted to refactor this for some time to be similar in design to the operator, just having one image (container or machine image in this case), and the final push to do that was yesterday when I realized the design of this first setup wasn't right for bursting. It would be setting up the default flux.service, meaning trying to start a second main broker, and not only did we not want this, the bursting design (just having follower brokers) deemed the login/manager nodes kind of useless. We weren't even including the manager node in the hostlist! Up to this point I was trying my best to work off of the upstream Google-maintained repository, but when I realized we needed something totally different, I used it as an opportunity for a refactor.
For the refactor, we now have one "bursted" image that has all the metadata server parsing removed, and instead is just installing flux modules. All of the customization and configuration is done via a startup script, which is much easier to iterate on because you don't need to rebuild the VM. To be clear, this includes both terraform modules and a VM build with packer and compute engine (I removed Google Build because it also added unecessary complexity).
For the startup script, since (I read somewhere) that there is a time limit of ~10 minutes, we obviously can't just run the broker from there, and something like nohup might not be reliable. So instead I have the startup script write a flux-start.service
that only exists to start a worker broker, and is started at the end of the startup script.
- the library install roots (e.g.,
--libdir
) I think need to match - Rocky installing by default to/usr/lib64
and the first cluster to/usr/lib
led to an issue withflux_open
not being found. Copying them entirely over fixed it. In the future we likely want to have builds for different OS bases (much easier to do with a single image build)! - the install paths for Flux need to line up. I found when I tried to use an install at
/usr
and had/usr/local
on the VM, the brokers would connect and the job would run but with an error3451.524s: job.exception type=exec severity=0 job shell exec error on broker gffw-compute-a-003 (rank 6): /usr/libexec/flux/flux-shell: No such file or directory
Ensuring the install was both to/usr
fixed this issue. I suspect there would be similar nuances with matching versions. - Terraform is fairly reliable for configuration because (generally) it's using the APIs closest to the services. I say this because we've been struggling with CloudFormation on AWS recently, which is not doing that.
- If the local cluster doesn't give good reason for an error, try debugging from the remote (example below)
From the local cluster I saw:
5029.405s: job.exception type=exec severity=0 gffw-compute-a-001 (ranks 4) terminated before first barrier
flux-job: task(s) exited with exit code 1
5029.394s: flux-shell[4]: stderr: flux-shell: FATAL: flux_open: No such file or directory
And I was able to get more information trying to run something from the remote:
$ flux run echo hello
flux-job: task(s) exited with exit code 1
0.054s: flux-imp[6]: stderr: flux-imp: Warning: loading config: /usr/flux/imp/conf.d/*.toml: -1: Read error
0.055s: flux-imp[6]: stderr: flux-imp: Fatal: Failed to load configuration
flux-job: No job output found