vsoch/README.md Secret

## README.md

      
    Raw
  

              README.md
            
          
    Bursting to Compute Engine Notes

Bursting is fully working on Compute Engine (Google Cloud) for each of isolated (not connected) and connected brokers.
This was a little more challenging to get working (configuration is being done with Terraform instead of a Kubernetes operator, still
with the flux-burst Python module) so I want to share some background, design choices and what I learned.
TLDR

If you are bursting from GKE to Compute Engine (or somewhere to compute engine) you can use the flux-burst-compute-engine plugin, and an example is available alongside the operator. You'll need to build the burst VM too. Importantly, you local and bursted-to cluster must conform to the following:

The flux user id must match between two instances (e.g., here we use 1004, built into VMs and set here)
The flux lib directory (e.g., /usr/lib and /usr/lib64) should match (you'll probably be OK installing on same OS with same method)
The flux install location should be the same (e.g., /usr and /usr/local will have an error)

Note that the plugin will support local or mock bursts, but the setup needs to be refactored to use the (now) one VM image, and I didn't have time today.
Design choices:

VM Images

Our original terraform modules built 3 image (compute, login, manager) and I found it challenging to develop for because of the complexity. E.g., it was three m4 templates generating different entrypoint scripts, three VM builds, and a bunch of different terraform variables that would be parsed into the VMs at startup time via the Google Metadata server. The tiniest change of a setup step meant rebuilding the images, each 17-20 minutes or ~an hour all together. While the design is probably OK for a production setup (that doesn't need to change a lot), debugging and developing was really hard.
Doing multiple builds for different versions of an OS, Flux, or different OSs would compound that. I had wanted to refactor this for some time to be similar in design to the operator, just having one image (container or machine image in this case), and the final push to do that was yesterday when I realized the design of this first setup wasn't right for bursting. It would be setting up the default flux.service, meaning trying to start a second main broker, and not only did we not want this, the bursting design (just having follower brokers) deemed the login/manager nodes kind of useless. We weren't even including the manager node in the hostlist! Up to this point I was trying my best to work off of the upstream Google-maintained repository, but when I realized we needed something totally different, I used it as an opportunity for a refactor.
For the refactor, we now have one "bursted" image that has all the metadata server parsing removed, and instead is just installing flux modules. All of the customization and configuration is done via a startup script, which is much easier to iterate on because you don't need to rebuild the VM. To be clear, this includes both terraform modules and a VM build with packer and compute engine (I removed Google Build because it also added unecessary complexity).
Startup

For the startup script, since (I read somewhere) that there is a time limit of ~10 minutes, we obviously can't just run the broker from there, and something like nohup might not be reliable. So instead I have the startup script write a flux-start.service that only exists to start a worker broker, and is started at the end of the startup script.
What we learned:


the library install roots (e.g., --libdir) I think need to match - Rocky installing by default to /usr/lib64 and the first cluster to /usr/lib led to an issue with flux_open not being found. Copying them entirely over fixed it. In the future we likely want to have builds for different OS bases (much easier to do with a single image build)!
the install paths for Flux need to line up. I found when I tried to use an install at /usr and had /usr/local on the VM, the brokers would connect and the job would run but with an error 3451.524s: job.exception type=exec severity=0 job shell exec error on broker gffw-compute-a-003 (rank 6): /usr/libexec/flux/flux-shell: No such file or directory Ensuring the install was both to /usr fixed this issue. I suspect there would be similar nuances with matching versions.
Terraform is fairly reliable for configuration because (generally) it's using the APIs closest to the services. I say this because we've been struggling with CloudFormation on AWS recently, which is not doing that.
If the local cluster doesn't give good reason for an error, try debugging from the remote (example below)

From the local cluster I saw:
5029.405s: job.exception type=exec severity=0 gffw-compute-a-001 (ranks 4) terminated before first barrier
flux-job: task(s) exited with exit code 1
5029.394s: flux-shell[4]: stderr: flux-shell: FATAL: flux_open: No such file or directory

And I was able to get more information trying to run something from the remote:
$ flux run echo hello
flux-job: task(s) exited with exit code 1
0.054s: flux-imp[6]: stderr: flux-imp: Warning: loading config: /usr/flux/imp/conf.d/*.toml: -1: Read error
0.055s: flux-imp[6]: stderr: flux-imp: Fatal: Failed to load configuration
flux-job: No job output found