Skip to content

Instantly share code, notes, and snippets.

@sean-smith
Created April 27, 2022 23:15
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sean-smith/e74aade6d2ff387c3247cfe9ffe0f552 to your computer and use it in GitHub Desktop.
Save sean-smith/e74aade6d2ff387c3247cfe9ffe0f552 to your computer and use it in GitHub Desktop.

Slurm Failover from Spot to On-Demand

In AWS ParallelCluster you can setup a cluster with two queues, one for Spot pricing and one for On-demand. When a job fails, due to a spot reclaimation, you can automatically requeue that job to OnDemand.

To set that up, first create a cluster with a Spot and OnDemand queue:

- Name: od
    ComputeResources:
      - Name: c6i-od-c6i32xlarge
        MinCount: 0
        MaxCount: 4
        InstanceType: c6i.32xlarge
        Efa:
          Enabled: true
          GdrSupport: true
        DisableSimultaneousMultithreading: true
    Networking:
      SubnetIds:
        - subnet-846f1aff
      PlacementGroup:
        Enabled: true
  - Name: spot
    ComputeResources:
      - Name: c6i-spot-c6i32xlarge
        MaxCount: 4
        InstanceType: c6i.32xlarge
    Networking:
      SubnetIds:
        - subnet-846f1aff
    CapacityType: SPOT

Next submit your job like so:

$ sbatch -p spot --norequeue submit.sbatch
Submitted job with id 1
$ sbatch -p od -d afternotok:1 submit.sbatch
Submitted job with id 2
  • --norequeue tells slurm to not requeue in the same queue as the first job.
  • afternotok:1 tells slurm to only run the second job if the first one (job id 1) fails.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment