If your cluster tries 10 times to launch instances and fails, it'll automatically go into PROTECTED
mode. This disables instance provisioning until the compute fleet is restarted.
You'll see inact
as the status of the queue when the cluster is in PROTECTED
mode:
[ec2-user@ip-10-0-0-98 ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
default* inact infinite 2 idle~ spot-dy-compute-[1-100]
To disable it we simply set the protected_failure_count parameter to 0. This is the limit at which it'll go into protected mode. If it's at 0 it's disabled.
sudo su -
echo "protected_failure_count = 0" >> /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
See Protected Mode docs for more information.