sean-smith/3-2-0-beta.md Secret

## 3-2-0-beta.md

      
    Raw
  

              3-2-0-beta.md
            
          
    AWS ParallelCluster 3.2.0 Beta 🚀

Features

Queue parameters update


Description: Allow customers to update the queue parameters and replace the involved compute nodes without having to stop the whole compute fleet. Currently, such an update requires the compute fleet to be stopped.

Fast insufficient capacity fail-over


Description: Make the instance provisioning component aware of EC2 insufficient capacity failures so to prevent, for a configurable amount of time, the usage of specific compute resources that are configured in the scheduler queue.

Support Multiple FSx/EFS file systems


Description: Allow customers to attach multiple EFS and FSxLustre file systems. Currently, they can attach multiple EBS volumes, but only 1 EFS and 1 FSxLustre.

Support FSx for Ontap and OpenZFS


Description: Allow customers to attach up to 20 existing FSx for ONTAP and 20 existing FSx for OpenZFS to their clusters.

Memory-Based Scheduling


Description: Allow customers to submit jobs to nodes with particular memory constraints. Currently our default Slurm configuration does not allow it.

Setup

AWS ParallelCluster Manager


Setup a new stack following the quick create links on Github
Change the Version parameter to 3.2.0b2

Once the stack goes into CREATE_COMPLETE ~20 mins, you’ll get an email with a link and temporary password. Login and then you can create a cluster with the new beta version:

Keep in mind that most new features are not surfaced in the UI yet, however you can modify the yaml template directly to test and still take advantage of the Web UI for ease of use.

CLI

Prepare a virtual environment for ParallelCluster (detailed instructions: here, skipping steps 4 and 5):
python3 -m pip install --upgrade pip
python3 -m pip install --user --upgrade virtualenv
python3 -m virtualenv ~/pcluster-ve
source ~/pcluster-ve/bin/activate
Install Node Version Manager and Node.js:
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.38.0/install.sh | bash
. ~/.nvm/nvm.sh
nvm install node
node --version

Configure AWS credentials:
aws configure
Install ParallelCluster from PyPi:
pip install aws-parallelcluster==3.2.0b2
Feature Testing

How to test: Queue parameters update

We want to verify that you are able to update the parameters under the SlurmQueues section without restarting the compute fleet and without any impact on running jobs of the other queues (not impacted by the update).
All the parameters under the SlurmQueues can be updated selecting one of the two replacement strategies, except for DisableSimultaneousMultithreading,  InstanceType, MinCount and decreasing MaxCount.
The feature also give you the possibility to add new queues or new compute resources without having to stop the whole compute fleet. N.B. Removing queues or compute resources still requires a compute fleet stop.

Setup a 3.2.0b2 cluster with either the CLI or Pcluster Manager
Modify the queue parameters, for example I can increase the max count of a queue in Pcluster Manager from 4 to 5


Then I run Update and if the update goes through without giving me an error then it's working.

How to test: Fast insufficient capacity fail-over


Create a Cluster with multiple-compute resoures in the same queue:


In this example I used p4d.24xlarge which has very low capacity so I'm almost gauranteeting that it'll retry on another instance.

Submit a job to that queue:

$ sbatch -N 1 -n 1 --mem=4GB --wrap "sleep 30"

See that it first tries the p4dn.24xlarge instance then fails over to another instance type:

watch squeue
How to test: PERSISTENT_2 FSx Lustre Filesystem


Create a cluster with 3.2.0b2 and in the SharedStorage section include a snippet like:

SharedStorage:
  - Name: FsxLustre0
    StorageType: FsxLustre
    MountDir: /shared
    FsxLustreSettings:
      StorageCapacity: 1200
      DeploymentType: PERSISTENT_2
      PerUnitStorageThroughput: 250

Create the cluster and confirm the filesystem attached

How to test: Support Multiple FSx/EFS file systems


Create a cluster with 3.2.0b2 and in the SharedStorage section include a snippet like:

SharedStorage:
  - Name: FsxLustre0
    StorageType: FsxLustre
    MountDir: /shared
    FsxLustreSettings:
      StorageCapacity: 1200
      DeploymentType: SCRATCH_2
  - Name: FsxLustre1
    StorageType: FsxLustre
    MountDir: /shared2
    FsxLustreSettings:
      FileSystemId: fs-123456789
See Mount External Filesystem for instructions on how to create a filesystem you can use with pcluster.
How to test: FSx Netapp Ontap and OpenZFS Filesystem support


Create a Filesystem following one of the guides:


FSx Netapp Ontap
OpenZFS


Mount it to the cluster with the following yaml config:

- MountDir: /shared
  Name: Ontap
  StorageType: FsxOntap
  FsxOntapSettings:
    VolumeId: fs-1234567890
- MountDir: /shared1
  Name: ZFS
  StorageType: FsxOpenZfs
  FsxOpenZfsSettings:
    VolumeId: fs-1234567890

Verify it gets mounted correctly.

How to test: Improving awareness of EFA

Using Pluster Manager:

Confirm that when you select an EFA enabled instance type, such as hpc6a.48xlarge, or c6i.32xlarge, both Placement Group and EFA are enabled.


Confirm that when you select an instance type without EFA they are similarly disabled.

How to test: Memory-Based Scheduling


Create a cluster and on the review screen add:

SlurmSettings:
        EnableMemoryBasedScheduling: true # Default is: false
For example, in pcluster manager that's:

Once the cluster is created SSH in and then you can submit jobs like so:
$ sbatch -N 1 -n 1 --mem=4GB --wrap "sleep 30"
$ sbatch -N 1 -n 1 --mem=4GB --wrap "sleep 30"
Lets say you have a queue with t2.micro:


Instance Type
vCPUs
Memory (GB)


t2.medium
2
4.0


Without memory scheduling you could schedule two jobs on it since it has 2 vcpus. However by specifying --mem=4GB you're restricting it to a single job since it only has 4 GB.