Skip to content

Instantly share code, notes, and snippets.

@DarkSector
DarkSector / hyperpod-precheck.py
Last active February 26, 2024 19:14
Check Hyperpod runtime or parse training script for non-supported parameters.
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
import os
import re
import sys
import json
import socket
import pathlib
@sean-smith
sean-smith / instance-id-slurm.md
Last active March 13, 2024 18:29
Get instance ID to hostname mapping from a Slurm job.

Slurm Get Instance ID to Hostname

Update: you only need the following:

mpirun -N 1 -n 2 bash -c 'echo $(hostname): $(cat /sys/devices/virtual/dmi/id/board_asset_tag | tr -d " ")'
  1. Create a file get-instance-id.sh:
@sean-smith
sean-smith / pcluster-ssh.md
Last active November 8, 2023 08:02
Easily SSH into your cluster

ParallelCluster Easy SSH

All credit to @tpbrown for this solution.

Usage:

ssh clustername

🚀 Wifi

image

So naturally the first thing I wanted to do when we got fiber internet was to rename the wifi network to something sexier than "CenturyLink0483". I decided on 🚀.

To do so I navigated to the router setup page at 192.168.0.1, cringing with all the 90's tech it employs.

Then I added 🚀 and tried to update.

@sean-smith
sean-smith / spot-starccm+-termination.md
Last active June 17, 2022 18:37
StarCCM+ Spot Instance Termination

Save StarCCM+ State in AWS ParallelCluster

Spot termination gives a 2-minute warning before terminating the instance. This time period allows you to gracefully save data in order to resume later.

In the following I describe how this can be done with StarCCM+ in AWS ParallelCluster 3.X:

Setup

  1. Create a post-install script spot.sh like so:

Mount FSx Lustre on AWS Batch

This guide describes how to mount FSx Lustre filesystem. I give an example cloudformation stack to create the AWS Batch resources.

I loosely follow this guide.

For the parameters, it's important that the Subnet, Security Group, FSx ID and Fsx Mount Name follow the guidelines below:

Parameter Description

Mount FSx Netapp ONTAP with AWS ParallelCluster

FSx Netapp is a multi-protocol filesystem. It mounts on Windows as SMB, Linux as NFS and Mac. This allows cluster users to bridge their Windows and Linux machines with the same filesystem, potentially running both windows and linux machines for a post-processing workflow.

Screen Shot 2022-03-07 at 5 29 23 PM

Pros

  • Multi-Protocol
  • Hybrid support
  • Multi-AZ (for High Availibility)
@sean-smith
sean-smith / 01-pcluster-multiuser.md
Last active May 9, 2022 21:25
How to setup a multi-user AWS ParallelCluster Environment

Multi-User AWS ParallelCluster

In this example we're going to setup an HPC environment with AWS ParallelCluster and connect it to Microsoft AD, an AWS service that allows you to create managed Active Directory user pools. You can read more about it in the AD Tutorial.

You have three different options for AD provider, we're going to go with Microsoft AD due to the regional availibility. This allows us to use it in the same region (Ohio) as our hpc6a.48xlarge instances.

Type Description
Simple AD Open AD protocol, supported in only a [few](https://docs.aws.amazon.com/directoryservice/
@sean-smith
sean-smith / dcv.md
Last active April 5, 2022 19:33
Create a desktop visualization queue with AWS ParallelCluster and NICE DCV

DCV Visualization Queue

When DCV is enabled, the default behaviour of AWS ParallelCluster is to run a single DCV session on the head node, this is a quick and easy way to visualize the results of your simulations or run a desktop application such as StarCCM+.

A common ask is to run DCV sessions on a compute queue instead of the head node. This has several advantages, namely:

  1. Run multiple sessions on the same instance (possibly with different users per-session)
  2. Run a smaller head node and only spin up more-expensive DCV instances when needed. We set a 12 hr timer below that automatically kills sessions after we leave.

Setup

@sean-smith
sean-smith / example.py
Created September 29, 2021 15:13
Call AWS ParallelCluster API with Python
#!/usr/bin/env python3
import json
from base64 import b64decode, b64encode
from pprint import pprint
import boto3
import botocore
import yaml
import requests
def sigv4_request(method, host, path, params, headers, body):