Skip to content

Instantly share code, notes, and snippets.

@piyueh
Last active December 6, 2022 03:03
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save piyueh/f4ef8b2db2b0910afc181f536df78942 to your computer and use it in GitHub Desktop.
Save piyueh/f4ef8b2db2b0910afc181f536df78942 to your computer and use it in GitHub Desktop.
Scripts to create a Slurm cluster on GCP with Terraform

README

Steps

  1. Install Google Cloud SDK: to manipulate cloud resources

  2. Install Terraform: to create/destroy clusters from pre-defined specs

  3. Create/prepare a project on Google Cloud Platform (GCP)

  4. Enable Compute Engine API

  5. Create a service account with a role of project editor

  6. Create/download a JSON key file for the service account. Note this file can not be re-downloaded. Keep it safe. Or re-create a new one if lost.

  7. In the terminal, under this directory, execute

    $ terraform init
    
  8. In the terminal, under this directory, execute

    $ terraform apply \
          -var "project_id=<PROJECT ID>" \
          -var "credential_file=<CREDENTIAL FILE NAME>"
    

    The <PROJECT ID> can be found on the GCP console. This command creates all resources on GCP. Users can check the status of these resources on the GCP console.

  9. To login to the master node:

    $ gcloud compute ssh gcp-cluster-login0 --zone=us-central1-a
    

    Note that even when the GCP console shows the login node and other nodes are ready, it doesn't mean the Slurm is ready. It takes some time for the Slurm to be usable.

  10. To destroy the cluster:

    $ terraform destroy \
          -var "project_id=<PROJECT ID>" \
          -var "credential_file=<CREDENTIAL FILE NAME>"
    

Description of the resources

  • Node gcp-cluster-controller: where the Slurm daemon is at. This node is always on. NFS server also lives here. /home, /app, /etc/munge are mounted on all other nodes in the cluster. It's why this node has a larger disk.
  • Node gcp-cluster-login0: the master/login node of the cluster. Users submit jobs from this node. This node is always on.
  • Node gcp-cluster-compute-0-image: the template node for the Slurm partition debug-cpu. It's down after being successfully created. This cluster will create compute nodes when needed and destroy compute nodes when no job is running after 300 seconds. Compute nodes are created using this template node as the base image. So we don't need to wait for long for the compute nodes to be usable.
  • Node gcp-cluster-compute-1-image: similar to gcp-cluster-compute-0-image but for the partition debug-gpu.
  • Node gcp-cluster-compute-<x>-<y>: the actual compute nodes in partition <x> and node ID <y>. These compute nodes are only created and shown when there are Slurm jobs.
  • Network-related: gcp-cluster-network, gcp-cluster-router, gcp-cluster-nat, and an external IP used by the virtual router. The default SSH port (i.e., 22) is enable by default in the firewall, and it allows connections from any external IP sources. Another opened port for external access is for GCP's command-line tool gcloud. Users can also login to the controller and the master nodes with gcloud.

Note

  • The creation of the resources may fail at step 8 because of the quotas of the resources. GCP sets very low quotas for C2-type instances and V100 GPUs for new projects. You may need to request a higher quota from GCP.
  • The nodes in debug-cpu were automatically terminated with no problem when no jobs were running, as described previously. However, those in debug-gpu did not work. I have not figured out what went wrong. So be careful the bill of those GPU nodes.
  • It seems NVIDIA driver was not automatically installed, though I didn't spend much time investigating this issue.
# Description: terraform scripts to create a slurm cluster on Google Cloud Platform
# Author: Pi-Yueh Chuang (pychuang@gwu.edu)
# License: BSD 3-Clause
# Based on https://github.com/SchedMD/slurm-gcp
terraform {
required_providers {
google = {
source = "hashicorp/google"
version = "3.37.0"
}
}
}
provider "google" {
credentials = file(var.credential_file)
project = var.project_id
region = var.region
zone = var.zone
}
# hard-coded variables
locals {
cluster_name = "gcp-cluster"
disable_login_public_ips = true
disable_controller_public_ips = true
disable_compute_public_ips = true
partitions = [
{
name = "debug-cpu",
machine_type = "c2-standard-4",
max_node_count = 2,
zone = var.zone,
compute_disk_type = "pd-ssd",
compute_disk_size_gb = 30,
compute_labels = {},
cpu_platform = "Intel Cascade Lake",
gpu_count = 0,
gpu_type = null,
network_storage = [],
preemptible_bursting = true,
vpc_subnet = null,
static_node_count = 0
},
{
name = "debug-gpu",
machine_type = "n1-standard-4",
max_node_count = 1,
zone = var.zone,
compute_disk_type = "pd-ssd",
compute_disk_size_gb = 30,
compute_labels = {},
cpu_platform = null,
gpu_count = 1,
gpu_type = "nvidia-tesla-v100",
network_storage = [],
preemptible_bursting = true,
vpc_subnet = null,
static_node_count = 1
},
]
ompi_version = "v4.0.x"
}
module "slurm_cluster_network" {
source = "github.com/SchedMD/slurm-gcp//tf/modules/network"
cluster_name = local.cluster_name
disable_login_public_ips = local.disable_login_public_ips
disable_controller_public_ips = local.disable_controller_public_ips
disable_compute_public_ips = local.disable_compute_public_ips
network_name = null
partitions = local.partitions
private_ip_google_access = true
project = var.project_id
region = var.region
shared_vpc_host_project = null
subnetwork_name = null
}
module "slurm_cluster_controller" {
source = "github.com/SchedMD/slurm-gcp//tf/modules/controller"
boot_disk_size = 100
boot_disk_type = "pd-ssd"
cloudsql = null
cluster_name = local.cluster_name
compute_node_scopes = [
"https://www.googleapis.com/auth/monitoring.write",
"https://www.googleapis.com/auth/logging.write"
]
compute_node_service_account = "default"
disable_compute_public_ips = local.disable_compute_public_ips
disable_controller_public_ips = local.disable_controller_public_ips
labels = {}
login_network_storage = []
login_node_count = 1
machine_type = "n1-standard-2"
munge_key = null
network_storage = var.network_storage
ompi_version = local.ompi_version
partitions = local.partitions
project = var.project_id
region = var.region
secondary_disk = false
secondary_disk_size = 100
secondary_disk_type = "pd-ssd"
scopes = ["https://www.googleapis.com/auth/cloud-platform"]
service_account = "default"
shared_vpc_host_project = null
slurm_version = "19.05-latest"
subnet_depend = module.slurm_cluster_network.subnet_depend
subnetwork_name = null
suspend_time = 300
zone = var.zone
}
module "slurm_cluster_login" {
source = "github.com/SchedMD/slurm-gcp//tf/modules/login"
boot_disk_size = 20
boot_disk_type = "pd-standard"
cluster_name = local.cluster_name
controller_name = module.slurm_cluster_controller.controller_node_name
controller_secondary_disk = false
disable_login_public_ips = local.disable_login_public_ips
labels = {}
login_network_storage = []
machine_type = "n1-standard-2"
munge_key = null
network_storage = var.network_storage
node_count = 1
ompi_version = local.ompi_version
region = var.region
scopes = [
"https://www.googleapis.com/auth/monitoring.write",
"https://www.googleapis.com/auth/logging.write"
]
service_account = "default"
shared_vpc_host_project = null
subnet_depend = module.slurm_cluster_network.subnet_depend
subnetwork_name = null
zone = var.zone
}
module "slurm_cluster_compute" {
source = "github.com/SchedMD/slurm-gcp//tf/modules/compute"
compute_image_disk_size_gb = 20
compute_image_disk_type = "pd-ssd"
compute_image_labels = {}
compute_image_machine_type = "n1-standard-2"
controller_name = module.slurm_cluster_controller.controller_node_name
controller_secondary_disk = 0
cluster_name = local.cluster_name
disable_compute_public_ips = local.disable_compute_public_ips
network_storage = var.network_storage
ompi_version = local.ompi_version
partitions = local.partitions
project = var.project_id
region = var.region
scopes = [
"https://www.googleapis.com/auth/monitoring.write",
"https://www.googleapis.com/auth/logging.write"
]
service_account = "default"
shared_vpc_host_project = null
subnet_depend = module.slurm_cluster_network.subnet_depend
subnetwork_name = null
zone = var.zone
}
# Description: Input variables of main.tf
# Author: Pi-Yueh Chuang (pychuang@gwu.edu)
# License: BSD 3-Clause
# project_id is a mandatory variable from users
variable "project_id" {
type = string
description = "The GCP project where the cluster will be created in."
}
# credential_file is a mandatory variable from users
variable "credential_file" {
type = string
description = "The JSON credential file of a service account with project editor role."
}
variable "region" {
type = string
description = "The region where the resources will be allocated in."
default = "us-central1"
}
variable "zone" {
type = string
description = "The zone under the region where the resources will be allocated in."
default = "us-central1-a"
}
variable "network_storage" {
type = list(
object(
{
server_ip = string,
remote_mount = string,
local_mount = string,
fs_type = string,
mount_options = string
}
)
)
description = " An array of network attached storage mounts to be configured on all instances."
default = []
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment