Skip to content

Instantly share code, notes, and snippets.

View OguzPastirmaci's full-sized avatar

Oguz Pastirmaci OguzPastirmaci

  • Oracle Cloud Infrastructure
  • Seattle, WA
  • 06:12 (UTC -08:00)
View GitHub Profile
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
annotations:
name: nccl-allreduce-job0
spec:
minAvailable: 0
plugins:
ssh: []
svc: []
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQCv++INMpsLoaSg7IMdgPLandO3sEJl1FBRUOH3ziXlGPuNXevnd0qz9SLknNQ/MKSGn6mBjrh/nfdqz53y4QYUpA57/cxBlgqk1EW9OsRM4daQnNi1aFL/oXb5ZwKuUiuBlC37QgDTO+RBHphkyKJneQdtWpD5WlqgEDSbXuW1ScHCCBz09eOkWGR2b2CmM9b9IVIxLpV6FnCROK3Pn39OL2U0kA8UHu1q6gJhxdP+gBVMXMYsKyFL3t8yPaQ0khLOAP8i3CIFB3hivP9n5IZ24s6BV46kOq/fvTAG3rC87L8SYFjWz/rLX4NzfbGwDn/ylRdwf4xxPgv0ettrQLRiREETrmOZQQqp6siIzP9kovo0KqXyOHsl8XqUGPpo1YLzxvJLeO1rDxdf3KyuvdDEAG9QKXkxhhwnaEsNC0jWQRLge4hjrdFyRf5MvpGRt5bs0uh2HqvuEneZlvRXwUUN/gnpLhT6B7tdMbF3Y75JfLCQlFrYmQ3XlYe5Ztzk+SWGZ2uDVDODLFArevb6xGg8V9AvcwPpF2bnqlfQQ9L1St0dBvhMqPjNAr3ac0y0sRjyFEAvXCt2OZtUJ9u65Uvr0Or2cfpQOY9DacLLQAMAtOnBr8FKoFejhbbbXga9mok9vrjRACSoLUwVOlBPjjnxQ7FkgcKcZKqqgz3lG9Q8bw== test@test.com
This file has been truncated, but you can view the full file.
Warning: Permanently added 'nccl-alltoall-job-oguz-mpiworker-0.nccl-alltoall-job-oguz' (ED25519) to the list of known hosts.
Warning: Permanently added 'nccl-alltoall-job-oguz-mpiworker-1.nccl-alltoall-job-oguz' (ED25519) to the list of known hosts.
Warning: Permanently added 'nccl-alltoall-job-oguz-mpiworker-2.nccl-alltoall-job-oguz' (ED25519) to the list of known hosts.
Warning: Permanently added 'nccl-alltoall-job-oguz-mpiworker-4.nccl-alltoall-job-oguz' (ED25519) to the list of known hosts.
Warning: Permanently added 'nccl-alltoall-job-oguz-mpiworker-3.nccl-alltoall-job-oguz' (ED25519) to the list of known hosts.
Warning: Permanently added 'nccl-alltoall-job-oguz-mpiworker-5.nccl-alltoall-job-oguz' (ED25519) to the list of known hosts.
Warning: Permanently added 'nccl-alltoall-job-oguz-mpiworker-6.nccl-alltoall-job-oguz' (ED25519) to the list of known hosts.
Warning: Permanently added 'nccl-alltoall-job-oguz-mpiworker-12.nccl-alltoall-job-oguz' (ED25519) to the list of known hosts.
Warning: Permanently ad
<system version="1">
<cpu numaid="0" affinity="00000000,000000ff,ffffffff,ffff0000,00000000,00ffffff,ffffffff" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="143">
<pci busid="0000:08:00.0" class="0x060400" vendor="0x1000" device="0xc030" subsystem_vendor="0x1000" subsystem_device="0x0072" link_speed="5.0 GT/s PCIe" link_width="16">
<pci busid="0000:0d:00.0" class="0x060400" vendor="0x1000" device="0xc030" subsystem_vendor="0x1000" subsystem_device="0x1003" link_speed="32.0 GT/s PCIe" link_width="16">
<pci busid="0000:0f:00.0" class="0x030200" vendor="0x10de" device="0x2330" subsystem_vendor="0x10de" subsystem_device="0x16c1" link_speed="32.0 GT/s PCIe" link_width="16">
<gpu dev="0" sm="90" rank="0" gdr="1">
<nvlink target="0000:1c:00.0" count="5" tclass="0x068000"/>
<nvlink target="0000:1b:00.0" count="5" tclass="0x068000"/>
<nvlink target="0000:1a:00.0" count="4" tclass="0x068000"/>
<nvlink target="0000:1d:00.0" count="
#!/bin/bash
set -ex
wget -O /home/ubuntu/oracle-cloud-agent_1.38.0-4_amd64.snap https://objectstorage.us-phoenix-1.oraclecloud.com/p/-EYKOzTNCQWpvJzwhH6KHGewyHYL47IuDnx3PHqwkmdoThKQEzlx_SJRjhpjTUpz/n/imagegen/b/agent_test/o/1.38.0/3/oracle-cloud-agent_1.38.0-4_amd64.snap
sudo snap stop oracle-cloud-agent
sudo snap install --classic --dangerous /home/ubuntu/oracle-cloud-agent_1.38.0-4_amd64.snap
sudo snap start oracle-cloud-agent
NODE_POOL_NAME=
NODE_POOL_SIZE=
NODE_POOL_BOOT_VOLUME_SIZE_IN_GB=
NODE_IMAGE_ID=
CLUSTER_ID=
COMPARTMENT_ID=
NODE_SHAPE=
oci ce node-pool create \
--cluster-id $CLUSTER_ID \
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: nccl-test-a100
spec:
slotsPerWorker: 8
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: nccl-test-a100
spec:
slotsPerWorker: 8
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
apiVersion: v1
kind: Pod
metadata:
name: nvidia-version-check
spec:
restartPolicy: OnFailure
containers:
- name: nvidia-version-check
image: nvidia/cuda:11.7.1-base-ubuntu20.04
command: ["nvidia-smi"]
#-------------------------------------------------
# SGE default configuration file
#-------------------------------------------------
# Use always fully qualified pathnames, please
# Path to a log file. If the file already exists, the log output
# will be appended.
# If empty, a log file will be created in <SGE_ROOT>/<SGE_CELL>/common
# The file needs to be writable by the admin user