Skip to content

Instantly share code, notes, and snippets.

@Csinclair0
Last active November 9, 2022 19:18
Show Gist options
  • Save Csinclair0/18dcc46fb3ae98c189b9c5dcd56ead2f to your computer and use it in GitHub Desktop.
Save Csinclair0/18dcc46fb3ae98c189b9c5dcd56ead2f to your computer and use it in GitHub Desktop.
MPIJob nccl-tests
apiVersion: kubeflow.org/v1alpha2
kind: MPIJob
metadata:
name: nccl-tests
spec:
slotsPerWorker: 8
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
restartPolicy: OnFailure
initContainers:
- image: public.ecr.aws/w6p6i9i7/aws-efa-nccl-rdma:22.03-pt-py3
name: init
command: ["sh", "-c", "sleep 5"]
containers:
- image: public.ecr.aws/w6p6i9i7/aws-efa-nccl-rdma:22.03-pt-py3
imagePullPolicy: Always
name: nccl-test-launcher
env:
- name: LD_LIBRARY_PATH
value: /opt/amazon/openmpi/lib:/opt/nccl/build/lib:/opt/amazon/efa/lib:/opt/aws-ofi-nccl/install/lib:/usr/local/nvidia/lib:$LD_LIBRARY_PATH
- name: PATH
value: $PATH:/opt/amazon/efa/bin:/usr/sbin:/usr/bin:/usr/local/bin:/opt/amazon/openmpi/bin
command:
- /opt/amazon/openmpi/bin/mpirun
args:
- --allow-run-as-root
- --tag-output
- -np
- "16"
- -bind-to
- none
- -map-by
- slot
- -x
- PATH
- -x
- LD_LIBRARY_PATH
- -x
- NCCL_DEBUG=INFO
- -x
- NCCL_ALGO=RING
- -x
- NCCL_PROTO=simple
- -x
- FI_LOG_LEVEL=INFO
- -x
- NCCL_PROTO=simple
- -x
- FI_PROVIDER=efa
- -x
- NCCL_PROTO=simple
- -x
- FI_EFA_USE_DEVICE_RDMA=0
- --mca
- pml
- ^cm
- --oversubscribe
- /opt/nccl-tests/build/all_reduce_perf
- -b
- "8"
- -e
- 1G
- -f
- "2"
- -t
- "1"
- -g
- "1"
- -c
- "1"
- -n
- "100"
Worker:
replicas: 2
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
volumes:
- name: dshm
emptyDir:
medium: Memory
containers:
- image: public.ecr.aws/w6p6i9i7/aws-efa-nccl-rdma:22.03-pt-py3
imagePullPolicy: Always
name: nccl-worker
volumeMounts:
- mountPath: /dev/shm
name: dshm
resources:
limits:
nvidia.com/gpu: 8
vpc.amazonaws.com/efa: 1
requests:
nvidia.com/gpu: 8
vpc.amazonaws.com/efa: 1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment