Skip to content

Instantly share code, notes, and snippets.

@sachin-netbook
Last active June 9, 2022 15:55
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sachin-netbook/072e384f8c1e4ae3074ed4a82658ff89 to your computer and use it in GitHub Desktop.
Save sachin-netbook/072e384f8c1e4ae3074ed4a82658ff89 to your computer and use it in GitHub Desktop.
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
name: yolov5-training
spec:
elasticPolicy:
rdzvBackend: c10d
minReplicas: 1
maxReplicas: 4
maxRestarts: 100
pytorchReplicaSpecs:
Worker:
replicas: 3
restartPolicy: OnFailure
template:
spec:
containers:
# Need to be pytorch only as Kubeflow operator uses this. Renaming this doesn't start the job
- name: pytorch
image: sachinnetbook/yolov5-training:v6
#image: ultralytics/yolov5:latest
imagePullPolicy: IfNotPresent
env:
- name: LOGLEVEL
value: DEBUG
command:
- python
- -m
- torch.distributed.run
- "--nproc_per_node=1"
- yolov5/train.py
- --batch-size=48
- "--epochs=20"
- "--data=coco.yaml"
- "--weights= datasets/weights/yolov5s.pt"
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- mountPath: /workspace/train/datasets/
name: efs-data
- mountPath: /dev/shm
name: dshm
volumes:
- name: efs-data
persistentVolumeClaim:
claimName: efs-claim
- emptyDir:
medium: Memory
name: dshm
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment