kind create cluster
Make sure you add the helm repo before running the following command.
helm install kuberay-operator kuberay/kuberay-operator --version 1.1.0
kubectl apply -f redis.yaml
Use the redis.yaml
file from this gist.
kubectl apply -f rayjob.yaml
Wait for few minutes and ensure that the job is running by listening to the pod logs.
kubectl logs rayjob-sample-<pattern>
The output is similar to
test_counter got 1
test_counter got 2
test_counter got 3
test_counter got 4
test_counter got 5
kubectl exec -it rayjob-sample-raycluster-rjpgt-head-<pattern> -- pkill gcs_server
Wait till the head pod restarts.
$ kubectl get rayjobs
NAME JOB STATUS DEPLOYMENT STATUS START TIME END TIME AGE
rayjob-sample Failed 2024-04-22T08:26:27Z 2024-04-22T08:59:13Z 34m
~$ kubectl exec -it rayjob-sample-raycluster-rjpgt-head-9qp9v -- ray list jobs
======== List: 2024-04-22 02:02:18.785135 ========
Stats:
------------------------------
Total: 1
Table:
------------------------------
JOB_ID SUBMISSION_ID ENTRYPOINT TYPE STATUS MESSAGE ERROR_TYPE DRIVER_INFO
0 02000000 rayjob-sample-n6qk7 python /home/ray/samples/sample_code.py SUBMISSION RUNNING Job is currently running. id: '02000000'
node_ip_address: 10.244.0.7
pid: '743'
$ kubectl exec -it rayjob-sample-raycluster-rjpgt-head-9qp9v -- ray list actors
======== List: 2024-04-22 02:03:52.647288 ========
Stats:
------------------------------
Total: 2
Table:
------------------------------
ACTOR_ID CLASS_NAME STATE JOB_ID NAME NODE_ID PID RAY_NAMESPACE
0 34891bb8cfc2e99e1c3aa58c01000000 JobSupervisor DEAD 01000000 _ray_internal_job_actor_rayjob-sample-n6qk7 db14f9245511e1bf6e94dcc08ce739a775ee38bbf467dd2b7c954a4c 674 SUPERVISOR_ACTOR_RAY_NAMESPACE
1 37c6bbe6737f51181bd911a502000000 Counter DEAD 02000000 3cbf3b0ea55d7c302ffd2a4151100ac9326713cfa1cba2887d1136cd 288 12eea4d3-7051-4287-b55e-9d94523453ea