Skip to content

Instantly share code, notes, and snippets.

@sathyanarays
Last active April 22, 2024 09:06
Show Gist options
  • Save sathyanarays/571e4d8e734acd35c9dd4b3e2151f67e to your computer and use it in GitHub Desktop.
Save sathyanarays/571e4d8e734acd35c9dd4b3e2151f67e to your computer and use it in GitHub Desktop.

Steps to reproduce

Create KIND cluster

kind create cluster

Install KubeRay operator

Make sure you add the helm repo before running the following command.

helm install kuberay-operator kuberay/kuberay-operator --version 1.1.0

Install Redis

kubectl apply -f redis.yaml

Use the redis.yaml file from this gist.

Trigger RayJob

kubectl apply -f rayjob.yaml

Ensure the job is running

Wait for few minutes and ensure that the job is running by listening to the pod logs.

kubectl logs rayjob-sample-<pattern>

The output is similar to

test_counter got 1
test_counter got 2
test_counter got 3
test_counter got 4
test_counter got 5

Kill GCS

kubectl exec -it rayjob-sample-raycluster-rjpgt-head-<pattern> -- pkill gcs_server

Wait till the head pod restarts.

Observations

1. Rayjob failed
$ kubectl get rayjobs
NAME            JOB STATUS   DEPLOYMENT STATUS   START TIME             END TIME               AGE
rayjob-sample                Failed              2024-04-22T08:26:27Z   2024-04-22T08:59:13Z   34m
2. In head node, the job is still running
~$ kubectl exec -it rayjob-sample-raycluster-rjpgt-head-9qp9v -- ray list jobs

======== List: 2024-04-22 02:02:18.785135 ========
Stats:
------------------------------
Total: 1

Table:
------------------------------
      JOB_ID  SUBMISSION_ID        ENTRYPOINT                               TYPE        STATUS    MESSAGE                    ERROR_TYPE    DRIVER_INFO
 0  02000000  rayjob-sample-n6qk7  python /home/ray/samples/sample_code.py  SUBMISSION  RUNNING   Job is currently running.                id: '02000000'
                                                                                                                                           node_ip_address: 10.244.0.7
                                                                                                                                           pid: '743'
3. The actors are dead
$ kubectl exec -it rayjob-sample-raycluster-rjpgt-head-9qp9v -- ray list actors

======== List: 2024-04-22 02:03:52.647288 ========
Stats:
------------------------------
Total: 2

Table:
------------------------------
    ACTOR_ID                          CLASS_NAME     STATE      JOB_ID  NAME                                         NODE_ID                                                     PID  RAY_NAMESPACE
 0  34891bb8cfc2e99e1c3aa58c01000000  JobSupervisor  DEAD     01000000  _ray_internal_job_actor_rayjob-sample-n6qk7  db14f9245511e1bf6e94dcc08ce739a775ee38bbf467dd2b7c954a4c    674  SUPERVISOR_ACTOR_RAY_NAMESPACE
 1  37c6bbe6737f51181bd911a502000000  Counter        DEAD     02000000                                               3cbf3b0ea55d7c302ffd2a4151100ac9326713cfa1cba2887d1136cd    288  12eea4d3-7051-4287-b55e-9d94523453ea
apiVersion: ray.io/v1
kind: RayJob
metadata:
name: rayjob-sample
spec:
# submissionMode specifies how RayJob submits the Ray job to the RayCluster.
# The default value is "K8sJobMode", meaning RayJob will submit the Ray job via a submitter Kubernetes Job.
# The alternative value is "HTTPMode", indicating that KubeRay will submit the Ray job by sending an HTTP request to the RayCluster.
# submissionMode: "K8sJobMode"
entrypoint: python /home/ray/samples/sample_code.py
# shutdownAfterJobFinishes specifies whether the RayCluster should be deleted after the RayJob finishes. Default is false.
# shutdownAfterJobFinishes: false
# ttlSecondsAfterFinished specifies the number of seconds after which the RayCluster will be deleted after the RayJob finishes.
# ttlSecondsAfterFinished: 10
# activeDeadlineSeconds is the duration in seconds that the RayJob may be active before
# KubeRay actively tries to terminate the RayJob; value must be positive integer.
# activeDeadlineSeconds: 120
# RuntimeEnvYAML represents the runtime environment configuration provided as a multi-line YAML string.
# See https://docs.ray.io/en/latest/ray-core/handling-dependencies.html for details.
# (New in KubeRay version 1.0.)
runtimeEnvYAML: |
pip:
- requests==2.26.0
- pendulum==2.1.2
env_vars:
counter_name: "test_counter"
# Suspend specifies whether the RayJob controller should create a RayCluster instance.
# If a job is applied with the suspend field set to true, the RayCluster will not be created and we will wait for the transition to false.
# If the RayCluster is already created, it will be deleted. In the case of transition to false, a new RayCluste rwill be created.
# suspend: false
# rayClusterSpec specifies the RayCluster instance to be created by the RayJob controller.
rayClusterSpec:
rayVersion: '2.9.0' # should match the Ray version in the image of the containers
# Ray head pod template
headGroupSpec:
# The `rayStartParams` are used to configure the `ray start` command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
rayStartParams:
dashboard-host: '0.0.0.0'
redis-password: "5241590000000000"
num-cpus: "0"
#pod template
template:
spec:
containers:
- name: ray-head
image: rayproject/ray:2.9.0
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8265 # Ray dashboard
name: dashboard
- containerPort: 10001
name: client
resources:
limits:
cpu: "1"
requests:
cpu: "200m"
env:
- name: RAY_REDIS_ADDRESS
value: redis:6379
- name: REDIS_PASSWORD
value: "5241590000000000"
volumeMounts:
- mountPath: /home/ray/samples
name: code-sample
volumes:
# You set volumes at the Pod level, then mount them into containers inside that Pod
- name: code-sample
configMap:
# Provide the name of the ConfigMap you want to mount.
name: ray-job-code-sample
# An array of keys from the ConfigMap to create as files
items:
- key: sample_code.py
path: sample_code.py
workerGroupSpecs:
# the pod replicas in this group typed worker
- replicas: 1
minReplicas: 1
maxReplicas: 5
# logical group name, for this called small-group, also can be functional
groupName: small-group
# The `rayStartParams` are used to configure the `ray start` command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
rayStartParams: {}
#pod template
template:
spec:
containers:
- name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc'
image: rayproject/ray:2.9.0
lifecycle:
preStop:
exec:
command: [ "/bin/sh","-c","ray stop" ]
resources:
limits:
cpu: "1"
requests:
cpu: "200m"
# SubmitterPodTemplate is the template for the pod that will run the `ray job submit` command against the RayCluster.
# If SubmitterPodTemplate is specified, the first container is assumed to be the submitter container.
# submitterPodTemplate:
# spec:
# restartPolicy: Never
# containers:
# - name: my-custom-rayjob-submitter-pod
# image: rayproject/ray:2.9.0
# # If Command is not specified, the correct command will be supplied at runtime using the RayJob spec `entrypoint` field.
# # Specifying Command is not recommended.
# # command: ["sh", "-c", "ray job submit --address=http://$RAY_DASHBOARD_ADDRESS --submission-id=$RAY_JOB_SUBMISSION_ID -- echo hello world"]
######################Ray code sample#################################
# this sample is from https://docs.ray.io/en/latest/cluster/job-submission.html#quick-start-example
# it is mounted into the container and executed to show the Ray job at work
---
apiVersion: v1
kind: ConfigMap
metadata:
name: ray-job-code-sample
data:
sample_code.py: |
import ray
import os
import requests
import time
ray.init()
@ray.remote
class Counter:
def __init__(self):
# Used to verify runtimeEnv
self.name = os.getenv("counter_name")
assert self.name == "test_counter"
self.counter = 0
def inc(self):
time.sleep(60)
self.counter += 1
def get_counter(self):
return "{} got {}".format(self.name, self.counter)
counter = Counter.remote()
for _ in range(5000):
ray.get(counter.inc.remote())
print(ray.get(counter.get_counter.remote()))
# Verify that the correct runtime env was used for the job.
assert requests.__version__ == "2.26.0"
kind: ConfigMap
apiVersion: v1
metadata:
name: redis-config
labels:
app: redis
data:
redis.conf: |-
dir /data
port 6379
bind 0.0.0.0
appendonly yes
protected-mode no
requirepass 5241590000000000
pidfile /data/redis-6379.pid
---
apiVersion: v1
kind: Service
metadata:
name: redis
labels:
app: redis
spec:
type: ClusterIP
ports:
- name: redis
port: 6379
selector:
app: redis
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis
labels:
app: redis
spec:
replicas: 1
selector:
matchLabels:
app: redis
template:
metadata:
labels:
app: redis
spec:
containers:
- name: redis
image: redis:5.0.8
command:
- "sh"
- "-c"
- "redis-server /usr/local/etc/redis/redis.conf"
ports:
- containerPort: 6379
volumeMounts:
- name: config
mountPath: /usr/local/etc/redis/redis.conf
subPath: redis.conf
volumes:
- name: config
configMap:
name: redis-config
---
# Redis password
apiVersion: v1
kind: Secret
metadata:
name: redis-password-secret
type: Opaque
data:
# echo -n "5241590000000000" | base64
password: NTI0MTU5MDAwMDAwMDAwMA==
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment