sathyanarays/README.md

## README.md

      
    Raw
  

              README.md
            
          
    Steps to reproduce

Create KIND cluster

kind create cluster

Install KubeRay operator

Make sure you add the helm repo before running the following command.
helm install kuberay-operator kuberay/kuberay-operator --version 1.1.0

Install Redis

kubectl apply -f redis.yaml

Use the redis.yaml file from this gist.
Trigger RayJob

kubectl apply -f rayjob.yaml

Ensure the job is running

Wait for few minutes and ensure that the job is running by listening to the pod logs.
kubectl logs rayjob-sample-<pattern>

The output is similar to
test_counter got 1
test_counter got 2
test_counter got 3
test_counter got 4
test_counter got 5

Kill GCS

kubectl exec -it rayjob-sample-raycluster-rjpgt-head-<pattern> -- pkill gcs_server

Wait till the head pod restarts.
Observations

1. Rayjob failed

$ kubectl get rayjobs
NAME            JOB STATUS   DEPLOYMENT STATUS   START TIME             END TIME               AGE
rayjob-sample                Failed              2024-04-22T08:26:27Z   2024-04-22T08:59:13Z   34m

2. In head node, the job is still running

~$ kubectl exec -it rayjob-sample-raycluster-rjpgt-head-9qp9v -- ray list jobs

======== List: 2024-04-22 02:02:18.785135 ========
Stats:
------------------------------
Total: 1

Table:
------------------------------
      JOB_ID  SUBMISSION_ID        ENTRYPOINT                               TYPE        STATUS    MESSAGE                    ERROR_TYPE    DRIVER_INFO
 0  02000000  rayjob-sample-n6qk7  python /home/ray/samples/sample_code.py  SUBMISSION  RUNNING   Job is currently running.                id: '02000000'
                                                                                                                                           node_ip_address: 10.244.0.7
                                                                                                                                           pid: '743'

3. The actors are dead

$ kubectl exec -it rayjob-sample-raycluster-rjpgt-head-9qp9v -- ray list actors

======== List: 2024-04-22 02:03:52.647288 ========
Stats:
------------------------------
Total: 2

Table:
------------------------------
    ACTOR_ID                          CLASS_NAME     STATE      JOB_ID  NAME                                         NODE_ID                                                     PID  RAY_NAMESPACE
 0  34891bb8cfc2e99e1c3aa58c01000000  JobSupervisor  DEAD     01000000  _ray_internal_job_actor_rayjob-sample-n6qk7  db14f9245511e1bf6e94dcc08ce739a775ee38bbf467dd2b7c954a4c    674  SUPERVISOR_ACTOR_RAY_NAMESPACE
 1  37c6bbe6737f51181bd911a502000000  Counter        DEAD     02000000                                               3cbf3b0ea55d7c302ffd2a4151100ac9326713cfa1cba2887d1136cd    288  12eea4d3-7051-4287-b55e-9d94523453ea


## rayjob.yaml
apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-sample
spec:
  # submissionMode specifies how RayJob submits the Ray job to the RayCluster.
  # The default value is "K8sJobMode", meaning RayJob will submit the Ray job via a submitter Kubernetes Job.
  # The alternative value is "HTTPMode", indicating that KubeRay will submit the Ray job by sending an HTTP request to the RayCluster.
  # submissionMode: "K8sJobMode"
  entrypoint: python /home/ray/samples/sample_code.py
  # shutdownAfterJobFinishes specifies whether the RayCluster should be deleted after the RayJob finishes. Default is false.
  # shutdownAfterJobFinishes: false

  # ttlSecondsAfterFinished specifies the number of seconds after which the RayCluster will be deleted after the RayJob finishes.
  # ttlSecondsAfterFinished: 10

  # activeDeadlineSeconds is the duration in seconds that the RayJob may be active before
  # KubeRay actively tries to terminate the RayJob; value must be positive integer.
  # activeDeadlineSeconds: 120

  # RuntimeEnvYAML represents the runtime environment configuration provided as a multi-line YAML string.
  # See https://docs.ray.io/en/latest/ray-core/handling-dependencies.html for details.
  # (New in KubeRay version 1.0.)
  runtimeEnvYAML: |
    pip:
      - requests==2.26.0
      - pendulum==2.1.2
    env_vars:
      counter_name: "test_counter"

  # Suspend specifies whether the RayJob controller should create a RayCluster instance.
  # If a job is applied with the suspend field set to true, the RayCluster will not be created and we will wait for the transition to false.
  # If the RayCluster is already created, it will be deleted. In the case of transition to false, a new RayCluste rwill be created.
  # suspend: false

  # rayClusterSpec specifies the RayCluster instance to be created by the RayJob controller.
  rayClusterSpec:
    rayVersion: '2.9.0' # should match the Ray version in the image of the containers
    # Ray head pod template
    headGroupSpec:
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams:
        dashboard-host: '0.0.0.0'
        redis-password: "5241590000000000"
        num-cpus: "0"
      #pod template
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.9.0
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
              resources:
                limits:
                  cpu: "1"
                requests:
                  cpu: "200m"
              env:
                - name: RAY_REDIS_ADDRESS
                  value: redis:6379
                - name: REDIS_PASSWORD
                  value: "5241590000000000"
              volumeMounts:
                - mountPath: /home/ray/samples
                  name: code-sample
          volumes:
            # You set volumes at the Pod level, then mount them into containers inside that Pod
            - name: code-sample
              configMap:
                # Provide the name of the ConfigMap you want to mount.
                name: ray-job-code-sample
                # An array of keys from the ConfigMap to create as files
                items:
                  - key: sample_code.py
                    path: sample_code.py
    workerGroupSpecs:
      # the pod replicas in this group typed worker
      - replicas: 1
        minReplicas: 1
        maxReplicas: 5
        # logical group name, for this called small-group, also can be functional
        groupName: small-group
        # The `rayStartParams` are used to configure the `ray start` command.
        # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
        # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
        rayStartParams: {}
        #pod template
        template:
          spec:
            containers:
              - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                image: rayproject/ray:2.9.0
                lifecycle:
                  preStop:
                    exec:
                      command: [ "/bin/sh","-c","ray stop" ]
                resources:
                  limits:
                    cpu: "1"
                  requests:
                    cpu: "200m"
  # SubmitterPodTemplate is the template for the pod that will run the `ray job submit` command against the RayCluster.
  # If SubmitterPodTemplate is specified, the first container is assumed to be the submitter container.
  # submitterPodTemplate:
  #   spec:
  #     restartPolicy: Never
  #     containers:
  #       - name: my-custom-rayjob-submitter-pod
  #         image: rayproject/ray:2.9.0
  #         # If Command is not specified, the correct command will be supplied at runtime using the RayJob spec `entrypoint` field.
  #         # Specifying Command is not recommended.
  #         # command: ["sh", "-c", "ray job submit --address=http://$RAY_DASHBOARD_ADDRESS --submission-id=$RAY_JOB_SUBMISSION_ID -- echo hello world"]


######################Ray code sample#################################
# this sample is from https://docs.ray.io/en/latest/cluster/job-submission.html#quick-start-example
# it is mounted into the container and executed to show the Ray job at work
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-job-code-sample
data:
  sample_code.py: |
    import ray
    import os
    import requests
    import time

    ray.init()

    @ray.remote
    class Counter:
        def __init__(self):
            # Used to verify runtimeEnv
            self.name = os.getenv("counter_name")
            assert self.name == "test_counter"
            self.counter = 0

        def inc(self):
            time.sleep(60)
            self.counter += 1

        def get_counter(self):
            return "{} got {}".format(self.name, self.counter)

    counter = Counter.remote()

    for _ in range(5000):
        ray.get(counter.inc.remote())
        print(ray.get(counter.get_counter.remote()))

    # Verify that the correct runtime env was used for the job.
    assert requests.__version__ == "2.26.0"

## redis.yaml
kind: ConfigMap
apiVersion: v1
metadata:
  name: redis-config
  labels:
    app: redis
data:
  redis.conf: |-
    dir /data
    port 6379
    bind 0.0.0.0
    appendonly yes
    protected-mode no
    requirepass 5241590000000000
    pidfile /data/redis-6379.pid
---
apiVersion: v1
kind: Service
metadata:
  name: redis
  labels:
    app: redis
spec:
  type: ClusterIP
  ports:
    - name: redis
      port: 6379
  selector:
    app: redis
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis
  labels:
    app: redis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
        - name: redis
          image: redis:5.0.8
          command:
            - "sh"
            - "-c"
            - "redis-server /usr/local/etc/redis/redis.conf"
          ports:
            - containerPort: 6379
          volumeMounts:
            - name: config
              mountPath: /usr/local/etc/redis/redis.conf
              subPath: redis.conf
      volumes:
        - name: config
          configMap:
            name: redis-config
---
# Redis password
apiVersion: v1
kind: Secret
metadata:
  name: redis-password-secret
type: Opaque
data:
  # echo -n "5241590000000000" | base64
  password: NTI0MTU5MDAwMDAwMDAwMA==
	apiVersion: ray.io/v1
	kind: RayJob
	metadata:
	name: rayjob-sample
	spec:
	# submissionMode specifies how RayJob submits the Ray job to the RayCluster.
	# The default value is "K8sJobMode", meaning RayJob will submit the Ray job via a submitter Kubernetes Job.
	# The alternative value is "HTTPMode", indicating that KubeRay will submit the Ray job by sending an HTTP request to the RayCluster.
	# submissionMode: "K8sJobMode"
	entrypoint: python /home/ray/samples/sample_code.py
	# shutdownAfterJobFinishes specifies whether the RayCluster should be deleted after the RayJob finishes. Default is false.
	# shutdownAfterJobFinishes: false

	# ttlSecondsAfterFinished specifies the number of seconds after which the RayCluster will be deleted after the RayJob finishes.
	# ttlSecondsAfterFinished: 10

	# activeDeadlineSeconds is the duration in seconds that the RayJob may be active before
	# KubeRay actively tries to terminate the RayJob; value must be positive integer.
	# activeDeadlineSeconds: 120

	# RuntimeEnvYAML represents the runtime environment configuration provided as a multi-line YAML string.
	# See https://docs.ray.io/en/latest/ray-core/handling-dependencies.html for details.
	# (New in KubeRay version 1.0.)
	runtimeEnvYAML: \|
	pip:
	- requests==2.26.0
	- pendulum==2.1.2
	env_vars:
	counter_name: "test_counter"

	# Suspend specifies whether the RayJob controller should create a RayCluster instance.
	# If a job is applied with the suspend field set to true, the RayCluster will not be created and we will wait for the transition to false.
	# If the RayCluster is already created, it will be deleted. In the case of transition to false, a new RayCluste rwill be created.
	# suspend: false

	# rayClusterSpec specifies the RayCluster instance to be created by the RayJob controller.
	rayClusterSpec:
	rayVersion: '2.9.0' # should match the Ray version in the image of the containers
	# Ray head pod template
	headGroupSpec:
	# The `rayStartParams` are used to configure the `ray start` command.
	# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
	# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
	rayStartParams:
	dashboard-host: '0.0.0.0'
	redis-password: "5241590000000000"
	num-cpus: "0"
	#pod template
	template:
	spec:
	containers:
	- name: ray-head
	image: rayproject/ray:2.9.0
	ports:
	- containerPort: 6379
	name: gcs-server
	- containerPort: 8265 # Ray dashboard
	name: dashboard
	- containerPort: 10001
	name: client
	resources:
	limits:
	cpu: "1"
	requests:
	cpu: "200m"
	env:
	- name: RAY_REDIS_ADDRESS
	value: redis:6379
	- name: REDIS_PASSWORD
	value: "5241590000000000"
	volumeMounts:
	- mountPath: /home/ray/samples
	name: code-sample
	volumes:
	# You set volumes at the Pod level, then mount them into containers inside that Pod
	- name: code-sample
	configMap:
	# Provide the name of the ConfigMap you want to mount.
	name: ray-job-code-sample
	# An array of keys from the ConfigMap to create as files
	items:
	- key: sample_code.py
	path: sample_code.py
	workerGroupSpecs:
	# the pod replicas in this group typed worker
	- replicas: 1
	minReplicas: 1
	maxReplicas: 5
	# logical group name, for this called small-group, also can be functional
	groupName: small-group
	# The `rayStartParams` are used to configure the `ray start` command.
	# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
	# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
	rayStartParams: {}
	#pod template
	template:
	spec:
	containers:
	- name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc'
	image: rayproject/ray:2.9.0
	lifecycle:
	preStop:
	exec:
	command: [ "/bin/sh","-c","ray stop" ]
	resources:
	limits:
	cpu: "1"
	requests:
	cpu: "200m"
	# SubmitterPodTemplate is the template for the pod that will run the `ray job submit` command against the RayCluster.
	# If SubmitterPodTemplate is specified, the first container is assumed to be the submitter container.
	# submitterPodTemplate:
	# spec:
	# restartPolicy: Never
	# containers:
	# - name: my-custom-rayjob-submitter-pod
	# image: rayproject/ray:2.9.0
	# # If Command is not specified, the correct command will be supplied at runtime using the RayJob spec `entrypoint` field.
	# # Specifying Command is not recommended.
	# # command: ["sh", "-c", "ray job submit --address=http://$RAY_DASHBOARD_ADDRESS --submission-id=$RAY_JOB_SUBMISSION_ID -- echo hello world"]


	######################Ray code sample#################################
	# this sample is from https://docs.ray.io/en/latest/cluster/job-submission.html#quick-start-example
	# it is mounted into the container and executed to show the Ray job at work
	---
	apiVersion: v1
	kind: ConfigMap
	metadata:
	name: ray-job-code-sample
	data:
	sample_code.py: \|
	import ray
	import os
	import requests
	import time

	ray.init()

	@ray.remote
	class Counter:
	def __init__(self):
	# Used to verify runtimeEnv
	self.name = os.getenv("counter_name")
	assert self.name == "test_counter"
	self.counter = 0

	def inc(self):
	time.sleep(60)
	self.counter += 1

	def get_counter(self):
	return "{} got {}".format(self.name, self.counter)

	counter = Counter.remote()

	for _ in range(5000):
	ray.get(counter.inc.remote())
	print(ray.get(counter.get_counter.remote()))

	# Verify that the correct runtime env was used for the job.
	assert requests.__version__ == "2.26.0"