Skip to content

Instantly share code, notes, and snippets.

@czgdp1807
Last active September 24, 2021 07:21
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save czgdp1807/75e9850b8143ba0b789cf587436a29b7 to your computer and use it in GitHub Desktop.
Save czgdp1807/75e9850b8143ba0b789cf587436a29b7 to your computer and use it in GitHub Desktop.

CASE 1

raylet.out

[2021-09-23 17:23:52,726 I 9840 7408] worker_pool.cc:370: Started worker process of 1 worker(s) with pid 3692

[2021-09-23 17:23:56,841 W 9840 7408] worker_pool.cc:512: Received a register request from an unknown worker shim process:8428

[2021-09-23 17:23:56,848 I 9840 7408] node_manager.cc:1167: NodeManager::DisconnectClient, disconnect_type=0, has creation task exception = 0

[2021-09-23 17:23:56,848 I 9840 7408] node_manager.cc:1181: Ignoring client disconnect because the client has already been disconnected.

[2021-09-23 17:24:02,010 W 9840 9956] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.

[2021-09-23 17:24:09,962 W 9840 9956] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.

[2021-09-23 17:24:20,043 W 9840 9956] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.

[2021-09-23 17:24:22,729 I 9840 7408] worker_pool.cc:397: Some workers of the worker process(3692) have not registered to raylet within timeout.

raylet.err

[2021-09-23 17:23:56,841 E 8428 2152] core_worker.cc:411: Failed to register worker 73afd576c65c9f3964c4d01aba44b698c5782aaaade29d2713d54a87 to Raylet. Invalid: Invalid: Unknown worker 8428

CASE 2

raylet.out

[2021-09-23 17:51:30,915 I 4128 7172] worker_pool.cc:370: Started worker process of 1 worker(s) with pid 8552

[2021-09-23 17:51:35,089 W 4128 7172] worker_pool.cc:512: Received a register request from an unknown worker shim process:6624

[2021-09-23 17:51:35,096 I 4128 7172] node_manager.cc:1167: NodeManager::DisconnectClient, disconnect_type=0, has creation task exception = 0

[2021-09-23 17:51:35,096 I 4128 7172] node_manager.cc:1181: Ignoring client disconnect because the client has already been disconnected.

[2021-09-23 17:51:40,113 W 4128 6812] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.

[2021-09-23 17:51:48,101 W 4128 6812] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.

[2021-09-23 17:51:58,163 W 4128 6812] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.

[2021-09-23 17:52:08,207 W 4128 6812] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.

[2021-09-23 17:52:18,218 W 4128 6812] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.

raylet.err

[2021-09-23 17:51:35,089 E 6624 3148] core_worker.cc:411: Failed to register worker d8586d87bb11554dfdb8d58e17a1cde66f89b4466a4f8480ec4500fd to Raylet. Invalid: Invalid: Unknown worker 6624

CASE 3

diff

Original Code

  if (state.starting_worker_processes.count(shim_process) == 0) {
    RAY_LOG(WARNING) << "Received a register request from an unknown worker shim process:"
                     << shim_process.GetId();
    // Status status = Status::Invalid("Unknown worker");
    // send_reply_callback(status, /*port=*/0);
    // return status;
  }

raylet.out

[2021-09-23 18:23:50,421 I 2588 3824] worker_pool.cc:370: Started worker process of 1 worker(s) with pid 7632

[2021-09-23 18:23:54,566 W 2588 3824] worker_pool.cc:512: Received a register request from an unknown worker shim process:5184

[2021-09-23 18:23:56,734 I 2588 5264] object_store.cc:35: Object store current usage 8e-09 / 3.47069 GB.

[2021-09-23 18:23:59,614 I 2588 3824] node_manager.cc:1167: NodeManager::DisconnectClient, disconnect_type=1, has creation task exception = 0

[2021-09-23 18:23:59,615 I 2588 3824] node_manager.cc:1265: Driver (pid=7860) is disconnected. job_id: 01000000

[2021-09-23 18:23:59,669 W 2588 7644] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.

[2021-09-23 18:23:59,817 I 2588 3824] node_manager.cc:1167: NodeManager::DisconnectClient, disconnect_type=0, has creation task exception = 0

[2021-09-23 18:23:59,854 I 2588 3824] main.cc:273: Raylet received SIGTERM, shutting down...

[2021-09-23 18:23:59,854 I 2588 3824] service_based_accessor.cc:435: Unregistering node info, node id = 0ec8877e1204ea7784f3eaaef41bc67f1c0aa2ed19234416b2e96a82

[2021-09-23 18:23:59,855 I 2588 3824] io_service_pool.cc:47: IOServicePool is stopped.

[2021-09-23 18:24:00,073 I 2588 3824] service_based_accessor.cc:445: Finished unregistering node info, status = OK, node id = 0ec8877e1204ea7784f3eaaef41bc67f1c0aa2ed19234416b2e96a82

output

{'node_ip_address': '10.1.0.4', 'raylet_ip_address': '10.1.0.4', 'redis_address': '10.1.0.4:6379', 'object_store_address': 'tcp://127.0.0.1:56351', 'raylet_socket_name': 'tcp://127.0.0.1:63117', 'webui_url': None, 'session_dir': 'C:\\CONDA_~1\\ray\\session_2021-09-23_18-25-36_143068_1836', 'metrics_export_port': 58324, 'node_id': '870da466b9a4ed8c130c2b081de902a13987fb954c880055a8a92a55'}

CASE 4

diff

Original Code

  if (state.starting_worker_processes.count(shim_process) == 0) {
    RAY_LOG(WARNING) << "Received a register request from an unknown worker shim process:"
                     << shim_process.GetId();
    // Status status = Status::Invalid("Unknown worker");
    // send_reply_callback(status, /*port=*/0);
    Status status = Status::OK("Unknown worker");
    return status;
  }

raylet.out

[2021-09-23 18:37:12,797 I 7036 9268] worker_pool.cc:370: Started worker process of 1 worker(s) with pid 6848

[2021-09-23 18:37:17,105 W 7036 9268] worker_pool.cc:512: Received a register request from an unknown worker shim process:5068

[2021-09-23 18:37:21,957 W 7036 1268] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.

[2021-09-23 18:37:29,984 W 7036 1268] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.

[2021-09-23 18:37:40,031 W 7036 1268] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.

[2021-09-23 18:37:42,786 I 7036 9268] worker_pool.cc:397: Some workers of the worker process(6848) have not registered to raylet within timeout.

CASE 5

diff

Original Code

  if (state.starting_worker_processes.count(shim_process) == 0) {
    RAY_LOG(WARNING) << "Received a register request from an unknown worker shim process:"
                     << shim_process.GetId();
    // Status status = Status::Invalid("Unknown worker");
    // send_reply_callback(status, /*port=*/0);
    Status status = Status::OK("Unknown worker");
    return status;
  }

raylet.out

[2021-09-23 19:00:42,593 I 8360 9308] worker_pool.cc:370: Started worker process of 1 worker(s) with pid 3732

[2021-09-23 19:00:46,547 W 8360 9308] worker_pool.cc:512: Received a register request from an unknown worker shim process:10056

[2021-09-23 19:00:52,061 W 8360 9560] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.

[2021-09-23 19:01:00,097 W 8360 9560] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.

[2021-09-23 19:01:10,170 W 8360 9560] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.

[2021-09-23 19:01:21,685 W 8360 9560] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.

[2021-09-23 19:01:30,336 W 8360 9560] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.

[2021-09-23 19:01:40,412 W 8360 9560] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.

CASE 6

diff

Original Code

  if (state.starting_worker_processes.count(shim_process) == 0) {
    RAY_LOG(WARNING) << "Received a register request from an unknown worker shim process:"
                     << shim_process.GetId();
    // Status status = Status::Invalid("Unknown worker");
    // send_reply_callback(status, /*port=*/0);
    // return status;
  }

raylet.out

[2021-09-23 19:04:49,230 I 9924 1796] worker_pool.cc:370: Started worker process of 1 worker(s) with pid 5264

[2021-09-23 19:04:53,383 W 9924 1796] worker_pool.cc:512: Received a register request from an unknown worker shim process:632

[2021-09-23 19:04:55,986 I 9924 980] object_store.cc:35: Object store current usage 8e-09 / 3.51367 GB.

[2021-09-23 19:04:58,151 W 9924 9356] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.

[2021-09-23 19:04:59,244 I 9924 1796] node_manager.cc:1167: NodeManager::DisconnectClient, disconnect_type=1, has creation task exception = 0

[2021-09-23 19:04:59,245 I 9924 1796] node_manager.cc:1265: Driver (pid=4768) is disconnected. job_id: 01000000

[2021-09-23 19:04:59,291 I 9924 1796] node_manager.cc:1167: NodeManager::DisconnectClient, disconnect_type=0, has creation task exception = 0

[2021-09-23 19:04:59,559 I 9924 1796] main.cc:273: Raylet received SIGTERM, shutting down...

[2021-09-23 19:04:59,559 I 9924 1796] service_based_accessor.cc:435: Unregistering node info, node id = 8250c97c6451993adf300790ec0995abfc9f00f7045e6112847d5b47

[2021-09-23 19:04:59,561 I 9924 1796] io_service_pool.cc:47: IOServicePool is stopped.

[2021-09-23 19:04:59,744 I 9924 1796] service_based_accessor.cc:445: Finished unregistering node info, status = OK, node id = 8250c97c6451993adf300790ec0995abfc9f00f7045e6112847d5b47

comments

logs\worker-8a38ea24a819915e37af5dfebb7754b1937e9eb32801985b85ed1e67-01000000-632.err and logs\worker-8a38ea24a819915e37af5dfebb7754b1937e9eb32801985b85ed1e67-01000000-632.out created but empty.

output

{'node_ip_address': '10.1.0.4', 'raylet_ip_address': '10.1.0.4', 'redis_address': '10.1.0.4:6379', 'object_store_address': 'tcp://127.0.0.1:63437', 'raylet_socket_name': 'tcp://127.0.0.1:64779', 'webui_url': None, 'session_dir': 'C:\\CONDA_~1\\ray\\session_2021-09-23_19-06-17_178051_8824', 'metrics_export_port': 65242, 'node_id': '6f1b347fc996855782ee1a6e8d5161a370876822f5b3537b26c9e7bf'}```
@czgdp1807
Copy link
Author

czgdp1807 commented Sep 24, 2021

The above gist is a result of bunch of experiments that I did after observing two things,

  1. The worker registration timeout limit is not the same for all the platforms. As of now in master it is set as 30 ms. However, on Windows a worker process takes around 450 ms (give or take 10-20 ms, see duration.count() in the above gist). Reference to the config in ray is available here. I will refer to this configuration as worker_reg_timeout from now on.
  2. A warning and invalid status is returned when an unknown SHIM process is encountered. Reference to code in ray available here.

A brief analysis and summary of the above details,

  1. CASE 1 - The worker_reg_timeout is set to 30 ms. It can be seen that the code hangs, an unknown worker shim process warning is received and registration timeout can be seen (list line under raylet.out). The code hangs.
  2. CASE 2 - The worker_reg_timeout is increased to 800 ms. The registration time out is now removed. The code still hangs.
  3. CASE 3 - The worker_reg_timeout is set to 30 ms and an invalid status is not returned for an unknown worker shim process, only a warning is logged. A dict is returned as output and the code doesn't hang.
  4. CASE 4 - The worker_reg_timeout is set to 30 ms but an OK status is returned as well. An unknown worker shim process warning is received and the code hangs. Registration timeout also happens.
  5. CASE 5 - The worker_reg_timeout is set to 800 ms and an OK status is returned. Unknown worker shim warning is thrown and the code hangs.
  6. CASE 6 - The worker_reg_timeout is set to 800 ms and a warning is logged. Only an unknown worker shim warning is thrown. The code doesn't hang. Worker log files (see comments) are created and an output is received.

The reason for code hanging appears to be returning a status when an unknown worker shim process is encountered. Registration time out is not associated with code hanging as increasing worker_reg_timeout just removes the timeout message from the logs. When we increase the worker_reg_timeout and just log a warning for unknown worker shim process (i.e., CASE 6), the code runs fine and the initialisation seems to complete successfully.

Possible Directions - The configurations in this file should be platform specific.

Questions - Are unknown shim processes critical enough to leave the WorkerPool::RegisterWorker function in between by returning an invalid status? If they are not that critical then state.starting_worker_processes can be just updated if this condition (i.e., shim process is unknown) succeeds. Or even a simpler approach can be taken by just logging the warning and moving forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment