CASE 1
- RAY_CONFIG(int64_t, worker_register_timeout_seconds, 30)
duration.count()
- 459 msproc.GetId()
- 3692
raylet.out
[2021-09-23 17:23:52,726 I 9840 7408] worker_pool.cc:370: Started worker process of 1 worker(s) with pid 3692
[2021-09-23 17:23:56,841 W 9840 7408] worker_pool.cc:512: Received a register request from an unknown worker shim process:8428
[2021-09-23 17:23:56,848 I 9840 7408] node_manager.cc:1167: NodeManager::DisconnectClient, disconnect_type=0, has creation task exception = 0
[2021-09-23 17:23:56,848 I 9840 7408] node_manager.cc:1181: Ignoring client disconnect because the client has already been disconnected.
[2021-09-23 17:24:02,010 W 9840 9956] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-09-23 17:24:09,962 W 9840 9956] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-09-23 17:24:20,043 W 9840 9956] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-09-23 17:24:22,729 I 9840 7408] worker_pool.cc:397: Some workers of the worker process(3692) have not registered to raylet within timeout.
raylet.err
[2021-09-23 17:23:56,841 E 8428 2152] core_worker.cc:411: Failed to register worker 73afd576c65c9f3964c4d01aba44b698c5782aaaade29d2713d54a87 to Raylet. Invalid: Invalid: Unknown worker 8428
CASE 2
- RAY_CONFIG(int64_t, worker_register_timeout_seconds, 800)
duration.count()
- 434 msproc.GetId()
- 8552
raylet.out
[2021-09-23 17:51:30,915 I 4128 7172] worker_pool.cc:370: Started worker process of 1 worker(s) with pid 8552
[2021-09-23 17:51:35,089 W 4128 7172] worker_pool.cc:512: Received a register request from an unknown worker shim process:6624
[2021-09-23 17:51:35,096 I 4128 7172] node_manager.cc:1167: NodeManager::DisconnectClient, disconnect_type=0, has creation task exception = 0
[2021-09-23 17:51:35,096 I 4128 7172] node_manager.cc:1181: Ignoring client disconnect because the client has already been disconnected.
[2021-09-23 17:51:40,113 W 4128 6812] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-09-23 17:51:48,101 W 4128 6812] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-09-23 17:51:58,163 W 4128 6812] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-09-23 17:52:08,207 W 4128 6812] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-09-23 17:52:18,218 W 4128 6812] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
raylet.err
[2021-09-23 17:51:35,089 E 6624 3148] core_worker.cc:411: Failed to register worker d8586d87bb11554dfdb8d58e17a1cde66f89b4466a4f8480ec4500fd to Raylet. Invalid: Invalid: Unknown worker 6624
CASE 3
- RAY_CONFIG(int64_t, worker_register_timeout_seconds, 30)
duration.count()
- 461 msproc.GetId()
- 7632
diff
if (state.starting_worker_processes.count(shim_process) == 0) {
RAY_LOG(WARNING) << "Received a register request from an unknown worker shim process:"
<< shim_process.GetId();
// Status status = Status::Invalid("Unknown worker");
// send_reply_callback(status, /*port=*/0);
// return status;
}
raylet.out
[2021-09-23 18:23:50,421 I 2588 3824] worker_pool.cc:370: Started worker process of 1 worker(s) with pid 7632
[2021-09-23 18:23:54,566 W 2588 3824] worker_pool.cc:512: Received a register request from an unknown worker shim process:5184
[2021-09-23 18:23:56,734 I 2588 5264] object_store.cc:35: Object store current usage 8e-09 / 3.47069 GB.
[2021-09-23 18:23:59,614 I 2588 3824] node_manager.cc:1167: NodeManager::DisconnectClient, disconnect_type=1, has creation task exception = 0
[2021-09-23 18:23:59,615 I 2588 3824] node_manager.cc:1265: Driver (pid=7860) is disconnected. job_id: 01000000
[2021-09-23 18:23:59,669 W 2588 7644] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-09-23 18:23:59,817 I 2588 3824] node_manager.cc:1167: NodeManager::DisconnectClient, disconnect_type=0, has creation task exception = 0
[2021-09-23 18:23:59,854 I 2588 3824] main.cc:273: Raylet received SIGTERM, shutting down...
[2021-09-23 18:23:59,854 I 2588 3824] service_based_accessor.cc:435: Unregistering node info, node id = 0ec8877e1204ea7784f3eaaef41bc67f1c0aa2ed19234416b2e96a82
[2021-09-23 18:23:59,855 I 2588 3824] io_service_pool.cc:47: IOServicePool is stopped.
[2021-09-23 18:24:00,073 I 2588 3824] service_based_accessor.cc:445: Finished unregistering node info, status = OK, node id = 0ec8877e1204ea7784f3eaaef41bc67f1c0aa2ed19234416b2e96a82
output
{'node_ip_address': '10.1.0.4', 'raylet_ip_address': '10.1.0.4', 'redis_address': '10.1.0.4:6379', 'object_store_address': 'tcp://127.0.0.1:56351', 'raylet_socket_name': 'tcp://127.0.0.1:63117', 'webui_url': None, 'session_dir': 'C:\\CONDA_~1\\ray\\session_2021-09-23_18-25-36_143068_1836', 'metrics_export_port': 58324, 'node_id': '870da466b9a4ed8c130c2b081de902a13987fb954c880055a8a92a55'}
CASE 4
- RAY_CONFIG(int64_t, worker_register_timeout_seconds, 30)
duration.count()
- 471 msproc.GetId()
- 6848
diff
if (state.starting_worker_processes.count(shim_process) == 0) {
RAY_LOG(WARNING) << "Received a register request from an unknown worker shim process:"
<< shim_process.GetId();
// Status status = Status::Invalid("Unknown worker");
// send_reply_callback(status, /*port=*/0);
Status status = Status::OK("Unknown worker");
return status;
}
raylet.out
[2021-09-23 18:37:12,797 I 7036 9268] worker_pool.cc:370: Started worker process of 1 worker(s) with pid 6848
[2021-09-23 18:37:17,105 W 7036 9268] worker_pool.cc:512: Received a register request from an unknown worker shim process:5068
[2021-09-23 18:37:21,957 W 7036 1268] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-09-23 18:37:29,984 W 7036 1268] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-09-23 18:37:40,031 W 7036 1268] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-09-23 18:37:42,786 I 7036 9268] worker_pool.cc:397: Some workers of the worker process(6848) have not registered to raylet within timeout.
CASE 5
- RAY_CONFIG(int64_t, worker_register_timeout_seconds, 800)
duration.count()
- 449 msproc.GetId()
- 3732
diff
if (state.starting_worker_processes.count(shim_process) == 0) {
RAY_LOG(WARNING) << "Received a register request from an unknown worker shim process:"
<< shim_process.GetId();
// Status status = Status::Invalid("Unknown worker");
// send_reply_callback(status, /*port=*/0);
Status status = Status::OK("Unknown worker");
return status;
}
raylet.out
[2021-09-23 19:00:42,593 I 8360 9308] worker_pool.cc:370: Started worker process of 1 worker(s) with pid 3732
[2021-09-23 19:00:46,547 W 8360 9308] worker_pool.cc:512: Received a register request from an unknown worker shim process:10056
[2021-09-23 19:00:52,061 W 8360 9560] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-09-23 19:01:00,097 W 8360 9560] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-09-23 19:01:10,170 W 8360 9560] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-09-23 19:01:21,685 W 8360 9560] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-09-23 19:01:30,336 W 8360 9560] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-09-23 19:01:40,412 W 8360 9560] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
CASE 6
- RAY_CONFIG(int64_t, worker_register_timeout_seconds, 800)
duration.count()
- 465 msproc.GetId()
- 5264
diff
if (state.starting_worker_processes.count(shim_process) == 0) {
RAY_LOG(WARNING) << "Received a register request from an unknown worker shim process:"
<< shim_process.GetId();
// Status status = Status::Invalid("Unknown worker");
// send_reply_callback(status, /*port=*/0);
// return status;
}
raylet.out
[2021-09-23 19:04:49,230 I 9924 1796] worker_pool.cc:370: Started worker process of 1 worker(s) with pid 5264
[2021-09-23 19:04:53,383 W 9924 1796] worker_pool.cc:512: Received a register request from an unknown worker shim process:632
[2021-09-23 19:04:55,986 I 9924 980] object_store.cc:35: Object store current usage 8e-09 / 3.51367 GB.
[2021-09-23 19:04:58,151 W 9924 9356] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-09-23 19:04:59,244 I 9924 1796] node_manager.cc:1167: NodeManager::DisconnectClient, disconnect_type=1, has creation task exception = 0
[2021-09-23 19:04:59,245 I 9924 1796] node_manager.cc:1265: Driver (pid=4768) is disconnected. job_id: 01000000
[2021-09-23 19:04:59,291 I 9924 1796] node_manager.cc:1167: NodeManager::DisconnectClient, disconnect_type=0, has creation task exception = 0
[2021-09-23 19:04:59,559 I 9924 1796] main.cc:273: Raylet received SIGTERM, shutting down...
[2021-09-23 19:04:59,559 I 9924 1796] service_based_accessor.cc:435: Unregistering node info, node id = 8250c97c6451993adf300790ec0995abfc9f00f7045e6112847d5b47
[2021-09-23 19:04:59,561 I 9924 1796] io_service_pool.cc:47: IOServicePool is stopped.
[2021-09-23 19:04:59,744 I 9924 1796] service_based_accessor.cc:445: Finished unregistering node info, status = OK, node id = 8250c97c6451993adf300790ec0995abfc9f00f7045e6112847d5b47
comments
logs\worker-8a38ea24a819915e37af5dfebb7754b1937e9eb32801985b85ed1e67-01000000-632.err
and
logs\worker-8a38ea24a819915e37af5dfebb7754b1937e9eb32801985b85ed1e67-01000000-632.out
created but empty.
output
{'node_ip_address': '10.1.0.4', 'raylet_ip_address': '10.1.0.4', 'redis_address': '10.1.0.4:6379', 'object_store_address': 'tcp://127.0.0.1:63437', 'raylet_socket_name': 'tcp://127.0.0.1:64779', 'webui_url': None, 'session_dir': 'C:\\CONDA_~1\\ray\\session_2021-09-23_19-06-17_178051_8824', 'metrics_export_port': 65242, 'node_id': '6f1b347fc996855782ee1a6e8d5161a370876822f5b3537b26c9e7bf'}```
The above gist is a result of bunch of experiments that I did after observing two things,
master
it is set as 30 ms. However, on Windows a worker process takes around 450 ms (give or take 10-20 ms, seeduration.count()
in the above gist). Reference to the config in ray is available here. I will refer to this configuration asworker_reg_timeout
from now on.A brief analysis and summary of the above details,
worker_reg_timeout
is set to 30 ms. It can be seen that the code hangs, an unknown worker shim process warning is received and registration timeout can be seen (list line underraylet.out
). The code hangs.worker_reg_timeout
is increased to 800 ms. The registration time out is now removed. The code still hangs.worker_reg_timeout
is set to 30 ms and an invalid status is not returned for an unknown worker shim process, only a warning is logged. A dict is returned as output and the code doesn't hang.worker_reg_timeout
is set to 30 ms but an OK status is returned as well. An unknown worker shim process warning is received and the code hangs. Registration timeout also happens.worker_reg_timeout
is set to 800 ms and an OK status is returned. Unknown worker shim warning is thrown and the code hangs.worker_reg_timeout
is set to 800 ms and a warning is logged. Only an unknown worker shim warning is thrown. The code doesn't hang. Worker log files (see comments) are created and an output is received.The reason for code hanging appears to be returning a status when an unknown worker shim process is encountered. Registration time out is not associated with code hanging as increasing
worker_reg_timeout
just removes the timeout message from the logs. When we increase theworker_reg_timeout
and just log a warning for unknown worker shim process (i.e., CASE 6), the code runs fine and the initialisation seems to complete successfully.Possible Directions - The configurations in this file should be platform specific.
Questions - Are unknown shim processes critical enough to leave the
WorkerPool::RegisterWorker
function in between by returning an invalid status? If they are not that critical thenstate.starting_worker_processes
can be just updated if this condition (i.e., shim process is unknown) succeeds. Or even a simpler approach can be taken by just logging the warning and moving forward.