import ray
from ray.cluster_utils import Cluster
cluster = Cluster()
cluster.add_node()
cluster.add_node()
n = cluster.add_node()
ray.init(address=cluster.address)
cluster.remove_node(n)
import time
time.sleep(30)
This happens because when a node is removed from the cluster we get this error .
Traceback (most recent call last):
File “/Users/sangbincho/work/ray/python/ray/dashboard/dashboard.py”, line 697, in run
timeout=2)
File “/Users/sangbincho/anaconda3/envs/dashboard/lib/python3.7/site-packages/grpc/_channel.py”, line 826, in __call__
return _end_unary_response_blocking(state, call, False, None)
File “/Users/sangbincho/anaconda3/envs/dashboard/lib/python3.7/site-packages/grpc/_channel.py”, line 729, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = “failed to connect to all addresses”
debug_error_string = “{“created”:”@1594082072.365387000”,”description”:”Failed to pick subchannel”,”file”:”src/core/ext/filters/client_channel/client_channel.cc”,”file_line”:3941,”referenced_errors”:[{“created”:”@1594082072.365385000”,”description”:”failed to connect to all addresses”,”file”:”src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc”,”file_line”:393,”grpc_status”:14}]}”
In the front-end, this results in our not having log counts or error counts for that node. The value is expected to be present (undefined case not handled), so the UI crashes.
export const makeNodeErrors = (
errorCounts: {
perWorker: { [pid: string]: number };
total: number;
},
setErrorDialog: (hostname: string, pid: number | null) => void,
): NodeFeatureComponent => ({ node }) =>
errorCounts.total === 0 ? (
<Typography color="textSecondary" component="span" variant="inherit">
No errors
</Typography>
) : (
<SpanButton onClick={() => setErrorDialog(node.hostname, null)}>
View all errors ({errorCounts.total.toLocaleString()})
</SpanButton>
);
^— in above snippet, errorCounts
can be undefined.
On the HEAD of the master branch, we have changed the code to make a null check. This change was made in the following PR that I merged about a week ago: Machine View Sorting / Grouping by mfitton · Pull Request #9214 · ray-project/ray · GitHub
const nodeErrCount = (node: Node) =>
node.error_count ? sum(Object.values(node.error_count)) : 0;
As well as the type definition of the log_counts and error_counts. They used to exist as a top level part of a NodeInfoResponse
, and merging of the log data and other client data was done on the front end. Now, merging of that data is done on the backend, and the new type definition reflects that the values could be undefined.:
log_count?: { [pid: string]: number };
error_count?: { [pid: string]: number };