Skip to content

Instantly share code, notes, and snippets.

@mfitton
Created July 30, 2020 18:47
Show Gist options
  • Save mfitton/281619d7eeefb5b49693decb46817a3d to your computer and use it in GitHub Desktop.
Save mfitton/281619d7eeefb5b49693decb46817a3d to your computer and use it in GitHub Desktop.

Fixing the node removed from cluster bug

Repro Script

import ray
from ray.cluster_utils import Cluster

cluster = Cluster()
cluster.add_node()
cluster.add_node()
n = cluster.add_node()
ray.init(address=cluster.address)
cluster.remove_node(n)
import time
time.sleep(30)

Root Cause

This happens because when a node is removed from the cluster we get this error .

Traceback (most recent call last):
  File “/Users/sangbincho/work/ray/python/ray/dashboard/dashboard.py”, line 697, in run
    timeout=2)
  File “/Users/sangbincho/anaconda3/envs/dashboard/lib/python3.7/site-packages/grpc/_channel.py”, line 826, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File “/Users/sangbincho/anaconda3/envs/dashboard/lib/python3.7/site-packages/grpc/_channel.py”, line 729, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = “failed to connect to all addresses”
	debug_error_string = “{“created”:”@1594082072.365387000”,”description”:”Failed to pick subchannel”,”file”:”src/core/ext/filters/client_channel/client_channel.cc”,”file_line”:3941,”referenced_errors”:[{“created”:”@1594082072.365385000”,”description”:”failed to connect to all addresses”,”file”:”src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc”,”file_line”:393,”grpc_status”:14}]}”

In the front-end, this results in our not having log counts or error counts for that node. The value is expected to be present (undefined case not handled), so the UI crashes.

export const makeNodeErrors = (
  errorCounts: {
    perWorker: { [pid: string]: number };
    total: number;
  },
  setErrorDialog: (hostname: string, pid: number | null) => void,
): NodeFeatureComponent => ({ node }) =>
  errorCounts.total === 0 ? (
    <Typography color="textSecondary" component="span" variant="inherit">
      No errors
    </Typography>
  ) : (
    <SpanButton onClick={() => setErrorDialog(node.hostname, null)}>
      View all errors ({errorCounts.total.toLocaleString()})
    </SpanButton>
  );

^— in above snippet, errorCounts can be undefined.

How it is already fixed on master

On the HEAD of the master branch, we have changed the code to make a null check. This change was made in the following PR that I merged about a week ago: Machine View Sorting / Grouping by mfitton · Pull Request #9214 · ray-project/ray · GitHub

const nodeErrCount = (node: Node) =>
  node.error_count ? sum(Object.values(node.error_count)) : 0;

As well as the type definition of the log_counts and error_counts. They used to exist as a top level part of a NodeInfoResponse, and merging of the log data and other client data was done on the front end. Now, merging of that data is done on the backend, and the new type definition reflects that the values could be undefined.:

log_count?: { [pid: string]: number };
error_count?: { [pid: string]: number };
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment