Skip to content

Instantly share code, notes, and snippets.

@alexeyserbin
Last active January 29, 2019 19:28
Show Gist options
  • Save alexeyserbin/c6f03f1d20151d906bf230e9b014dfb7 to your computer and use it in GitHub Desktop.
Save alexeyserbin/c6f03f1d20151d906bf230e9b014dfb7 to your computer and use it in GitHub Desktop.
layout title author
post
Location Awareness in Kudu
Alexey Serbin

This post is about location awareness in Kudu. It gives an overview of the following:

  • principles of the design
  • restrictions of the current implementation
  • potential future enhancements and extensions

Introduction

The first cut of location awareness in Kudu is built for the following use case:

  • In a Kudu cluster consisting of multiple servers spread over several racks, place the replicas of a tablet in such a way that the tablet stays available even if all the servers in a single rack become unavailable.

A rack failure might happen because of a failure of a hardware component shared among servers in the rack: network switch, power supply, etc. More generally, replace 'rack' with any other aggregation of nodes (e.g., chassis, site, etc.) where some or all nodes in an aggregate become unavailable in case of a failure. This even applies to a datacenter if the network latency between datacenters is low.

Locations in Kudu

In Kudu, a location is defined by a string that begins with a slash (/) and consists of slash-separated tokens each of which contains only characters from the set [a-zA-Z0-9_-.]. The components of the location string hierarchy are supposed to correspond to the physical or cloud-defined hierarchy of the deployed cluster, e.g. /dc-0/rack-09. As of now, Kudu does not exploit the hierarchical structure of the location except for the client's logic to find the closest tablet server. However, we want to keep the hierarchy there to make it possible to exploit it later on and also to establish compatibility with HDFS.

Defining and assigning locations

Kudu masters assign locations to tablet servers and clients.

Every Kudu master runs the location assignment procedure to assign a location to a tablet server when it registers. The location assigned by the leader master is used to make tablet replica placement decisions, etc. To determine the location for a tablet server, the master invokes an executable that takes the IP address or hostname of the tablet server and outputs the corresponding location string for the specified IP address/hostname.

The master associates the produced location string with the registered tablet server and keeps it until the tablet server re-registers, which only occurs if the master or tablet server restarts. Kudu tablet servers are location agnostic, at least for now, so the assigned location is not reported back to the tablet server. Masters use the assigned location information internally to make replica placement decisions, to try to place replicas evenly across locations and to keep tablets available in case all tablet servers in a single location fail. Also, masters provide connected clients with the information on the assigned client's location, so the clients can make informed decisions when they attempt to read from the closest tablet server.

An example of location assignment script for Kudu

#!/bin/sh
#
# It's assumed a Kudu cluster consists of nodes with IPv4 addresses in the
# private 192.168.100.0/32 subnet. The nodes are hosted in racks, where
# each rack can contain at most 32 nodes. Essentially, that's about having
# 8 locations, one location per rack.
#
# DISCLAIMER:
#   This is an example Bourne shell script for Kudu location assignment. Please
#   note it's just a toy script created with illustrative-only purpose.
#   The error handling and the input validation are minimalistic. Also, the
#   network topology choice, supportability and capacity planning aspects of
#   this script might be sub-optimal if applied as-is for real-world use cases.

set -e

if [ $# -ne 1 ]; then
  echo "usage: $0 <ip_address>"
  exit 1
fi

ip_address=$1
shift

suffix=${ip_address##192.168.100.}
if [ -z "${suffix##*.*}" ]; then
  # An IP address from a non-controlled subnet: maps into the 'other' location.
  echo "/other"
  exit 0
fi

# The mapping of the IP addresses
if [ -z "$suffix" -o $suffix -lt 0 -o $suffix -gt 255 ]; then
  echo "ERROR: '$ip_address' is not a valid IPv4 address"
  exit 2
fi

if [ $suffix -eq 0 -o $suffix -eq 255 ]; then
  echo "ERROR: '$ip_address' is a 0xffffff00 IPv4 subnet address"
  exit 3
fi

if [ $suffix -lt 32 ]; then
  echo "/dc0/rack00"
elif [ $suffix -ge 32 -a $suffix -lt 64 ]; then
  echo "/dc0/rack01"
elif [ $suffix -ge 64 -a $suffix -lt 96 ]; then
  echo "/dc0/rack02"
elif [ $suffix -ge 96 -a $suffix -lt 128 ]; then
  echo "/dc0/rack03"
elif [ $suffix -ge 128 -a $suffix -lt 160 ]; then
  echo "/dc0/rack04"
elif [ $suffix -ge 160 -a $suffix -lt 192 ]; then
  echo "/dc0/rack05"
elif [ $suffix -ge 192 -a $suffix -lt 224 ]; then
  echo "/dc0/rack06"
else
  echo "/dc0/rack07"
fi

How to make a Kudu cluster location-aware

To make a Kudu cluster location-aware, it's necessary to set the --location_mapping_cmd flag for Kudu master(s) and make the corresponding executable (binary or a script) available at the nodes where Kudu masters run. In case of multiple masters, it's important to make sure that the location mappings stay the same regardless of the node where the location assignment command is running.

It's recommended to have at least three locations defined in a Kudu cluster. The reasoning is simple: with just two locations it's not possible to spread tablet replicas of tablets with replication factor of 3 (and higher) such that no location contains a majority of replicas of a tablet.

For example, running a Kudu cluster in a single datacenter dc0, assign location /dc0/rack0 to tablet servers running at machines in the rack rack0, /dc0/rack1 to tablet servers running at machines in the rack rack1, and /dc0/rack2 to machines running in the rack rack2.

The location-aware placement policy for tablet replicas in Kudu

While placing replicas of tablets in location-aware cluster, Kudu uses a best effort approach to follow the following principle:

  • Spread replicas across locations so that the failure of tablet servers in one location does not make tablets unavailable.

That's referred to as replica placement policy or just placement policy. In Kudu, both the initial placement of tablet replicas and the automatic re-replication are governed by the that policy.

Automatic re-replication and placement policy

By design, keeping the target replication factor for tablets has higher priority than conforming to the replica placement policy. In other words, when bringing up tablet replicas to replace failed ones, Kudu uses a best-effort approach with regard to conforming to the constraints of the placement policy. Essentially, that means that if there isn't a way to place a replica to conform with the placement policy, the system places the replica anyway. The resulting violation of the placement policy can be addressed later on when unreachable tablet servers become available again or the misconfiguration is addressed. As of now, to fix the resulting placement policy violations it's necessary to run the CLI rebalancer tool manually (see below for details), but in future releases that might be done automatically in background.

Reinstating location-aware policy in Kudu cluster

As explained earlier, even if the initial placement of tablet replicas conforms to the placement policy, the cluster might get to a point where there are not enough tablet servers to place a new or a replacement replica. Ideally, such situations should be handled automatically: once there are enough tablet servers in the cluster or the misconfiguration is fixed, the placement policy should be reinstated. Currently, it's possible to reinstate the placement policy using the kudu CLI tool:

sudo -u kudu kudu cluster rebalance <master_rpc_endpoints>

In the first phase, the location-aware rebalancing process tries to reestablish the placement policy. If that's not possible, the tool terminates. Use the --disable_policy_fixer flag to skip this phase and continue to the cross-location rebalancing phase.

The second phase is cross-location rebalancing, i.e. moving tablet replicas between different locations in attempt to spread tablet replicas among locations evenly, equalizing the loads of locations throughout the cluster. For the number S of tablet servers in a location and the total number R of replicas in the location, the load of the location is defined as R/S. If the benefits of spreading the load among locations do not justify the cost of the cross-location replica movement, the tool can be instructed to skip the second phase of the location-aware rebalancing. Use the --disable_cross_location_rebalancing command line flag for that.

The third phase is intra-location rebalancing, i.e. balancing the distribution of tablet replicas within each location as if each location is a cluster on its own. Use the --disable_intra_location_rebalancing flag to skip this phase.

Examples

Below are a few examples to illustrate what happens during each phase of the location-aware rebalancing process.

In the diagrams below, the larger outer boxes denote locations, and the smaller inner ones denote tablet servers. As for the real-world objects behind locations in this example, one might think of server racks with a shared power supply or a shared network switch. It's assumed that no more than one tablet server is run at each node (i.e. machine) in a rack.

The first phase of the rebalancing process is about detecting violations and reinstating the placement policy in the cluster. In the diagram below, there are three locations defined: /L0, /L1, /L2. Each location has two tablet servers. Table A has the replication factor of three (RF=3) and consists of four tablets: A0, A1, A2, A3. Table B has replication factor of five (RF=5) and consists of three tablets: B0, B1, B2.

The distribution of the replicas for tablet A0 violates the placement policy. Why? Because replicas A0.0 and A0.1 constitute the majority of replicas (two out of three) and reside in the same location /L0.

         /L0                     /L1                    /L2
+-------------------+   +-------------------+  +-------------------+
|   TS0      TS1    |   |   TS2      TS3    |  |   TS4      TS5    |
| +------+ +------+ |   | +------+ +------+ |  | +------+ +------+ |
| | A0.0 | | A0.1 | |   | | A0.2 | |      | |  | |      | |      | |
| |      | | A1.0 | |   | | A1.1 | |      | |  | | A1.2 | |      | |
| |      | | A2.0 | |   | | A2.1 | |      | |  | | A2.2 | |      | |
| |      | | A3.0 | |   | | A3.1 | |      | |  | | A3.2 | |      | |
| | B0.0 | | B0.1 | |   | | B0.2 | | B0.3 | |  | | B0.4 | |      | |
| | B1.0 | | B1.1 | |   | | B1.2 | | B1.3 | |  | | B1.4 | |      | |
| | B2.0 | | B2.1 | |   | | B2.2 | | B2.3 | |  | | B2.4 | |      | |
| +------+ +------+ |   | +------+ +------+ |  | +------+ +------+ |
+-------------------+   +-------------------+  +-------------------+

The location-aware rebalancer should initiate movement either of T0.0 or T0.1 from /L0 to other location, so the resulting replica distribution would not contain the majority of replicas in any single location. In addition to that, the rebalancer tool tries to evenly spread the load across all locations and tablet servers within each location. The latter narrows down the list of the candidate replicas to move: A0.1 is the best candidate to move from location /L0, so location /L0 would not contain the majority of replicas for tablet A0. The same principle dictates the target location and the target tablet server to receive A0.1: that should be tablet server TS5 in the location /L2. The result distribution of the tablet replicas after the move is represented in the diagram below.

         /L0                     /L1                    /L2
+-------------------+   +-------------------+  +-------------------+
|   TS0      TS1    |   |   TS2      TS3    |  |   TS4      TS5    |
| +------+ +------+ |   | +------+ +------+ |  | +------+ +------+ |
| | A0.0 | |      | |   | | A0.2 | |      | |  | |      | | A0.1 | |
| |      | | A1.0 | |   | | A1.1 | |      | |  | | A1.2 | |      | |
| |      | | A2.0 | |   | | A2.1 | |      | |  | | A2.2 | |      | |
| |      | | A3.0 | |   | | A3.1 | |      | |  | | A3.2 | |      | |
| | B0.0 | | B0.1 | |   | | B0.2 | | B0.3 | |  | | B0.4 | |      | |
| | B1.0 | | B1.1 | |   | | B1.2 | | B1.3 | |  | | B1.4 | |      | |
| | B2.0 | | B2.1 | |   | | B2.2 | | B2.3 | |  | | B2.4 | |      | |
| +------+ +------+ |   | +------+ +------+ |  | +------+ +------+ |
+-------------------+   +-------------------+  +-------------------+

The second phase of the location-aware rebalancing is about moving tablet replicas across locations to make the locations' load more balanced. At this stage all violations of the placement policy are already rectified. The rebalancer tool doesn't attempt to make any moves which would violate the placement policy.

The load of the locations in the diagram above:

  • /L0: 1/5
  • /L1: 1/5
  • /L2: 2/7

A possible distribution of the tablet replicas after the second phase is represented below. The result load of the locations:

  • /L0: 2/9
  • /L1: 2/9
  • /L2: 2/9
         /L0                     /L1                    /L2
+-------------------+   +-------------------+  +-------------------+
|   TS0      TS1    |   |   TS2      TS3    |  |   TS4      TS5    |
| +------+ +------+ |   | +------+ +------+ |  | +------+ +------+ |
| | A0.0 | |      | |   | | A0.2 | |      | |  | |      | | A0.1 | |
| |      | | A1.0 | |   | | A1.1 | |      | |  | | A1.2 | |      | |
| |      | | A2.0 | |   | | A2.1 | |      | |  | | A2.2 | |      | |
| |      | | A3.0 | |   | | A3.1 | |      | |  | | A3.2 | |      | |
| | B0.0 | |      | |   | | B0.2 | | B0.3 | |  | | B0.4 | | B0.1 | |
| | B1.0 | | B1.1 | |   | |      | | B1.3 | |  | | B1.4 | | B2.2 | |
| | B2.0 | | B2.1 | |   | | B2.2 | | B2.3 | |  | | B2.4 | |      | |
| +------+ +------+ |   | +------+ +------+ |  | +------+ +------+ |
+-------------------+   +-------------------+  +-------------------+

The third phase of the location-aware rebalancing is about moving tablet replicas within each location to make the distribution of replicas even, both per-table and per-server.

See below for a possible replicas' distribution in the example scenario after the third phase of the location-aware rebalancing successfully completes.

         /L0                     /L1                    /L2
+-------------------+   +-------------------+  +-------------------+
|   TS0      TS1    |   |   TS2      TS3    |  |   TS4      TS5    |
| +------+ +------+ |   | +------+ +------+ |  | +------+ +------+ |
| | A0.0 | |      | |   | |      | | A0.2 | |  | |      | | A0.1 | |
| |      | | A1.0 | |   | | A1.1 | |      | |  | | A1.2 | |      | |
| |      | | A2.0 | |   | | A2.1 | |      | |  | | A2.2 | |      | |
| |      | | A3.0 | |   | | A3.1 | |      | |  | | A3.2 | |      | |
| | B0.0 | |      | |   | | B0.2 | | B0.3 | |  | | B0.4 | | B0.1 | |
| | B1.0 | | B1.1 | |   | |      | | B1.3 | |  | | B1.4 | | B1.2 | |
| | B2.0 | | B2.1 | |   | | B2.2 | | B2.3 | |  | |      | | B2.4 | |
| +------+ +------+ |   | +------+ +------+ |  | +------+ +------+ |
+-------------------+   +-------------------+  +-------------------+

Future work

As of now, we have the following task in the roadmap:

  • Run the location-aware rebalancing in background, automatically reinstating the placement policy and making tablet replica distribution even across a Kudu cluster.

In addition to that, but not yet in the roadmap, there is a 'table pinning' use case to address:

  • Make it possible to specify placement policy where replicas of tablets of particular tables are placed only at nodes within the specified locations.

The table pinning use case is requested in KUDU-2604 JIRA, see [2].

References

[[1]] Location awareness in Kudu, design document

[[2]] KUDU-2604

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment