layout | title | author |
---|---|---|
post |
Location Awareness in Kudu |
Alexey Serbin |
This post is about location awareness in Kudu. It gives an overview of the following:
- principles of the design
- restrictions of the current implementation
- potential future enhancements and extensions
The first cut of location awareness in Kudu is built for the following use case:
- In a Kudu cluster consisting of multiple servers spread over several racks, place the replicas of a tablet in such a way that the tablet stays available even if all the servers in a single rack become unavailable.
A rack failure might happen because of a failure of a hardware component shared among servers in the rack: network switch, power supply, etc. More generally, replace 'rack' with any other aggregation of nodes (e.g., chassis, site, etc.) where some or all nodes in an aggregate become unavailable in case of a failure. This even applies to a datacenter if the network latency between datacenters is low.
In Kudu, a location is defined by a string that begins with a slash (/
) and
consists of slash-separated tokens each of which contains only characters from
the set [a-zA-Z0-9_-.]
. The components of the location string hierarchy are
supposed to correspond to the physical or cloud-defined hierarchy of the
deployed cluster, e.g. /dc-0/rack-09
. As of now, Kudu does not
exploit the hierarchical structure of the location except for the client's
logic to find the closest tablet server. However, we want to keep the hierarchy
there to make it possible to exploit it later on and also to establish
compatibility with HDFS.
Kudu masters assign locations to tablet servers and clients.
Every Kudu master runs the location assignment procedure to assign a location to a tablet server when it registers. The location assigned by the leader master is used to make tablet replica placement decisions, etc. To determine the location for a tablet server, the master invokes an executable that takes the IP address or hostname of the tablet server and outputs the corresponding location string for the specified IP address/hostname.
The master associates the produced location string with the registered tablet server and keeps it until the tablet server re-registers, which only occurs if the master or tablet server restarts. Kudu tablet servers are location agnostic, at least for now, so the assigned location is not reported back to the tablet server. Masters use the assigned location information internally to make replica placement decisions, to try to place replicas evenly across locations and to keep tablets available in case all tablet servers in a single location fail. Also, masters provide connected clients with the information on the assigned client's location, so the clients can make informed decisions when they attempt to read from the closest tablet server.
#!/bin/sh
#
# It's assumed a Kudu cluster consists of nodes with IPv4 addresses in the
# private 192.168.100.0/32 subnet. The nodes are hosted in racks, where
# each rack can contain at most 32 nodes. Essentially, that's about having
# 8 locations, one location per rack.
#
# DISCLAIMER:
# This is an example Bourne shell script for Kudu location assignment. Please
# note it's just a toy script created with illustrative-only purpose.
# The error handling and the input validation are minimalistic. Also, the
# network topology choice, supportability and capacity planning aspects of
# this script might be sub-optimal if applied as-is for real-world use cases.
set -e
if [ $# -ne 1 ]; then
echo "usage: $0 <ip_address>"
exit 1
fi
ip_address=$1
shift
suffix=${ip_address##192.168.100.}
if [ -z "${suffix##*.*}" ]; then
# An IP address from a non-controlled subnet: maps into the 'other' location.
echo "/other"
exit 0
fi
# The mapping of the IP addresses
if [ -z "$suffix" -o $suffix -lt 0 -o $suffix -gt 255 ]; then
echo "ERROR: '$ip_address' is not a valid IPv4 address"
exit 2
fi
if [ $suffix -eq 0 -o $suffix -eq 255 ]; then
echo "ERROR: '$ip_address' is a 0xffffff00 IPv4 subnet address"
exit 3
fi
if [ $suffix -lt 32 ]; then
echo "/dc0/rack00"
elif [ $suffix -ge 32 -a $suffix -lt 64 ]; then
echo "/dc0/rack01"
elif [ $suffix -ge 64 -a $suffix -lt 96 ]; then
echo "/dc0/rack02"
elif [ $suffix -ge 96 -a $suffix -lt 128 ]; then
echo "/dc0/rack03"
elif [ $suffix -ge 128 -a $suffix -lt 160 ]; then
echo "/dc0/rack04"
elif [ $suffix -ge 160 -a $suffix -lt 192 ]; then
echo "/dc0/rack05"
elif [ $suffix -ge 192 -a $suffix -lt 224 ]; then
echo "/dc0/rack06"
else
echo "/dc0/rack07"
fi
To make a Kudu cluster location-aware, it's necessary to set the
--location_mapping_cmd
flag for Kudu master(s) and make the corresponding
executable (binary or a script) available at the nodes where Kudu masters run.
In case of multiple masters, it's important to make sure that the location
mappings stay the same regardless of the node where the location assignment
command is running.
It's recommended to have at least three locations defined in a Kudu cluster. The reasoning is simple: with just two locations it's not possible to spread tablet replicas of tablets with replication factor of 3 (and higher) such that no location contains a majority of replicas of a tablet.
For example, running a Kudu cluster in a single datacenter dc0
, assign
location /dc0/rack0
to tablet servers running at machines in the rack rack0
,
/dc0/rack1
to tablet servers running at machines in the rack rack1
,
and /dc0/rack2
to machines running in the rack rack2
.
While placing replicas of tablets in location-aware cluster, Kudu uses a best effort approach to follow the following principle:
- Spread replicas across locations so that the failure of tablet servers in one location does not make tablets unavailable.
That's referred to as replica placement policy or just placement policy. In Kudu, both the initial placement of tablet replicas and the automatic re-replication are governed by the that policy.
By design, keeping the target replication factor for tablets has higher priority than conforming to the replica placement policy. In other words, when bringing up tablet replicas to replace failed ones, Kudu uses a best-effort approach with regard to conforming to the constraints of the placement policy. Essentially, that means that if there isn't a way to place a replica to conform with the placement policy, the system places the replica anyway. The resulting violation of the placement policy can be addressed later on when unreachable tablet servers become available again or the misconfiguration is addressed. As of now, to fix the resulting placement policy violations it's necessary to run the CLI rebalancer tool manually (see below for details), but in future releases that might be done automatically in background.
As explained earlier, even if the initial placement of tablet replicas conforms
to the placement policy, the cluster might get to a point where there are not
enough tablet servers to place a new or a replacement replica. Ideally, such
situations should be handled automatically: once there are enough tablet servers
in the cluster or the misconfiguration is fixed, the placement policy should
be reinstated. Currently, it's possible to reinstate the placement policy using
the kudu
CLI tool:
sudo -u kudu kudu cluster rebalance <master_rpc_endpoints>
In the first phase, the location-aware rebalancing process tries to
reestablish the placement policy. If that's not possible, the tool
terminates. Use the --disable_policy_fixer
flag to skip this phase and
continue to the cross-location rebalancing phase.
The second phase is cross-location rebalancing, i.e. moving tablet replicas
between different locations in attempt to spread tablet replicas among
locations evenly, equalizing the loads of locations throughout the cluster. For
the number S
of tablet servers in a location and the total number R
of
replicas in the location, the load of the location is defined as R/S
.
If the benefits of spreading the load among locations do not justify the cost
of the cross-location replica movement, the tool can be instructed to skip the
second phase of the location-aware rebalancing. Use the
--disable_cross_location_rebalancing
command line flag for that.
The third phase is intra-location rebalancing, i.e. balancing the distribution
of tablet replicas within each location as if each location is a cluster on its
own. Use the --disable_intra_location_rebalancing
flag to skip this phase.
Below are a few examples to illustrate what happens during each phase of the location-aware rebalancing process.
In the diagrams below, the larger outer boxes denote locations, and the smaller inner ones denote tablet servers. As for the real-world objects behind locations in this example, one might think of server racks with a shared power supply or a shared network switch. It's assumed that no more than one tablet server is run at each node (i.e. machine) in a rack.
The first phase of the rebalancing process is about detecting violations and
reinstating the placement policy in the cluster. In the diagram below, there
are three locations defined: /L0
, /L1
, /L2
. Each location has two tablet
servers. Table A
has the replication factor of three (RF=3) and consists of
four tablets: A0
, A1
, A2
, A3
. Table B
has replication factor of five
(RF=5) and consists of three tablets: B0
, B1
, B2
.
The distribution of the replicas for tablet A0
violates the placement policy.
Why? Because replicas A0.0
and A0.1
constitute the majority of replicas
(two out of three) and reside in the same location /L0
.
/L0 /L1 /L2
+-------------------+ +-------------------+ +-------------------+
| TS0 TS1 | | TS2 TS3 | | TS4 TS5 |
| +------+ +------+ | | +------+ +------+ | | +------+ +------+ |
| | A0.0 | | A0.1 | | | | A0.2 | | | | | | | | | |
| | | | A1.0 | | | | A1.1 | | | | | | A1.2 | | | |
| | | | A2.0 | | | | A2.1 | | | | | | A2.2 | | | |
| | | | A3.0 | | | | A3.1 | | | | | | A3.2 | | | |
| | B0.0 | | B0.1 | | | | B0.2 | | B0.3 | | | | B0.4 | | | |
| | B1.0 | | B1.1 | | | | B1.2 | | B1.3 | | | | B1.4 | | | |
| | B2.0 | | B2.1 | | | | B2.2 | | B2.3 | | | | B2.4 | | | |
| +------+ +------+ | | +------+ +------+ | | +------+ +------+ |
+-------------------+ +-------------------+ +-------------------+
The location-aware rebalancer should initiate movement either of T0.0
or
T0.1
from /L0
to other location, so the resulting replica distribution would
not contain the majority of replicas in any single location. In addition to
that, the rebalancer tool tries to evenly spread the load across all locations
and tablet servers within each location. The latter narrows down the list
of the candidate replicas to move: A0.1
is the best candidate to move from
location /L0
, so location /L0
would not contain the majority of replicas
for tablet A0
. The same principle dictates the target location and the target
tablet server to receive A0.1
: that should be tablet server TS5
in the
location /L2
. The result distribution of the tablet replicas after the move
is represented in the diagram below.
/L0 /L1 /L2
+-------------------+ +-------------------+ +-------------------+
| TS0 TS1 | | TS2 TS3 | | TS4 TS5 |
| +------+ +------+ | | +------+ +------+ | | +------+ +------+ |
| | A0.0 | | | | | | A0.2 | | | | | | | | A0.1 | |
| | | | A1.0 | | | | A1.1 | | | | | | A1.2 | | | |
| | | | A2.0 | | | | A2.1 | | | | | | A2.2 | | | |
| | | | A3.0 | | | | A3.1 | | | | | | A3.2 | | | |
| | B0.0 | | B0.1 | | | | B0.2 | | B0.3 | | | | B0.4 | | | |
| | B1.0 | | B1.1 | | | | B1.2 | | B1.3 | | | | B1.4 | | | |
| | B2.0 | | B2.1 | | | | B2.2 | | B2.3 | | | | B2.4 | | | |
| +------+ +------+ | | +------+ +------+ | | +------+ +------+ |
+-------------------+ +-------------------+ +-------------------+
The second phase of the location-aware rebalancing is about moving tablet replicas across locations to make the locations' load more balanced. At this stage all violations of the placement policy are already rectified. The rebalancer tool doesn't attempt to make any moves which would violate the placement policy.
The load of the locations in the diagram above:
/L0
: 1/5/L1
: 1/5/L2
: 2/7
A possible distribution of the tablet replicas after the second phase is represented below. The result load of the locations:
/L0
: 2/9/L1
: 2/9/L2
: 2/9
/L0 /L1 /L2
+-------------------+ +-------------------+ +-------------------+
| TS0 TS1 | | TS2 TS3 | | TS4 TS5 |
| +------+ +------+ | | +------+ +------+ | | +------+ +------+ |
| | A0.0 | | | | | | A0.2 | | | | | | | | A0.1 | |
| | | | A1.0 | | | | A1.1 | | | | | | A1.2 | | | |
| | | | A2.0 | | | | A2.1 | | | | | | A2.2 | | | |
| | | | A3.0 | | | | A3.1 | | | | | | A3.2 | | | |
| | B0.0 | | | | | | B0.2 | | B0.3 | | | | B0.4 | | B0.1 | |
| | B1.0 | | B1.1 | | | | | | B1.3 | | | | B1.4 | | B2.2 | |
| | B2.0 | | B2.1 | | | | B2.2 | | B2.3 | | | | B2.4 | | | |
| +------+ +------+ | | +------+ +------+ | | +------+ +------+ |
+-------------------+ +-------------------+ +-------------------+
The third phase of the location-aware rebalancing is about moving tablet replicas within each location to make the distribution of replicas even, both per-table and per-server.
See below for a possible replicas' distribution in the example scenario after the third phase of the location-aware rebalancing successfully completes.
/L0 /L1 /L2
+-------------------+ +-------------------+ +-------------------+
| TS0 TS1 | | TS2 TS3 | | TS4 TS5 |
| +------+ +------+ | | +------+ +------+ | | +------+ +------+ |
| | A0.0 | | | | | | | | A0.2 | | | | | | A0.1 | |
| | | | A1.0 | | | | A1.1 | | | | | | A1.2 | | | |
| | | | A2.0 | | | | A2.1 | | | | | | A2.2 | | | |
| | | | A3.0 | | | | A3.1 | | | | | | A3.2 | | | |
| | B0.0 | | | | | | B0.2 | | B0.3 | | | | B0.4 | | B0.1 | |
| | B1.0 | | B1.1 | | | | | | B1.3 | | | | B1.4 | | B1.2 | |
| | B2.0 | | B2.1 | | | | B2.2 | | B2.3 | | | | | | B2.4 | |
| +------+ +------+ | | +------+ +------+ | | +------+ +------+ |
+-------------------+ +-------------------+ +-------------------+
As of now, we have the following task in the roadmap:
- Run the location-aware rebalancing in background, automatically reinstating the placement policy and making tablet replica distribution even across a Kudu cluster.
In addition to that, but not yet in the roadmap, there is a 'table pinning' use case to address:
- Make it possible to specify placement policy where replicas of tablets of particular tables are placed only at nodes within the specified locations.
The table pinning use case is requested in KUDU-2604 JIRA, see [2].
[[1]] Location awareness in Kudu, design document
[[2]] KUDU-2604