Skip to content

Instantly share code, notes, and snippets.

@miketheman
Created July 22, 2013 21:36
Show Gist options
  • Save miketheman/6057930 to your computer and use it in GitHub Desktop.
Save miketheman/6057930 to your computer and use it in GitHub Desktop.
Adding nodes to a ZooKeeper ensemble

Adding 2 nodes to an existing 3-node ZooKeeper ensemble without losing the Quorum

Since many deployments may start out with 3 nodes and so little is known about how to grow a cluster from 3 memebrs to 5 members without losing the existing Quorum, here is an example of how this might be achieved.

In this example, all 5 nodes will be running on the same Vagrant host for the purpose of illustration, running on distinct configurations (ports and data directories) without the actual load of clients.

YMMV. Caveat usufructuarius.

Step 1: Have a healthy 3-node ensemble

Ensure all 3 nodes are up, one is the leader, and all are in sync

setup

sudo apt-get update
sudo apt-get install -y openjdk-6-jre-headless vim

mkdir ~/zook
cd ~/zook
wget http://apache.claz.org/zookeeper/zookeeper-3.4.5/zookeeper-3.4.5.tar.gz # You may wish to choose a closer mirror
tar xzf zookeeper-3.4.5.tar.gz
for i in `seq 5` ; do mkdir conf$i ; echo $i > conf$i/myid ; done

tickTime=2000
dataDir=/home/vagrant/zook/conf1
clientPort=2181
initLimit=50
syncLimit=200
server.1=localhost:2881:3881
server.2=localhost:2882:3882
server.3=localhost:2883:3883

zookeeper-3.4.5/bin/zkServer.sh start-foreground conf1/zoo.cfg


tickTime=2000
dataDir=/home/vagrant/zook/conf2
clientPort=2182
initLimit=50
syncLimit=200
server.1=localhost:2881:3881
server.2=localhost:2882:3882
server.3=localhost:2883:3883

zookeeper-3.4.5/bin/zkServer.sh start-foreground conf2/zoo.cfg


tickTime=2000
dataDir=/home/vagrant/zook/conf3
clientPort=2183
initLimit=50
syncLimit=200
server.1=localhost:2881:3881
server.2=localhost:2882:3882
server.3=localhost:2883:3883

zookeeper-3.4.5/bin/zkServer.sh start-foreground conf3/zoo.cfg

Step 2: Set up 2 new nodes to join the cluster

Start the service, see the nodes join the cluster, snapshot the data and become active.

tickTime=2000
dataDir=/home/vagrant/zook/conf4
clientPort=2184
initLimit=50
syncLimit=200
server.1=localhost:2881:3881
server.2=localhost:2882:3882
server.3=localhost:2883:3883
server.4=localhost:2884:3884
server.5=localhost:2885:3885

zookeeper-3.4.5/bin/zkServer.sh start-foreground conf4/zoo.cfg


tickTime=2000
dataDir=/home/vagrant/zook/conf5
clientPort=2185
initLimit=50
syncLimit=200
server.1=localhost:2881:3881
server.2=localhost:2882:3882
server.3=localhost:2883:3883
server.4=localhost:2884:3884
server.5=localhost:2885:3885

zookeeper-3.4.5/bin/zkServer.sh start-foreground conf5/zoo.cfg

Step 3: Add the 2 new nodes config to existing cluster

for i in `seq 3` ;do vim conf$i/zoo.cfg ; done

server.4=localhost:2884:3884
server.5=localhost:2885:3885

Save files.

Step 4: Restart Followers with new config

# Stop this instance with Ctrl+C, then run
zookeeper-3.4.5/bin/zkServer.sh start-foreground conf2/zoo.cfg

Ensure that is joins the ensemble, repeat with other Follower.

# Ctrl+C
zookeeper-3.4.5/bin/zkServer.sh start-foreground conf3/zoo.cfg

Step 5: Restart the Leader

Ensure that all 4 nodes have network conenctivity to each other on the designated ports, and then bounce the Leader.

# Ctrl+C
zookeeper-3.4.5/bin/zkServer.sh start-foreground conf1/zoo.cfg
@atmosx
Copy link

atmosx commented Apr 13, 2018

As part of the agreement protocol all write requests from clients are forwarded to a single server, called the leader. The rest of the ZooKeeper servers, called followers, receive message proposals from the leader and agree upon message delivery. The messaging layer takes care of replacing leaders on failures and syncing followers with leaders.

So what's actually scaling out is reads not writes as all writes are forwarded to the leader for consistency.

Further reading: https://zookeeper.apache.org/doc/r3.4.8/zookeeperOver.html#Implementation

@jiangtao7
Copy link

Anyone follow the above steps to get success? for, I am not lucky to add a zookeeper into cluster, I am checking the zookeeper sourcedoe now, to look for the root cause.

@leonj1
Copy link

leonj1 commented May 18, 2018

The above should work for ZK 3.4+. The trick to understanding ZK members joining a cluster is to know that ZK members DO NOT initiate connections to servers with higher ids (e.g. server.1 will not initiate a connection to 2, 3, etc). BUT connections that are initiated from servers with higher ids (e.g. server.3) will get established with 1 and 2 since they are lower. This is why myId is crucial and required. The reason for this, my best guess, is that ZK needs to ensure only a single socket connection between members exist. So, by design, only socket connections are allowed from servers with higher ids to lower, this guarantees single socket connections.

You can verify that connections from servers with smaller server ids are being dropped by finding this line in the logs "Have smaller server identifier, so dropping the connection".

The final thing to know is for a member to accept the new leader, it has to pass these tests:

  1. have 1 received the same vote more than once?
  2. has most of the quorum (half + 1) sent me the same vote for the leader?
  3. the server who everyone thinks is the leader, has it sent me their vote?
  4. when I check that proposed leader's vote, which includes their current state (LEADING, FOLLOWING), do they also think they are the leader?

I hope this helps. I do believe the above steps will work since I have done similar steps to achieve the same result. But the order applied is critical.

@leonj1
Copy link

leonj1 commented May 18, 2018

Also, a vote is considered the leading vote when it passes these checks:

  1. Does this proposed Vote have a higher epoch than mine? epoch, in ZK, is unique. There is one epoch per leader election. When a leader is elected, it increment that epoch by 1.
  2. If epochs are the same, then which vote has the highest zxid (ZK transaction id)? Basically, who has the latest data.
  3. If 1 and 2 are equal, then which server has the highest server id? So, for clusters where no weights are defined (they all get a weight of 1), then server if wins.

@gentleyu
Copy link

gentleyu commented Jul 3, 2018

How to determine the zookeeper cluster's number when i expanding the cluster from 3 to 5?
For example,there is a zookeeper cluster with 3 nodes.is the cluster expand to 5 node server when i start the 4th node with 5node-"zoo.cfg"?

@spanktar
Copy link

What about the reverse? We're working on scaling a ZK cluster DOWN and worrying about how to avoid losing quorum when nodes are removed. In a cluster of 6 (yes, I know we should run an odd-numbered cluster. Inherited setup), we're moving to 3. If we remove 3 nodes, quorum won't be established and everything goes sideways. So how do we establish a smaller (temporarily dangerous) quorum of 2, then remove 3 old nodes? If we update a config on one of the machines to 3 nodes, it will disagree with the rest of cluster...then what? We can't shut down/restart all the nodes at once either. Anyone have any thoughts here?

Still digging to try to find answers...

@ajiraj2411
Copy link

Did you get the answer @spanktar?

@DaveMoreauDeprecated
Copy link

What about the reverse? We're working on scaling a ZK cluster DOWN and worrying about how to avoid losing quorum when nodes are removed. In a cluster of 6 (yes, I know we should run an odd-numbered cluster. Inherited setup), we're moving to 3. If we remove 3 nodes, quorum won't be established and everything goes sideways. So how do we establish a smaller (temporarily dangerous) quorum of 2, then remove 3 old nodes? If we update a config on one of the machines to 3 nodes, it will disagree with the rest of cluster...then what? We can't shut down/restart all the nodes at once either. Anyone have any thoughts here?

Still digging to try to find answers...

Would something like this work? I would start by updating config on clients to the smaller set. Maybe that will help the disagreement problem you mention by making sure that any potential consistency issues on the servers being removed won't matter.

I would think that the disagreement risk would be if followers think they are in the ensemble when reads come in, but the leader doesn't think those nodes are in the ensemble when writes happen. That could lead to reads from stale data. If the clients don't send requests to those nodes due to having the updated list, might that mitigate any potential problems?

After updating clients, I would update the 3 servers you are keeping with the new list and then shut down the 3 that are being deprecated.

Disclaimer: this is based on extrapolating from documentation I only read earlier today. I have done quite a few ZK cut-overs to new nodes, but I did so no background in ZooKeeper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment