Skip to content

Instantly share code, notes, and snippets.

@andrewthad
Created July 7, 2019 12:37
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save andrewthad/dc867073939ae8c14778e53669f9fc0f to your computer and use it in GitHub Desktop.
Save andrewthad/dc867073939ae8c14778e53669f9fc0f to your computer and use it in GitHub Desktop.
Zookeeper Protocol Notes

Discerning the Zookeeper Protocol

To write a zookeeper client, it is neccessary to understand the protocol used to communicate with it. Unfortunately, this protocol is not documented. In this example, we use tcpflow to dump both the read and write channel of the TCP connection that zkCli.sh uses to connect with the zookeeper server.

Requests

This is the result of connecting to a zookeeper server with zkCli.sh and, running get /foo (this znode does actually exist), and then running close. The client sends this:

00000000: 0000 002d 0000 0000 0000 0000 0000 0000  ...-............
00000010: 0000 7530 0000 0000 0000 0000 0000 0010  ..u0............
00000020: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000030: 0000 0000 1100 0000 0100 0000 0400 0000  ................
00000040: 042f 666f 6f00 0000 0008 0000 0002 ffff  ./foo...........
00000050: fff5                                     ..

Chris Nauroth's StackOverflow answer provides a good starting point for discerning the meaning. Each request is prefaced by its length encoded as a big-endian 32-bit word. Here are some relevant bits of the jute file describing the models:

class RequestHeader {
    int xid;
    int type;
}
class ConnectRequest {
    int protocolVersion;
    long lastZxidSeen;
    int timeOut;
    long sessionId;
    buffer passwd;
}
class GetDataRequest {
    ustring path;
    boolean watch;
}

There is a ConnectRequest followed by two standard requests in this dump. I suspect that before a zookeeper session has been established, any sequence of bytes is parsed as ConnectRequest. Also, there's some nonsense going on in ClientCnxn.java in the implementation of createBB for Packet. There's an extra boolean field named readOnly that doesn't show up in the jute file, but it gets tacked on to the end of the request.

Connect Request

  • 0000 002d (request comprised of next 45 bytes)
  • 0000 0000 (protocol version 0, zookeeper never bumps this)
  • 0000 0000 0000 0000 (lastZxidSeen is 0)
  • 0000 7530 (timeout is 30000 milliseconds)
  • 0000 0000 0000 0000 (session id is 0)
  • 0000 0010 (length of buffer passwd is 16)
  • 0000 0000 0000 0000 0000 0000 0000 0000 (password is 16 null bytes)
  • 00 (read-only is false, i.e. this connection can issue writes)

Request 1 (getData)

  • 0000 0011 (request comprised of next 17 bytes)
  • 0000 0001 (connection id 1)
  • 0000 0004 (op code 4: getData)
  • 0000 0004 (length of ustring path is 4)
  • 2f66 6f6f (the ASCII-encoded characters /foo)
  • 00 (the boolean watch, probably set to false, what does this do?)

Request 2 (closeSession)

  • 0000 0008 (request comprised of next 8 bytes)
  • 0000 0002 (connection id 2)
  • ffff fff5 (op code -11: closeSession)

Responses

Alright, let's take a look at the responses:

00000000: 0000 0025 0000 0000 0000 7530 016b c918  ...%......u0.k..
00000010: a8ee 0002 0000 0010 aaf7 2e9b dd17 87a2  ................
00000020: 44e3 ed5a 8753 99c9 0000 0000 5f00 0000  D..Z.S......_...
00000030: 0100 0000 0000 0000 0800 0000 0000 0000  ................
00000040: 0765 7861 6d70 6c65 0000 0000 0000 0004  .example........
00000050: 0000 0000 0000 0004 0000 016b c926 0af4  ...........k.&..
00000060: 0000 016b c926 0af4 0000 0000 0000 0000  ...k.&..........
00000070: 0000 0000 0000 0000 0000 0000 0000 0007  ................
00000080: 0000 0000 0000 0000 0000 0004 0000 0010  ................
00000090: 0000 0002 0000 0000 0000 0009 0000 0000  ................

In the zookeeper CLI, what we see is:

[zk: localhost:2181(CONNECTED) 0] get /foo
example
cZxid = 0x4
ctime = Sat Jul 06 21:17:22 UTC 2019
mZxid = 0x4
mtime = Sat Jul 06 21:17:22 UTC 2019
pZxid = 0x4
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 7
numChildren = 0

The relevant models are:

class ConnectResponse {
    int protocolVersion;
    int timeOut;
    long sessionId;
    buffer passwd;
}
class ReplyHeader {
    int xid;
    long zxid;
    int err;
}
class Stat {
    long czxid;      // created zxid
    long mzxid;      // last modified zxid
    long ctime;      // created
    long mtime;      // last modified
    int version;     // version
    int cversion;    // child version
    int aversion;    // acl version
    long ephemeralOwner; // owner id if ephemeral, 0 otw
    int dataLength;  //length of the data in the node
    int numChildren; //number of children of this node
    long pzxid;      // last modified children
}
class GetDataResponse {
    buffer data;
    org.apache.zookeeper.data.Stat stat;
}

In the zookeeper source code, there is a hack for tacking readOnly on to the end of ConnectResponse. This can be found by searching for readOnly in ZooKeeperServer.java.

Connect Response

  • 0000 0025 (response is comprised of next 37 bytes)
  • 0000 0000 (connection id 0)
  • 0000 7530 (timeout is 30000 milliseconds)
  • 016b c18 a8ee 0002 (session id)
  • 0000 0010 (length of buffer passwd is 16)
  • aaf7 2e9b dd17 87a2 44e3 ed5a 8753 99c9 (server generated password)
  • 00 (server is not in read-only mode)

Response 1 (ReplyHeader + GetDataResponse)

  • 0000 005f (response is comprised of next 95 bytes)
  • 0000 0001 (connection id 1)
  • 0000 0000 0000 0008 (transaction id 8)
  • 0000 0000 (error code 0, presumably this means no error)
  • 65 7861 6d70 6c65 (data buffer contents, ASCII encoding of "example")
  • 0000 0000 0000 0004 (created zxid is 4)
  • 0000 0000 0000 0004 (last modified zxid is 4)
  • 0000 016b c926 0af4 (created on Sat Jul 06 21:17:22 UTC 2019)
  • 0000 016b c926 0af4 (last modified on Sat Jul 06 21:17:22 UTC 2019)
  • 0000 0000 (version is 0)
  • 0000 0000 (c version is 0, what is this?)
  • 0000 0000 (acl version is 0)
  • 0000 0000 0000 0000 (ephemeral owner is 0x0, what is this?)
  • 0000 0007 (data length is 7, this seems redundant)
  • 0000 0000 (number of children is 0)
  • 0000 0000 0000 0004 (last child modified by zxid 4)

Response 2 (ReplyHeader, nothing else)

  • 0000 0010 (request comprised of next 16 bytes)
  • 0000 0002 (connection id 2)
  • 0000 0000 0000 0009 (transaction id 9)
  • 0000 0000 (error code 0, presumably this means no error)

That's all for now. Hopefully this is instructive for others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment