jstarek/gist:ea221320b9729b8466a98210f4eb535e

## gistfile1.txt
Machine translation of Kristian Köhntopps
https://plus.google.com/u/0/+KristianK%C3%B6hntopp/posts/fUC6J1Hyh3z


Consensus systems - Zookeeper, etcd, consul - what is that?

I for some time had a semi-finished Erklärbärartikel on "consensus
systems - Zookeeper, etcd, consul" through my mind, because every time I
roll my toenails when someone classifies them as KV-storages.

Sure, you can use those as a key-value stores, but that's like taking an
oil tanker to a regatta.

Actually it is about something else, and to functionality that you all
want to have. That with the KV-Store is just a nice side aspect.

Anyway, this is now a rough and unhewn braindump and even without
slides. Criticism welcome.


The problem

The cloud, the overlay. So we have our application divided into a bunch
of more or less independent authorities who talk in a defined way with
each other over the network. From every sort instance - from any type of
system, each micro-Service - we can start with our magical cloud as many
instances: Load Balancer, frontends, Memcaches, queue systems, backend
queue consumer, databases and so on.

How do the Load Balancers know which front-end servers they target and
from which IP addresses and ports to accept the front-end server
requests? And how do the Load Balancers know whether these front ends
are still alive and in the pool or not? How does the front-end the
Submit ports the queue and how to find the backend queue to consume
orders?

This is Service Discovery.

You can do static and distribute configuration files with IP addresses
and ports. Or write in to a database, a Redis or another SPOF.

Or, if one wants to avoid SPOF, write in more than one such system. Then
it goes one but as the man who has two clocks: If you have more than one
copy of data, and these copies is selected may differ from each other,
then you have to decide which version of the data is valid.

Worse, it is not enough that each system decide for themselves which
version of the data is the latest, but all existing systems also have to
agree on which version is the valid version of the data.

And not just for service discovery ( "Who's there, who's out? And of
those who are in it, to what address do I find?"), But also in Locks (
"Who can serve up data who must wait? "), in meters (" What is the next
free ID? ") and queues (" which orders are still unprocessed, and who is
working on this job here? ").

What is needed is a source of truth ( "source of truth") and a consensus
on who is this source of truth and how this truth is.

That's what a consensus system does.

The often misunderstood CAP Theorem

This is usually the point at which the conversation turns to Brewer's
CAP Theorem. The CAP Theorem says (it proves it even), that a
distributed system can have a maximum of two of three characteristics:

- It may be consistent (C for consistency)

- It may be available (A for Availability)

- There may be a network partition tolerate (P for partition Tolerance)

In distributed systems, one must always reckon with network partitions,
so one has to take P and then has only the choice between C and A as a
second property.

This sounds simple, but for some reason in the last 5 years most
misunderstood theorem in computer science.

There is the fact that the CAP theorem is true for the distributed
aspect of distributed systems. If you do not have a distributed system,
one does not need network partitions and can therefore easily select C
and A. For systems for distributed systems, which nevertheless do not
get the aspect "localStorage" there is another computer
science-technical term - "broken".

Indeed, the reading and writing of data on disks in the past 40 years
has been a subject of intense research, and so we now know not only how
to safely hinbekommt it locally under adverse circumstances, but we can
do the whole thing so that reading processes never have to wait for the
same writing processes that index updates are atomic, consistent, and
that reading processes do not leave individual records, simply because
they are being changed.

That's all for no one a big problem, you should in any case my, if you
have ever had a book about databases in hand - but then decides the
reality to discourage a.

Anyway, only when talking about the distributed aspect of distributed
systems, plays "P" a role, then it is time to make the decision between
"C" and "A". Many systems then select "A" because that seems somehow
easier and better performance, but none of these systems can be ever be
a source of truth.

Consensus systems are distributed systems, who then decided in the
eternal war between Consistency and Availability for Consistency.

What does this mean in practice

Consensus systems are distributed systems, meaning there are things that
have more than one instance. These instances talk to each other, at
least the part of them that can reach each other, and then, agree on
what is the version of the truth, who do all think together.

Because now put quite a lot of complexity in it: There are distributed
systems, ie systems with more than one node, network between these
nodes, and this power is eventually broken, healing, or replaced without
will at the most inconvenient times between the "healing" and "broken"
back and forth. In this chaos the nodes are talking to each other and
coordinate their approach.

Each system has any data from which it is believed that it is the
version of the truth, which should all believe. The systems are now
talking to each other and try to convince each other that their
respective version of the truth is the version that should believe any
other system members.

This is called Leader Election and follows one of two popular
algorithms, of which people have again proved that they are correct and
can converge potentially, that if the network is stable long enough,
then find each other all the systems involved and agree on a version of
the truth, when lost the least information. The two algorithms are
called Paxos and Raft - Paxos is the older, Leslie Lamport, and Raft is
supposedly easier to understand new and. Both have been proven correct,
but I will not bore you with DSB computer science and retell evidence
here.

For this to work you need each counter as version numbers and leader is
the one at the end system, its local copy of the data is the latest
counter moderately. This version with the corresponding version number
supernatant is then distributed to all participating nodes, and at the
end all participating systems on the same, date. If new or previously
separated systems to be elected and the whole computer heap reconfigures
itself and brings all new to upcoming systems on the same level.

Ensemble and Session

This set of systems that have a copy of the data and are entitled to
vote in the version and leader election, is called the inner cluster or
the ensemble. It manages the data and provides all clients who want to
talk with the ensemble, the same consistent view of the data - it does
not matter with which member of the ensemble convinces a client, as long
as a client can only ever reach a member of the ensemble, the than
himself part of the active ensembles and on stand being feels.

So clients build one or more connections to one or more members of the
ensemble on these issues, whether they are active and on stand and who
are the other members of the ensemble and how to contact the. This can
be thought of a bit in DNS Root Hints: The client can have a local
configuration file that contains the contact information of the
ensemble. All but one of these contact details may be outdated, it does
not matter, as long as at least is a correct address in the file, the
client can contact the ensemble, find the other members of the ensemble
and download an updated version of the configuration file and install it
locally.

Since all of these connections to members of the ensemble are
equivalent, it does not matter what type of port with which a member of
the ensemble of the client speaks - the "Session" is an abstract concept
that abstracts one or more connections from the client to the ensemble.
The client can talk about the session with the ensemble and perform
operations - when the operation is confirmed as successful, the client
knows that all active members of the ensemble have acknowledged the
operation and all other clients will see the result of this operation,
no matter what member of the ensemble they talk.

Operations in the ensemble

All consensus systems (Zookeeper, etcd, Consul, essentially) implement a
data model a tree not unlike a file system tree, DOM tree or LDAP tree
in which each node can be a directory for other nodes and at the same
time data (attributes or a small blob ) can store.

Changes to individual nodes are atomic possible. Sometimes there is more
atomic operations that manage counter or the owner or can be plugged
together, these operations from other, smaller primitives.

There are persistent node (which remain unchanged exist until they are
deleted manually) and there are ephemeral nodes, whose existence is
linked to the existence of a session: Disappears the session of a client
from the ensemble (by logging out or timeout), then disappear and the
ephemeral nodes belonging to this session.

Directories where clients can enter, are usually persistent, the
messages of the clients are ephemerisch usually "/ frontends" is a
persistent directory and clients then take there as "/ frontend /
client-17" with ephemeral node and store IP number and port as
attributes of the node.

If a client from the network and will lose its session, the entry
disappears "/ frontend / client-17" by itself. A load balancer can "/
frontend" read the list and so has a list of clients to which it can
distribute requests. By reading out the attributes of the nodes it finds
in the address and post number of clients.

For anyone who has previously worked with manually maintained lists of
servers in load balancers,  this procedure is frighteningly automatic:

- Check server regularly even if they are functional and internally
consistent.

- If the server is its internal consistency check, he carries himself in
the service directory in the database of the ensemble a.

- Failed the server or consistency check, the session expires sometime
by timeout and the server drops out of the service directories of the
ensemble.

- If a server is taken out of service, it logs on to the ensemble and
closes his session and thereby also automatically falls from the load
balancer.

- Loadbalancer find so automatically a current list of all servers in
the Service Directory and do not need to know something about
consistency checks nor finding servers or through the re-log servers. A
config reload the entry or exit of servers in the pool is not necessary.

- DNS is not playing with, and accordingly will not occur DNS typical
cache issues or update delays. By having a source of truth - the
ensemble - has to know all the systems involved, which is true, and
agree which version number has the truth.

Often clients can request an observer on a single node or a directory -
a watch - place. A change a node (other attributes, node disappears or
arises) or a node in the tree below the observed node, the client will
be notified and can read newly observed node or tree.

In this way, clients are automatically notified of configuration
changes.

One must therefore not SIGHUPpen a load balancer in access or disposal
of servers in the pool, but the load balancer receives a notification
from its Watch on "/ frontends" and reads the list of front ends then
again through when it changes.

Other useful operations

Load balancers can also be found in "/ load balancer" Enter a "/ load
balancer / lb-1" and "/ load balancer / lb-2" etc. Each load balancer
can try "/ load balancer / active-lb" to take the node in possession ,

For the ensemble usually has one or more atomic operations, such as for
generating ephemeral nodes, if they do not exist. A load balancer can
thus generate this node (and then he heard this LB) if it does not exist
or the creation fails.

In this way, it may just be a "/ loadbalancer / active-lb" and there is
consensus over who owns these.

So it is no problem to launch multiple load balancers instances - only
one of them will be active and receive and distribute requests. The
others stand around and standby lurking that you watch on "/ load
balancer / active-lb" is active. If he is that because the active LB its
session from whatever reason loses, they are trying to take over the
node. If they fail, they will load balancer in place of the load
balancer and take over the work.

Because the ensemble guarantees consistency here, ensures that any
number of inactive load balancers can be passively ready and exactly one
of these load balancers will be the active, requests accepting
Loadbalancer.

Because the notification is interrupt driven over Watches inactive
Loadbalancer not also need constantly to ask if they are off, but to be
awakened, if there is something to do.

Even more useful operations

Exactly so clients can atomic operations - atomic moving nodes from one
directory to another - see work from a WorkQueue or displays which have
clients on which node in stock which data fragments of a large database.

performance limits

If you have a source of truth in its distributed system are many things
suddenly easy and one is tempted to offload any work on this source of
truth. Here however we must bear two things in mind:

- The whole thing is an expensive business. Because each write operation
between the participating nodes must be coordinated, network latency
come into play. No matter how fast the local machine are: Depending on
the network is at 2000-10000 write operations per second, the oven off.

So it does not scale to commit to in a large cluster with hundreds of
nodes each counter and each other Pups consensus system with maximum
effort cluster-wide, even if the local code does beautifully simple.

- The whole is absolutely consistent ( "C"). This means that in the case
of a network failure ( "P" - partition of the network) can not be
Available ( "A"). That is, there is downtime of the ensemble, that's
exactly the point: The ensemble makes when the network wobbles because
it once must establish by Leader Election and data synchronization
consistency before it can ran the clients again.

Clients must therefore come to terms that the ensemble is temporarily
away, or that the ensemble is temporarily read-only.

How the manufacturers of consensus systems do not care.

But the point is that "Zookeeper is gone" is a feature and not a bug -
it's precisely the meaning of Zookeeper, to not be "A" until "C" is ensured
by a majority of systems are compared and synchronized their versions of
data Has.
	Machine translation of Kristian Köhntopps
	https://plus.google.com/u/0/+KristianK%C3%B6hntopp/posts/fUC6J1Hyh3z


	Consensus systems - Zookeeper, etcd, consul - what is that?

	I for some time had a semi-finished Erklärbärartikel on "consensus
	systems - Zookeeper, etcd, consul" through my mind, because every time I
	roll my toenails when someone classifies them as KV-storages.

	Sure, you can use those as a key-value stores, but that's like taking an
	oil tanker to a regatta.

	Actually it is about something else, and to functionality that you all
	want to have. That with the KV-Store is just a nice side aspect.

	Anyway, this is now a rough and unhewn braindump and even without
	slides. Criticism welcome.


	The problem

	The cloud, the overlay. So we have our application divided into a bunch
	of more or less independent authorities who talk in a defined way with
	each other over the network. From every sort instance - from any type of
	system, each micro-Service - we can start with our magical cloud as many
	instances: Load Balancer, frontends, Memcaches, queue systems, backend
	queue consumer, databases and so on.

	How do the Load Balancers know which front-end servers they target and
	from which IP addresses and ports to accept the front-end server
	requests? And how do the Load Balancers know whether these front ends
	are still alive and in the pool or not? How does the front-end the
	Submit ports the queue and how to find the backend queue to consume
	orders?

	This is Service Discovery.

	You can do static and distribute configuration files with IP addresses
	and ports. Or write in to a database, a Redis or another SPOF.

	Or, if one wants to avoid SPOF, write in more than one such system. Then
	it goes one but as the man who has two clocks: If you have more than one
	copy of data, and these copies is selected may differ from each other,
	then you have to decide which version of the data is valid.

	Worse, it is not enough that each system decide for themselves which
	version of the data is the latest, but all existing systems also have to
	agree on which version is the valid version of the data.

	And not just for service discovery ( "Who's there, who's out? And of
	those who are in it, to what address do I find?"), But also in Locks (
	"Who can serve up data who must wait? "), in meters (" What is the next
	free ID? ") and queues (" which orders are still unprocessed, and who is
	working on this job here? ").

	What is needed is a source of truth ( "source of truth") and a consensus
	on who is this source of truth and how this truth is.

	That's what a consensus system does.

	The often misunderstood CAP Theorem

	This is usually the point at which the conversation turns to Brewer's
	CAP Theorem. The CAP Theorem says (it proves it even), that a
	distributed system can have a maximum of two of three characteristics:

	- It may be consistent (C for consistency)

	- It may be available (A for Availability)

	- There may be a network partition tolerate (P for partition Tolerance)

	In distributed systems, one must always reckon with network partitions,
	so one has to take P and then has only the choice between C and A as a
	second property.

	This sounds simple, but for some reason in the last 5 years most
	misunderstood theorem in computer science.

	There is the fact that the CAP theorem is true for the distributed
	aspect of distributed systems. If you do not have a distributed system,
	one does not need network partitions and can therefore easily select C
	and A. For systems for distributed systems, which nevertheless do not
	get the aspect "localStorage" there is another computer
	science-technical term - "broken".

	Indeed, the reading and writing of data on disks in the past 40 years
	has been a subject of intense research, and so we now know not only how
	to safely hinbekommt it locally under adverse circumstances, but we can
	do the whole thing so that reading processes never have to wait for the
	same writing processes that index updates are atomic, consistent, and
	that reading processes do not leave individual records, simply because
	they are being changed.

	That's all for no one a big problem, you should in any case my, if you
	have ever had a book about databases in hand - but then decides the
	reality to discourage a.

	Anyway, only when talking about the distributed aspect of distributed
	systems, plays "P" a role, then it is time to make the decision between
	"C" and "A". Many systems then select "A" because that seems somehow
	easier and better performance, but none of these systems can be ever be
	a source of truth.

	Consensus systems are distributed systems, who then decided in the
	eternal war between Consistency and Availability for Consistency.

	What does this mean in practice

	Consensus systems are distributed systems, meaning there are things that
	have more than one instance. These instances talk to each other, at
	least the part of them that can reach each other, and then, agree on
	what is the version of the truth, who do all think together.

	Because now put quite a lot of complexity in it: There are distributed
	systems, ie systems with more than one node, network between these
	nodes, and this power is eventually broken, healing, or replaced without
	will at the most inconvenient times between the "healing" and "broken"
	back and forth. In this chaos the nodes are talking to each other and
	coordinate their approach.

	Each system has any data from which it is believed that it is the
	version of the truth, which should all believe. The systems are now
	talking to each other and try to convince each other that their
	respective version of the truth is the version that should believe any
	other system members.

	This is called Leader Election and follows one of two popular
	algorithms, of which people have again proved that they are correct and
	can converge potentially, that if the network is stable long enough,
	then find each other all the systems involved and agree on a version of
	the truth, when lost the least information. The two algorithms are
	called Paxos and Raft - Paxos is the older, Leslie Lamport, and Raft is
	supposedly easier to understand new and. Both have been proven correct,
	but I will not bore you with DSB computer science and retell evidence
	here.

	For this to work you need each counter as version numbers and leader is
	the one at the end system, its local copy of the data is the latest
	counter moderately. This version with the corresponding version number
	supernatant is then distributed to all participating nodes, and at the
	end all participating systems on the same, date. If new or previously
	separated systems to be elected and the whole computer heap reconfigures
	itself and brings all new to upcoming systems on the same level.

	Ensemble and Session

	This set of systems that have a copy of the data and are entitled to
	vote in the version and leader election, is called the inner cluster or
	the ensemble. It manages the data and provides all clients who want to
	talk with the ensemble, the same consistent view of the data - it does
	not matter with which member of the ensemble convinces a client, as long
	as a client can only ever reach a member of the ensemble, the than
	himself part of the active ensembles and on stand being feels.

	So clients build one or more connections to one or more members of the
	ensemble on these issues, whether they are active and on stand and who
	are the other members of the ensemble and how to contact the. This can
	be thought of a bit in DNS Root Hints: The client can have a local
	configuration file that contains the contact information of the
	ensemble. All but one of these contact details may be outdated, it does
	not matter, as long as at least is a correct address in the file, the
	client can contact the ensemble, find the other members of the ensemble
	and download an updated version of the configuration file and install it
	locally.

	Since all of these connections to members of the ensemble are
	equivalent, it does not matter what type of port with which a member of
	the ensemble of the client speaks - the "Session" is an abstract concept
	that abstracts one or more connections from the client to the ensemble.
	The client can talk about the session with the ensemble and perform
	operations - when the operation is confirmed as successful, the client
	knows that all active members of the ensemble have acknowledged the
	operation and all other clients will see the result of this operation,
	no matter what member of the ensemble they talk.

	Operations in the ensemble

	All consensus systems (Zookeeper, etcd, Consul, essentially) implement a
	data model a tree not unlike a file system tree, DOM tree or LDAP tree
	in which each node can be a directory for other nodes and at the same
	time data (attributes or a small blob ) can store.

	Changes to individual nodes are atomic possible. Sometimes there is more
	atomic operations that manage counter or the owner or can be plugged
	together, these operations from other, smaller primitives.

	There are persistent node (which remain unchanged exist until they are
	deleted manually) and there are ephemeral nodes, whose existence is
	linked to the existence of a session: Disappears the session of a client
	from the ensemble (by logging out or timeout), then disappear and the
	ephemeral nodes belonging to this session.

	Directories where clients can enter, are usually persistent, the
	messages of the clients are ephemerisch usually "/ frontends" is a
	persistent directory and clients then take there as "/ frontend /
	client-17" with ephemeral node and store IP number and port as
	attributes of the node.

	If a client from the network and will lose its session, the entry
	disappears "/ frontend / client-17" by itself. A load balancer can "/
	frontend" read the list and so has a list of clients to which it can
	distribute requests. By reading out the attributes of the nodes it finds
	in the address and post number of clients.

	For anyone who has previously worked with manually maintained lists of
	servers in load balancers, this procedure is frighteningly automatic:

	- Check server regularly even if they are functional and internally
	consistent.

	- If the server is its internal consistency check, he carries himself in
	the service directory in the database of the ensemble a.

	- Failed the server or consistency check, the session expires sometime
	by timeout and the server drops out of the service directories of the
	ensemble.

	- If a server is taken out of service, it logs on to the ensemble and
	closes his session and thereby also automatically falls from the load
	balancer.

	- Loadbalancer find so automatically a current list of all servers in
	the Service Directory and do not need to know something about
	consistency checks nor finding servers or through the re-log servers. A
	config reload the entry or exit of servers in the pool is not necessary.

	- DNS is not playing with, and accordingly will not occur DNS typical
	cache issues or update delays. By having a source of truth - the
	ensemble - has to know all the systems involved, which is true, and
	agree which version number has the truth.

	Often clients can request an observer on a single node or a directory -
	a watch - place. A change a node (other attributes, node disappears or
	arises) or a node in the tree below the observed node, the client will
	be notified and can read newly observed node or tree.

	In this way, clients are automatically notified of configuration
	changes.

	One must therefore not SIGHUPpen a load balancer in access or disposal
	of servers in the pool, but the load balancer receives a notification
	from its Watch on "/ frontends" and reads the list of front ends then
	again through when it changes.

	Other useful operations

	Load balancers can also be found in "/ load balancer" Enter a "/ load
	balancer / lb-1" and "/ load balancer / lb-2" etc. Each load balancer
	can try "/ load balancer / active-lb" to take the node in possession ,

	For the ensemble usually has one or more atomic operations, such as for
	generating ephemeral nodes, if they do not exist. A load balancer can
	thus generate this node (and then he heard this LB) if it does not exist
	or the creation fails.

	In this way, it may just be a "/ loadbalancer / active-lb" and there is
	consensus over who owns these.

	So it is no problem to launch multiple load balancers instances - only
	one of them will be active and receive and distribute requests. The
	others stand around and standby lurking that you watch on "/ load
	balancer / active-lb" is active. If he is that because the active LB its
	session from whatever reason loses, they are trying to take over the
	node. If they fail, they will load balancer in place of the load
	balancer and take over the work.

	Because the ensemble guarantees consistency here, ensures that any
	number of inactive load balancers can be passively ready and exactly one
	of these load balancers will be the active, requests accepting
	Loadbalancer.

	Because the notification is interrupt driven over Watches inactive
	Loadbalancer not also need constantly to ask if they are off, but to be
	awakened, if there is something to do.

	Even more useful operations

	Exactly so clients can atomic operations - atomic moving nodes from one
	directory to another - see work from a WorkQueue or displays which have
	clients on which node in stock which data fragments of a large database.

	performance limits

	If you have a source of truth in its distributed system are many things
	suddenly easy and one is tempted to offload any work on this source of
	truth. Here however we must bear two things in mind:

	- The whole thing is an expensive business. Because each write operation
	between the participating nodes must be coordinated, network latency
	come into play. No matter how fast the local machine are: Depending on
	the network is at 2000-10000 write operations per second, the oven off.

	So it does not scale to commit to in a large cluster with hundreds of
	nodes each counter and each other Pups consensus system with maximum
	effort cluster-wide, even if the local code does beautifully simple.

	- The whole is absolutely consistent ( "C"). This means that in the case
	of a network failure ( "P" - partition of the network) can not be
	Available ( "A"). That is, there is downtime of the ensemble, that's
	exactly the point: The ensemble makes when the network wobbles because
	it once must establish by Leader Election and data synchronization
	consistency before it can ran the clients again.

	Clients must therefore come to terms that the ensemble is temporarily
	away, or that the ensemble is temporarily read-only.

	How the manufacturers of consensus systems do not care.

	But the point is that "Zookeeper is gone" is a feature and not a bug -
	it's precisely the meaning of Zookeeper, to not be "A" until "C" is ensured
	by a majority of systems are compared and synchronized their versions of
	data Has.