Created
July 7, 2016 16:11
-
-
Save jstarek/ea221320b9729b8466a98210f4eb535e to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Machine translation of Kristian Köhntopps | |
https://plus.google.com/u/0/+KristianK%C3%B6hntopp/posts/fUC6J1Hyh3z | |
Consensus systems - Zookeeper, etcd, consul - what is that? | |
I for some time had a semi-finished Erklärbärartikel on "consensus | |
systems - Zookeeper, etcd, consul" through my mind, because every time I | |
roll my toenails when someone classifies them as KV-storages. | |
Sure, you can use those as a key-value stores, but that's like taking an | |
oil tanker to a regatta. | |
Actually it is about something else, and to functionality that you all | |
want to have. That with the KV-Store is just a nice side aspect. | |
Anyway, this is now a rough and unhewn braindump and even without | |
slides. Criticism welcome. | |
The problem | |
The cloud, the overlay. So we have our application divided into a bunch | |
of more or less independent authorities who talk in a defined way with | |
each other over the network. From every sort instance - from any type of | |
system, each micro-Service - we can start with our magical cloud as many | |
instances: Load Balancer, frontends, Memcaches, queue systems, backend | |
queue consumer, databases and so on. | |
How do the Load Balancers know which front-end servers they target and | |
from which IP addresses and ports to accept the front-end server | |
requests? And how do the Load Balancers know whether these front ends | |
are still alive and in the pool or not? How does the front-end the | |
Submit ports the queue and how to find the backend queue to consume | |
orders? | |
This is Service Discovery. | |
You can do static and distribute configuration files with IP addresses | |
and ports. Or write in to a database, a Redis or another SPOF. | |
Or, if one wants to avoid SPOF, write in more than one such system. Then | |
it goes one but as the man who has two clocks: If you have more than one | |
copy of data, and these copies is selected may differ from each other, | |
then you have to decide which version of the data is valid. | |
Worse, it is not enough that each system decide for themselves which | |
version of the data is the latest, but all existing systems also have to | |
agree on which version is the valid version of the data. | |
And not just for service discovery ( "Who's there, who's out? And of | |
those who are in it, to what address do I find?"), But also in Locks ( | |
"Who can serve up data who must wait? "), in meters (" What is the next | |
free ID? ") and queues (" which orders are still unprocessed, and who is | |
working on this job here? "). | |
What is needed is a source of truth ( "source of truth") and a consensus | |
on who is this source of truth and how this truth is. | |
That's what a consensus system does. | |
The often misunderstood CAP Theorem | |
This is usually the point at which the conversation turns to Brewer's | |
CAP Theorem. The CAP Theorem says (it proves it even), that a | |
distributed system can have a maximum of two of three characteristics: | |
- It may be consistent (C for consistency) | |
- It may be available (A for Availability) | |
- There may be a network partition tolerate (P for partition Tolerance) | |
In distributed systems, one must always reckon with network partitions, | |
so one has to take P and then has only the choice between C and A as a | |
second property. | |
This sounds simple, but for some reason in the last 5 years most | |
misunderstood theorem in computer science. | |
There is the fact that the CAP theorem is true for the distributed | |
aspect of distributed systems. If you do not have a distributed system, | |
one does not need network partitions and can therefore easily select C | |
and A. For systems for distributed systems, which nevertheless do not | |
get the aspect "localStorage" there is another computer | |
science-technical term - "broken". | |
Indeed, the reading and writing of data on disks in the past 40 years | |
has been a subject of intense research, and so we now know not only how | |
to safely hinbekommt it locally under adverse circumstances, but we can | |
do the whole thing so that reading processes never have to wait for the | |
same writing processes that index updates are atomic, consistent, and | |
that reading processes do not leave individual records, simply because | |
they are being changed. | |
That's all for no one a big problem, you should in any case my, if you | |
have ever had a book about databases in hand - but then decides the | |
reality to discourage a. | |
Anyway, only when talking about the distributed aspect of distributed | |
systems, plays "P" a role, then it is time to make the decision between | |
"C" and "A". Many systems then select "A" because that seems somehow | |
easier and better performance, but none of these systems can be ever be | |
a source of truth. | |
Consensus systems are distributed systems, who then decided in the | |
eternal war between Consistency and Availability for Consistency. | |
What does this mean in practice | |
Consensus systems are distributed systems, meaning there are things that | |
have more than one instance. These instances talk to each other, at | |
least the part of them that can reach each other, and then, agree on | |
what is the version of the truth, who do all think together. | |
Because now put quite a lot of complexity in it: There are distributed | |
systems, ie systems with more than one node, network between these | |
nodes, and this power is eventually broken, healing, or replaced without | |
will at the most inconvenient times between the "healing" and "broken" | |
back and forth. In this chaos the nodes are talking to each other and | |
coordinate their approach. | |
Each system has any data from which it is believed that it is the | |
version of the truth, which should all believe. The systems are now | |
talking to each other and try to convince each other that their | |
respective version of the truth is the version that should believe any | |
other system members. | |
This is called Leader Election and follows one of two popular | |
algorithms, of which people have again proved that they are correct and | |
can converge potentially, that if the network is stable long enough, | |
then find each other all the systems involved and agree on a version of | |
the truth, when lost the least information. The two algorithms are | |
called Paxos and Raft - Paxos is the older, Leslie Lamport, and Raft is | |
supposedly easier to understand new and. Both have been proven correct, | |
but I will not bore you with DSB computer science and retell evidence | |
here. | |
For this to work you need each counter as version numbers and leader is | |
the one at the end system, its local copy of the data is the latest | |
counter moderately. This version with the corresponding version number | |
supernatant is then distributed to all participating nodes, and at the | |
end all participating systems on the same, date. If new or previously | |
separated systems to be elected and the whole computer heap reconfigures | |
itself and brings all new to upcoming systems on the same level. | |
Ensemble and Session | |
This set of systems that have a copy of the data and are entitled to | |
vote in the version and leader election, is called the inner cluster or | |
the ensemble. It manages the data and provides all clients who want to | |
talk with the ensemble, the same consistent view of the data - it does | |
not matter with which member of the ensemble convinces a client, as long | |
as a client can only ever reach a member of the ensemble, the than | |
himself part of the active ensembles and on stand being feels. | |
So clients build one or more connections to one or more members of the | |
ensemble on these issues, whether they are active and on stand and who | |
are the other members of the ensemble and how to contact the. This can | |
be thought of a bit in DNS Root Hints: The client can have a local | |
configuration file that contains the contact information of the | |
ensemble. All but one of these contact details may be outdated, it does | |
not matter, as long as at least is a correct address in the file, the | |
client can contact the ensemble, find the other members of the ensemble | |
and download an updated version of the configuration file and install it | |
locally. | |
Since all of these connections to members of the ensemble are | |
equivalent, it does not matter what type of port with which a member of | |
the ensemble of the client speaks - the "Session" is an abstract concept | |
that abstracts one or more connections from the client to the ensemble. | |
The client can talk about the session with the ensemble and perform | |
operations - when the operation is confirmed as successful, the client | |
knows that all active members of the ensemble have acknowledged the | |
operation and all other clients will see the result of this operation, | |
no matter what member of the ensemble they talk. | |
Operations in the ensemble | |
All consensus systems (Zookeeper, etcd, Consul, essentially) implement a | |
data model a tree not unlike a file system tree, DOM tree or LDAP tree | |
in which each node can be a directory for other nodes and at the same | |
time data (attributes or a small blob ) can store. | |
Changes to individual nodes are atomic possible. Sometimes there is more | |
atomic operations that manage counter or the owner or can be plugged | |
together, these operations from other, smaller primitives. | |
There are persistent node (which remain unchanged exist until they are | |
deleted manually) and there are ephemeral nodes, whose existence is | |
linked to the existence of a session: Disappears the session of a client | |
from the ensemble (by logging out or timeout), then disappear and the | |
ephemeral nodes belonging to this session. | |
Directories where clients can enter, are usually persistent, the | |
messages of the clients are ephemerisch usually "/ frontends" is a | |
persistent directory and clients then take there as "/ frontend / | |
client-17" with ephemeral node and store IP number and port as | |
attributes of the node. | |
If a client from the network and will lose its session, the entry | |
disappears "/ frontend / client-17" by itself. A load balancer can "/ | |
frontend" read the list and so has a list of clients to which it can | |
distribute requests. By reading out the attributes of the nodes it finds | |
in the address and post number of clients. | |
For anyone who has previously worked with manually maintained lists of | |
servers in load balancers, this procedure is frighteningly automatic: | |
- Check server regularly even if they are functional and internally | |
consistent. | |
- If the server is its internal consistency check, he carries himself in | |
the service directory in the database of the ensemble a. | |
- Failed the server or consistency check, the session expires sometime | |
by timeout and the server drops out of the service directories of the | |
ensemble. | |
- If a server is taken out of service, it logs on to the ensemble and | |
closes his session and thereby also automatically falls from the load | |
balancer. | |
- Loadbalancer find so automatically a current list of all servers in | |
the Service Directory and do not need to know something about | |
consistency checks nor finding servers or through the re-log servers. A | |
config reload the entry or exit of servers in the pool is not necessary. | |
- DNS is not playing with, and accordingly will not occur DNS typical | |
cache issues or update delays. By having a source of truth - the | |
ensemble - has to know all the systems involved, which is true, and | |
agree which version number has the truth. | |
Often clients can request an observer on a single node or a directory - | |
a watch - place. A change a node (other attributes, node disappears or | |
arises) or a node in the tree below the observed node, the client will | |
be notified and can read newly observed node or tree. | |
In this way, clients are automatically notified of configuration | |
changes. | |
One must therefore not SIGHUPpen a load balancer in access or disposal | |
of servers in the pool, but the load balancer receives a notification | |
from its Watch on "/ frontends" and reads the list of front ends then | |
again through when it changes. | |
Other useful operations | |
Load balancers can also be found in "/ load balancer" Enter a "/ load | |
balancer / lb-1" and "/ load balancer / lb-2" etc. Each load balancer | |
can try "/ load balancer / active-lb" to take the node in possession , | |
For the ensemble usually has one or more atomic operations, such as for | |
generating ephemeral nodes, if they do not exist. A load balancer can | |
thus generate this node (and then he heard this LB) if it does not exist | |
or the creation fails. | |
In this way, it may just be a "/ loadbalancer / active-lb" and there is | |
consensus over who owns these. | |
So it is no problem to launch multiple load balancers instances - only | |
one of them will be active and receive and distribute requests. The | |
others stand around and standby lurking that you watch on "/ load | |
balancer / active-lb" is active. If he is that because the active LB its | |
session from whatever reason loses, they are trying to take over the | |
node. If they fail, they will load balancer in place of the load | |
balancer and take over the work. | |
Because the ensemble guarantees consistency here, ensures that any | |
number of inactive load balancers can be passively ready and exactly one | |
of these load balancers will be the active, requests accepting | |
Loadbalancer. | |
Because the notification is interrupt driven over Watches inactive | |
Loadbalancer not also need constantly to ask if they are off, but to be | |
awakened, if there is something to do. | |
Even more useful operations | |
Exactly so clients can atomic operations - atomic moving nodes from one | |
directory to another - see work from a WorkQueue or displays which have | |
clients on which node in stock which data fragments of a large database. | |
performance limits | |
If you have a source of truth in its distributed system are many things | |
suddenly easy and one is tempted to offload any work on this source of | |
truth. Here however we must bear two things in mind: | |
- The whole thing is an expensive business. Because each write operation | |
between the participating nodes must be coordinated, network latency | |
come into play. No matter how fast the local machine are: Depending on | |
the network is at 2000-10000 write operations per second, the oven off. | |
So it does not scale to commit to in a large cluster with hundreds of | |
nodes each counter and each other Pups consensus system with maximum | |
effort cluster-wide, even if the local code does beautifully simple. | |
- The whole is absolutely consistent ( "C"). This means that in the case | |
of a network failure ( "P" - partition of the network) can not be | |
Available ( "A"). That is, there is downtime of the ensemble, that's | |
exactly the point: The ensemble makes when the network wobbles because | |
it once must establish by Leader Election and data synchronization | |
consistency before it can ran the clients again. | |
Clients must therefore come to terms that the ensemble is temporarily | |
away, or that the ensemble is temporarily read-only. | |
How the manufacturers of consensus systems do not care. | |
But the point is that "Zookeeper is gone" is a feature and not a bug - | |
it's precisely the meaning of Zookeeper, to not be "A" until "C" is ensured | |
by a majority of systems are compared and synchronized their versions of | |
data Has. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment