Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 98 You must be signed in to star a gist
  • Fork 27 You must be signed in to fork a gist
  • Save yurishkuro/10cb2dc42f42a007a8ce0e055ed0d171 to your computer and use it in GitHub Desktop.
Save yurishkuro/10cb2dc42f42a007a8ce0e055ed0d171 to your computer and use it in GitHub Desktop.
etcd vs consul vs ???
  • What do Etcd, Consul, and Zookeeper do?

    • Service Registration:
      • Host, port number, and sometimes authentication credentials, protocols, versions numbers, and/or environment details.
    • Service Discovery:
      • Ability for client application to query the central registry to learn of service location.
    • Consistent and durable general-purpose K/V store across distributed system.
      • Some solutions support this better than others.
      • Based on Paxos or some derivative (i.e. Raft) algorithm to quickly converge to a consistent state.
      • Centralized locking can be based on this K/V store.
    • Leader Election:
      • Not to be confused with leader election within the quorum of Etcd/Consul nodes. This is an implementation detail that is transparent to the user. What we are talking about here is leader election among the services that are registered against Etcd/Consul.
      • Etcd tabled their leader election module until the API stabilizes.
    • Other non-standard use cases:
      • Distributed locking
      • Atomic broadcast
      • Sequence numbers
      • Pointers to data in eventually consistent stores.
  • How do they behave in a distributed system?

    • All of the solutions under consideration are primarily CP systems in the CAP context. That is, they favor consistency over availability. This means that all nodes have a consistent view of written data but at the expense of availability in the event that a network partitions occurs (i.e. loss of node).
      • Some of these solutions will support "stale reads" in the event of node loss.
    • Each solution can work with only one node. It is generally advised that we have one etcd/ consul per VM/physical host. We do not want to have an etcd/consul per container!
  • Immediate problems that we are trying to solve:

    • Get and set dynamic configuration across a distributed system (e.g. things in moc.config.json):
      • This is perhaps the most pressing problem that we need to solve.
      • An SCM tool like Puppet/Anisble are great for managing static configurations but they are too heavy for dynamic changes.
    • Service registration:
      • We need to be able to spin up a track and have services make themselves visible via DNS.
      • This would be useful primarily outside of production where we would want to regularly spin up and destroy tracks.
      • That said, we don't have a highly-distributed and elastic architecture so we could get by without this for a while.
    • Service discovery:
      • Services must be able to determine which host to talk to for a particular service.
      • This may not be as important for production if we have a loadbalancer. In fact, a loadbalancer would be more transparent to our existing apps as they work at the IP level.
      • That said, we don't have a highly-distributed and elastic architecture so we could get by without this for a while.
  • Features that we don't need for now:

    • Leader election. Many of our apps are currently not designed to scale horizontally. However, it should be noted that Consul has the ability to select a leader based on health checks.
  • Problems that these tools are not designed to solve:

    • Load-balancing.
  • Things that I've explored:

    • Etcd:

      • Basic info:
        • Service registration relies on using a key TTL along with heartbeating from the service to ensure the key remains available. If a services fails to update the key’s TTL, Etcd will expire it. If a service becomes unavailable, clients will need to handle the connection failure and try another service instance.
        • There would be a compelling reason to favor Etcd if we ever planned to use CoreOS but I don't see this happening anytime soon.
      • Pros:
        • Service discovery involves listing the keys under a directory and then waiting for changes on the directory. Since the API is HTTP based, the client application keeps a long-polling connection open with the Etcd cluster.
        • Has been around for longer than Consul. 150% more github watches/stars.
        • 3 times as many contributors (i.e. more eyes) and forks on github.
      • Cons:
        • There are claims that the Raft implementation used by Etcd (go-raft) is not quite right (unverified).
        • Immature, but by the time its use is under consideration in production, it should have reached 1.0.
        • Serving DNS records from Etcd may require a separate service/process (verify):
    • Consul:

      • Pros:
        • Has more high-level features like service monitoring.
        • There is another project out of Hashicorp that will read/set environment variable for processes from Consul.
        • Better documentation.
          • I had an easier time installing and configuring this over Etcd, not that Etcd was particularly hard. Docs make all the difference.
          • Stuff like this makes me want to shed a tear. I commend the KIDS at Hashicorp.
        • You can make DSN queries directly against Consul agent! Nice! No need for SkyDNS or Helix
        • We can add arbitrary checks! Nice, if we are into that sort of thing.
        • Understands the notion of a datacenter. Each cluster is confined to datacenter but the cluster is able to communicate with other datacenters/clusters.
          • At Skybox, we might use this feature to separate docker tracks, even if they live on same host.
        • It has a rudimentary web UI:
      • Cons:
        • There are claims that Consul's implementation of Raft is better (unverified).
        • Immature. Even younger than Etcd (though there are no reason to believe that there are problems with it).
  • Etcd and Consul similarities:

    • HTTP+JSON based API. Curl-able.
    • Docker containers can talk directly with Etcd/Consul over the docker0 interface (i.e. default gateway).
    • Atomic look-before-you-set:
      • Etcd: Compare-and-set by both value and version index.
      • Consul: Check-and-set by sequence number (ModifyIndex)
    • DNS TTLs can be set to something VERY low.
      • Etcd: supports TTL (time-to-live) on both keys and directories, which will be honoured: if a value has existed beyond its TTL
      • Consul: By default, serves all DNS results with a 0 TTL value
    • Has been tested with Jepsen (tool to simulate network partitions in distributed databases).
    • Both work with Confd by Kelsey Hightower.
    • Long polling for changes:
      • Etcd: Easily listen for changes to a prefix via HTTP long-polling.
      • Consul: A blocking query against some endpoints will wait for a change to potentially take place using long polling.
  • Things that I have not explored:

    • SkyDNS: Anyone have good input on this one?
    • Zookeeper: It seems mature but it would take a lot more work to make it work for us.
      • We would be have to configure and use it without high-level features.
      • Provides only a primitive K/V store.
      • Requires that application developers build their own system to provide service discovery.
      • Java dependency (and Dan Streit hates Java)
      • All clients must maintain active connections to the ZooKeeper servers, and perform keep-alives.
      • Zookeeper not recommended for virtual environments? Why? I just read this somewhere.
    • Corosync/Pacemaker (not sure if this is a viable solution, actually)
    • Redis is not viable! It is an in-memory K/V that does not persist data. Nope.
    • Smartstack + Synapse + Nerve from AirBnB (not viable as it only does TCP through HAproxy).
      • Ruby dependencies and many moving parts.
  • References:

@mosceo
Copy link

mosceo commented Oct 7, 2022

Cons: There are claims that Consul's implementation of Raft is better (unverified).

Is is a con for Consul?

@isavcic
Copy link

isavcic commented Jun 8, 2023

Note that etcd still doesn't pass Jepsen test, while Consul does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment