Skip to content

Instantly share code, notes, and snippets.

@skorfmann
Last active June 15, 2019 08:36
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save skorfmann/10243181 to your computer and use it in GitHub Desktop.
Save skorfmann/10243181 to your computer and use it in GitHub Desktop.

Howto Reset etcd discovery

Every time my test cluster is going down, I was struggeling with etcd autodiscovery failing. This looks probably familiar to you:

The Problem

systemd[1]: Starting etcd...
systemd[1]: Started etcd.
etcd[3066]: [etcd] Apr  9 08:31:42.512 INFO      | Discovery via https://discovery.etcd.io using prefix /<TOKEN>.
systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILURE
etcd[3066]: [etcd] Apr  9 08:31:43.501 CRITICAL  | Discovery failed and a backup peer list wasn't provided: Discovery found an initialized cluster but no peers are registered.
systemd[1]: Unit etcd.service entered failed state.

So, one solution is just providing a new discovery enpoint (https://discovery.etcd.io/new). However, that's quite annoying as well.

Solution

Another solution was posted by Brandon Philips: https://groups.google.com/forum/#!topic/coreos-dev/Yv13qEHHbQg You can easily do a reset of the discovery endpoint via curl:

curl -XDELETE https://discovery.etcd.io/<TOKEN>/_state

And then a look at the logs.

etcd[3073]: [etcd] Apr  9 08:31:53.765 INFO      | Discovery via https://discovery.etcd.io using prefix /<TOKEN>.
etcd[3073]: [etcd] Apr  9 08:31:54.577 INFO      | Discovery _state was empty, so this machine is the initial leader.
etcd[3073]: [etcd] Apr  9 08:31:55.060 INFO      | f25d7a0be22c443cb3c071d9b56c04e1: state changed from 'stopped' to 'follower'.
etcd[3073]: [etcd] Apr  9 08:31:55.061 INFO      | URLs:  / f25d7a0be22c443cb3c071d9b56c04e1 (http://10.128.16.213:7001)
etcd[3073]: [etcd] Apr  9 08:31:56.413 WARNING   | Attempt to join via 10.128.16.213:7001 failed: Error during join version check: Get http://10.128.16.213:7001/version: net/http: timeout awaiting response headers
etcd[3073]: [etcd] Apr  9 08:31:56.414 WARNING   | the entire cluster is down! this peer will restart the cluster.
etcd[3073]: [etcd] Apr  9 08:31:56.414 INFO      | etcd server [name f25d7a0be22c443cb3c071d9b56c04e1, listen on [::]:4001, advertised url http://10.128.16.213:4001]
etcd[3073]: [etcd] Apr  9 08:31:56.416 INFO      | peer server [name f25d7a0be22c443cb3c071d9b56c04e1, listen on [::]:7001, advertised url http://10.128.16.213:7001]
etcd[3073]: [etcd] Apr  9 08:31:56.480 INFO      | f25d7a0be22c443cb3c071d9b56c04e1: state changed from 'follower' to 'candidate'.
etcd[3073]: [etcd] Apr  9 08:31:56.481 INFO      | f25d7a0be22c443cb3c071d9b56c04e1: state changed from 'candidate' to 'leader'.
etcd[3073]: [etcd] Apr  9 08:31:56.482 INFO      | f25d7a0be22c443cb3c071d9b56c04e1: leader changed from '' to 'f25d7a0be22c443cb3c071d9b56c04e1'.
etcd[3073]: [etcd] Apr  9 08:31:59.618 INFO      | f25d7a0be22c443cb3c071d9b56c04e1: snapshot of 1439689 events at index 1439689 completed

No restart, easy to remember, much better!

@pctj101
Copy link

pctj101 commented Dec 2, 2014

Thank you for the gist!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment