Every time my test cluster is going down, I was struggeling with etcd autodiscovery failing. This looks probably familiar to you:
systemd[1]: Starting etcd...
systemd[1]: Started etcd.
etcd[3066]: [etcd] Apr 9 08:31:42.512 INFO | Discovery via https://discovery.etcd.io using prefix /<TOKEN>.
systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILURE
etcd[3066]: [etcd] Apr 9 08:31:43.501 CRITICAL | Discovery failed and a backup peer list wasn't provided: Discovery found an initialized cluster but no peers are registered.
systemd[1]: Unit etcd.service entered failed state.
So, one solution is just providing a new discovery enpoint (https://discovery.etcd.io/new). However, that's quite annoying as well.
Another solution was posted by Brandon Philips: https://groups.google.com/forum/#!topic/coreos-dev/Yv13qEHHbQg You can easily do a reset of the discovery endpoint via curl:
curl -XDELETE https://discovery.etcd.io/<TOKEN>/_state
And then a look at the logs.
etcd[3073]: [etcd] Apr 9 08:31:53.765 INFO | Discovery via https://discovery.etcd.io using prefix /<TOKEN>.
etcd[3073]: [etcd] Apr 9 08:31:54.577 INFO | Discovery _state was empty, so this machine is the initial leader.
etcd[3073]: [etcd] Apr 9 08:31:55.060 INFO | f25d7a0be22c443cb3c071d9b56c04e1: state changed from 'stopped' to 'follower'.
etcd[3073]: [etcd] Apr 9 08:31:55.061 INFO | URLs: / f25d7a0be22c443cb3c071d9b56c04e1 (http://10.128.16.213:7001)
etcd[3073]: [etcd] Apr 9 08:31:56.413 WARNING | Attempt to join via 10.128.16.213:7001 failed: Error during join version check: Get http://10.128.16.213:7001/version: net/http: timeout awaiting response headers
etcd[3073]: [etcd] Apr 9 08:31:56.414 WARNING | the entire cluster is down! this peer will restart the cluster.
etcd[3073]: [etcd] Apr 9 08:31:56.414 INFO | etcd server [name f25d7a0be22c443cb3c071d9b56c04e1, listen on [::]:4001, advertised url http://10.128.16.213:4001]
etcd[3073]: [etcd] Apr 9 08:31:56.416 INFO | peer server [name f25d7a0be22c443cb3c071d9b56c04e1, listen on [::]:7001, advertised url http://10.128.16.213:7001]
etcd[3073]: [etcd] Apr 9 08:31:56.480 INFO | f25d7a0be22c443cb3c071d9b56c04e1: state changed from 'follower' to 'candidate'.
etcd[3073]: [etcd] Apr 9 08:31:56.481 INFO | f25d7a0be22c443cb3c071d9b56c04e1: state changed from 'candidate' to 'leader'.
etcd[3073]: [etcd] Apr 9 08:31:56.482 INFO | f25d7a0be22c443cb3c071d9b56c04e1: leader changed from '' to 'f25d7a0be22c443cb3c071d9b56c04e1'.
etcd[3073]: [etcd] Apr 9 08:31:59.618 INFO | f25d7a0be22c443cb3c071d9b56c04e1: snapshot of 1439689 events at index 1439689 completed
No restart, easy to remember, much better!
Thank you for the gist!