I hereby claim:
- I am armon on github.
- I am armon (https://keybase.io/armon) on keybase.
- I have a public key ASByRbYLSFAAGnJf0iMdYR0t5U9u5uVjfP8vY6p2s0vFego
To claim this, I am signing this object:
func readPath(name string) { | |
p := GetPolicy(name) | |
DoSomething(p) | |
} | |
func writePath(name string) { | |
p := GetPolicy(name) | |
LockManager.Lock(name, func() { | |
DoSomethign(p) |
I hereby claim:
To claim this, I am signing this object:
InfoQ: Vault is an online system that clients must request secrets from, what risk is there that a Vault outage causes down time?
Armon: HashiCorp has been in the datacenter automation space for several years, and we understand the highly-available nature of modern infrastructure. When we designed Vault, high availability was a critical part of the design, not something we tried to bolt on later. Vault makes use of coordination services like Consul or Zookeeper to perform leader election. This means you can deploy multiple Vault instances, such that if one fails there is an automatic failover to a healthy instance. We typically recommend deploying at least
Simplest way to do this with Consul is to run a single "global" datacenter.
This means the timing for the LAN
gossip need to be tuned to be WAN appropriate.
In consul/config.go
(https://github.com/hashicorp/consul/blob/master/consul/config.go#L267)
Do something like:
// Make the 'LAN' more forgiving for latency spikes
conf.SerfLANConfig.MemberlistConfig = memberlist.DefaultWANConfig()
Then we need to tune the Raft layer to be extremely forgiving.
#!/bin/bash | |
# Store the live members | |
consul members | grep alive | awk '{ print $1 }' > /tmp/alive.txt | |
# Clean-up the collectd metrics | |
cd /data/graphite/whisper/collectd | |
ls | awk '{print substr($1, 0, index($1, "_node_")) }' > /tmp/monitored.txt | |
for NODE in `cat /tmp/monitored.txt`; do if grep -q $NODE /tmp/alive.txt; then echo $NODE alive; else echo $NODE dead; sudo rm -Rf ${NODE}_node_*; fi; done |
The initial observed cluster behavior: | |
1) Constant churn of nodes between Failed and Alive | |
2) Message bus saturated (~150 updates/sec) | |
3) Subset of cluster affected | |
4) Some nodes that are flapping don't exist! (Node dead, or agent down) | |
One immediate question is how the cluster remained in an unstable | |
state. We expect that the cluster should converge and return to | |
a quiet state after some time. However, there was a bug in the | |
low level SWIM implementation (memberlist library). |
Sent 5/1/2014 | |
Hey Igor, | |
Glad you did a write up! I’m one of the authors of Consul. You mention we get some | |
things wrong about SmartStack, but we would love to get that corrected. The website | |
is generated from this file: | |
https://github.com/hashicorp/consul/blob/master/website/source/intro/vs/smartstack.html.markdown |
armon:~/projects/consul-demo-tf/tf (master) $ TF_LOG=1 terraform plan | |
2014/10/15 19:51:31 Detected home directory from env var: /Users/armon | |
2014/10/15 19:51:31 [DEBUG] Discoverd plugin: aws = /Users/armon/projects/go/bin/terraform-provider-aws | |
2014/10/15 19:51:31 [DEBUG] Discoverd plugin: cloudflare = /Users/armon/projects/go/bin/terraform-provider-cloudflare | |
2014/10/15 19:51:31 [DEBUG] Discoverd plugin: consul = /Users/armon/projects/go/bin/terraform-provider-consul | |
2014/10/15 19:51:31 [DEBUG] Discoverd plugin: digitalocean = /Users/armon/projects/go/bin/terraform-provider-digitalocean | |
2014/10/15 19:51:31 [DEBUG] Discoverd plugin: dnsimple = /Users/armon/projects/go/bin/terraform-provider-dnsimple | |
2014/10/15 19:51:31 [DEBUG] Discoverd plugin: google = /Users/armon/projects/go/bin/terraform-provider-google | |
2014/10/15 19:51:31 [DEBUG] Discoverd plugin: heroku = /Users/armon/projects/go/bin/terraform-provider-heroku | |
2014/10/15 19:51:31 [DEBUG] Discoverd plugin: mailgun = /Users/armon/projects/go/bin/terraform- |
2014/09/12 14:21:07 [DEBUG] http: Request /v1/agent/self (385.106us) | |
2014/09/12 14:21:07 [DEBUG] http: Request /v1/event/fire/mysql-available (80.68us) | |
2014/09/12 14:21:07 [DEBUG] consul: user event: mysql-available | |
2014/09/12 14:21:07 [DEBUG] agent: new event: mysql-available (be9e89d7-e66b-8dbf-6a3e-ac1f64cfbc27) | |
2014/09/12 14:21:07 [DEBUG] http: Request /v1/event/list?index=1&name=mysql-available (5.50690515s) | |
2014/09/12 14:21:07 [DEBUG] http: Request /v1/event/list?index=1&name=mysql-available (42.151us) | |
2014/09/12 14:21:07 [DEBUG] agent: watch handler 'cat >> events.out' output: |
# jdyer at MacBook-Pro.local in ~/Projects/consul [15:48:45] | |
$ dig @localhost -p 8600 _sip._udp.service.consul srv | |
; <<>> DiG 9.10.0-P2 <<>> @localhost -p 8600 _sip._udp.service.consul srv | |
; (3 servers found) | |
;; global options: +cmd | |
;; Got answer: | |
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 5926 | |
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 | |
;; WARNING: recursion requested but not available |