Skip to content

Instantly share code, notes, and snippets.

@optiz0r
Last active April 7, 2022 10:58
Show Gist options
  • Save optiz0r/fc4a5e4117b4949bb670ad89e873683d to your computer and use it in GitHub Desktop.
Save optiz0r/fc4a5e4117b4949bb670ad89e873683d to your computer and use it in GitHub Desktop.
Upgrades a Hashicorp Vault cluster and clients using Choria

Upgrade Vault Cluster

This Choria Playbook will automate the steps to do a simple version upgrade of a Vault cluster

  • Upgrades follower servers first
  • Upgrades the leader last (to reduce risk of a failover from newer version to older one)
  • Sleeps in between each upgrade to allow operator unseal to be run, and for vault to re-register itself in service discovery
  • Bulk updates all clients

Warning!

If you have manual unseal, you must watch the playbook run, and unseal the nodes during the sleep period before the next node is upgraded. If you do not unseal at least one node before the final node is upgraded, there will be an outage.

You might also want to manually run vault operator step-down between the penultimate and final node upgrades, to minimise the impact.

A future version of this playbook might verify nodes have been unsealed before continuing with the upgrade of other nodes. Currently this is not implemented

Upgrade procedure

  • On each server:
    • Disables puppet, to avoid interference
    • Waits for any in-progress puppet runs to complete
    • Triggers a yum clean all to ensure repos contain the latest packages
    • Ensures the vault package version is the desired version
    • Re-enables puppet
    • Runs puppet (which is assumed will make a change and trigger a restart of the service)
      • For me, puppet will always be setting a file capability on the newly installed vault binary, so this is a safe assumption
    • Waits for the puppet run to complete
  • On each client (in batches of 100):
    • Triggers a yum clean all
    • Ensures the vault package version is the desired version
  • Fetches the version of all installed vault versions for review

Dependencies

  • Assumes this playbook is present in a site_vault puppet module
  • Consul DNS for locating the leader instance
  • yum/dnf based distro (because the playbook issues a yum clean)
  • Custom facts (site specific):
    • group the cluster identifier to upgrade (optional)
    • hostname_environment a site-specific identifier for the environment, obtained by parsing the hostname

Usage

mco playbook run site_vault::upgrade_cluster --modulepath modules:thirdparty --environment p --new_version '1.10.0+ent'

  • --noop flag will not take any action, just print the hostnames that would be affected
# @summary Upgrades vault on a set of cluster agents
#
# @param nodes
# The nodes to upgrade
#
# @param new_version
# The new version to use (must be an upgrade)
#
# @param sleep
# How long to sleep in between node upgrades (in seconds)
#
plan site_vault::upgrade_agent (
Choria::Nodes $nodes = [],
String $new_version,
Integer $sleep = 60,
Boolean $fail_ok = false,
) {
# Flush the yum caches before starting to make sure the new version is visible
# Wait 60s before starting. This is what gives us the settle time
# between nodes
choria::task(
'mcollective',
'action' => 'package.yum_clean',
'nodes' => $nodes,
'properties' => {
'mode' => 'expire-cache',
},
'pre_sleep' => $sleep,
'fail_ok' => $fail_ok,
)
# Install the new version
choria::task(
'mcollective',
'action' => 'package.install',
'nodes' => $nodes,
'properties' => {
'package' => 'vault',
'version' => $new_version,
},
'fail_ok' => $fail_ok,
)
}
# @summary Upgrades a vault cluster to a newer version of vault in sequence
#
# @param new_version
# The RPM version string (EVR) for the new version to upgrade to, e.g. `1.0.0+ent`
#
# @param environment
# The environment to upgrade, either `d` for development, or `p` for production
#
# @param skip_servers
# Whether or not to skip upgrade of servers (including servers which run client also)
#
# @param skip_clients
# Whether or not to skip upgrade of clients
# (excluding ones where server also run, which are always treated as servers)
#
# @param group
# When upgrading clients, limit to only nodes with group fact matching this regex pattern
#
# @param noop
# List list which nodes would be upgraded, don't actually upgrade anything
#
plan site_vault::upgrade_cluster (
String $new_version,
Enum['d', 'p'] $environment,
Boolean $skip_servers = false,
Boolean $skip_clients = false,
Optional[String] $group = undef,
Boolean $noop = false,
) {
if $environment == 'p' {
$domain = 'consul'
} else {
$domain = 'dev.consul'
}
$leader_hostname = dns_cname("active.vault.service.${domain}")
# Query to find list of vault servers (whether or not those servers are also clients)
$server_query = @("EOF"/L)
inventory[certname] {
facts.hostname_environment = '${environment}'
and resources {
type = 'Class'
and title = 'Site_vault'
and parameters.use_server = true
}
}
| EOF
# Query to find list of vault clients (which are not also servers)
if $group {
$client_query = @("EOF"/L)
inventory[certname] {
facts.hostname_environment = '${environment}'
and facts.group ~ '${group}'
and resources {
type = 'Class'
and title = 'Site_vault'
and parameters.use_server = false
}
}
| EOF
} else {
$client_query = @("EOF"/L)
inventory[certname] {
facts.hostname_environment = '${environment}'
and resources {
type = 'Class'
and title = 'Site_vault'
and parameters.use_server = false
}
}
| EOF
}
if ! $skip_servers {
$servers = choria::discover(
'pql',
'query' => $server_query,
'test' => false,
'at_least' => 1,
'when_empty' => 'Could not find any vault servers to upgrade'
)
if ! ($leader_hostname in $servers) {
fail("Expected to find leader node (${leader_hostname}) in the set of servers, something isn't right, aborting for safety")
}
$leader = $servers.filter |$node| {
$node == $leader_hostname
}
$followers = $servers.filter |$node| {
$node != $leader_hostname
}
} else {
$leader = []
$followers = []
}
if ! $skip_clients {
$clients = choria::discover(
'pql',
'query' => $client_query,
'test' => false,
'empty_ok' => true,
)
} else {
$clients = []
}
$all_nodes = [$followers, $leader, $clients].flatten
if ! $all_nodes {
fail('No servers or clients selected for upgrade')
}
if ! $noop {
if ! $skip_servers {
# Upgrade the followers one-by one
$followers.choria::in_groups_of(1) |$nodes| {
# Run the upgrade on this node
choria::run_playbook(
'site_vault::upgrade_server',
'nodes' => $nodes,
'new_version' => $new_version,
)
}
# Upgrade the leader after all the servers have been upgraded
choria::run_playbook(
'site_vault::upgrade_server',
'nodes' => $leader,
'new_version' => $new_version,
)
}
$clients.choria::in_groups_of(100) |$nodes| {
# Run the upgrade on this node
choria::run_playbook(
'site_vault::upgrade_agent',
'nodes' => $nodes,
'new_version' => $new_version,
'fail_ok' => true,
)
}
} else {
notice('In noop mode, no upgrades done')
}
# Summarise the installed vault versions across all nodes
choria::task(
'mcollective',
'action' => 'package.status',
'nodes' => $all_nodes,
'properties' => {
'package' => 'vault',
},
'post' => ['summarize'],
'fail_ok' => true,
)
# Print the list of updated nodes
$all_nodes
}
# @summary Upgrades vault on a set of server nodes
#
# @param nodes
# The nodes to upgrade
#
# @param new_version
# The new version to use (must be an upgrade)
#
plan site_vault::upgrade_server (
Choria::Nodes $nodes = [],
String $new_version,
) {
# Disable puppet on all nodes before starting to prevent
# conflicts
choria::task(
'mcollective',
'action' => 'puppet.disable',
'nodes' => $nodes,
'fail_ok' => true,
'silent' => true,
'properties' => {
'message' => 'Upgrading vault cluster via choria playbook'
}
)
choria::task(
'mcollective',
'action' => 'puppet.status',
'nodes' => $nodes,
'assert' => 'applying=false',
'tries' => 10,
'try_sleep' => 30,
'silent' => true,
)
# Run the upgrade on this node
choria::run_playbook(
'site_vault::upgrade_agent',
'nodes' => $nodes,
'new_version' => $new_version,
'sleep' => 60,
)
# Re-enable puppet on this node
choria::task(
'mcollective',
'action' => 'puppet.enable',
'nodes' => $nodes,
'fail_ok' => true,
'silent' => true,
)
# Trigger a background
choria::task(
'mcollective',
'action' => 'puppet.runonce',
'nodes' => $nodes,
'properties' => {
'force' => true,
}
)
choria::task(
'mcollective',
'action' => 'puppet.status',
'nodes' => $nodes,
'assert' => 'applying=false',
'pre_sleep' => 10,
'tries' => 10,
'try_sleep' => 30,
'silent' => true,
)
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment