Skip to content

Instantly share code, notes, and snippets.

@mikegreen
Created November 8, 2021 17:47
Show Gist options
  • Save mikegreen/deb794ea759cb8bc1370ffd4a3aec3bb to your computer and use it in GitHub Desktop.
Save mikegreen/deb794ea759cb8bc1370ffd4a3aec3bb to your computer and use it in GitHub Desktop.
Vault-prod-hardening
The purpose of this document is to provide consideration to security, operational and support tasks and conditions of a production-ready / mission-critical Vault deployment.
This is a living document, please feel free to suggest changes and have someone take a review and approve. You might also want to see this doc from Julia that was done for pre-renewal health checking.
Infrastructure Security
Are servers provisioned via a build/codified pipeline?
Can staff login (SSH/Console/etc) to individual servers?
Is all traffic in/out of server encrypted?
Is the cluster subnet firewalled from other network resources?
If a server is destroyed/lost, are logs and events available post-mortem?
Is root token creation restricted? Monitored?
When Vault was initialized, were the shamir/recovery keys kept secure?
Is SELinux in play? Has this been reviewed?
Vault Operations
Is the root token present/long-lived (it should not be)?
Have production hardening recommendations been followed, and if not, documented why deviation from recommendations?
How far back can access to a secret be audited?
Who has access to audit logs?
Who has access to server/application logs?
Has Vault operation been incorporated into the business continuity plan?
If an urgent Vault CVE was published, how quickly could Vault be tested and upgraded?
Is everyone administering and developing against vault aware of Vault limitations, outlined here https://www.vaultproject.io/docs/internals/limits?
Are only supported/required ciphers enabled?
Can users/services create non-entity tokens, or are they required to use token roles?
Vault Hygiene
Are reasonable rate limits implemented and monitored/updated (be mindful of rate limit bugs until 1.6.3)?
Are quota limits reasonable to allow growth but also prevent token spikes/etc?
What is the Vault Token monitoring
Are developers/admins aware to avoid the system default 768h TTL?
Are max_lease_ttl’s being set?
Replication/DR
Is replication WAL monitored?
Are metrics/monitoring here understood? https://learn.hashicorp.com/tutorials/vault/monitor-replication
Infrastructure
Disk space alerting on instance
Disk space alerting on audit points
Log monitoring/alerting for warn/errors
Draught/absence alerts on log outages/stoppages
RAM utilization alerting
CPU utilization alerting
Are network configurations/firewall openings understood, documented, and known not to change
Proper ports opened (ie, UDP if Consul storage), verified in DR/PR clusters as well
Vault Support
Have failure scenarios (ie, node down, storage lost, connectivity outage) been documented and troubleshooting/resolution steps defined?
Is support prepared to support Vault as a mission-critical service?
Who has the ability to seal and unseal Vault?
Is monitoring in place to triage a Vault outage (ie, sealed, down, unresponsive, etc)?
Is a plan in place to review and perform Vault upgrades to remain within 2 major (ie, 1.4, 1.5) versions current?
Are Vault changes/upgrades first performed in a production-mirrored environment (ie, the test cluster has audit logging, same auth methods, ability to test against real cloud credentials, etc)?
Development
Are internal controls in place to log/audit/control policy and configuration changes to Vault?
Is developer code reviewed/audited to ensure policy least privilege is used?
Is Vault config/paths/policy/etc managed by the Terraform Vault provider?
Ongoing Care
When Vault listener, Consul agent, etc certificates are created - is something in place to know when they expire and proactively replace?
Is this process durable (ie, not on a team members calendar)?
Are Vault sys and audit logs monitored and alerted when errors/warnings occur?
Are performance replication clusters monitored for connectivity/consistency/replication (ie, WAL checks)?
Business Continuity
Are disaster recovery (DR) operations documented and tested?
Are normal and on-call operations teams trained/tested on performing a DR cutover at least quarterly?
Are devops/dev-team/app-teams aware how to switch their applications to the DR cluster?
Are DR operation tokens protected/removed when not in use/needed?
Is Consul / Integrated Storage auto-snapshot enabled with a realistic RPO timing?
Are there alerts if snapshots fail?
Are snapshots backed up offline/non-local storage?
Are snapshots verified to be valid?
Policy and Configuration
Are policies documented, preferably managed via code, and audited occasionally?
Are policies least-privilege?
Are policies tested in a negative fashion, ie, ensure that a token with this policy can do what it allows but also prevents more access than intended?
Is use of “sudo” in capabilities understood and checked for?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment