Skip to content

Instantly share code, notes, and snippets.

@selenamarie
Last active March 15, 2019 05:12
  • Star 21 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save selenamarie/8724731 to your computer and use it in GitHub Desktop.
An Ideal Postgres Environment

Ideal Postgres environment

Documentation

  • Documented replication topology
  • Documented network topology
  • Documented interface topology - including users, passwords, connection estimates, load balancers, connection proxies
  • Documented procedure, schedule for failover and testing
  • Documented procedure, schedule for disaster recovery and testing
  • Documented procedure, schedule for maintenance, upgrades
  • Documented procedure, schedule for data expiration
  • Docuemnted procedure, schedule for backups and testing
  • Documented schedule for system upgrades

Automation

  • Automated maintenance
  • Automated disaster recovery testing
  • Automated backup testing
  • Automated stage environment setup
  • Automated data expiration
  • Automated failover*
  • Automated user management
  • Configuration change management

Monitoring

  • Monitoring of key Postgres performance indicators: query duration, query counts, IO utilization, cache hits, hot tables, locks, cancelled queries, vacuum frequency and duration, large shared buffer "churn" events, size of tables, size of database, bloat, tables with high sequential scan counts (indexing targets), network throughput
  • Monitoring of key application performance indicators: query plans that scan a high % of partitioned tables, duration of stored procedures called regularly
  • System monitoring: CPU, memory, IO, swap, DB connections

Configuration

  • Automated pg_hba.conf and user management through configuration management tool
  • postgresql.conf managed through configuration management tool
  • recovery.conf maintained with configuration management tool

Operations

  • n+1 topology for replication/failover
  • reporting system on 1 hr replication delay (to allow for extended BI queries)
  • PITR archive for recovery from operator/developer error
  • Backup testing completed weekly
  • Failver, Disaster recovery testing completed once per quarter
  • Uptime target for all systems defined and agreed to by users

Re: automation of failover -- Can be varying levels of automation depending on the environment. Most important bit is once the system or operator has decided failover is necessary, next steps should be automatic to avoid errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment