public
Last active

An Ideal Postgres Environment

  • Download Gist
prod_postgres.md
Markdown

Ideal Postgres environment

Documentation

  • Documented replication topology
  • Documented network topology
  • Documented interface topology - including users, passwords, connection estimates, load balancers, connection proxies
  • Documented procedure, schedule for failover and testing
  • Documented procedure, schedule for disaster recovery and testing
  • Documented procedure, schedule for maintenance, upgrades
  • Documented procedure, schedule for data expiration
  • Docuemnted procedure, schedule for backups and testing
  • Documented schedule for system upgrades

Automation

  • Automated maintenance
  • Automated disaster recovery testing
  • Automated backup testing
  • Automated stage environment setup
  • Automated data expiration
  • Automated failover*
  • Automated user management
  • Configuration change management

Monitoring

  • Monitoring of key Postgres performance indicators: query duration, query counts, IO utilization, cache hits, hot tables, locks, cancelled queries, vacuum frequency and duration, large shared buffer "churn" events, size of tables, size of database, bloat, tables with high sequential scan counts (indexing targets), network throughput
  • Monitoring of key application performance indicators: query plans that scan a high % of partitioned tables, duration of stored procedures called regularly
  • System monitoring: CPU, memory, IO, swap, DB connections

Configuration

  • Automated pg_hba.conf and user management through configuration management tool
  • postgresql.conf managed through configuration management tool
  • recovery.conf maintained with configuration management tool

Operations

  • n+1 topology for replication/failover
  • reporting system on 1 hr replication delay (to allow for extended BI queries)
  • PITR archive for recovery from operator/developer error
  • Backup testing completed weekly
  • Failver, Disaster recovery testing completed once per quarter
  • Uptime target for all systems defined and agreed to by users

Re: automation of failover -- Can be varying levels of automation depending on the environment. Most important bit is once the system or operator has decided failover is necessary, next steps should be automatic to avoid errors.

Please sign in to comment on this gist.

Something went wrong with that request. Please try again.