Skip to content

Instantly share code, notes, and snippets.

@ssrihari
Last active April 15, 2024 04:46
Show Gist options
  • Save ssrihari/6443dbc0d4da4a91ba48 to your computer and use it in GitHub Desktop.
Save ssrihari/6443dbc0d4da4a91ba48 to your computer and use it in GitHub Desktop.
Postgres clusters and their nuances

Postgres clusters and their nuances

Why

  • Do you even need a cluster of postgreses? Why did we choose to create a cluster ourselves?
  • Why didn’t RDS work for us? Why it might not work for you.
  • What are other RDS like services out there?

Postgres side of things:

  • What you need to know within postgres so your data can be streamed down to multiple nodes quick, and efficiently.

Hardware

  • Can you do without an SSD? Are writing fast enough? How to find out?
  • Will your application’s I/O performance degrade when you’re writing WALs?
  • How fast do you need to ship WALs to another machine? Network I/O nitigrities.

WAL (Write Ahead Logging)

  • What is it? How does it work?
  • How to configure it for a cluster - hot_standby, wal_keep_segments, etc.

WAL Archives

  • Why do you need it when you have streaming replication?
  • How do you set the archive and streaming delays?

How does this impact normal workflows?

  • Large deletes or updates can blot WALs
  • Read-only standbys, large queries and timeouts

Repmgr side of things:

  • What does this magic replication manager do? What I learnt by digging into it’s C code.

What does a cluster of postgres machines look like?

  • What are ‘master’, ‘standbys’, ‘witnesses’, and ‘failed’ nodes
  • Cascading, and Synchronous replication
  • priorities, configurations

Into the guts of ‘how’ repmgr works

  • The consensus algorithm it uses
  • repl_nodes - a table of all the machines in the cluster, their configs, statuses
  • repl_status - how far is each machine behind the master?
  • repl_events - what’s happening in the cluster along a timeline

The things repmgr does for your cluster

  • well, it exists. not a single open tool (save pgpool) out there that’s as good.
  • help setup, configure the relationships between postgres boxes
  • ‘Clone’, ‘Register’, ‘Follow’, ‘Promote’ commands
  • monitor for failure, elect a master, automate failover
  • How to tap into the promote and failover scenarios?

The things repmgr doesn’t

  • It doesn’t talk to your application. it doesn’t act as a load balancer. or a single entry point to your cluster.
  • Tell you when a failover happens. You need to rig that up yourself.
  • Give you extensive support or documentation. But, you have the source.

Failovers

How things can go wrong

  • Multi-master / split brain / STONITH
  • No master
  • Network disconnect, Disk full

What happens when a failover happens?

  • How long does it take? How do you know which is the new master?
  • How do we recover a failed DB? How long does it take?
  • How do we add a new machine to the cluster?
  • How do we remove stale node entries?

When things have gone wrong

  • What measures to take to ensure you’re equipped for bad scenarios.
  • How do you save data when you know you’re going to lose it?
  • Deletes/Truncates cascade. How do you free disk space?
  • When is your free disk space critically low?
  • How to recover from WAL archives? Logical backups? Filesystem backups?
  • Which backup/restore mechanism will work best for you?

Wrapping up the big picture

  • What does my application do when there’s a failover?
  • How do the db-connections move to the new master?
  • How can the application, and repmgr work together for maximum robustness?
  • Alerting, Monitoring, Maintenance
Display the source blob
Display the rendered blob
Raw
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment