Skip to content

Instantly share code, notes, and snippets.

@lukeasrodgers
Last active April 16, 2024 22:40
Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save lukeasrodgers/4b6ee08ba9aa8dcf7e077fc1b4352e07 to your computer and use it in GitHub Desktop.
Save lukeasrodgers/4b6ee08ba9aa8dcf7e077fc1b4352e07 to your computer and use it in GitHub Desktop.
postgres conf - tuning notes

269 tunable settings, these 20 are most helpful

max_connections

  • set to useful peak
  • above 200, look at conn pooler
  • consider superuser_reserved_connections

shared_buffers - shared RAM

  • amt of memory pg takes for itself
  • generally, avail ram ÷ 4
  • lower for writ heavy, higher for read heavy
  • larger than 8gb not useful. extension that shows what happens in buffer cache
  • can go over 8 G if lots of tables, lots of indexes. depends on workload
  • over 8 G not harmful

work_mem - non-shared RAM, used by query operation, incluidng sorts and hashes before disk

  • if over limit, just writes to disk
  • complex queries can have multiple operations
  • set to
    • avail RAM ÷ max connections
    • avail ram ÷ 2 x max conns
  • highly dependent on workload
    • 1-8 M for lots of conns with simple workloads
    • 64-128 M for fewer conns and or complex workloads
  • must be power of 2

maintenance_work_mem

  • similar to work mem, but applies to index builds, vacuum, autovacuum
  • set to avail RAM ÷ 8
  • larger than 2 G not useful
  • can temporarily set higher to speed index creation
  • can be set per-session

autovacuum_work_mem

  • default to main work mem
  • set to avail RAM ÷ 16
  • larger than 1G not useful
  • must be power of 2

effective_cache_size

  • estimate of how much RAM db has access to directrly and indirectly to potntailly cached data
  • doesnt' allocate, just a hint
  • if only db on server, set to physical RAM

random_page_cost

  • hint to how fast data can be read from disk
  • value of 1 is equiv to reading from RAM
  • SSD = 1.2, SAN = 1.2, HDD = 4

wal_level

  • set to lowest val required
  • setting to higher means more data written to disk

wal_buffers

  • amt of shared mem used for WAL data that has not yet been written to disk
  • deafult of -1 bases size on shared buffers, default is often fine
  • set to 128 M on servers with more than 8 cores with high workload

commit_delay

  • how long to wait in MS before a WAL flush is initiated
  • on servers with high write load, this can help improve perf by grouping write sof multiple txns together
  • if commit delay reach limit it flushes, if wal_buffers hits limit, it flushes
  • on servers with low write, this can impede performance
  • default is no delay
  • leave at default

min_wal_size

  • how much to alocate to store WAL files
  • default is 80 M - generally too small, set to between 1-8G
  • otheriwse needs to continually grow WAL file

max_wal_size

  • soft limit
  • must be large enough for cehckpoint
  • default is 1G
  • setting too hight can increase time after crahs
  • can be set to 2 x min_wal_size

checkpoing_timeout

  • max time between automated checkkpoints
  • default is 5 mins
  • useful range is 30mins - 1 hr
  • replay is about 20% of this time after a crash
  • setting too low will force checkpoints often, wranings
  • setting to high will increase crash recover time
  • for long ETL batches incresae to be longer than query run time

checkpoint_completion_target

  • amt of time before timeout as afractin to attempt to finish writing data out to disk
  • defaul tis 0.5 - 50% of checkpoint_timeout
  • useful values are 0.8 - 0.9
  • high values can cuase locks if thlots of data tobe rwirtten near end of checkpoint
  • low valus cn adegrade performance

checkpoint_warning

max_worker_processes

  • max concur worker proceses
  • set to no more than number of cores

max_parallel_workers_per_gather

  • set to 4
  • max num of parallel procese per executor

max_parallel_workers

  • max num of parallel workesr running at any point
  • set to number of cores
  • only from pg10

effective_io_concurrency

  • gives query planner hint to number of I/O requests storage system can effectively handel at same time
  • hHDD should be low
  • SAN should b higher
  • SSD should be in hundreds
  • similar to random_page_cost

default_statistics_target

  • during ANALYZE operation, how much sampel takne

  • default is 100, can be set to 1000, means analyzer takes longer to run

  • default can be too low for larger data sets

  • setting to high slows ANALYZE down

  • if misanalying table wrong, this can help

  • can analyze temp table, may be helpful depending on size - diff than EXPLAIN ANALYZE

  • analyze runs periodically by worker, used by query planner

do not change

fsync

  • always leave on, never turn off

full_page_writes

  • leave on

synchronous_commit

  • always leave on
  • set to off if you don't mind data loss from WAL

quey planner config

  • just leave them all on, e.g. enable_bitmapscan
  • can be tuned per session

less common and he

idle_in_transaction_timeout

  • left open without calling COMMIT or ROLLBACK
  • prevent vacuum and maintenance, index creation
  • default disabled
  • maybe only pg10
  • if gets into this state, may need to abandon connection or issue manual rollback

statement_timeout

  • default disabled
  • only set in current session, not cluster-wide

temp_file_limit

  • let bad query from using too much RAM, exceed work_mem and start writing to disk, e.g. for big JOIN
  • off by default

vacuum

  • reclaim space by dead tuples
  • as rows updated/delete,d marked dead
  • MVCC means txns might still have view of them
  • VACUUM job is to flag thoe blocks for reuse anc reclaim
  • VACUUM only release disk if entire file 1G is empty, otherwise just mark it reusable
  • cause lots of I/O requests
  • vacuum full copies all data into new files, reclaims disk from dead uples, at expsen of locking table while doing so
  • VACUUM FULL requres as much disk as current files, plsu actual data size for new files
  • pg_repack does something similar, bit without table lock, only smaller lock aight at end of operation, takes a bit longer.
  • only really useful if you need to reclaim space. if table grows and shrinks, maybe just tweak analyze targets on it
  • tables may grow quickly at the beginning, then taper off as inserts start reusingn unused space.
  • after VACUUM FULL table will often appear to be growing more quickly
  • VACUUM ANALYZE runs normal vacuum followed by an ANALYZE; ANALYZE can be run on its own. if doing manually, best to just do VACUUM ANALYZE
  • VACUUM ANALYZE VERBOSE - shows where stuck
  • VACUUM updates FSM per table. FSM used when writing data
  • AUTOVACUUM runs in bg, auto on tables that need it
  • by default overly conservative with locks held - tries to avoid locks that would impact performance
  • can be speicied cluster wide, or per table

aotvacuum_threshold and autovacuum_scale_factor

  • min value of rows that need to have changed
  • don't set to hight, tables get excluded
  • don't set too low, tables get vacuum often
  • can set on per-table basis for larger values scale factory:
  • large 0.001, medium: 0.1, small: 0.2
  • if majority tables are small, set cluster-wide settings for small tables, per-table table level for bigger tables (and vice versa)
  • always remember to check vbalues for new tables, and as tables grow
  • keep checking formula to see how oftenit will run
  • formula: threshold + (scale_factor x num tuples)
  • tuning fill_factor for bulk imports autovacuum_max_workers
  • default is 3, way too low
  • consider number of tables actively updated and deleted
  • takes from superuser_reserve_connections
  • too low means autovacuum never keeps up
  • too hight means additional I/O overhead autovacuum_naptime
  • time to wait befor eautovacum launche rcehcek if it needs a process
  • default is okay
  • rather twaeak threshold and scale factor autovacuum_cost_limit
  • keep track of accumulated cost of running vacuum
  • if exceeds limit, will wait
  • default is fine
  • hard to set good value autovacuum_analyze_threshold
  • may want this to urn more often, make sure less than vacuum threshold so it runs more often

autovacuum_cost_delay

  • if cost exceeded, how long to sleep
  • default is okay
  • can be set per table

vacuum stas

  • pg_stat_all_tables pg_stat_user_tables
  • shows when last manual or vacuum table ran
  • as well as last manual or automatic analyze ran pg_stat_progress_vacuum
  • show stats on each vacuum process currently running, only pg 9.6

query optimization

two modes query planners

  • regular query planner (12 or fewer joins), does exhaustive search
  • generic query optimizer (> 12 joins)
    • heuristics
    • geqo_* parameters

alternates

  • COPY instead of INSERT
  • TRUNCATE intsead of DLETE
  • CURSOR intesad of LIMIT/OFFSET
  • INTERSECT instead of AND
  • UNION instead of OR
  • pg_tgrm instead of LIKE

TOAST

  • vacuum slow? could be TOAST
  • max block (therefore row) size is 8kb
  • how to insert data larger? compressed and/or broken into 2kb chunks
  • original table has OID entry to TOAST table
  • slow to access, select * could be slower if TOAST entries

storage strategies

  • PLAIN no TOASTE or compression
  • MAIN - users compression discourages use of TOAST
  • EXTERNAL - users TOAST bt no compression - can be preferable if doing substring ops/queries
  • EXTENDED - compress first then uses TOAST
  • default is EXTENDED for most toast-able columns
  • to change use ALTER TABLE ... SET STORAGE

table bloat

  • lots of updates or deltes, no vacuum, or bacuum has run bt no new rows yet
  • can be inefficient to scan over lots of empty blocks

partitionaing

  • try not to exceed 400 partitions
  • only write tables need to be vacuum, old tables don't
  • every query needs to reference key that determines partition, otherwise just scans everything
  • pg 11 starting to do auto partitioning, pg 12 fully automatic partitioning

indexes

  • unique indexes can have where clause, unique constraints cannot
  • don't create both, functionally identical
  • uniqe constraints have better error messages than unique indexes
  • if overlapping single-col and multi-col index, pg will use single-col just because it's faster, even it perfor is the same

over indexing

  • pg_stat_user_indexes look at idx_scan, idx_tup_rad and idx_tup_fetch
  • try to combinae indexes where possible:w

index bloat

  • frequently delete most but not all index entries causes bloat
  • may need REINDEX - if you delete large part of your table
  • REINDEX cannot be done concurrently, use CREATE INDEX, DROP INDEX, ALTER INDEX

Operating System parameters

  • before 9.3 used sys V shared memory, after 9.3 POSIX kernel.shmmax
  • max of Sys V shared mem, less important to set after 9.3
  • sysctl kerne.schmmax should be about the amount of RAM you have
  • if too low, set and put into /etc/sysctl.d/postgresql.conf
  • kernel.shmall total num 4KB shared memory pages, again less critical since 9.3
    • divid schmmax by 4
  • if using systemd, RemoveIPC, shared memory and semaphores are removed when use rfully logs out, probaby only prob if you installed from source
  • resource limits, num processes, open files per user, per process amount RAM per process
  • cat /proc/$(pgrep -o postgres0/limits
  • set in /etc/security/limits.conf
  • one process per connection, also processes for WAL worker, autovacuum
  • max_files_per_process
  • pooloing - can't do listen/notify with pooler
  • default virtual memory not optimal for pg
  • kernel default memory overcommit setting canc ause system to run out virtual RAM and terminate the postmaster process with OOM
  • if pg only thing on server, not such a prob -if total RAM low, increase swap
  • OOM killer only invoked when phuysical memory and swap exahusted
  • lower config params like shared_buffers and work_mem
  • lower number oc concurrent connections (mx_connections) and use a conn pool
  • on linux 2.6 can pevent overcommit, which lowers risk of OOM killer
  • set overcommitmemory=2
  • on systeme without swap, setting vm.overcommit_memory to 2 will probably cause issues
  • can also tune overcommit_ratio as percentagee of RAM avail or overcommit
  • default value here is fine
  • if wrangling OOM killer, only want to prevent OOM killer running on master process, not children
  • huge pages can improve perf if supported in your kernel - huge_pages=try
  • to estimate, look at VmPEak value for postmaster and system huge pages ize
head -1 $PGDATA/postmaster.pid
grep ^VmPeak /$pid
grep &HugePagesize /proc/meminfo
XXX
sysctl  -w vm.nrhugepages = XXX
  • don't use 0 swappiness, set swappiness to between 5 and 20, lots of RAM higher, 10 is fine. same in virtualized environment
  • vm.dirty_background_ration is ratio percent of mem filled with dirty pages need to be flushed
  • range from 0 1--
  • values lower than 5 may not work
  • deafult is 10 = 1.6 G on server with 16 G RAM
  • write intensive ops may see improvement wih lower ratio so that linux flushes in background
  • no big diff on read heavy workload
  • also foreground version of these configs, but flusing done in foreground, therefore blocks application
  • in general foreground should be higher than bg, want to encourage bg flushing vs fg
  • vm.dirty_expire_cenrtisecs setting to high can cause longer IO poauses later
  • sysctl -w vm.dirty_background_ration = vm.dirty_ration=10 good place to start
  • linux has diff IO schedulers, defaul it CFQ generally fine, can also try Deadline, may be worse for interactive workloads
  • EXT fs mount options: relatime vs noatime - later means no need to track access
  • data=ordered vs data=writeback latter offers better perf at risk of data corruption. writeback writes journal entry first, then data can cause corruption in case of crash. DON'T USE IT
  • barrier=0 if using RAID with battery backup, risk data corruption

server resources

  • put WAL on separate HDD, can be spindle disk
  • journaeled fs for WAL
  • use tablespaces to tier storage
  • smaller tables will be cached
  • larger tables accessed randomly benefit from faster drives
  • CPU - cores matter, speed is slightly less important, high IO will cause CPU wait time
  • parallel processing >= pg 9.6 stats_temp_directory
  • does not need to be persisted, can be a RAM disk, s mount new partition, set stats_temp_directory in conf

statistics and logging

  • stats collecter
  • counts accesses to tables, to indexes, total (estiamted) numbr of rows
  • vacuum and analyze actions on tables
  • counts calls to user-defined functions and timing
  • access from pg_stats_* views
  • track_activities should be on, by default
  • track_actrivity_query_size max length for tracking a command, default 1024, increase if long queries
  • track_counts leave on
  • track_io_timing minotor block read write times, affets perf
  • track_functions may affect performance
  • pg_stat_activity dynamic view of commands currently running
  • pg_stat_database - commit, rollback counts, num deadlocks, can show hit ratio for buffer cache
  • pg_stat_user_tables seq scans, rows returned, index scans, rows inserted,
  • pg_stat_user_indexes re indexes being used
  • pg_statio_user_tables raw details on IO if you turned on IO tracking
  • pg_statio_user_indexes ditto, disk blocks read from index
  • pg_statio_user_sequences ditto
  • pg_stat_user_functions
  • pg_stat_reset get learn view after you make changes
  • pg_stat_bgwriter shows num checkpoints run, forced checkpoints (want these to be low as possible wrt normal checkpoints)
  • pg_stat_statements execution stats, num calls, total, min, amx, total reows retrieved; blocks hit read dirtied written; merges similar queries
    • some overhead but probably worth it
  • pgstattuple siae of table; num live tuples; total size live tuples; num dead tuples; total free space;
    • very little perf impact
  • pg_buffercache what is happening in shared buffer cache
  • pg_prewarm load data into eithe rOS or pg sql buffer cache at startup
  • pg_view basically queries pg_stat_activity
  • check_postgres nagios plugin
    • cna be done without nagios too
  • pgbagder - analyzes log files
    • similar to pg_stat_statements help find queries that should have index

logging

  • logging_collector should always be on
  • log_rotation_age log_rotation_size
  • log_statement - off, ddl, mod, all
  • log_lock_waits when longer than deadlock_timeout
  • pgconfigurator.cybertec.at
  • pgtune.lepoard.in.ua
  • github.com/jbkerkus/annotated.conf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment