lukeasrodgers/pg_tuning_notes.md

## pg_tuning_notes.md

      
    Raw
  

              pg_tuning_notes.md
            
          
    269 tunable settings, these 20 are most helpful

max_connections

set to useful peak
above 200, look at conn pooler
consider superuser_reserved_connections

shared_buffers - shared RAM

amt of memory pg takes for itself
generally, avail ram ÷ 4
lower for writ heavy, higher for read heavy
larger than 8gb not useful. extension that shows what happens in buffer cache
can go over 8 G if lots of tables, lots of indexes. depends on workload
over 8 G not harmful

work_mem - non-shared RAM, used by query operation, incluidng sorts and hashes before disk

if over limit, just writes to disk
complex queries can have multiple operations
set to

avail RAM ÷ max connections
avail ram ÷ 2 x max conns


highly dependent on workload

1-8 M for lots of conns with simple workloads
64-128 M for fewer conns and or complex workloads


must be power of 2

maintenance_work_mem

similar to work mem, but applies to index builds, vacuum, autovacuum
set to avail RAM ÷ 8
larger than 2 G not useful
can temporarily set higher to speed index creation
can be set per-session

autovacuum_work_mem

default to main work mem
set to avail RAM ÷ 16
larger than 1G not useful
must be power of 2

effective_cache_size

estimate of how much RAM db has access to directrly and indirectly to potntailly cached data
doesnt' allocate, just a hint
if only db on server, set to physical RAM

random_page_cost

hint to how fast data can be read from disk
value of 1 is equiv to reading from RAM
SSD = 1.2, SAN = 1.2, HDD = 4

wal_level

set to lowest val required
setting to higher means more data written to disk

wal_buffers

amt of shared mem used for WAL data that has not yet been written to disk
deafult of -1 bases size on shared buffers, default is often fine
set to 128 M on servers with more than 8 cores with high workload

commit_delay

how long to wait in MS before a WAL flush is initiated
on servers with high write load, this can help improve perf by grouping write sof multiple txns together
if commit delay reach limit it flushes, if wal_buffers hits limit, it  flushes
on servers with low write, this can impede performance
default is no delay
leave at default

min_wal_size

how much to alocate to store WAL files
default is 80 M - generally too small, set to between 1-8G
otheriwse needs to continually grow WAL file

max_wal_size

soft limit
must be large enough for cehckpoint
default is 1G
setting too hight can increase time after crahs
can be set to 2 x min_wal_size

checkpoing_timeout

max time between automated checkkpoints
default is 5 mins
useful range is 30mins - 1 hr
replay is about 20% of this time after a crash
setting too low will force checkpoints often, wranings
setting to high will increase crash recover time
for long ETL batches incresae to be longer than query run time

checkpoint_completion_target

amt of time before timeout as afractin to attempt to finish writing data out to disk
defaul tis 0.5 - 50% of checkpoint_timeout
useful values are 0.8 - 0.9
high values can cuase locks if thlots of data tobe rwirtten near end of checkpoint
low valus cn adegrade performance

checkpoint_warning
max_worker_processes

max concur worker proceses
set to no more than number of cores

max_parallel_workers_per_gather

set to 4
max num of parallel procese per executor

max_parallel_workers

max num of parallel workesr running at any point
set to number of cores
only from pg10

effective_io_concurrency

gives query planner hint to number of I/O requests storage system can effectively handel at same time
hHDD should be low
SAN should b higher
SSD should be in hundreds
similar to random_page_cost

default_statistics_target


during ANALYZE operation, how much sampel takne


default is 100, can be set to 1000, means analyzer takes longer to run


default can be too low for larger data sets


setting to high slows ANALYZE down


if misanalying table wrong, this can help


can analyze temp table, may be helpful depending on size - diff than EXPLAIN ANALYZE


analyze runs periodically by worker, used by query planner


do not change

fsync

always leave on, never turn off

full_page_writes

leave on

synchronous_commit

always leave on
set to off if you don't mind data loss from WAL

quey planner config

just leave them all on, e.g. enable_bitmapscan
can be tuned per session

less common and he

idle_in_transaction_timeout

left open without calling COMMIT or ROLLBACK
prevent vacuum and maintenance, index creation
default disabled
maybe only pg10
if gets into this state, may need to abandon connection or issue manual rollback

statement_timeout

default disabled
only set in current session, not cluster-wide

temp_file_limit

let bad query from using too much RAM, exceed work_mem and start writing to disk, e.g. for big JOIN
off by default

vacuum


reclaim space by dead tuples
as rows updated/delete,d marked dead
MVCC means txns might still have view of them
VACUUM job is to flag thoe blocks for reuse anc reclaim
VACUUM only release disk if entire file 1G is empty, otherwise just mark it reusable
cause lots of I/O requests
vacuum full copies all data into new files, reclaims disk from dead uples, at expsen of locking table while doing so
VACUUM FULL requres as much disk as current files, plsu actual data size for new files
pg_repack does something similar, bit without table lock, only smaller lock aight at end of operation, takes a bit longer.
only really useful if you need to reclaim space. if table grows and shrinks, maybe just tweak analyze targets on it
tables may grow quickly at the beginning, then taper off as inserts start reusingn unused space.
after VACUUM FULL table will often appear to be growing more quickly
VACUUM ANALYZE runs normal vacuum followed by an ANALYZE; ANALYZE can be run on its own. if doing manually, best to just do VACUUM ANALYZE
VACUUM ANALYZE VERBOSE - shows where stuck
VACUUM updates FSM per table. FSM used when writing data
AUTOVACUUM runs in bg, auto on tables that need it
by default overly conservative with locks held - tries to avoid locks that would impact performance
can be speicied cluster wide, or per table

aotvacuum_threshold and autovacuum_scale_factor

min value of rows that need to have changed
don't set to hight, tables get excluded
don't set too low, tables get vacuum often
can set on per-table basis for larger values
scale factory:
large 0.001, medium: 0.1, small: 0.2
if majority tables are small, set cluster-wide settings for small tables, per-table table level for bigger tables (and vice versa)
always remember to check vbalues for new tables, and as tables grow
keep checking formula to see how oftenit will run
formula: threshold + (scale_factor x num tuples)
tuning fill_factor for bulk imports
autovacuum_max_workers
default is 3, way too low
consider number of tables actively updated and deleted
takes from superuser_reserve_connections
too low means autovacuum never keeps up
too hight means additional I/O overhead
autovacuum_naptime
time to wait befor eautovacum launche rcehcek if it needs a process
default is okay
rather twaeak threshold and scale factor
autovacuum_cost_limit
keep track of accumulated cost of running vacuum
if exceeds limit, will wait
default is fine
hard to set good value
autovacuum_analyze_threshold
may want this to urn more often, make sure less than vacuum threshold so it runs more often

autovacuum_cost_delay

if cost exceeded, how long to sleep
default is okay
can be set per table

vacuum stas


pg_stat_all_tables pg_stat_user_tables
shows when last manual or vacuum table ran
as well as last manual or automatic analyze ran
pg_stat_progress_vacuum
show stats on each vacuum process currently running, only pg 9.6

query optimization

two modes query planners

regular query planner (12 or fewer joins), does exhaustive search
generic query optimizer (> 12 joins)

heuristics
geqo_* parameters


alternates

COPY instead of INSERT
TRUNCATE intsead of DLETE
CURSOR intesad of LIMIT/OFFSET
INTERSECT instead of AND
UNION instead of OR
pg_tgrm instead of LIKE

TOAST


vacuum slow? could be TOAST
max block (therefore row) size is 8kb
how to insert data larger? compressed and/or broken into 2kb chunks
original table has OID entry to TOAST table
slow to access, select * could be slower if TOAST entries

storage strategies


PLAIN no TOASTE or compression
MAIN - users compression discourages use of TOAST
EXTERNAL - users TOAST bt no compression - can be preferable if doing substring ops/queries
EXTENDED - compress first then uses TOAST
default is EXTENDED for most toast-able columns
to change use ALTER TABLE ... SET STORAGE

table bloat


lots of updates or deltes, no vacuum, or bacuum has run bt no new rows yet
can be inefficient to scan over lots of empty blocks

partitionaing


try not to exceed 400 partitions
only write tables need to be vacuum, old tables don't
every query needs to reference key that determines partition, otherwise just scans everything
pg 11 starting to do auto partitioning, pg 12 fully automatic partitioning

indexes


unique indexes can have where clause, unique constraints cannot
don't create both, functionally identical
uniqe constraints have better error messages than unique indexes
if overlapping single-col and multi-col index, pg will use single-col just because it's faster,
even it perfor is the same

over indexing


pg_stat_user_indexes look at idx_scan, idx_tup_rad and idx_tup_fetch
try to combinae indexes where possible:w

index bloat


frequently delete most but not all index entries causes bloat
may need REINDEX - if you delete large part of your table
REINDEX cannot be done concurrently, use CREATE INDEX, DROP INDEX, ALTER INDEX

Operating System parameters


before 9.3 used sys V shared memory, after 9.3 POSIX
kernel.shmmax
max of Sys V shared mem, less important to set after 9.3
sysctl kerne.schmmax should be about the amount of RAM you have
if too low, set and put into /etc/sysctl.d/postgresql.conf
kernel.shmall total num 4KB shared memory pages, again less critical since 9.3

divid schmmax by 4


if using systemd, RemoveIPC, shared memory and semaphores are removed when use rfully logs out, probaby only prob if you installed from source
resource limits, num processes, open files per user, per process amount RAM per process
cat /proc/$(pgrep -o postgres0/limits
set in /etc/security/limits.conf
one process per connection, also processes for WAL worker, autovacuum
max_files_per_process
pooloing - can't do listen/notify with pooler
default virtual memory not optimal for pg
kernel default memory overcommit setting canc ause system to run out virtual RAM and terminate the postmaster process with OOM
if pg only thing on server, not such a prob
-if total RAM low, increase swap
OOM killer only invoked when phuysical memory and swap exahusted
lower config params like shared_buffers and work_mem
lower number oc concurrent connections (mx_connections) and use a conn pool
on linux 2.6 can pevent overcommit, which lowers risk of OOM killer
set overcommitmemory=2
on systeme without swap, setting vm.overcommit_memory to 2 will probably cause issues
can also tune overcommit_ratio as percentagee of RAM avail or overcommit
default value here is fine
if wrangling OOM killer, only want to prevent OOM killer running on master process, not children
huge pages can improve perf if supported in your kernel - huge_pages=try
to estimate, look at VmPEak value for postmaster and system huge pages ize


head -1 $PGDATA/postmaster.pid
grep ^VmPeak /$pid
grep &HugePagesize /proc/meminfo
XXX
sysctl  -w vm.nrhugepages = XXX


don't use 0 swappiness, set swappiness to between 5 and 20, lots of RAM higher, 10 is fine. same in virtualized environment
vm.dirty_background_ration is ratio percent of mem filled with dirty pages need to be flushed
range from 0 1--
values lower than 5 may not work
deafult is 10 = 1.6 G on server with 16 G RAM
write intensive ops may see improvement wih lower ratio so that linux flushes in background
no big diff on read heavy workload
also foreground version of these configs, but flusing done in foreground, therefore blocks application
in general foreground should be higher than bg, want to encourage bg flushing vs fg
vm.dirty_expire_cenrtisecs setting to high can cause longer IO poauses later
sysctl -w vm.dirty_background_ration = vm.dirty_ration=10 good place to start
linux has diff IO schedulers, defaul it CFQ generally fine, can also try Deadline, may be worse for interactive workloads
EXT fs mount options: relatime vs noatime - later means no need to track access
data=ordered vs data=writeback latter offers better perf at risk of data corruption. writeback writes journal entry first, then data can cause corruption in case of crash. DON'T USE IT
barrier=0 if using RAID with battery backup, risk data corruption


server resources


put WAL on separate HDD, can be spindle disk
journaeled fs for WAL
use tablespaces to tier storage
smaller tables will be cached
larger tables accessed randomly benefit from faster drives
CPU - cores matter, speed is slightly less important, high IO will cause CPU wait time
parallel processing >= pg 9.6
stats_temp_directory
does not need to be persisted, can be a RAM disk, s mount new partition, set stats_temp_directory in conf

statistics and logging


stats collecter
counts accesses to tables, to indexes, total (estiamted) numbr of rows
vacuum and analyze actions on tables
counts calls to user-defined functions and timing
access from pg_stats_* views
track_activities should be on, by default
track_actrivity_query_size max length for tracking a command, default 1024, increase if long queries
track_counts leave on
track_io_timing minotor block read write times, affets perf
track_functions may affect performance
pg_stat_activity dynamic view of commands currently running
pg_stat_database - commit, rollback counts, num deadlocks, can show hit ratio for buffer cache
pg_stat_user_tables seq scans, rows returned, index scans, rows inserted,
pg_stat_user_indexes re indexes being used
pg_statio_user_tables raw details on IO if you turned on IO tracking
pg_statio_user_indexes ditto, disk blocks read from index
pg_statio_user_sequences ditto
pg_stat_user_functions
pg_stat_reset get  learn view after you make changes
pg_stat_bgwriter shows num checkpoints run, forced checkpoints (want these to be low as possible wrt normal checkpoints)
pg_stat_statements execution stats, num calls, total, min, amx, total reows retrieved; blocks hit read dirtied written; merges similar queries

some overhead but probably worth it


pgstattuple siae of table; num live tuples; total size live tuples; num dead tuples; total free space;

very little perf impact


pg_buffercache what is happening in shared buffer cache
pg_prewarm load data into eithe rOS or pg sql buffer cache at startup
pg_view basically queries pg_stat_activity
check_postgres nagios plugin

cna be done without nagios too


pgbagder - analyzes log files

similar to pg_stat_statements help find queries that should have index


logging


logging_collector should always be on
log_rotation_age log_rotation_size
log_statement - off, ddl, mod, all
log_lock_waits when longer than deadlock_timeout
pgconfigurator.cybertec.at
pgtune.lepoard.in.ua
github.com/jbkerkus/annotated.conf