Skip to content

Instantly share code, notes, and snippets.

@genslein
Created June 6, 2019 15:32
Show Gist options
  • Save genslein/38a96414c7d1c507e8d45549720b3325 to your computer and use it in GitHub Desktop.
Save genslein/38a96414c7d1c507e8d45549720b3325 to your computer and use it in GitHub Desktop.
PgCon 2019 learnings

PgCon 2019 Learnings

Pluggable storage in PG 12

  • Table Access Method (AM) definable
  • Opens doors for Columnar storage, UNDO capabilities
  • Multiple Tuple types: minimal, virtual and heap
  • Aims to solve the chronic Bloat problem modern workloads face regularly
  • Not Supported: WAL, non-heap catalogs

nbtree Architecture for PG 12

Big picture B-trees
  • Recursive growth, pages are split to create new pages
  • Execution is in logarithmic intervals
  • Uses Lehman & Yao algorithm (nbtree or B-Link trees)
The key space
  • Every page owns a range or a "space"
  • The original page of split (left side page)
  • The new page of split (right side page) owns the new range after the previous high-key
  • Page splits insert a new high-key in the right side page
How to recover
  • Moving right to recover, B-link trees take optimistic approach, pessemistic in earlier designs
  • Inserts a downlink from the right page to the left page
  • Scans each page to ensure ordering on the high-keys and verify each page covers the values of interest
Terminology
  • Pivot tuples -> high keys of each page
  • Non-pivot tuples -> tuples in a page on the heap not linked
  • Heap TID
  • TPC-C or "split after new tuple"
Interesting points
  • The algorithm optimizes to reduce number of page splits
  • It uses 50% less space during the number of page splits, fixing a problem rather shipping a new feature
  • Regression testing shows 25% reduction in index usage and space utilization

Odyssey: PgBouncer alternative

  • Slides
  • Github Repo in 1.0 RC, getting ready for major release
  • Folks had difficulty getting maintainers to fix bugs
  • Created their own with the functionality desired and identical configuration pattern to PgBouncer
    • Robust query tracing in logs
    • Proper connection management for database & user unique pairs
    • Proper error forwarding from Postgres
    • Proper CPU horizontal scaling of processes
    • Productionalized TLS management without HAProxy

Postgres for all your data shapes

  • BRIN --> Block-Range Index trying to solve the lightweight index problem
    • If you want to index every field or column in a table, B-Trees are far too expensive, sometimes half the table size
    • Stores min-max value for range (typically 1MB)
    • Great for large Data shapes (1TB+), data warehouse use cases
    • Allows skipping large groupings that can be ruled outside of the query parameters
    • Index is 0.003% of the size of the Heap
    • It is slower than B-tree though
  • GIN Indexes
    • Great for duplicates, text documents
  • GiST
    • Great for geometrics
  • tex_pattern_ops can be used in indexing to help with LIKE queries
  • Whole word and prefix searches are possible
    • 21 languages and counting
  • Set default_text_search_config which allows query operators to perform the matching
  • Can create GIN indexes with tsvector to index the searches
  • Operators for tsquery data types get created upon enabling extensions
  • CAVEAT ILIKE '%verit%' Can't perform standard tsvector
    • trigram extension is needed to index all 3 segments of full-text
    • Then indexing happens on all three parts
    • Example body. --> [(b, o, d), (o, d, y), (d, y, .)

Postgres Tuning Training

  • General rule of thumb
    • max_wal_size = 1 GB
      • divide this by 3 should be 1 checkpoint
  • Doing large, bulks loads will trigger checkpoints and you'd WANT that to happen
  • extension pg_buffercache could be useful which has views, functions of all shared buffers to be able to examine, what fractions are dirty vs. clean
Spread checkpoints
  • checkpoint_completion_target = 0.5
    • Example: Say it needs to write 1GB from buffers to disk
    • checkpoint_timeout is 30 min
    • Older Postgres did the entire write all at once to disk
    • This used to caused huge latency spikes for up to 10 secs, then back to normal
    • with the completion target, in 15 minutes, in a paced manner, based on the percentage set
    • At the next point in the checkpoint, I'll perform an fsync
      • This should be fast because the disk should have all of the written data
    • When you decrease this, there will be an increased usage of /tmp disk which will incur other shared_buffer flushing and performance issues
Basic Tuning Revisited
  • There is no magic bullet, based on workload
  • Rule of thumb: Start lower, work your way up
    • max_connections is usually good with 2x CPU cores
    • if CPU bound, some hyperthreading can help only so much, can still shoot yourself
    • Too many idle connections = recipe for disaster
    • USE A CONNECTION POOLER
  • shared_buffers
    • often quoted as "25% of RAM" is obsolete!!!!
    • No new rule, very workload dependent
    • tune using pg_buffercache
    • This investment really helps with Cache Hit ratios might help (maybe below 90%, might be worth it)
    • What matters is the active dataset
  • work_mem = 4 MB (default)
    • this is not per query,
    • if a single query is doing a lot of complex sorts, hash joins, etc. could use 2 GB
    • a default of 4 MB would slow down all subsequent queries
    • This needs to be carefully considered in max_connections
    • Keep total amount of memory per connection within reasonable limits
  • maintenance_work_mem = 64 MB (default)
    • going beyond 1 GB doesn't really help
    • smaller values can be better (L2/L3 caches)
  • effective_io_concurrency = 1 (default)
    • Worth increasing for SSDs or RAID arrays
    • Prefetching essentially, if storage can handle parallelization, this usually helps
    • This is a parallelization factor, can be up to 16 within safety
    • Means "number of spindles" -> factor of pages to be returned simultaneously

VACUUM

  • Long-running statements, serializable or Isolation-level transactions can halt removal of bloat
  • Same for long-running write Operations
    • Even for non-serializable transactions
  • Two-phase commit and idle transactions can cause significant problems
  • Cannot remove "dead rows" when version is still visible to at least one user/connection
  • Process:
    • Phase 1 -> Scans tables, reduces dead rows to row pointers, makes list of pointers
    • Phase 2 -> Scan indexes, uses list of missing rows to remove dead index pointers
    • Phase 3 -> Scan tables again, uses list of rows to remove, removes row pointers
    • Phase 4 -> Truncate empty blocks
  • some statistics
    • maintenance_work_mem -> 6 bytes per pointer
    • vacuum_cost_delay primary calculation to delay vacuuming
Side Effects
  • Can generate lots of WAL
  • VACUUM
    • Locks table against other DDL
    • No other DDL can run (Inserts, updates, deletes, etc.)
    • Attempts to Get AccessExclusiveLock
    • truncates space at EOTable only
    • skips truncation if can't get Lock
  • VACUUM FULL
    • Repacks table, minimizing usage, creates a table copy
      • Similar to CLUSTER command
    • Locks whole Table
    • Scans and vacuums ALL rows
    • doesn't respect vacuum delay
    • Destroys replication latency!!
  • Autovacuum
    • Can be set at table or system levels, enabled by default for all newer versions
    • min_duration is disabled by default (-1), 10 (in ms) to wait to clean a tuple
    • max_workers (3 default) -> number of vacuum processes, increasing workers will go slower
    • if need faster, increase costs settings
    • REALLY THINK HARD FOR MODIFYING TABLE LEVEL AUTOVACUUM
    • heavily discouraged
      • autovacuum_vacuum_threshold = 50
      • autovacuum_analyze_threshold = 50
      • autovacuum_vacuum_scale_factor = 0.2
      • autovacuum_analyze_scale_factor = 0.1
    • Do we need to vacuum? threshold + (rows * scale_factor)?
    • scale_factor -> analyze is 10%, vacuum is 20%
      • consider lower on huge tables (0.05 as an example)
      • Clean up of smaller tables is relatively cheap
      • more frequently but less intrusive to overall queries
      • Catch 22: If your long-running transactions prevent the vacuum from running, would need to increase the scale factor value to get the cleanup to go through or long-running process needs to be killed, be judicious
    • threshold 50 avoids tiny tables, low cost-benefit
Vacuum Performance
  • autovacuum_vacuum_cost_delay = 20ms
  • autovacuum_vacuum_cost_limit = 200
  • vacuum_cost_page_hit = 1
  • vacuum_cost_page_miss = 10
  • vacuum_cost_page_dirty = 20
  • Can consume lots of CPU or I/O
  • example, 10k rows, this means 8 MB/sec reads, 4 MB/sec writes
  • May want to increase cost_limit to 1000, which would allow cleanup more aggressively
  • cost_limit is shared identically for each worker
Warnings and Monitoring
  • WARNING: Can exceed declared I/O usage limits while certain tables are processing in the vacuum
  • pg_stat_progress_vacuum 9.6+ to track progress
  • vacuumdb CLI command run concurrent vacuum jobs, no parallelization on large tables...yet
  • txids are 32-bit, roughly 4 Billion transactions will wraparound
  • pg_clog, pg_multixact

Takeaways

  • There may be a way to create a "profiling" feature to tune shared_buffers, work_mem and other settings for read-heavy or write-heavy or default (mixed) workloads.
  • Hallway track talk: for the majority of workloads, optimizing DB is 2x performance gain, optimizing apps can yield up to 100x performance gains
  • Discussions around Vacuum, bloat and Sharded workload patterns still developing, recognizing some external vendors have already engaged, but desire more formal managed Partitioning + fdw strategy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment