genslein/PgCon_2019_Learnings.md

## PgCon_2019_Learnings.md

      
    Raw
  

              PgCon_2019_Learnings.md
            
          
    PgCon 2019 Learnings

Pluggable storage in PG 12


Table Access Method (AM) definable
Opens doors for Columnar storage, UNDO capabilities
Multiple Tuple types: minimal, virtual and heap
Aims to solve the chronic Bloat problem modern workloads face regularly
Not Supported: WAL, non-heap catalogs

nbtree Architecture for PG 12


Slides

Big picture B-trees


Recursive growth, pages are split to create new pages
Execution is in logarithmic intervals
Uses Lehman & Yao algorithm (nbtree or B-Link trees)

The key space


Every page owns a range or a "space"
The original page of split (left side page)
The new page of split (right side page) owns the new range after the previous high-key
Page splits insert a new high-key in the right side page

How to recover


Moving right to recover, B-link trees take optimistic approach, pessemistic in earlier designs
Inserts a downlink from the right page to the left page
Scans each page to ensure ordering on the high-keys and verify each page covers the values of interest

Terminology


Pivot tuples -> high keys of each page
Non-pivot tuples -> tuples in a page on the heap not linked
Heap TID
TPC-C or "split after new tuple"

Interesting points


The algorithm optimizes to reduce number of page splits
It uses 50% less space during the number of page splits, fixing a problem rather shipping a new feature
Regression testing shows 25% reduction in index usage and space utilization

Odyssey: PgBouncer alternative


Slides
Github Repo in 1.0 RC, getting ready for major release
Folks had difficulty getting maintainers to fix bugs
Created their own with the functionality desired and identical configuration pattern to PgBouncer

Robust query tracing in logs
Proper connection management for database & user unique pairs
Proper error forwarding from Postgres
Proper CPU horizontal scaling of processes
Productionalized TLS management without HAProxy


Postgres for all your data shapes


BRIN --> Block-Range Index trying to solve the lightweight index problem

If you want to index every field or column in a table, B-Trees are far too expensive, sometimes half the table size
Stores min-max value for range (typically 1MB)
Great for large Data shapes (1TB+), data warehouse use cases
Allows skipping large groupings that can be ruled outside of the  query parameters
Index is 0.003% of the size of the Heap
It is slower than B-tree though


GIN Indexes

Great for duplicates, text documents


GiST

Great for geometrics


tex_pattern_ops can be used in indexing to help with LIKE queries
Whole word and prefix searches are possible

21 languages and counting


Set default_text_search_config which allows query operators to perform the matching
Can create GIN indexes with tsvector to index the searches
Operators for tsquery data types get created upon enabling extensions
CAVEAT ILIKE '%verit%' Can't perform standard tsvector

trigram extension is needed to index all 3 segments of full-text
Then indexing happens on all three parts
Example body. --> [(b, o, d), (o, d, y), (d, y, .)


Postgres Tuning Training


General rule of thumb

max_wal_size = 1 GB

divide this by 3 should be 1 checkpoint


Doing large, bulks loads will trigger checkpoints and you'd WANT that to happen
extension pg_buffercache could be useful which has views, functions of all shared buffers to be able to examine, what fractions are dirty vs. clean

Spread checkpoints


checkpoint_completion_target = 0.5

Example: Say it needs to write 1GB from buffers to disk
checkpoint_timeout is 30 min
Older Postgres did the entire write all at once to disk
This used to caused huge latency spikes for up to 10 secs, then back to normal
with the completion target, in 15 minutes, in a paced manner, based on the percentage set
At the next point in the checkpoint, I'll perform an fsync

This should be fast because the disk should have all of the written data


When you decrease this, there will be an increased usage of /tmp disk which will incur other shared_buffer flushing and performance issues


Basic Tuning Revisited


There is no magic bullet, based on workload
Rule of thumb: Start lower, work your way up

max_connections is usually good with 2x CPU cores
if CPU bound, some hyperthreading can help only so much, can still shoot yourself
Too many idle connections = recipe for disaster
USE A CONNECTION POOLER


shared_buffers

often quoted as "25% of RAM" is obsolete!!!!
No new rule, very workload dependent
tune using pg_buffercache
This investment really helps with Cache Hit ratios might help (maybe below 90%, might be worth it)
What matters is the active dataset


work_mem = 4 MB (default)

this is not per query,
if a single query is doing a lot of complex sorts, hash joins, etc. could use 2 GB
a default of 4 MB would slow down all subsequent queries
This needs to be carefully considered in max_connections
Keep total amount of memory per connection within reasonable limits


maintenance_work_mem = 64 MB (default)

going beyond 1 GB doesn't really help
smaller values can be better (L2/L3 caches)


effective_io_concurrency = 1 (default)

Worth increasing for SSDs or RAID arrays
Prefetching essentially, if storage can handle parallelization, this usually helps
This is a parallelization factor, can be up to 16 within safety
Means "number of spindles" -> factor of pages to be returned simultaneously


VACUUM


Long-running statements, serializable or Isolation-level transactions can halt removal of bloat
Same for long-running write Operations

Even for non-serializable transactions


Two-phase commit and idle transactions can cause significant problems
Cannot remove "dead rows" when version is still visible to at least one user/connection
Process:

Phase 1 -> Scans tables, reduces dead rows to row pointers, makes list of pointers
Phase 2 -> Scan indexes, uses list of missing rows to remove dead index pointers
Phase 3 -> Scan tables again, uses list of rows to remove, removes row pointers
Phase 4 -> Truncate empty blocks


some statistics

maintenance_work_mem -> 6 bytes per pointer
vacuum_cost_delay primary calculation to delay vacuuming


Side Effects


Can generate lots of WAL
VACUUM

Locks table against other DDL
No other DDL can run (Inserts, updates, deletes, etc.)
Attempts to Get AccessExclusiveLock
truncates space at EOTable only
skips truncation if can't get Lock


VACUUM FULL

Repacks table, minimizing usage, creates a table copy

Similar to CLUSTER command


Locks whole Table
Scans and vacuums ALL rows
doesn't respect vacuum delay
Destroys replication latency!!


Autovacuum

Can be set at table or system levels, enabled by default for all newer versions
min_duration is disabled by default (-1), 10 (in ms) to wait to clean a tuple
max_workers (3 default) -> number of vacuum processes, increasing workers will go slower
if need faster, increase costs settings
REALLY THINK HARD FOR MODIFYING TABLE LEVEL AUTOVACUUM
heavily discouraged

autovacuum_vacuum_threshold = 50
autovacuum_analyze_threshold = 50
autovacuum_vacuum_scale_factor = 0.2
autovacuum_analyze_scale_factor = 0.1


Do we need to vacuum? threshold + (rows * scale_factor)?
scale_factor -> analyze is 10%, vacuum is 20%

consider lower on huge tables (0.05 as an example)
Clean up of smaller tables is relatively cheap
more frequently but less intrusive to overall queries
Catch 22: If your long-running transactions prevent the vacuum from running, would need to increase the scale factor value to get the cleanup to go through or long-running process needs to be killed, be judicious


threshold 50 avoids tiny tables, low cost-benefit


Vacuum Performance


autovacuum_vacuum_cost_delay = 20ms
autovacuum_vacuum_cost_limit = 200
vacuum_cost_page_hit = 1
vacuum_cost_page_miss = 10
vacuum_cost_page_dirty = 20
Can consume lots of CPU or I/O
example, 10k rows, this means 8 MB/sec reads, 4 MB/sec writes
May want to increase cost_limit to 1000, which would allow cleanup more aggressively
cost_limit is shared identically for each worker

Warnings and Monitoring


WARNING: Can exceed declared I/O usage limits while certain tables are processing in the vacuum
pg_stat_progress_vacuum 9.6+ to track progress
vacuumdb CLI command run concurrent vacuum jobs, no parallelization on large tables...yet
txids are 32-bit, roughly 4 Billion transactions will wraparound
pg_clog, pg_multixact

Takeaways


There may be a way to create a "profiling" feature to tune shared_buffers, work_mem and other settings for read-heavy or write-heavy or default (mixed) workloads.
Hallway track talk: for the majority of workloads, optimizing DB is 2x performance gain, optimizing apps can yield up to 100x performance gains
Discussions around Vacuum, bloat and Sharded workload patterns still developing, recognizing some external vendors have already engaged, but desire more formal managed Partitioning + fdw strategy