- Table Access Method (AM) definable
- Opens doors for Columnar storage, UNDO capabilities
- Multiple Tuple types: minimal, virtual and heap
- Aims to solve the chronic Bloat problem modern workloads face regularly
- Not Supported: WAL, non-heap catalogs
- Recursive growth, pages are split to create new pages
- Execution is in logarithmic intervals
- Uses Lehman & Yao algorithm (nbtree or B-Link trees)
- Every page owns a range or a "space"
- The original page of split (left side page)
- The new page of split (right side page) owns the new range after the previous high-key
- Page splits insert a new high-key in the right side page
- Moving right to recover, B-link trees take optimistic approach, pessemistic in earlier designs
- Inserts a downlink from the right page to the left page
- Scans each page to ensure ordering on the high-keys and verify each page covers the values of interest
- Pivot tuples -> high keys of each page
- Non-pivot tuples -> tuples in a page on the heap not linked
- Heap TID
- TPC-C or "split after new tuple"
- The algorithm optimizes to reduce number of page splits
- It uses 50% less space during the number of page splits, fixing a problem rather shipping a new feature
- Regression testing shows 25% reduction in index usage and space utilization
- Slides
- Github Repo in 1.0 RC, getting ready for major release
- Folks had difficulty getting maintainers to fix bugs
- Created their own with the functionality desired and identical configuration pattern to PgBouncer
- Robust query tracing in logs
- Proper connection management for database & user unique pairs
- Proper error forwarding from Postgres
- Proper CPU horizontal scaling of processes
- Productionalized TLS management without HAProxy
- BRIN --> Block-Range Index trying to solve the lightweight index problem
- If you want to index every field or column in a table, B-Trees are far too expensive, sometimes half the table size
- Stores min-max value for range (typically 1MB)
- Great for large Data shapes (1TB+), data warehouse use cases
- Allows skipping large groupings that can be ruled outside of the query parameters
- Index is 0.003% of the size of the Heap
- It is slower than B-tree though
- GIN Indexes
- Great for duplicates, text documents
- GiST
- Great for geometrics
tex_pattern_ops
can be used in indexing to help withLIKE
queries- Whole word and prefix searches are possible
- 21 languages and counting
- Set
default_text_search_config
which allows query operators to perform the matching - Can create GIN indexes with
tsvector
to index the searches - Operators for
tsquery
data types get created upon enabling extensions - CAVEAT
ILIKE '%verit%'
Can't perform standard tsvector- trigram extension is needed to index all 3 segments of full-text
- Then indexing happens on all three parts
- Example
body.
-->[(b, o, d), (o, d, y), (d, y, .)
- General rule of thumb
max_wal_size
= 1 GB- divide this by 3 should be 1 checkpoint
- Doing large, bulks loads will trigger checkpoints and you'd WANT that to happen
- extension
pg_buffercache
could be useful which has views, functions of all shared buffers to be able to examine, what fractions are dirty vs. clean
checkpoint_completion_target
= 0.5- Example: Say it needs to write 1GB from buffers to disk
- checkpoint_timeout is 30 min
- Older Postgres did the entire write all at once to disk
- This used to caused huge latency spikes for up to 10 secs, then back to normal
- with the completion target, in 15 minutes, in a paced manner, based on the percentage set
- At the next point in the checkpoint, I'll perform an fsync
- This should be fast because the disk should have all of the written data
- When you decrease this, there will be an increased usage of
/tmp
disk which will incur other shared_buffer flushing and performance issues
- There is no magic bullet, based on workload
- Rule of thumb: Start lower, work your way up
- max_connections is usually good with 2x CPU cores
- if CPU bound, some hyperthreading can help only so much, can still shoot yourself
- Too many idle connections = recipe for disaster
- USE A CONNECTION POOLER
- shared_buffers
- often quoted as "25% of RAM" is obsolete!!!!
- No new rule, very workload dependent
- tune using
pg_buffercache
- This investment really helps with Cache Hit ratios might help (maybe below 90%, might be worth it)
- What matters is the active dataset
- work_mem = 4 MB (default)
- this is not per query,
- if a single query is doing a lot of complex sorts, hash joins, etc. could use 2 GB
- a default of 4 MB would slow down all subsequent queries
- This needs to be carefully considered in max_connections
- Keep total amount of memory per connection within reasonable limits
- maintenance_work_mem = 64 MB (default)
- going beyond 1 GB doesn't really help
- smaller values can be better (L2/L3 caches)
- effective_io_concurrency = 1 (default)
- Worth increasing for SSDs or RAID arrays
- Prefetching essentially, if storage can handle parallelization, this usually helps
- This is a parallelization factor, can be up to 16 within safety
- Means "number of spindles" -> factor of pages to be returned simultaneously
- Long-running statements, serializable or Isolation-level transactions can halt removal of bloat
- Same for long-running write Operations
- Even for non-serializable transactions
- Two-phase commit and idle transactions can cause significant problems
- Cannot remove "dead rows" when version is still visible to at least one user/connection
- Process:
- Phase 1 -> Scans tables, reduces dead rows to row pointers, makes list of pointers
- Phase 2 -> Scan indexes, uses list of missing rows to remove dead index pointers
- Phase 3 -> Scan tables again, uses list of rows to remove, removes row pointers
- Phase 4 -> Truncate empty blocks
- some statistics
- maintenance_work_mem -> 6 bytes per pointer
vacuum_cost_delay
primary calculation to delay vacuuming
- Can generate lots of WAL
- VACUUM
- Locks table against other DDL
- No other DDL can run (Inserts, updates, deletes, etc.)
- Attempts to Get AccessExclusiveLock
- truncates space at EOTable only
- skips truncation if can't get Lock
- VACUUM FULL
- Repacks table, minimizing usage, creates a table copy
- Similar to CLUSTER command
- Locks whole Table
- Scans and vacuums ALL rows
- doesn't respect vacuum delay
- Destroys replication latency!!
- Repacks table, minimizing usage, creates a table copy
- Autovacuum
- Can be set at table or system levels, enabled by default for all newer versions
- min_duration is disabled by default (-1), 10 (in ms) to wait to clean a tuple
- max_workers (3 default) -> number of vacuum processes, increasing workers will go slower
- if need faster, increase costs settings
- REALLY THINK HARD FOR MODIFYING TABLE LEVEL AUTOVACUUM
- heavily discouraged
autovacuum_vacuum_threshold
= 50autovacuum_analyze_threshold
= 50autovacuum_vacuum_scale_factor
= 0.2autovacuum_analyze_scale_factor
= 0.1
- Do we need to vacuum?
threshold + (rows * scale_factor)
? - scale_factor -> analyze is 10%, vacuum is 20%
- consider lower on huge tables (0.05 as an example)
- Clean up of smaller tables is relatively cheap
- more frequently but less intrusive to overall queries
- Catch 22: If your long-running transactions prevent the vacuum from running, would need to increase the scale factor value to get the cleanup to go through or long-running process needs to be killed, be judicious
- threshold 50 avoids tiny tables, low cost-benefit
autovacuum_vacuum_cost_delay
= 20msautovacuum_vacuum_cost_limit
= 200vacuum_cost_page_hit
= 1vacuum_cost_page_miss
= 10vacuum_cost_page_dirty
= 20- Can consume lots of CPU or I/O
- example, 10k rows, this means 8 MB/sec reads, 4 MB/sec writes
- May want to increase cost_limit to 1000, which would allow cleanup more aggressively
- cost_limit is shared identically for each worker
- WARNING: Can exceed declared I/O usage limits while certain tables are processing in the vacuum
pg_stat_progress_vacuum
9.6+ to track progressvacuumdb
CLI command run concurrent vacuum jobs, no parallelization on large tables...yet- txids are 32-bit, roughly 4 Billion transactions will wraparound
- pg_clog, pg_multixact
- There may be a way to create a "profiling" feature to tune
shared_buffers
,work_mem
and other settings for read-heavy or write-heavy or default (mixed) workloads. - Hallway track talk: for the majority of workloads, optimizing DB is 2x performance gain, optimizing apps can yield up to 100x performance gains
- Discussions around Vacuum, bloat and Sharded workload patterns still developing, recognizing some external vendors have already engaged, but desire more formal managed Partitioning + fdw strategy