Please refer to ZFS 2-3x slower than EXT4 to see how ZFS defaults + Postgres defaults severely underperform EXT4 defaults + Postgres defaults (and also to know more about the system on which these benchmarks were performed). This page documents how to tune ZFS + Postgres to give better performance for the tpcb-like benchmark.
Please do not copy these settings blindly because I am myself not clear on why/how these settings had the impact they did. For example, I cannot explain why full_page_writes=off
independently did not give that much boost, nor did an optimized PG configuration. However, putting both of them together gave a 2-4x boost compared to baseline numbers.
You'd want to compare the numbers on the first row and the last row to get a sense of the maximum performance gains that I could achieve.
+------------------------+-------+--------+--------+--------+--------+--------+--------+
| client | 1 | 4 | 8 | 12 | 24 | 48 | 96 |
+------------------------+-------+--------+--------+--------+--------+--------+--------+
| EXT4 defaults | 719 | 1816 | 2835 | 3576 | 4531 | 5485 | 6876 |
| cold (warm) | (763) | (1982) | (3286) | (3889) | (5071) | (5604) | (6540) |
+------------------------+-------+--------+--------+--------+--------+--------+--------+
| ZFS benchmarks |
+--------------------------------------------------------------------------------------+
| PG defaults | 302 | 808 | 1138 | 1287 | 1619 | 2575 | 4040 |
| cold (warm) | (309) | (836) | (1140) | (1307) | (1651) | (2621) | (4301) |
+------------------------+-------+--------+--------+--------+--------+--------+--------+
| Optimized PG conf [1] | 327 | 1365 | 2291 | | | | |
| full_page_write=on | (334) | (1331) | (2049) | | | | |
| cold (warm) | | | | | | | |
+------------------------+-------+--------+--------+--------+--------+--------+--------+
| PG defaults | | 878 | | 1334 | | | |
| + kernel tuning [2] | | (882) | | (1363) | | | |
| cold (warm) | | | | | | | |
+------------------------+-------+--------+--------+--------+--------+--------+--------+
| PG defaults | | 957 | | 1372 | 2130 | | |
| + full_page_writes=off | | (959) | | (1399) | (2154) | | |
| + kernel tuning [2] | | | | | | | |
| cold (warm) | | | | | | | |
+------------------------+-------+--------+--------+--------+--------+--------+--------+
| Optimized PG conf [1] | | 2335 | | 4751 | 6969 | 7627 | 8429 |
| + full_page_writes=off | | (2488) | | (5072) | (5667) | (6485) | (7720) |
| + kernel tuning [2] | | | | | | | |
| cold (warm) | | | | | | | |
+------------------------+-------+--------+--------+--------+--------+--------+--------+
zpool create -f -m /firstzfs firstzfs -o ashift=12 /dev/nvme1n1p2
zfs create firstzfs/postgres \
-o mountpoint=/firstzfs/postgres \
-o atime=off \
-o canmount=on \
-o compression=lz4 \
-o quota=100G \
-o recordsize=8k \
-o logbias=throughput
- Give more memory to PG for
shared_buffers
- Tweak the IO parameters for SSDs
max_connections = 100
shared_buffers = 16GB
effective_cache_size = 48GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.7
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 41943kB
min_wal_size = 1GB
max_wal_size = 2GB
max_worker_processes = 8
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
To be honest, I don't completely understand what this really does, but apparently it has a big positive impact for Postgres + ZFS.
full_page_writes=off
- Set CPU governor to
performance
. I noticed that multiple machines had this set topowersave
instead, which keeps reducing the CPU speed in periods of inactivity, and increases the CPU speed, when required, but only after a slight delay. In an old benchmark (not reported above), I think I saw an improvement of ~5% due to this tweak alone.
cpufreq-set -g performance
- Change drive read-ahead to
4096
. Warning: This is recommended in the bookPostgeSQL 10 - High Performance
, but I am not sure how relevant this is for ZFS, because ZFS has about 10+ (really!) tunable parameters for prefetching of data.
blockdev --setra 4096 /dev/sda
- Change swapiness. Caveat: I haven't even configured swap on the benchmarking system, but I did this anyway, for sake of completeness. Again, another recommendation from
PostgreSQL 10 - High Performance
echo 0 > /proc/sys/vm/swappiness
For the record, have you tried with different recordsizes like 128K and 1M?.
And leaving the ashift to auto? Are you sure about the Physical sector size of your NVMe device?.
My output:
`carles@xeon:~$ sudo fdisk /dev/nvme0n1
Welcome to fdisk (util-linux 2.31.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.
Command (m for help): p
Disk /dev/nvme0n1: 477 GiB, 512110190592 bytes, 1000215216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos`
What kind of data are you pushing to PostgreSQL?. If you compress Data a lot, you'll write a lot of Data with few IOps and so your performance will be great, but if PostreSQL is pushing Data that can not be compressed (encrypted, blobs already compressed, etc...) then you'll be wasting CPU.
With sync=always you get safe against power outages but performance pays a price.