Skip to content

Instantly share code, notes, and snippets.

@saurabhnanda
Last active August 2, 2023 06:06
Show Gist options
  • Star 18 You must be signed in to star a gist
  • Fork 6 You must be signed in to fork a gist
  • Save saurabhnanda/5258207935bf23cd112be292d22f00d5 to your computer and use it in GitHub Desktop.
Save saurabhnanda/5258207935bf23cd112be292d22f00d5 to your computer and use it in GitHub Desktop.
Tuning Postgres + ZFS

Tuning ZFS + Postgres to outperform EXT4 + Postgres

Please refer to ZFS 2-3x slower than EXT4 to see how ZFS defaults + Postgres defaults severely underperform EXT4 defaults + Postgres defaults (and also to know more about the system on which these benchmarks were performed). This page documents how to tune ZFS + Postgres to give better performance for the tpcb-like benchmark.

BIG FAT WARNING

Please do not copy these settings blindly because I am myself not clear on why/how these settings had the impact they did. For example, I cannot explain why full_page_writes=off independently did not give that much boost, nor did an optimized PG configuration. However, putting both of them together gave a 2-4x boost compared to baseline numbers.

Benchmark results

You'd want to compare the numbers on the first row and the last row to get a sense of the maximum performance gains that I could achieve.

+------------------------+-------+--------+--------+--------+--------+--------+--------+
| client                 | 1     | 4      | 8      | 12     | 24     | 48     | 96     |
+------------------------+-------+--------+--------+--------+--------+--------+--------+
| EXT4 defaults          | 719   | 1816   | 2835   | 3576   | 4531   | 5485   | 6876   |
| cold (warm)            | (763) | (1982) | (3286) | (3889) | (5071) | (5604) | (6540) |
+------------------------+-------+--------+--------+--------+--------+--------+--------+
| ZFS benchmarks                                                                       |
+--------------------------------------------------------------------------------------+
| PG defaults            | 302   | 808    | 1138   | 1287   | 1619   | 2575   | 4040   |
| cold (warm)            | (309) | (836)  | (1140) | (1307) | (1651) | (2621) | (4301) |
+------------------------+-------+--------+--------+--------+--------+--------+--------+
| Optimized PG conf [1]  | 327   | 1365   | 2291   |        |        |        |        |
| full_page_write=on     | (334) | (1331) | (2049) |        |        |        |        |
| cold (warm)            |       |        |        |        |        |        |        |
+------------------------+-------+--------+--------+--------+--------+--------+--------+
| PG defaults            |       | 878    |        | 1334   |        |        |        |
| + kernel tuning [2]    |       | (882)  |        | (1363) |        |        |        |
| cold (warm)            |       |        |        |        |        |        |        |
+------------------------+-------+--------+--------+--------+--------+--------+--------+
| PG defaults            |       | 957    |        | 1372   | 2130   |        |        |
| + full_page_writes=off |       | (959)  |        | (1399) | (2154) |        |        |
| + kernel tuning [2]    |       |        |        |        |        |        |        |
| cold (warm)            |       |        |        |        |        |        |        |
+------------------------+-------+--------+--------+--------+--------+--------+--------+
| Optimized PG conf [1]  |       | 2335   |        | 4751   | 6969   | 7627   | 8429   |
| + full_page_writes=off |       | (2488) |        | (5072) | (5667) | (6485) | (7720) |
| + kernel tuning [2]    |       |        |        |        |        |        |        |
| cold (warm)            |       |        |        |        |        |        |        |
+------------------------+-------+--------+--------+--------+--------+--------+--------+

Explanation of the benchmark results

ZFS creation and settings

zpool create -f -m /firstzfs firstzfs -o ashift=12 /dev/nvme1n1p2

zfs create firstzfs/postgres \
  -o mountpoint=/firstzfs/postgres \
  -o atime=off \
  -o canmount=on \
  -o compression=lz4 \
  -o quota=100G \
  -o recordsize=8k \
  -o logbias=throughput

Optimized PG configuration

  • Give more memory to PG for shared_buffers
  • Tweak the IO parameters for SSDs
max_connections = 100
shared_buffers = 16GB
effective_cache_size = 48GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.7
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 41943kB
min_wal_size = 1GB
max_wal_size = 2GB
max_worker_processes = 8
max_parallel_workers_per_gather = 4
max_parallel_workers = 8

Disabling full page writes on Postgres

To be honest, I don't completely understand what this really does, but apparently it has a big positive impact for Postgres + ZFS.

full_page_writes=off

Kernel tuning

  • Set CPU governor to performance. I noticed that multiple machines had this set to powersave instead, which keeps reducing the CPU speed in periods of inactivity, and increases the CPU speed, when required, but only after a slight delay. In an old benchmark (not reported above), I think I saw an improvement of ~5% due to this tweak alone.
cpufreq-set -g performance
  • Change drive read-ahead to 4096. Warning: This is recommended in the book PostgeSQL 10 - High Performance, but I am not sure how relevant this is for ZFS, because ZFS has about 10+ (really!) tunable parameters for prefetching of data.
blockdev --setra 4096 /dev/sda
  • Change swapiness. Caveat: I haven't even configured swap on the benchmarking system, but I did this anyway, for sake of completeness. Again, another recommendation from PostgreSQL 10 - High Performance
echo 0 > /proc/sys/vm/swappiness
@thoro
Copy link

thoro commented Feb 1, 2019

Read up on the postgres options here:
https://www.postgresql.org/docs/9.5/runtime-config-wal.html

With full_page_writes=off you are effectively reducing the amount of what has to be committed to ZFS, therefore you get a faster commit performance. (i.e. wal files are smaller, less fsync)

For postgres you anyways can remove the ARC completely and set primarycache=metadata, since postgres read cache is VERY effective.

For a sake of completeness, please add 3 additional benchmarks if possible:

full_page_writes=off + ext4

sync=disabled + full_page_writes=on
sync=disabled + full_page_writes=off

@unicolet
Copy link

unicolet commented Feb 1, 2019

@saurabhnanda it would be interesting to see if setting xattr=sa affect the results

@ryao
Copy link

ryao commented Feb 1, 2019

@unicolet PostgreSQL should open files and then keep them open. There is basically zero opportunity for xattr=sa to improve things, It would not hurt to test it, but I would be very surprised if it was found to improve things.

@unicolet
Copy link

unicolet commented Feb 1, 2019

@ryao I don't unfortunately have data to back my guess and I'm ooo now, but from my experience postgres creates quite a lot of files during operations, this doc has a list: https://www.postgresql.org/docs/11/storage-file-layout.html

It would be quite cool if there was a test suite that could be used to simulate different scenarios so we could compare results, sthing like the https://github.com/elastic/rally

@carlesmateo
Copy link

For the record, have you tried with different recordsizes like 128K and 1M?.
And leaving the ashift to auto? Are you sure about the Physical sector size of your NVMe device?.
My output:
`carles@xeon:~$ sudo fdisk /dev/nvme0n1

Welcome to fdisk (util-linux 2.31.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

Command (m for help): p
Disk /dev/nvme0n1: 477 GiB, 512110190592 bytes, 1000215216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos`

What kind of data are you pushing to PostgreSQL?. If you compress Data a lot, you'll write a lot of Data with few IOps and so your performance will be great, but if PostreSQL is pushing Data that can not be compressed (encrypted, blobs already compressed, etc...) then you'll be wasting CPU.
With sync=always you get safe against power outages but performance pays a price.

@kraduk
Copy link

kraduk commented Mar 25, 2021

A bit late to the party I know, but its worth mentioning that zfs recordsize refers to the maximum one, not the actual size. This is a very common misconception. So setting it to 32k vs 128K vs 1M. Just restricts or expands the block size options zfs has to play with, and is unlikely to result in read amplification. eg you can see below the spread of blocks on a pool that only has 128k set as its MAX recordsize.

root@4:~# zdb -S rpool
Simulated DDT histogram:

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    1.65M    198G    115G    115G    1.65M    198G    115G    115G
     2     344K   40.9G   8.11G   8.15G     694K   82.1G   16.3G   16.4G
     4    1.01K   31.5M   12.8M   14.3M    4.56K    140M   57.9M   64.7M
     8      110   5.29M   4.22M   4.35M    1.09K   52.9M   40.9M   42.2M
    16       19    566K     93K    112K      388   12.4M   1.91M   2.27M
    32       28   2.18M    504K    524K    1.50K    135M   30.5M   31.3M
    64       22    298K     45K     88K    2.03K   25.2M   4.01M   8.11M
   128       51     51K     51K    204K    8.92K   8.74M   8.74M   35.7M
 Total    1.99M    239G    123G    124G    2.35M    281G    132G    132G

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment