I ran some experiments with varying recordsizes, filesizes, and compression. The files were .csv, representative of a simple schema:
full_name,external_id,last_modified
'Past, Gabrielle',40605,'2006-07-09 23:17:20'
'Vachil, Corry',44277,'1996-09-05 05:12:44'
The files were all generated on an ext4 filesystem. There were three sets of five files, with 75, 100,000, and 1,000,000 rows each, resulting in the following sizes:
❯ find . -name '*small*.csv' -exec du -bc {} + | \
awk 'END {printf "%s %.2f %s\n", "Average file size:", ($1 / (NR-1) / 1024), "KiB"}'
Average file size: 3.13 KiB
# command repeated for `medium`
Average file size: 4410.09 KiB
# command repeated for `large`
Average file size: 45078.84 KiB
A dataset was then created; the underlying zpool has ashift=12
, and is on a 3x3 RAIDZ1 on spinning disks with 4K sector size.
❯ sudo zpool get ashift
NAME PROPERTY VALUE SOURCE
tank ashift 12 local
❯ sudo zfs get compression,recordsize tank/foobar
NAME PROPERTY VALUE SOURCE
tank/foobar compression off local
tank/foobar recordsize 128K local
Between each run, the files were deleted and rsync'd over from the ext4 filesystem.
Using recordsize=128K
, the small
files yielded these results:
❯ du
41 .
❯ ls -ls | awk '{sum = sum + $6} END {print int(sum/1024)}'
15
❯ sudo zdb tank/foobar
# truncated output
Object lvl iblk dblk dsize dnsize lsize %full type
3596 1 128K 3.50K 5.50K 512 3.50K 100.00 ZFS plain file
3597 1 128K 3.50K 5.50K 512 3.50K 100.00 ZFS plain file
3598 1 128K 3.50K 5.50K 512 3.50K 100.00 ZFS plain file
3599 1 128K 3.50K 5.50K 512 3.50K 100.00 ZFS plain file
3600 1 128K 3.50K 5.50K 512 3.50K 100.00 ZFS plain file
Using recordsize=4K
, the small
files yielded these results:
❯ du
3 .
❯ ls -ls | awk '{sum = sum + $6} END {print int(sum/1024)}'
15
❯ sudo zdb tank/foobar
# truncated output
Object lvl iblk dblk dsize dnsize lsize %full type
2852 1 128K 3.50K 5.50K 512 3.50K 100.00 ZFS plain file
2853 1 128K 3.50K 5.50K 512 3.50K 100.00 ZFS plain file
2854 1 128K 3.50K 5.50K 512 3.50K 100.00 ZFS plain file
2855 1 128K 3.50K 5.50K 512 3.50K 100.00 ZFS plain file
2856 1 128K 3.50K 5.50K 512 3.50K 100.00 ZFS plain file
With recordsize=2K
, the small
files yielded these results:
❯ du
121 .
❯ ls -ls | awk '{sum = sum + $6} END {print int(sum/1024)}'
15
❯ sudo zdb tank/foobar
# truncated output
Object lvl iblk dblk dsize dnsize lsize %full type
3591 2 128K 2K 21.5K 512 4K 100.00 ZFS plain file
3592 2 128K 2K 21.5K 512 4K 100.00 ZFS plain file
3593 2 128K 2K 21.5K 512 4K 100.00 ZFS plain file
3594 2 128K 2K 21.5K 512 4K 100.00 ZFS plain file
3595 2 128K 2K 21.5K 512 4K 100.00 ZFS plain file
Setting compression=on
had no effect for these files at any recordsize.
Using recordsize=128K
, the medium
files yielded these results:
❯ du
22446 .
❯ ls -ls | awk '{sum = sum + $6} END {print int(sum/1024)}'
22050
❯ sudo zdb tank/foobar
# truncated output
Object lvl iblk dblk dsize dnsize lsize %full type
3621 2 128K 128K 4.38M 512 4.38M 100.00 ZFS plain file
3622 2 128K 128K 4.38M 512 4.38M 100.00 ZFS plain file
3623 2 128K 128K 4.38M 512 4.38M 100.00 ZFS plain file
3624 2 128K 128K 4.38M 512 4.38M 100.00 ZFS plain file
3625 2 128K 128K 4.38M 512 4.38M 100.00 ZFS plain file
Using recordsize=4K
, the medium
files yielded these results:
❯ du
29973 .
❯ ls -ls | awk '{sum = sum + $6} END {print int(sum/1024)}'
22050
❯ sudo zdb tank/foobar
# truncated output
Object lvl iblk dblk dsize dnsize lsize %full type
3626 3 128K 4K 5.84M 512 4.31M 100.00 ZFS plain file
3627 3 128K 4K 5.85M 512 4.31M 100.00 ZFS plain file
3628 3 128K 4K 5.85M 512 4.31M 100.00 ZFS plain file
3629 3 128K 4K 5.85M 512 4.31M 100.00 ZFS plain file
3630 3 128K 4K 5.85M 512 4.31M 100.00 ZFS plain file
Using recordsize=2K
, the medium
files yielded these results:
❯ du
31429 .
❯ ls -ls | awk '{sum = sum + $6} END {print int(sum/1024)}'
22050
❯ sudo zdb tank/foobar
# truncated output
Object lvl iblk dblk dsize dnsize lsize %full type
2858 3 128K 2K 11.7M 512 4.31M 100.00 ZFS plain file
3631 3 128K 2K 11.7M 512 4.31M 100.00 ZFS plain file
3632 3 128K 2K 11.7M 512 4.31M 100.00 ZFS plain file
3633 3 128K 2K 11.7M 512 4.31M 100.00 ZFS plain file
3634 3 128K 2K 11.7M 512 4.31M 100.00 ZFS plain file
Setting compression=on
had no effect for these files at with recordsize=2K
. At recordsize=4K
, yielded this:
❯ du
29963 .
❯ ls -ls | awk '{sum = sum + $6} END {print int(sum/1024)}'
22050
❯ sudo zdb tank/foobar
# truncated output
Object lvl iblk dblk dsize dnsize lsize %full type
2859 3 128K 4K 5.84M 512 4.31M 100.00 ZFS plain file
3640 3 128K 4K 5.85M 512 4.31M 100.00 ZFS plain file
3641 3 128K 4K 5.85M 512 4.31M 100.00 ZFS plain file
3642 3 128K 4K 5.85M 512 4.31M 100.00 ZFS plain file
3643 3 128K 4K 5.84M 512 4.31M 100.00 ZFS plain file
With compression=on, recordsize=128K
, yielded this:
❯ du
16611 .
❯ ls -ls | awk '{sum = sum + $6} END {print int(sum/1024)}'
22050
❯ sudo zdb tank/foobar
# truncated output
Object lvl iblk dblk dsize dnsize lsize %full type
3644 2 128K 128K 3.24M 512 4.38M 100.00 ZFS plain file
3645 2 128K 128K 3.24M 512 4.38M 100.00 ZFS plain file
3646 2 128K 128K 3.24M 512 4.38M 100.00 ZFS plain file
3647 2 128K 128K 3.24M 512 4.38M 100.00 ZFS plain file
3648 2 128K 128K 3.24M 512 4.38M 100.00 ZFS plain file
Jumping up to recordsize=1M
with compression=on
yielded this:
❯ du
15610 .
❯ ls -ls | awk '{sum = sum + $6} END {print int(sum/1024)}'
22050
❯ sudo zdb tank/foobar
# truncated output
Object lvl iblk dblk dsize dnsize lsize %full type
3649 2 128K 1M 3.05M 512 5M 100.00 ZFS plain file
3650 2 128K 1M 3.04M 512 5M 100.00 ZFS plain file
3651 2 128K 1M 3.04M 512 5M 100.00 ZFS plain file
3652 2 128K 1M 3.04M 512 5M 100.00 ZFS plain file
3653 2 128K 1M 3.05M 512 5M 100.00 ZFS plain file
Resetting compression=off
, using recordsize=128K
, the large
files yielded these results:
❯ du
225874 .
❯ ls -ls | awk '{sum = sum + $6} END {print int(sum/1024)}'
225394
❯ sudo zdb tank/foobar
# truncated output
Object lvl iblk dblk dsize dnsize lsize %full type
3654 2 128K 128K 44.1M 512 44.1M 100.00 ZFS plain file
3655 2 128K 128K 44.1M 512 44.1M 100.00 ZFS plain file
3656 2 128K 128K 44.1M 512 44.1M 100.00 ZFS plain file
3657 2 128K 128K 44.1M 512 44.1M 100.00 ZFS plain file
3658 2 128K 128K 44.1M 512 44.1M 100.00 ZFS plain file
Using recordsize=4K
, the large
files yielded this:
❯ du
305491 .
❯ ls -ls | awk '{sum = sum + $6} END {print int(sum/1024)}'
225394
❯ sudo zdb tank/foobar
# truncated output
Object lvl iblk dblk dsize dnsize lsize %full type
2860 3 128K 4K 59.7M 512 44.0M 100.00 ZFS plain file
3659 3 128K 4K 59.7M 512 44.0M 100.00 ZFS plain file
3660 3 128K 4K 59.7M 512 44.0M 100.00 ZFS plain file
3661 3 128K 4K 59.7M 512 44.0M 100.00 ZFS plain file
3662 3 128K 4K 59.6M 512 44.0M 100.00 ZFS plain file
Using recordsize=2K
, the large
files yielded this:
❯ du
609978 .
❯ ls -ls | awk '{sum = sum + $6} END {print int(sum/1024)}'
225394
❯ sudo zdb tank/foobar
# truncated output
Object lvl iblk dblk dsize dnsize lsize %full type
2034 3 128K 2K 119M 512 44.0M 100.00 ZFS plain file
2575 3 128K 2K 119M 512 44.0M 100.00 ZFS plain file
3663 3 128K 2K 119M 512 44.0M 100.00 ZFS plain file
3664 3 128K 2K 119M 512 44.0M 100.00 ZFS plain file
3665 3 128K 2K 119M 512 44.0M 100.00 ZFS plain file
Setting compression=on
had no effect for these files at with recordsize=2K
. At recordsize=4K
, yielded this:
❯ du
305150 .
❯ ls -ls | awk '{sum = sum + $6} END {print int(sum/1024)}'
225394
❯ sudo zdb tank/foobar
# truncated output
Object lvl iblk dblk dsize dnsize lsize %full type
2036 3 128K 4K 59.6M 512 44.0M 100.00 ZFS plain file
2862 3 128K 4K 59.6M 512 44.0M 100.00 ZFS plain file
3669 3 128K 4K 59.6M 512 44.0M 100.00 ZFS plain file
3670 3 128K 4K 59.6M 512 44.0M 100.00 ZFS plain file
3671 3 128K 4K 59.6M 512 44.0M 100.00 ZFS plain file
With compression=on, recordsize=128K
, yielded this:
❯ du
169070 .
❯ ls -ls | awk '{sum = sum + $6} END {print int(sum/1024)}'
225394
❯ sudo zdb tank/foobar
# truncated output
Object lvl iblk dblk dsize dnsize lsize %full type
2863 2 128K 128K 33.0M 512 44.1M 100.00 ZFS plain file
3672 2 128K 128K 33.0M 512 44.1M 100.00 ZFS plain file
3673 2 128K 128K 33.0M 512 44.1M 100.00 ZFS plain file
3674 2 128K 128K 33.0M 512 44.1M 100.00 ZFS plain file
3675 2 128K 128K 33.0M 512 44.1M 100.00 ZFS plain file
Jumping up to recordsize=1M
with compression=on
yielded this:
❯ du
161243 .
❯ ls -ls | awk '{sum = sum + $6} END {print int(sum/1024)}'
225394
❯ sudo zdb tank/foobar
# truncated output
Object lvl iblk dblk dsize dnsize lsize %full type
2576 2 128K 1M 31.5M 512 45M 100.00 ZFS plain file
2864 2 128K 1M 31.5M 512 45M 100.00 ZFS plain file
3676 2 128K 1M 31.5M 512 45M 100.00 ZFS plain file
3677 2 128K 1M 31.5M 512 45M 100.00 ZFS plain file
3678 2 128K 1M 31.5M 512 45M 100.00 ZFS plain file
Finally, to see the effect of compression on already compressed files, I created .tar.gz
files from each set,
then rsync'd them over with recordsize=1M, compression=on
:
❯ du
108045 .
❯ ls -ls | awk '{sum = sum + $6} END {print int(sum/1024)}'
108086
❯ sudo zdb tank/foobar
# truncated output
Object lvl iblk dblk dsize dnsize lsize %full type
3683 2 128K 1M 96.2M 512 97M 100.00 ZFS plain file
3684 2 128K 1M 9.29M 512 10M 100.00 ZFS plain file
3685 1 128K 7K 10.5K 512 7K 100.00 ZFS plain file
The difference in filesizes between ext4
and zfs
(with recordsize=1M, compression=on
) can be seen here:
# ext4
❯ du *.tar.gz
98580 large.tar.gz
9504 medium.tar.gz
8 small.tar.gz
# zfs
❯ du *.tar.gz
98512 large.tar.gz
9511 medium.tar.gz
11 small.tar.gz
8 small.tar.gz