danielewood/petabytes.md

## petabytes.md

      
    Raw
  

              petabytes.md
            
          
    Petabyte Scale Storage

Scaling to Petabyte levels of storage, and then adding duplication on top presents significant challenges in managing multiple layers of technology. There will be significant risks that can only be evaluated with limited test solutions and datasets to see how behavior is in the real world. A high level concern, because of where in the stack deduplication exists on the Linux side, is that distributed parity is useless in conjunction with a Linux solution. The result is a maximum storage efficiency of approximately 33% with Gluster and Deduplication, resulting in approximately $200/TB. This is not the case with a Microsoft solution where efficiency reaches 66% at seven nodes, for a cost of approximately $100/TB. The Microsoft solution also presents significant limitations in flexibility.
I have not yet extensively researched CephFS and BeeGFS, but I suspect the end result with both of them will be similar to Gluster.
Most of my notes are based on the documentation provided by Red Hat Gluster Storage 3.4, which is a rebase of what I believe to be upstream 3.10. Upstream GlusterFS 4.1 has just recently been released and adds a few new API features that will be useful, but nothing in 4.1 changes anything about the basics of implementation and features that would be important to a fresh deployment.
Red Hat VDO


Red Hat VDO


Summary:
VDO combines three techniques — zero-block elimination, data deduplication, and data compression — to reduce data footprint.  The first of these, zero-block elimination, works by eliminating blocks of data consisting entirely of zeros while the second technique, data deduplication, eliminates identical copies of blocks of data that have already been stored. Finally, data compression is applied, which reduces the size of the unique blocks of data stored. By utilizing these techniques, VDO can dramatically increase the efficiency for both storage and network bandwidth utilization.


Pros:

Deduplication requires 268MB RAM/1TB disk; ~70GB for a 256TB volume.


Limits:

Max 256TB per physical volume (Base10 or Base2? Docs do not state, assume Base10)
Logical volume size is specified by admin, admin must manage logical size according to dedupe rate and fill rate.


Unknowns:

What is the performance impact of VDO compression?
What is the performance impact of VDO Deduplication?
How to prevent 80% use on ZFS? REFQUOTA?

Is ZFS the best choice for underlying storage in this scenario? Maybe better to use LVM RAID groups with XFS since snapshots may not be needed at all for a backup target.


General Gluster Notes


Deduplication with Gluster will NOT be at all effective on Dispersed volumes, because the rotating parity will not always land on the same volume. This means that if one wishes to pair deduplication with Gluster/Ceph, it must be in a replication mode. This means the maximum storage efficiency drops to approximately 33% in all scenarios.


User Serviceable Snapshots is not supported on Erasure Coded (EC) volumes.

If Snapshots are a required feature:

Dispersed-Distributed (EC) is not a workable solution.
Use Arbitrated Replicated Volumes
Storage Efficiency of EC is lost, we now are looking at a ~33% increase in number of nodes for the same storage capacity.


Supports HA SMB with AD integration through CTDB


Gluster Tiering:

Hot tier

4x 7.6TB SSDs in LVM RAID-10 w/XFS. (Avoid COW for Scratch FS)
Maybe compression on a VDO layer, no deduplications.
Is there any real-world write performance benefit to a Hot Tier?


Cold tier

ZFS RAID-Z2 8x 5+2
Two zpools with 4x RZ2 vdevs per zpool.
Write caching is handled by the ZIL/SLOG, one 480GB Optane U.2 drive per zpool.
No compression/deduplication on ZFS (Maybe?)
VDO layer for compression and deduplication.
VDO overlay maxes at 256TB/volume.
Side note: Is this the best route? Maybe XFS+VDO per disk, all in a 400+ brick Gluster is better? This also kills any hope of deduplication.


Node Specs:

60x 12TB Drives, 56 active, 4 global hotspares
480TB/node after ZFS redundancy + hotspare.
Absolute Minimum 256GB RAM per node, will require ~128GB just for VDO Dedupe.
Would need to have custom build using H11SSL-NC (AMD SP3 board) to provide enough PCIe lanes for 100Gbe NICs, otherwise you'll bottleneck at ~60Gbps. This also provides a path for up to 1TB/RAM per node, but that may not be needed (and it should help very little on writes).


Pitfalls:


Avoid SMR drives if possible(Most Seagate large capacity, WD Purple, and some others). They may incur significant write penalties due to how they size sectors for writing. This gets worse on sustained writes as the PMR buffer gets filled and now the drive must switch to SMR only writes. SMR drives are fine for bursty or low datarate workloads due to the PMR buffer, much like TLC SSDs using a MLC buffer to hide the poor write performance of TLC NAND.


Intel Motherboard lack PCIe lanes on UP boards(40-48 PCIe lanes), DP boards(80-88 PCIe lanes) will split the PCIe lanes across CPUs, causing traffic to have to cross the CPU interlink. This means performance could become unpredictable depending on traffic path from HBA to NIC on a per-connection basis.


Intel DP systems may incur significant performance penalties due to Accessing memory across NUMA node boundaries takes longer than accessing memory on the local node. With Intel processors sharing the last-level cache between cores on a node, cache contention between nodes is a much greater problem than cache contention within a node.


ZFS Deduplication is a minefield:

Difficult Constraints:

~5GB/TB deduplicated storage.
A 400TB Node would require 2TB of RAM, just for the DDT.


Potential Mitigations:

Optane could be used for L2ARC, this could result in pressure relief of RAM, but the performance impacts are unknown.
Relatively easy to test. Add Optane to a low-RAM server, turn on deduplication, see how performance degrades once DDT exceeds available RAM (minus ARC reservation).


Virtual Disk Optimizer (VDO) volumes are only supported as part of a Red Hat Hyperconverged Infrastructure for Virtualization 2.0 deployment. VDO is not supported with other Red Hat Gluster Storage deployments.


Gluster SMB 3 Multi-Channel is STILL in Tech Preview, after two years.


Read the entire Known Issues for RH Gluster 3.4 release, there are significant bugs.

Issues related to Samba

BZ#1329718 - Snapshot volumes are read-only. All snapshots are made available as directories inside .snaps directory. Even though snapshots are read-only, the directory attribute of snapshots is same as the directory attribute of root of snapshot volume, which can be read-write. This can lead to confusion, because Windows will assume that the snapshots directory is read-write. Restore previous version option in file properties gives open option. It will open the file from the corresponding snapshot. If opening of the file also creates temp files (for example, Microsoft Word files), the open will fail. This is because temp file creation will fail because snapshot volume is read-only.
BZ#1300572 - Due to a bug in the Linux CIFS client, SMB2.0+ connections from Linux to Red Hat Gluster Storage currently will not work properly. SMB1 connections from Linux to Red Hat Gluster Storage, and all connections with supported protocols from Windows continue to work.

Linux would probably be using NFS anyways.


Potential Alternatives


Storage Spaces Direct on Server 2019


Pros:

Scales to 7x Storinators

56 HDD + 4 hotspares, per node
(Hot Tier) 2-4x NVME/Optane, per Node
Hotspares should not count against 416 drive limit


Offline Deduplication, no significant requirements, will not slow ingest rate
4PB Maximum with Server 2019
Data efficiency (Dual parity Erasure Coding, N-2), includes hotspares:

3x Storinators = 31.11% for HDDs, 672TB (Base10)
4x Storinators = 46.67% for HDDs, 1344TB
5x Storinators = 56.00% for HDDs, 2016TB
6x Storinators = 62.22% for HDDs, 2688TB
7x Storinators = 66.67% for HDDs, 3360TB


Deduplication is Multithreaded as of Server 2016
Deduplication is supported on ReFS as of 1709 release, including in 2019.


Cons:

Max 64TB per volume
Deduplication does not cross volume boundaries
$6K+ (16Core) in Datacenter Licensing, per node


Addendum:


Storage Spaces Efficiency table:


Size (TB Base10)
Efficiency
$/Terabyte
disks/node
drive size
nodes
Hotspare/node
Data Disks
Hotspares
parity Disks
total disks
Drive Cost
Drive Total
Node Cost
Node Total
Total Cost


672
31.11%
$268.75
60
12
3
4
56
12
112
180
$570.00
$102,600.00
$26,000.00
$78,000.00
$180,600.00


1344
46.67%
$179.17
60
12
4
4
112
16
112
240
$570.00
$136,800.00
$26,000.00
$104,000.00
$240,800.00


2016
56.00%
$133.68
60
12
5
4
168
20
112
300
$465.00
$139,500.00
$26,000.00
$130,000.00
$269,500.00


2688
62.22%
$120.31
60
12
6
4
224
24
112
360
$465.00
$167,400.00
$26,000.00
$156,000.00
$323,400.00


3360
66.67%
$101.67
60
12
7
4
280
28
112
420
$380.00
$159,600.00
$26,000.00
$182,000.00
$341,600.00


Gluster Efficiency table:


Size (TB Base10)
Efficiency
$/Terabyte
disks/node
drive size
total
Data Disks
Parity Disks
vdevs
Hotspares
Node Cap
VDO Capacity
VDO (Base2)
Gluster-Nodes
Data-Bricks
Resil-Bricks
Size (PB Base2)
Drive Cost
Drive Total
Node Cost
Node Total
Total Cost


1120
33%
$193.57
60
14
840
5
2
8
4
560
280
254.6585165
4
4
4
0.99
$570.00
$136,800.00
$20,000.00
$80,000.00
$216,800.00


2240
44%
$145.18
60
14
840
5
2
8
4
560
280
254.6585165
6
8
4
1.99
$570.00
$205,200.00
$20,000.00
$120,000.00
$325,200.00


1792
36%
$181.47
60
14
840
4
3
8
4
448
224
203.7268132
6
8
4
1.59
$570.00
$205,200.00
$20,000.00
$120,000.00
$325,200.00


960
33%
$199.58
60
12
720
5
2
8
4
480
240
218.2787284
4
4
4
0.85
$465.00
$111,600.00
$20,000.00
$80,000.00
$191,600.00


1920
44%
$149.69
60
12
720
5
2
8
4
480
240
218.2787284
6
8
4
1.71
$465.00
$167,400.00
$20,000.00
$120,000.00
$287,400.00


1536
36%
$187.11
60
12
720
4
3
8
4
384
192
174.6229827
6
8
4
1.36
$465.00
$167,400.00
$20,000.00
$120,000.00
$287,400.00


1600
44%
$160.50
60
10
600
5
2
8
4
400
200
181.8989404
6
8
4
1.42
$380.00
$136,800.00
$20,000.00
$120,000.00
$256,800.00


1280
36%
$200.63
60
10
600
4
3
8
4
320
160
145.5191523
6
8
4
1.14
$380.00
$136,800.00
$20,000.00
$120,000.00
$256,800.00


1280
44%
$172.50
60
8
480
5
2
8
4
320
160
145.5191523
6
8
4
1.14
$280.00
$100,800.00
$20,000.00
$120,000.00
$220,800.00


1024
36%
$215.63
60
8
480
4
3
8
4
256
128
116.4153218
6
8
4
0.91
$280.00
$100,800.00
$20,000.00
$120,000.00
$220,800.00
Size (TB Base10)	Efficiency	$/Terabyte	disks/node	drive size	nodes	Hotspare/node	Data Disks	Hotspares	parity Disks	total disks	Drive Cost	Drive Total	Node Cost	Node Total	Total Cost
672	31.11%	$268.75	60	12	3	4	56	12	112	180	$570.00	$102,600.00	$26,000.00	$78,000.00	$180,600.00
1344	46.67%	$179.17	60	12	4	4	112	16	112	240	$570.00	$136,800.00	$26,000.00	$104,000.00	$240,800.00
2016	56.00%	$133.68	60	12	5	4	168	20	112	300	$465.00	$139,500.00	$26,000.00	$130,000.00	$269,500.00
2688	62.22%	$120.31	60	12	6	4	224	24	112	360	$465.00	$167,400.00	$26,000.00	$156,000.00	$323,400.00
3360	66.67%	$101.67	60	12	7	4	280	28	112	420	$380.00	$159,600.00	$26,000.00	$182,000.00	$341,600.00
Size (TB Base10)	Efficiency	$/Terabyte	disks/node	drive size	total	Data Disks	Parity Disks	vdevs	Hotspares	Node Cap	VDO Capacity	VDO (Base2)	Gluster-Nodes	Data-Bricks	Resil-Bricks	Size (PB Base2)	Drive Cost	Drive Total	Node Cost	Node Total	Total Cost
1120	33%	$193.57	60	14	840	5	2	8	4	560	280	254.6585165	4	4	4	0.99	$570.00	$136,800.00	$20,000.00	$80,000.00	$216,800.00
2240	44%	$145.18	60	14	840	5	2	8	4	560	280	254.6585165	6	8	4	1.99	$570.00	$205,200.00	$20,000.00	$120,000.00	$325,200.00
1792	36%	$181.47	60	14	840	4	3	8	4	448	224	203.7268132	6	8	4	1.59	$570.00	$205,200.00	$20,000.00	$120,000.00	$325,200.00
960	33%	$199.58	60	12	720	5	2	8	4	480	240	218.2787284	4	4	4	0.85	$465.00	$111,600.00	$20,000.00	$80,000.00	$191,600.00
1920	44%	$149.69	60	12	720	5	2	8	4	480	240	218.2787284	6	8	4	1.71	$465.00	$167,400.00	$20,000.00	$120,000.00	$287,400.00
1536	36%	$187.11	60	12	720	4	3	8	4	384	192	174.6229827	6	8	4	1.36	$465.00	$167,400.00	$20,000.00	$120,000.00	$287,400.00
1600	44%	$160.50	60	10	600	5	2	8	4	400	200	181.8989404	6	8	4	1.42	$380.00	$136,800.00	$20,000.00	$120,000.00	$256,800.00
1280	36%	$200.63	60	10	600	4	3	8	4	320	160	145.5191523	6	8	4	1.14	$380.00	$136,800.00	$20,000.00	$120,000.00	$256,800.00
1280	44%	$172.50	60	8	480	5	2	8	4	320	160	145.5191523	6	8	4	1.14	$280.00	$100,800.00	$20,000.00	$120,000.00	$220,800.00
1024	36%	$215.63	60	8	480	4	3	8	4	256	128	116.4153218	6	8	4	0.91	$280.00	$100,800.00	$20,000.00	$120,000.00	$220,800.00