Skip to content

Instantly share code, notes, and snippets.

@danielewood
Created October 17, 2018 22:58
Show Gist options
  • Save danielewood/d6f5fb6b61fc3e0074c3df0abe734e59 to your computer and use it in GitHub Desktop.
Save danielewood/d6f5fb6b61fc3e0074c3df0abe734e59 to your computer and use it in GitHub Desktop.

Petabyte Scale Storage

Scaling to Petabyte levels of storage, and then adding duplication on top presents significant challenges in managing multiple layers of technology. There will be significant risks that can only be evaluated with limited test solutions and datasets to see how behavior is in the real world. A high level concern, because of where in the stack deduplication exists on the Linux side, is that distributed parity is useless in conjunction with a Linux solution. The result is a maximum storage efficiency of approximately 33% with Gluster and Deduplication, resulting in approximately $200/TB. This is not the case with a Microsoft solution where efficiency reaches 66% at seven nodes, for a cost of approximately $100/TB. The Microsoft solution also presents significant limitations in flexibility.

I have not yet extensively researched CephFS and BeeGFS, but I suspect the end result with both of them will be similar to Gluster.

Most of my notes are based on the documentation provided by Red Hat Gluster Storage 3.4, which is a rebase of what I believe to be upstream 3.10. Upstream GlusterFS 4.1 has just recently been released and adds a few new API features that will be useful, but nothing in 4.1 changes anything about the basics of implementation and features that would be important to a fresh deployment.

Red Hat VDO

  • Red Hat VDO
    • Summary:

      VDO combines three techniques — zero-block elimination, data deduplication, and data compression — to reduce data footprint. The first of these, zero-block elimination, works by eliminating blocks of data consisting entirely of zeros while the second technique, data deduplication, eliminates identical copies of blocks of data that have already been stored. Finally, data compression is applied, which reduces the size of the unique blocks of data stored. By utilizing these techniques, VDO can dramatically increase the efficiency for both storage and network bandwidth utilization.

      • Pros:
        • Deduplication requires 268MB RAM/1TB disk; ~70GB for a 256TB volume.
      • Limits:
        • Max 256TB per physical volume (Base10 or Base2? Docs do not state, assume Base10)
        • Logical volume size is specified by admin, admin must manage logical size according to dedupe rate and fill rate.
      • Unknowns:
        • What is the performance impact of VDO compression?
        • What is the performance impact of VDO Deduplication?
        • How to prevent 80% use on ZFS? REFQUOTA?
          • Is ZFS the best choice for underlying storage in this scenario? Maybe better to use LVM RAID groups with XFS since snapshots may not be needed at all for a backup target.

General Gluster Notes

  • Deduplication with Gluster will NOT be at all effective on Dispersed volumes, because the rotating parity will not always land on the same volume. This means that if one wishes to pair deduplication with Gluster/Ceph, it must be in a replication mode. This means the maximum storage efficiency drops to approximately 33% in all scenarios.

  • User Serviceable Snapshots is not supported on Erasure Coded (EC) volumes.

    • If Snapshots are a required feature:
      • Dispersed-Distributed (EC) is not a workable solution.
      • Use Arbitrated Replicated Volumes
      • Storage Efficiency of EC is lost, we now are looking at a ~33% increase in number of nodes for the same storage capacity.
  • Supports HA SMB with AD integration through CTDB

  • Gluster Tiering:

    • Hot tier
      • 4x 7.6TB SSDs in LVM RAID-10 w/XFS. (Avoid COW for Scratch FS)
      • Maybe compression on a VDO layer, no deduplications.
      • Is there any real-world write performance benefit to a Hot Tier?
    • Cold tier
      • ZFS RAID-Z2 8x 5+2
      • Two zpools with 4x RZ2 vdevs per zpool.
      • Write caching is handled by the ZIL/SLOG, one 480GB Optane U.2 drive per zpool.
      • No compression/deduplication on ZFS (Maybe?)
      • VDO layer for compression and deduplication.
      • VDO overlay maxes at 256TB/volume.
      • Side note: Is this the best route? Maybe XFS+VDO per disk, all in a 400+ brick Gluster is better? This also kills any hope of deduplication.
  • Node Specs:

    • 60x 12TB Drives, 56 active, 4 global hotspares
    • 480TB/node after ZFS redundancy + hotspare.
    • Absolute Minimum 256GB RAM per node, will require ~128GB just for VDO Dedupe.
    • Would need to have custom build using H11SSL-NC (AMD SP3 board) to provide enough PCIe lanes for 100Gbe NICs, otherwise you'll bottleneck at ~60Gbps. This also provides a path for up to 1TB/RAM per node, but that may not be needed (and it should help very little on writes).
  • Pitfalls:

    • Avoid SMR drives if possible(Most Seagate large capacity, WD Purple, and some others). They may incur significant write penalties due to how they size sectors for writing. This gets worse on sustained writes as the PMR buffer gets filled and now the drive must switch to SMR only writes. SMR drives are fine for bursty or low datarate workloads due to the PMR buffer, much like TLC SSDs using a MLC buffer to hide the poor write performance of TLC NAND.

    • Intel Motherboard lack PCIe lanes on UP boards(40-48 PCIe lanes), DP boards(80-88 PCIe lanes) will split the PCIe lanes across CPUs, causing traffic to have to cross the CPU interlink. This means performance could become unpredictable depending on traffic path from HBA to NIC on a per-connection basis.

    • Intel DP systems may incur significant performance penalties due to Accessing memory across NUMA node boundaries takes longer than accessing memory on the local node. With Intel processors sharing the last-level cache between cores on a node, cache contention between nodes is a much greater problem than cache contention within a node.

    • ZFS Deduplication is a minefield:

      • Difficult Constraints:
        • ~5GB/TB deduplicated storage.
        • A 400TB Node would require 2TB of RAM, just for the DDT.
      • Potential Mitigations:
        • Optane could be used for L2ARC, this could result in pressure relief of RAM, but the performance impacts are unknown.
        • Relatively easy to test. Add Optane to a low-RAM server, turn on deduplication, see how performance degrades once DDT exceeds available RAM (minus ARC reservation).
    • Virtual Disk Optimizer (VDO) volumes are only supported as part of a Red Hat Hyperconverged Infrastructure for Virtualization 2.0 deployment. VDO is not supported with other Red Hat Gluster Storage deployments.

    • Gluster SMB 3 Multi-Channel is STILL in Tech Preview, after two years.

    • Read the entire Known Issues for RH Gluster 3.4 release, there are significant bugs.

      • Issues related to Samba
        • BZ#1329718 - Snapshot volumes are read-only. All snapshots are made available as directories inside .snaps directory. Even though snapshots are read-only, the directory attribute of snapshots is same as the directory attribute of root of snapshot volume, which can be read-write. This can lead to confusion, because Windows will assume that the snapshots directory is read-write. Restore previous version option in file properties gives open option. It will open the file from the corresponding snapshot. If opening of the file also creates temp files (for example, Microsoft Word files), the open will fail. This is because temp file creation will fail because snapshot volume is read-only.
        • BZ#1300572 - Due to a bug in the Linux CIFS client, SMB2.0+ connections from Linux to Red Hat Gluster Storage currently will not work properly. SMB1 connections from Linux to Red Hat Gluster Storage, and all connections with supported protocols from Windows continue to work.
          • Linux would probably be using NFS anyways.

Potential Alternatives

Addendum:

  • Storage Spaces Efficiency table:

    Size (TB Base10) Efficiency $/Terabyte disks/node drive size nodes Hotspare/node Data Disks Hotspares parity Disks total disks Drive Cost Drive Total Node Cost Node Total Total Cost
    672 31.11% $268.75 60 12 3 4 56 12 112 180 $570.00 $102,600.00 $26,000.00 $78,000.00 $180,600.00
    1344 46.67% $179.17 60 12 4 4 112 16 112 240 $570.00 $136,800.00 $26,000.00 $104,000.00 $240,800.00
    2016 56.00% $133.68 60 12 5 4 168 20 112 300 $465.00 $139,500.00 $26,000.00 $130,000.00 $269,500.00
    2688 62.22% $120.31 60 12 6 4 224 24 112 360 $465.00 $167,400.00 $26,000.00 $156,000.00 $323,400.00
    3360 66.67% $101.67 60 12 7 4 280 28 112 420 $380.00 $159,600.00 $26,000.00 $182,000.00 $341,600.00
  • Gluster Efficiency table:

    Size (TB Base10) Efficiency $/Terabyte disks/node drive size total Data Disks Parity Disks vdevs Hotspares Node Cap VDO Capacity VDO (Base2) Gluster-Nodes Data-Bricks Resil-Bricks Size (PB Base2) Drive Cost Drive Total Node Cost Node Total Total Cost
    1120 33% $193.57 60 14 840 5 2 8 4 560 280 254.6585165 4 4 4 0.99 $570.00 $136,800.00 $20,000.00 $80,000.00 $216,800.00
    2240 44% $145.18 60 14 840 5 2 8 4 560 280 254.6585165 6 8 4 1.99 $570.00 $205,200.00 $20,000.00 $120,000.00 $325,200.00
    1792 36% $181.47 60 14 840 4 3 8 4 448 224 203.7268132 6 8 4 1.59 $570.00 $205,200.00 $20,000.00 $120,000.00 $325,200.00
    960 33% $199.58 60 12 720 5 2 8 4 480 240 218.2787284 4 4 4 0.85 $465.00 $111,600.00 $20,000.00 $80,000.00 $191,600.00
    1920 44% $149.69 60 12 720 5 2 8 4 480 240 218.2787284 6 8 4 1.71 $465.00 $167,400.00 $20,000.00 $120,000.00 $287,400.00
    1536 36% $187.11 60 12 720 4 3 8 4 384 192 174.6229827 6 8 4 1.36 $465.00 $167,400.00 $20,000.00 $120,000.00 $287,400.00
    1600 44% $160.50 60 10 600 5 2 8 4 400 200 181.8989404 6 8 4 1.42 $380.00 $136,800.00 $20,000.00 $120,000.00 $256,800.00
    1280 36% $200.63 60 10 600 4 3 8 4 320 160 145.5191523 6 8 4 1.14 $380.00 $136,800.00 $20,000.00 $120,000.00 $256,800.00
    1280 44% $172.50 60 8 480 5 2 8 4 320 160 145.5191523 6 8 4 1.14 $280.00 $100,800.00 $20,000.00 $120,000.00 $220,800.00
    1024 36% $215.63 60 8 480 4 3 8 4 256 128 116.4153218 6 8 4 0.91 $280.00 $100,800.00 $20,000.00 $120,000.00 $220,800.00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment