Scaling to Petabyte levels of storage, and then adding duplication on top presents significant challenges in managing multiple layers of technology. There will be significant risks that can only be evaluated with limited test solutions and datasets to see how behavior is in the real world. A high level concern, because of where in the stack deduplication exists on the Linux side, is that distributed parity is useless in conjunction with a Linux solution. The result is a maximum storage efficiency of approximately 33% with Gluster and Deduplication, resulting in approximately $200/TB. This is not the case with a Microsoft solution where efficiency reaches 66% at seven nodes, for a cost of approximately $100/TB. The Microsoft solution also presents significant limitations in flexibility.
I have not yet extensively researched CephFS and BeeGFS, but I suspect the end result with both of them will be similar to Gluster.
Most of my notes are based on the documentation provided by Red Ha