Skip to content

Instantly share code, notes, and snippets.

@Validark
Created March 14, 2024 00:12
Show Gist options
  • Save Validark/9455d410ccc8e4bfd1c1c5c4fa38f934 to your computer and use it in GitHub Desktop.
Save Validark/9455d410ccc8e4bfd1c1c5c4fa38f934 to your computer and use it in GitHub Desktop.
PDEP & PEXT information across architectures

Here is an overview of which ISA's support these instructions I compiled for an upcoming presentation:

  • BESM Soviet mainframes:
    • BESM-6 (Designed in 1965. Produced 1968-1987) => AUX & APX
  • x86-64:
    • BMI2 (Intel Haswell 2013+, AMD Excavator 2015+, Fast on AMD since Zen 3 2020+) => PDEP & PEXT
  • arm64 / aarch64:
  • PowerPC:
  • RISC-V:
    • Zbe => BDEP & BEXT, but did not make it into ratified B instruction set

As @Eisenwave said, x86-64 performs these operations on general purpose registers, whereas ARM does these in vector registers. PowerPC also uses vector registers. The jury is still out on RISC-V, so I'm not sure what the future holds for them.

Bikeshedding

I learned from reading through the RISC-V B instruction set discussions that apparently the 'P' in 'PDEP' and 'PEXT', meaning parallel, actually refers to the internal implementation of those operations. Which, I think most people should agree, is not useful to a consumer. I heard people mention it could be implemented on a butterfly network or even more directly than that, it doesn't have to be a butterfly network operation. All this to say, I don't think PDEP & PEXT are good names. BDEP and BEXT sound better to me, although BEXT often refers to extracting a single bit I think. Some names like BitScatter/BitGather or BCOMPRESS/BEXPAND were floated, as analogous to the vector instructions that operate at the byte/word/dword/quad granularity (VPEXPAND{B,W,D,Q} & VPCOMPRESS{B,W,D,Q} on x86, XXGENPCVM + VPERM on PowerPC, VIOTA+VRGATHER & VCOMPRESS on RISC-V, lookup-table-or-BDEP-routine+TBL/TBX & COMPACT on ARM, SPLICE can sometimes be useful on ARM as well).

Some use cases the compiler could optimize automatically

pdep

  • succinct "select", i.e. find k'th set bit in a bitstring
  • unset k 1 bits
  • on x86, one day, I imagine that under certain circumstances a compiler would want to convert some vector logic into SWAR logic, to overcome the latency of moving between vector and general purpose registers. If that ever were to occur, PDEP could fill in code like this:
    const byte_indices = comptime switch (builtin.cpu.arch.endian()) {
        .little => @as(@Vector(8, u8), @splat(1)) << std.simd.iota(u3, 8),
        .big => @as(@Vector(8, u8), @splat(0x80)) >> std.simd.iota(u3, 8),
    };
    const splatted = @as(@Vector(8, u8), @splat(x));
    const selector = (splatted & byte_indices) != byte_indices;
    const splatted_lsbs: u64 = @bitCast(@select(u8, selector, @splat(1), @splat(0)));
    // all of the above is equivalent to:
    pdep(x, 0x0101010101010101)
    (I also think I could have used a per-element vector shift, but in context, I used selector again for something else, so it made sense in my real code)

pext

I also have a little bit of ongoing work on some pext-emulation for when the mask is known at compile time. It's still pretty rudimentary, but it's better than nothing.

https://gist.github.com/Validark/40d2df74b87692fe135bbeac14eed50d

I have a few emulation routines, and have plans to add at least one more. It checks the number of bitgroups, and if sufficiently low, it might lower to regular shifts. I also have one where we use multiplication to concentrate the bits into the upper 64-bits, and then move it back to the LSb. I plan on adding another multiplication routine that would use an extra addition to try to move bits into locations where they are far enough away from other bits that a multiplication would be sufficient (if it was not before). E.g. pext(x, 0x8040201008040201) would become (((x & 0x8040201008040201) + (0x8080808080808080 - 0x8040201008040201)) * 0x0002040810204081) >> 56

More:

https://stackoverflow.com/questions/69966389/what-are-assembly-instructions-like-pext-actually-used-for

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment