Skip to content

Instantly share code, notes, and snippets.

@rygorous
Last active November 19, 2017 06:45
Embed
What would you like to do?
FMA3 codegen
So the general idea with FMA3 is that it's designed to have enough degrees of
freedom so you can just have a more general FMA4 op in the IR and pick the right
FMA3 op *very* late (ideally, after register allocation!).
The generalized FMA op looks like this:
dst = FMA ±src0, ±src1, ±src2
where at most one of the src's can be a memory operand. This computes
(±src0 * ±src1) ± src2. The two signs for src0 and src1 combine
into a sign for the multiply result and a sign for the add, so in terms of
instruction encoding we need two bits for signs, for the four possible combinations:
+(src0 * src1) + src2 [VFMADD]
+(src0 * src1) - src2 [VFMSUB]
-(src0 * src1) + src2 [VFNMADD]
-(src0 * src1) - src2 [VFNMSUB]
and as you can see all these options exist in the ISA.
- If dst, src0, src1, src2 are all pair-wise distinct, then we can't generate this
instruction directly. To legalize, we need to insert a MOV to seed one of the
operands (doesn't matter which) into the destination register, after which we're
in the restricted "at most 3 distinct registers" case.
- If dst appears as one of src0/src1/src2, we use the permutation of VFM* that has
the dst register as its first operand. There are 6 permutations of 3 elements;
we can swap the two arguments to multiply without changing the result, so we only
need to encode 3 such permutations, and FMA3 has a sufficient set.
This works even with memory operands (despite the additional constraint that the
memory operand has to be the 3rd source op of the encoded instruction). I don't
remember the details anymore though, but it all works out.
So in practice you just treat it as a FMA4 architecture and only really pick the
instruction you use very late. In some cases you end up having to insert an extra
MOV, but unless all 3 source operands stay live after the operation, you can always
avoid this - provided your register allocator knows that you'd like it to reuse
dead registers ASAP (i.e. ideally within the same instruction) if they appear as
source operands.
----
Explicit enumeration of cases:
1. |{dst, src0, src1, src2}| = 4, i.e. 4 distinct operands - need a temp MOV.
This is true whether there's a memory operand involved or not.
2. VFop dst, dst, src1, src2 (dst=src0 and no mem ops)
VFop dst, src0, dst, src2 (dst=src1 and no mem ops)
VFop dst, dst, [src1], src2 (dst=src0 with mem op on mul)
VFop dst, [src0], dst, src2 (dst=src1 with mem op on mul)
The two pairs are equivalent since src0/src1 commute, so you can swap src0 and src1
operands on the source op if necessary. So without loss of generality, dst=src0.
-> Emit: VFop132 dst, src2, src1
(note src1 is in the mem-op position already)
3. VFop dst, dst, src1, [src2] (dst=src0 with mem op on add)
VFop dst, src0, dst, [src2] (dst=src1 with mem op on add)
Symmetric again, so w.l.o.g. dst=src0.
-> Emit: VFop213 dst, src1, [src2]
4. VFop dst, src0, src1, dst (dst=src2 and no mem ops)
VFop dst, [src0], src1, dst (dst=src2 with mem op on mul)
VFop dst, src0, [src1], dst (dst=src2 with mem op on mul)
Again symmetric, so w.l.o.g. src0 is the memory operand.
-> Emit: VFop231 dst, src1, src0
That covers all cases (unless I missed something, but I don't think so).
So, the actual algorithm is this:
// input: VFop dst, src0, src1, src2; <=1 of the three sources a mem operand.
if (all_4_distinct) {
emit(MOV(dst, src0));
src0 = dst; // now we're in 3-distinct case.
}
if (dst == src1)
swap(src0, src1);
// now either dst=src0 or dst=src2.
if (dst == src0) {
if (is_mem_op(src2))
emit(VFop213(dst, src1, src2));
else
emit(VFop132(dst, src2, src1));
} else {
if (is_mem_op(src0))
emit(VFop231(dst, src1, src0));
else
emit(VFop231(dst, src0, src1));
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment