limitations:
-
won't elide memcpys from allocas that are live only within a subscope, e.g.,
if (c) { S b; g(0, &b); return b; } else { S a; g(&a, &a); return a; }
only more frequent of the branches would be elided. one way around this is to hoist the alloca out.
caveats:
- because it stomps away zero or one alloca, could it hinder later opt passes that rely on allocas?
benefits:
- nounwind function lose the memcpy
- inlines better
-
in theory, inlining would strip away the extra sret memcpy of the inner inlined, but this fails in practice:
__attribute__((nothrow)) S m(unsigned g) { S rv = {0}; if (g) { S rv2; g_(&rv2); return rv2; } rv.large[g] = g + 1; return rv; } S n() { S k = m(32); k.large[445] = 2302; return k; }
-
the overarching goal is to divine the allocainst that is most likely to be copied to sret.
this implies some degree of walking basic blocks.
this means identifying which allocainsts have the "final say" at each point of the graph, and finding
can't do it during the machine passes because the memcpy would is likely to have been expanded during lowering and isel. on the other hand, even the ir-to-ir passes will expand small memcpies.
- only on sret noalias
- only on nounwind functions
- what about recursives?
fix memcpy elision during inlining instead of within nounwind fn:
theirs:
a -m-> b
b -m-> c
=>
a -m-> b
a -m-> c
ours:
a -m-> b
...stuff with b
b -m-> c
=> if a is never depped elsewhere
...stuff with a
a -m-> b
a -m-> c
also, if a == c:
a -m-> b
...stuff with b
b -m-> a
=> if a is never depped elsewhere
...stuff with a
a -m-> b
a -m-> a
=> elim ident memcpy
...stuff with a
a -m-> b
=> dse
...stuff with a
if a is depped somwhere after second m:
a -m-> b
...stuff with b
b -m-> c
... = ... a
=>
...stuff with a
a -m-> b
a -m-> c
... = ... a
augment the existing mc-mc code:
in all cases, memcopies have the same length.
case 1:
memcpy a -> b
do stuff to b
memcpy b -> c
==> if c == a
do stuff to a
memcpy a -> b
case 2:
memcpy a -> b
do stuff to a
memcpy b -> c
==> if stuff doesn't mod or ref c or mod b
memcpy a -> b
memcpy a -> c
do stuff to a
==> dse
memcpy a -> c
do stuff to a
case 3:
memcpy a -> b
do stuff to c
memcpy b -> c
==> if stuff doesn't mod or ref b or mod a
do stuff to c
memcpy a -> b
memcpy a -> c
==> dse
do stuff to c
memcpy a -> c
Currently, memcpy-memcpy pairs are only considered when there are no mods or refs of either the source or dest memory operands of the examined memcpy:
memcpy(b <- a) ; the "dependee" memcpy
... ; no mod/ref of a, b, or c in between
memcpy(c <- b) ; the examined memcpy
In the above, if b and/or c are mod/refed in the space between the two
memcopies, then the mod/ref-ing instruction closest to the examined memcpy is
matched and the dependee is never seen. If on the other hand only a is
mod/refed in between, then the memcpy pair is recognized but ultimately ignored
because the processMemCpyMemCpyDependence
transformation would be invalid:
memcpy(b <- a); *a = 42; memcpy(c <- b)
=>
memcpy(b <- a); *a = 42; memcpy(c <- a)
What this patch does is search harder for memcpy pairs and then match and transform them against three general cases:
Case 1:
memcpy(b <- a); ...; *b = 42; ...; memcpy(a <- b);
=> if a is never mod/refed in between the two memcpys
...; *a = 42; ...; memcpy(b <- a);
Case 2 (essentially the todo mentioned in processMemCpyMemCpyDependence
):
memcpy(b <- a); ...; memcpy(c <- b);
=> if "..." doesn't mod/ref either c or b
memcpy(c <- a); memcpy(b <- a); *a = 42;
Case 3:
memcpy(b <- a); ...; memcpy(c <- b)
=> if "..." doesn't mod/ref b or a
...; memcpy(b <- a); memcpy(c <- b)
Feedback on the soundness of these three cases is eagerly sought.
At this time, only case 2 has been implemented because it's the easiest and most useful. For instance:
typedef struct { unsigned char large[65536]; } S;
extern void g_(S *);
S p1(unsigned g) {
S rv = {0};
if (g) {
S rv2;
g_(&rv2);
return rv2;
}
rv.large[g] = g + 1;
return rv;
}
S p0() {
S k = p1(32);
k.large[445] = 2302;
return k;
}
S set(S x, unsigned n) {
x.large[n] = n;
return x;
}
S p() {
S k = p0();
k = set(k, 99);
k.large[22] += 23;
return k;
}
produces, at -O3 (without the patch; extraneous memcopies marked):
define void @p(%struct.S* noalias nocapture sret) local_unnamed_addr #0 {
%2 = alloca %struct.S, align 1
%3 = alloca [22 x i8], align 8
%4 = alloca [76 x i8], align 1
%5 = alloca [345 x i8], align 1
%6 = alloca [65090 x i8], align 2
%7 = getelementptr inbounds [22 x i8], [22 x i8]* %3, i64 0, i64 0
call void @llvm.lifetime.start(i64 22, i8* %7)
%8 = getelementptr inbounds [76 x i8], [76 x i8]* %4, i64 0, i64 0
call void @llvm.lifetime.start(i64 76, i8* %8)
%9 = getelementptr inbounds [345 x i8], [345 x i8]* %5, i64 0, i64 0
call void @llvm.lifetime.start(i64 345, i8* %9)
%10 = getelementptr inbounds [65090 x i8], [65090 x i8]* %6, i64 0, i64 0
call void @llvm.lifetime.start(i64 65090, i8* %10)
%11 = getelementptr inbounds %struct.S, %struct.S* %2, i64 0, i32 0, i64 0
call void @llvm.lifetime.start(i64 65536, i8* %11) #3, !noalias !8
call void @g_(%struct.S* nonnull %2) #3, !noalias !8
call void @llvm.memcpy.p0i8.p0i8.i64(i8* %7, i8* %11, i64 22, i32 1, i1 false) <===
%12 = getelementptr inbounds %struct.S, %struct.S* %2, i64 0, i32 0, i64 22
%13 = load i8, i8* %12, align 1
%14 = getelementptr inbounds %struct.S, %struct.S* %2, i64 0, i32 0, i64 23
call void @llvm.memcpy.p0i8.p0i8.i64(i8* %8, i8* %14, i64 76, i32 1, i1 false) <===
%15 = getelementptr inbounds %struct.S, %struct.S* %2, i64 0, i32 0, i64 100
call void @llvm.memcpy.p0i8.p0i8.i64(i8* %9, i8* %15, i64 345, i32 1, i1 false) <===
%16 = getelementptr inbounds %struct.S, %struct.S* %2, i64 0, i32 0, i64 446
%17 = getelementptr inbounds [65090 x i8], [65090 x i8]* %6, i64 0, i64 0
call void @llvm.memcpy.p0i8.p0i8.i64(i8* %17, i8* %16, i64 65090, i32 1, i1 false) #3 <===
call void @llvm.lifetime.end(i64 65536, i8* %11) #3, !noalias !8
%18 = add i8 %13, 23
%19 = getelementptr inbounds %struct.S, %struct.S* %0, i64 0, i32 0, i64 0
call void @llvm.memcpy.p0i8.p0i8.i64(i8* %19, i8* %7, i64 22, i32 1, i1 false)
%20 = getelementptr inbounds %struct.S, %struct.S* %0, i64 0, i32 0, i64 22
store i8 %18, i8* %20, align 1
%21 = getelementptr inbounds %struct.S, %struct.S* %0, i64 0, i32 0, i64 23
call void @llvm.memcpy.p0i8.p0i8.i64(i8* %21, i8* %8, i64 76, i32 1, i1 false)
%22 = getelementptr inbounds %struct.S, %struct.S* %0, i64 0, i32 0, i64 99
store i8 99, i8* %22, align 1
%23 = getelementptr inbounds %struct.S, %struct.S* %0, i64 0, i32 0, i64 100
call void @llvm.memcpy.p0i8.p0i8.i64(i8* %23, i8* %9, i64 345, i32 1, i1 false)
%24 = getelementptr inbounds %struct.S, %struct.S* %0, i64 0, i32 0, i64 445
store i8 -2, i8* %24, align 1
%25 = getelementptr inbounds %struct.S, %struct.S* %0, i64 0, i32 0, i64 446
call void @llvm.memcpy.p0i8.p0i8.i64(i8* %25, i8* %10, i64 65090, i32 1, i1 false)
call void @llvm.lifetime.end(i64 22, i8* %7)
call void @llvm.lifetime.end(i64 76, i8* %8)
call void @llvm.lifetime.end(i64 345, i8* %9)
call void @llvm.lifetime.end(i64 65090, i8* %10)
ret void
}
With this patch, The highlighted memcopies are properly seen, transformed, and later removed by DSE.
since processStore potentially needs to move stores upward, we need to figure out how to keep mssa in synch.
in mssa, nodes are either phis, defs, or uses.
first, note what can be moved. phis must remain static at the top of each basic block. defs and uses can't move above or below any point that clobbers or non-no-alias loads.
moveUp, which is called by processStore, already ensures this. in other words, defs and uses are ensured not to move past clobbers may-/must-alias loads.
upwards:
S past S': S' was above S and had optimized L pointing to it. therefore, none of the L alias S. re-order, update defs, and transfer uses of S to S'.
S past L: since S no-alias L and L was above S, no interaction. just re-order.
L past S: L no-alias S, so S can't be L's def. no interaction, just re-order.
L past L': L no-alias L'. just re-order.
downwards:
S past S': equivalent to moving S' up past S. so S would absorb all of S's uses.
S past L: simple.
L past S: simple.
L past L': simple.
differential dependencies:
- run memcpyopt tests through update_test_checks.py
- restricted upward slice
- instructionClobbersQuery patched to only check UseLoc
pass itself:
- performCallSlotOptzn relies on caller to remove memcpy
- creates a new pass selectable by "-memcpyopt-mssa"
- adds a flag, -mco-mssa that forces mssa usage in -O3
- maintains simplicity by restricting matches to single bb
- performCallSlotOptzn
- processMemCpyMemCpyDependence
- abstracts instruction removal
- restricts moveUp's insertion point to a memory instruction
some ideas
- tryMergingIntoMemset: walk AccessList
- add replacememoryinst that will prevent renumbering and allocation
- don't clobber bb numbering on MemoryAccess removal
- maybe add stop_At param to getclobberingmemoryaccess?