bryant/memcpyopt-catalog

## memcpyopt-catalog
load-to-store of agg => memcpy/memmove (processStore)
call-load-to-store of non-agg => call-direct (processStore,
    performCallSlotOptzn)
byte-wise store sequence => memset (processStore, tryMergingIntoMemset)
store to agg => memset (processStore)
memset => widen memset with neighboring stores/memset (processMemSet)
erase identity memcpy (processMemCpy)
memcpy from global constant => memset (processMemCpy)
memset-memcpy => memcpy-memset with disjoint dest ranges (processMemCpy,
    processMemSetMemCpyDependence)
call-memcpy => memcpy-call (processMemCpy, performCallSlotOptzn)
memcpy-memcpy opt (processMemCpy, processMemCpyMemCpyDependence)
memcpy from alloca or lifetimestart with no clob in between = remove
    (processMemCpy)
memset-to-memcpy => memset-memset (processMemCpy, performMemCpyToMemSetOptzn)
memmove, src `no-alias` dest => memcpy (processMemMove)
memcpy from byval (processByValArgument)

## memcpyopt-notes.md

      
    Raw
  

              memcpyopt-notes.md
            
          
    limitations:


won't elide memcpys from allocas that are live only within a subscope, e.g.,
if (c) {
    S b;
    g(0, &b);
    return b;
} else {
    S a;
    g(&a, &a);
    return a;
}
only more frequent of the branches would be elided. one way around this is to
hoist the alloca out.


caveats:

because it stomps away zero or one alloca, could it hinder later opt passes
that rely on allocas?

benefits:

nounwind function lose the memcpy
inlines better


in theory, inlining would strip away the extra sret memcpy of the inner
inlined, but this fails in practice:
__attribute__((nothrow)) S m(unsigned g) {
  S rv = {0};
  if (g) {
    S rv2;
    g_(&rv2);
    return rv2;
  }
  rv.large[g] = g + 1;
  return rv;
}

S n() {
  S k = m(32);
  k.large[445] = 2302;
  return k;
}


thoughts

the overarching goal is to divine the allocainst that is most likely to be
copied to sret.
this implies some degree of walking basic blocks.
this means identifying which allocainsts have the "final say" at each point of
the graph, and finding
can't do it during the machine passes because the memcpy would is likely to have
been expanded during lowering and isel. on the other hand, even the ir-to-ir
passes will expand small memcpies.

only on sret noalias
only on nounwind functions
what about recursives?

a better approach

fix memcpy elision during inlining instead of within nounwind fn:
theirs:

a -m-> b
b -m-> c
=>
a -m-> b
a -m-> c

ours:

a -m-> b
...stuff with b
b -m-> c
=> if a is never depped elsewhere
...stuff with a
a -m-> b
a -m-> c

also, if a == c:

a -m-> b
...stuff with b
b -m-> a
=> if a is never depped elsewhere
...stuff with a
a -m-> b
a -m-> a
=> elim ident memcpy
...stuff with a
a -m-> b
=> dse
...stuff with a

if a is depped somwhere after second m:

a -m-> b
...stuff with b
b -m-> c
... = ... a
=>
...stuff with a
a -m-> b
a -m-> c
... = ... a

a better better approach

augment the existing mc-mc code:
in all cases, memcopies have the same length.
case 1:
    memcpy a -> b
    do stuff to b
    memcpy b -> c
==> if c == a
    do stuff to a
    memcpy a -> b

case 2:
    memcpy a -> b
    do stuff to a
    memcpy b -> c
==> if stuff doesn't mod or ref c or mod b
    memcpy a -> b
    memcpy a -> c
    do stuff to a
==> dse
    memcpy a -> c
    do stuff to a

case 3:
    memcpy a -> b
    do stuff to c
    memcpy b -> c
==> if stuff doesn't mod or ref b or mod a
    do stuff to c
    memcpy a -> b
    memcpy a -> c
==> dse
    do stuff to c
    memcpy a -> c

phab write-up

Currently, memcpy-memcpy pairs are only considered when there are no mods or
refs of either the source or dest memory operands of the examined memcpy:
memcpy(b <- a)  ; the "dependee" memcpy
...  ; no mod/ref of a, b, or c in between
memcpy(c <- b)  ; the examined memcpy

In the above, if b and/or c are mod/refed in the space between the two
memcopies, then the mod/ref-ing instruction closest to the examined memcpy is
matched and the dependee is never seen. If on the other hand only a is
mod/refed in between, then the memcpy pair is recognized but ultimately ignored
because the processMemCpyMemCpyDependence transformation would be invalid:
memcpy(b <- a); *a = 42; memcpy(c <- b)
    =>
memcpy(b <- a); *a = 42; memcpy(c <- a)

What this patch does is search harder for memcpy pairs and then match and
transform them against three general cases:
Case 1:
memcpy(b <- a); ...; *b = 42; ...; memcpy(a <- b);
    => if a is never mod/refed in between the two memcpys
...; *a = 42; ...; memcpy(b <- a);

Case 2 (essentially the todo mentioned in processMemCpyMemCpyDependence):
memcpy(b <- a); ...;  memcpy(c <- b);
    => if "..." doesn't mod/ref either c or b
memcpy(c <- a); memcpy(b <- a); *a = 42;

Case 3:
memcpy(b <- a); ...; memcpy(c <- b)
    => if "..." doesn't mod/ref b or a
...; memcpy(b <- a); memcpy(c <- b)

Feedback on the soundness of these three cases is eagerly sought.
At this time, only case 2 has been implemented because it's the easiest and
most useful.  For instance:
typedef struct { unsigned char large[65536]; } S;

extern void g_(S *);

S p1(unsigned g) {
  S rv = {0};
  if (g) {
    S rv2;
    g_(&rv2);
    return rv2;
  }
  rv.large[g] = g + 1;
  return rv;
}

S p0() {
  S k = p1(32);
  k.large[445] = 2302;
  return k;
}

S set(S x, unsigned n) {
  x.large[n] = n;
  return x;
}

S p() {
  S k = p0();
  k = set(k, 99);
  k.large[22] += 23;
  return k;
}
produces, at -O3 (without the patch; extraneous memcopies marked):
define void @p(%struct.S* noalias nocapture sret) local_unnamed_addr #0 {
  %2 = alloca %struct.S, align 1
  %3 = alloca [22 x i8], align 8
  %4 = alloca [76 x i8], align 1
  %5 = alloca [345 x i8], align 1
  %6 = alloca [65090 x i8], align 2
  %7 = getelementptr inbounds [22 x i8], [22 x i8]* %3, i64 0, i64 0
  call void @llvm.lifetime.start(i64 22, i8* %7)
  %8 = getelementptr inbounds [76 x i8], [76 x i8]* %4, i64 0, i64 0
  call void @llvm.lifetime.start(i64 76, i8* %8)
  %9 = getelementptr inbounds [345 x i8], [345 x i8]* %5, i64 0, i64 0
  call void @llvm.lifetime.start(i64 345, i8* %9)
  %10 = getelementptr inbounds [65090 x i8], [65090 x i8]* %6, i64 0, i64 0
  call void @llvm.lifetime.start(i64 65090, i8* %10)
  %11 = getelementptr inbounds %struct.S, %struct.S* %2, i64 0, i32 0, i64 0
  call void @llvm.lifetime.start(i64 65536, i8* %11) #3, !noalias !8
  call void @g_(%struct.S* nonnull %2) #3, !noalias !8
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %7, i8* %11, i64 22, i32 1, i1 false)            <===
  %12 = getelementptr inbounds %struct.S, %struct.S* %2, i64 0, i32 0, i64 22
  %13 = load i8, i8* %12, align 1
  %14 = getelementptr inbounds %struct.S, %struct.S* %2, i64 0, i32 0, i64 23
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %8, i8* %14, i64 76, i32 1, i1 false)            <===
  %15 = getelementptr inbounds %struct.S, %struct.S* %2, i64 0, i32 0, i64 100
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %9, i8* %15, i64 345, i32 1, i1 false)           <===
  %16 = getelementptr inbounds %struct.S, %struct.S* %2, i64 0, i32 0, i64 446
  %17 = getelementptr inbounds [65090 x i8], [65090 x i8]* %6, i64 0, i64 0
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %17, i8* %16, i64 65090, i32 1, i1 false) #3     <===
  call void @llvm.lifetime.end(i64 65536, i8* %11) #3, !noalias !8
  %18 = add i8 %13, 23
  %19 = getelementptr inbounds %struct.S, %struct.S* %0, i64 0, i32 0, i64 0
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %19, i8* %7, i64 22, i32 1, i1 false)
  %20 = getelementptr inbounds %struct.S, %struct.S* %0, i64 0, i32 0, i64 22
  store i8 %18, i8* %20, align 1
  %21 = getelementptr inbounds %struct.S, %struct.S* %0, i64 0, i32 0, i64 23
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %21, i8* %8, i64 76, i32 1, i1 false)
  %22 = getelementptr inbounds %struct.S, %struct.S* %0, i64 0, i32 0, i64 99
  store i8 99, i8* %22, align 1
  %23 = getelementptr inbounds %struct.S, %struct.S* %0, i64 0, i32 0, i64 100
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %23, i8* %9, i64 345, i32 1, i1 false)
  %24 = getelementptr inbounds %struct.S, %struct.S* %0, i64 0, i32 0, i64 445
  store i8 -2, i8* %24, align 1
  %25 = getelementptr inbounds %struct.S, %struct.S* %0, i64 0, i32 0, i64 446
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %25, i8* %10, i64 65090, i32 1, i1 false)
  call void @llvm.lifetime.end(i64 22, i8* %7)
  call void @llvm.lifetime.end(i64 76, i8* %8)
  call void @llvm.lifetime.end(i64 345, i8* %9)
  call void @llvm.lifetime.end(i64 65090, i8* %10)
  ret void
}

With this patch, The highlighted memcopies are properly seen, transformed, and
later removed by DSE.
the memory ssa attack angle

since processStore potentially needs to move stores upward, we need to figure
out how to keep mssa in synch.
in mssa, nodes are either phis, defs, or uses.
first, note what can be moved. phis must remain static at the top of each basic
block. defs and uses can't move above or below any point that clobbers or
non-no-alias loads.
moveUp, which is called by processStore, already ensures this. in other words,
defs and uses are ensured not to move past clobbers may-/must-alias loads.
upwards:
S past S': S' was above S and had optimized L pointing to it. therefore, none of
the L alias S. re-order, update defs, and transfer uses of S to S'.
S past L: since S no-alias L and L was above S, no interaction. just re-order.
L past S: L no-alias S, so S can't be L's def. no interaction, just re-order.
L past L': L no-alias L'. just re-order.
downwards:
S past S': equivalent to moving S' up past S. so S would absorb all of S's uses.
S past L: simple.
L past S: simple.
L past L': simple.
patch summary and notes

differential dependencies:

run memcpyopt tests through update_test_checks.py
restricted upward slice
instructionClobbersQuery patched to only check UseLoc

pass itself:

performCallSlotOptzn relies on caller to remove memcpy
creates a new pass selectable by "-memcpyopt-mssa"
adds a flag, -mco-mssa that forces mssa usage in -O3
maintains simplicity by restricting matches to single bb

performCallSlotOptzn
processMemCpyMemCpyDependence


abstracts instruction removal
restricts moveUp's insertion point to a memory instruction

some ideas

tryMergingIntoMemset: walk AccessList
add replacememoryinst that will prevent renumbering and allocation
don't clobber bb numbering on MemoryAccess removal
maybe add stop_At param to getclobberingmemoryaccess?
	load-to-store of agg => memcpy/memmove (processStore)
	call-load-to-store of non-agg => call-direct (processStore,
	performCallSlotOptzn)
	byte-wise store sequence => memset (processStore, tryMergingIntoMemset)
	store to agg => memset (processStore)
	memset => widen memset with neighboring stores/memset (processMemSet)
	erase identity memcpy (processMemCpy)
	memcpy from global constant => memset (processMemCpy)
	memset-memcpy => memcpy-memset with disjoint dest ranges (processMemCpy,
	processMemSetMemCpyDependence)
	call-memcpy => memcpy-call (processMemCpy, performCallSlotOptzn)
	memcpy-memcpy opt (processMemCpy, processMemCpyMemCpyDependence)
	memcpy from alloca or lifetimestart with no clob in between = remove
	(processMemCpy)
	memset-to-memcpy => memset-memset (processMemCpy, performMemCpyToMemSetOptzn)
	memmove, src `no-alias` dest => memcpy (processMemMove)
	memcpy from byval (processByValArgument)