Skip to content

Instantly share code, notes, and snippets.

@bryant
Created March 31, 2017 03:01
Show Gist options
  • Save bryant/3857fbb8ee365b21b08822e73676dd19 to your computer and use it in GitHub Desktop.
Save bryant/3857fbb8ee365b21b08822e73676dd19 to your computer and use it in GitHub Desktop.
Notes for porting MemCpyOpt to MemorySSA
load-to-store of agg => memcpy/memmove (processStore)
call-load-to-store of non-agg => call-direct (processStore,
performCallSlotOptzn)
byte-wise store sequence => memset (processStore, tryMergingIntoMemset)
store to agg => memset (processStore)
memset => widen memset with neighboring stores/memset (processMemSet)
erase identity memcpy (processMemCpy)
memcpy from global constant => memset (processMemCpy)
memset-memcpy => memcpy-memset with disjoint dest ranges (processMemCpy,
processMemSetMemCpyDependence)
call-memcpy => memcpy-call (processMemCpy, performCallSlotOptzn)
memcpy-memcpy opt (processMemCpy, processMemCpyMemCpyDependence)
memcpy from alloca or lifetimestart with no clob in between = remove
(processMemCpy)
memset-to-memcpy => memset-memset (processMemCpy, performMemCpyToMemSetOptzn)
memmove, src `no-alias` dest => memcpy (processMemMove)
memcpy from byval (processByValArgument)

limitations:

  • won't elide memcpys from allocas that are live only within a subscope, e.g.,

    if (c) {
        S b;
        g(0, &b);
        return b;
    } else {
        S a;
        g(&a, &a);
        return a;
    }

    only more frequent of the branches would be elided. one way around this is to hoist the alloca out.

caveats:

  • because it stomps away zero or one alloca, could it hinder later opt passes that rely on allocas?

benefits:

  • nounwind function lose the memcpy
  • inlines better
    • in theory, inlining would strip away the extra sret memcpy of the inner inlined, but this fails in practice:

      __attribute__((nothrow)) S m(unsigned g) {
        S rv = {0};
        if (g) {
          S rv2;
          g_(&rv2);
          return rv2;
        }
        rv.large[g] = g + 1;
        return rv;
      }
      
      S n() {
        S k = m(32);
        k.large[445] = 2302;
        return k;
      }

thoughts

the overarching goal is to divine the allocainst that is most likely to be copied to sret.

this implies some degree of walking basic blocks.

this means identifying which allocainsts have the "final say" at each point of the graph, and finding

can't do it during the machine passes because the memcpy would is likely to have been expanded during lowering and isel. on the other hand, even the ir-to-ir passes will expand small memcpies.

  • only on sret noalias
  • only on nounwind functions
  • what about recursives?

a better approach

fix memcpy elision during inlining instead of within nounwind fn:

theirs:

a -m-> b
b -m-> c
=>
a -m-> b
a -m-> c

ours:

a -m-> b
...stuff with b
b -m-> c
=> if a is never depped elsewhere
...stuff with a
a -m-> b
a -m-> c

also, if a == c:

a -m-> b
...stuff with b
b -m-> a
=> if a is never depped elsewhere
...stuff with a
a -m-> b
a -m-> a
=> elim ident memcpy
...stuff with a
a -m-> b
=> dse
...stuff with a

if a is depped somwhere after second m:

a -m-> b
...stuff with b
b -m-> c
... = ... a
=>
...stuff with a
a -m-> b
a -m-> c
... = ... a

a better better approach

augment the existing mc-mc code:

in all cases, memcopies have the same length.

case 1:

    memcpy a -> b
    do stuff to b
    memcpy b -> c
==> if c == a
    do stuff to a
    memcpy a -> b

case 2:

    memcpy a -> b
    do stuff to a
    memcpy b -> c
==> if stuff doesn't mod or ref c or mod b
    memcpy a -> b
    memcpy a -> c
    do stuff to a
==> dse
    memcpy a -> c
    do stuff to a

case 3:

    memcpy a -> b
    do stuff to c
    memcpy b -> c
==> if stuff doesn't mod or ref b or mod a
    do stuff to c
    memcpy a -> b
    memcpy a -> c
==> dse
    do stuff to c
    memcpy a -> c

phab write-up

Currently, memcpy-memcpy pairs are only considered when there are no mods or refs of either the source or dest memory operands of the examined memcpy:

memcpy(b <- a)  ; the "dependee" memcpy
...  ; no mod/ref of a, b, or c in between
memcpy(c <- b)  ; the examined memcpy

In the above, if b and/or c are mod/refed in the space between the two memcopies, then the mod/ref-ing instruction closest to the examined memcpy is matched and the dependee is never seen. If on the other hand only a is mod/refed in between, then the memcpy pair is recognized but ultimately ignored because the processMemCpyMemCpyDependence transformation would be invalid:

memcpy(b <- a); *a = 42; memcpy(c <- b)
    =>
memcpy(b <- a); *a = 42; memcpy(c <- a)

What this patch does is search harder for memcpy pairs and then match and transform them against three general cases:

Case 1:

memcpy(b <- a); ...; *b = 42; ...; memcpy(a <- b);
    => if a is never mod/refed in between the two memcpys
...; *a = 42; ...; memcpy(b <- a);

Case 2 (essentially the todo mentioned in processMemCpyMemCpyDependence):

memcpy(b <- a); ...;  memcpy(c <- b);
    => if "..." doesn't mod/ref either c or b
memcpy(c <- a); memcpy(b <- a); *a = 42;

Case 3:

memcpy(b <- a); ...; memcpy(c <- b)
    => if "..." doesn't mod/ref b or a
...; memcpy(b <- a); memcpy(c <- b)

Feedback on the soundness of these three cases is eagerly sought.

At this time, only case 2 has been implemented because it's the easiest and most useful. For instance:

typedef struct { unsigned char large[65536]; } S;

extern void g_(S *);

S p1(unsigned g) {
  S rv = {0};
  if (g) {
    S rv2;
    g_(&rv2);
    return rv2;
  }
  rv.large[g] = g + 1;
  return rv;
}

S p0() {
  S k = p1(32);
  k.large[445] = 2302;
  return k;
}

S set(S x, unsigned n) {
  x.large[n] = n;
  return x;
}

S p() {
  S k = p0();
  k = set(k, 99);
  k.large[22] += 23;
  return k;
}

produces, at -O3 (without the patch; extraneous memcopies marked):

define void @p(%struct.S* noalias nocapture sret) local_unnamed_addr #0 {
  %2 = alloca %struct.S, align 1
  %3 = alloca [22 x i8], align 8
  %4 = alloca [76 x i8], align 1
  %5 = alloca [345 x i8], align 1
  %6 = alloca [65090 x i8], align 2
  %7 = getelementptr inbounds [22 x i8], [22 x i8]* %3, i64 0, i64 0
  call void @llvm.lifetime.start(i64 22, i8* %7)
  %8 = getelementptr inbounds [76 x i8], [76 x i8]* %4, i64 0, i64 0
  call void @llvm.lifetime.start(i64 76, i8* %8)
  %9 = getelementptr inbounds [345 x i8], [345 x i8]* %5, i64 0, i64 0
  call void @llvm.lifetime.start(i64 345, i8* %9)
  %10 = getelementptr inbounds [65090 x i8], [65090 x i8]* %6, i64 0, i64 0
  call void @llvm.lifetime.start(i64 65090, i8* %10)
  %11 = getelementptr inbounds %struct.S, %struct.S* %2, i64 0, i32 0, i64 0
  call void @llvm.lifetime.start(i64 65536, i8* %11) #3, !noalias !8
  call void @g_(%struct.S* nonnull %2) #3, !noalias !8
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %7, i8* %11, i64 22, i32 1, i1 false)            <===
  %12 = getelementptr inbounds %struct.S, %struct.S* %2, i64 0, i32 0, i64 22
  %13 = load i8, i8* %12, align 1
  %14 = getelementptr inbounds %struct.S, %struct.S* %2, i64 0, i32 0, i64 23
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %8, i8* %14, i64 76, i32 1, i1 false)            <===
  %15 = getelementptr inbounds %struct.S, %struct.S* %2, i64 0, i32 0, i64 100
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %9, i8* %15, i64 345, i32 1, i1 false)           <===
  %16 = getelementptr inbounds %struct.S, %struct.S* %2, i64 0, i32 0, i64 446
  %17 = getelementptr inbounds [65090 x i8], [65090 x i8]* %6, i64 0, i64 0
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %17, i8* %16, i64 65090, i32 1, i1 false) #3     <===
  call void @llvm.lifetime.end(i64 65536, i8* %11) #3, !noalias !8
  %18 = add i8 %13, 23
  %19 = getelementptr inbounds %struct.S, %struct.S* %0, i64 0, i32 0, i64 0
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %19, i8* %7, i64 22, i32 1, i1 false)
  %20 = getelementptr inbounds %struct.S, %struct.S* %0, i64 0, i32 0, i64 22
  store i8 %18, i8* %20, align 1
  %21 = getelementptr inbounds %struct.S, %struct.S* %0, i64 0, i32 0, i64 23
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %21, i8* %8, i64 76, i32 1, i1 false)
  %22 = getelementptr inbounds %struct.S, %struct.S* %0, i64 0, i32 0, i64 99
  store i8 99, i8* %22, align 1
  %23 = getelementptr inbounds %struct.S, %struct.S* %0, i64 0, i32 0, i64 100
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %23, i8* %9, i64 345, i32 1, i1 false)
  %24 = getelementptr inbounds %struct.S, %struct.S* %0, i64 0, i32 0, i64 445
  store i8 -2, i8* %24, align 1
  %25 = getelementptr inbounds %struct.S, %struct.S* %0, i64 0, i32 0, i64 446
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %25, i8* %10, i64 65090, i32 1, i1 false)
  call void @llvm.lifetime.end(i64 22, i8* %7)
  call void @llvm.lifetime.end(i64 76, i8* %8)
  call void @llvm.lifetime.end(i64 345, i8* %9)
  call void @llvm.lifetime.end(i64 65090, i8* %10)
  ret void
}

With this patch, The highlighted memcopies are properly seen, transformed, and later removed by DSE.

the memory ssa attack angle

since processStore potentially needs to move stores upward, we need to figure out how to keep mssa in synch.

in mssa, nodes are either phis, defs, or uses.

first, note what can be moved. phis must remain static at the top of each basic block. defs and uses can't move above or below any point that clobbers or non-no-alias loads.

moveUp, which is called by processStore, already ensures this. in other words, defs and uses are ensured not to move past clobbers may-/must-alias loads.

upwards:

S past S': S' was above S and had optimized L pointing to it. therefore, none of the L alias S. re-order, update defs, and transfer uses of S to S'.

S past L: since S no-alias L and L was above S, no interaction. just re-order.

L past S: L no-alias S, so S can't be L's def. no interaction, just re-order.

L past L': L no-alias L'. just re-order.

downwards:

S past S': equivalent to moving S' up past S. so S would absorb all of S's uses.

S past L: simple.

L past S: simple.

L past L': simple.

patch summary and notes

differential dependencies:

  • run memcpyopt tests through update_test_checks.py
  • restricted upward slice
  • instructionClobbersQuery patched to only check UseLoc

pass itself:

  • performCallSlotOptzn relies on caller to remove memcpy
  • creates a new pass selectable by "-memcpyopt-mssa"
  • adds a flag, -mco-mssa that forces mssa usage in -O3
  • maintains simplicity by restricting matches to single bb
    • performCallSlotOptzn
    • processMemCpyMemCpyDependence
  • abstracts instruction removal
  • restricts moveUp's insertion point to a memory instruction

some ideas

  • tryMergingIntoMemset: walk AccessList
  • add replacememoryinst that will prevent renumbering and allocation
  • don't clobber bb numbering on MemoryAccess removal
  • maybe add stop_At param to getclobberingmemoryaccess?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment