Skip to content

Instantly share code, notes, and snippets.

@binji
Last active May 18, 2018 20:38
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save binji/acc43b94c0a747e51dfafa1b5b099c9a to your computer and use it in GitHub Desktop.
Save binji/acc43b94c0a747e51dfafa1b5b099c9a to your computer and use it in GitHub Desktop.
Bulk Memory Operations

Bulk Memory Operations

Motivation

Some people have mentioned that memcpy and memmove functions are hot when profiling some WebAssembly benchmarks. Some examples:

I've been looking at perf profiles for wasm unity benchmark a bit recently and see that some of the hottest functions are doing memcpy or memset like things. If this is any indication of normal wasm code patterns, I think we could see significant improvement with an intrinsic so it may be worth prioritizing.

In a number of game engines I've been optimizing and benchmarking, interestingly the performance of memcpy() does show up relatively high in profiles. (~2%-5% of total execution time)

Prototype

I implemented a prototype implementation of a move_memory instruction in v8 which just calls out to v8's MemMove function. I compared this to an implementation generated by emscripten and currently used in the Unity demo. This implementation aligns then performs copies using i32.load and i32.store. I've also included performance achieved by unrolling this loop manually and increasing the size to i64.

Each test copies size bytes from one address to another, non-overlapping. This is repeated N times. Each row copies a total of 1 Gib of data, and only touches 1 Mib of memory in the source and destination ranges.

This is the core loop:

  let mask = Mib - 1;
  let start = performance.now();
  for (let i = 0; i < N; ++i) {
    f(dst_base + dst, src_base + src, size);
    dst = (dst + size) & mask;
    src = (src + size) & mask;
  }
  let end = performance.now();

Here are the results on my machine (x86_64, 2.9GHz, L1 32k, L2 256k, L3 256k):

intrinsic i64 load/store x 4 i64 load/store x 2 i32 load/store x 2 i32 load/store
size=32b, N=33554432 1.382 Gib/s 1.565 Gib/s 1.493 Gib/s 1.275 Gib/s 1.166 Gib/s
size=64b, N=16777216 3.285 Gib/s 2.669 Gib/s 2.383 Gib/s 1.861 Gib/s 1.639 Gib/s
size=128b, N=8388608 6.162 Gib/s 3.993 Gib/s 3.480 Gib/s 2.433 Gib/s 2.060 Gib/s
size=256b, N=4194304 9.939 Gib/s 5.323 Gib/s 4.462 Gib/s 2.724 Gib/s 2.213 Gib/s
size=512b, N=2097152 15.777 Gib/s 6.377 Gib/s 4.913 Gib/s 3.231 Gib/s 2.457 Gib/s
size=1.0Kib, N=1048576 17.902 Gib/s 7.010 Gib/s 6.112 Gib/s 3.568 Gib/s 2.614 Gib/s
size=2.0Kib, N=524288 19.870 Gib/s 8.248 Gib/s 6.915 Gib/s 3.764 Gib/s 2.699 Gib/s
size=4.0Kib, N=262144 20.940 Gib/s 9.145 Gib/s 7.400 Gib/s 3.871 Gib/s 2.729 Gib/s
size=8.0Kib, N=131072 21.162 Gib/s 9.258 Gib/s 7.672 Gib/s 3.925 Gib/s 2.763 Gib/s
size=16.0Kib, N=65536 20.991 Gib/s 9.758 Gib/s 7.756 Gib/s 3.945 Gib/s 2.773 Gib/s
size=32.0Kib, N=32768 22.504 Gib/s 9.956 Gib/s 7.861 Gib/s 3.966 Gib/s 2.780 Gib/s
size=64.0Kib, N=16384 22.534 Gib/s 10.088 Gib/s 7.931 Gib/s 3.974 Gib/s 2.782 Gib/s
size=128.0Kib, N=8192 29.728 Gib/s 10.032 Gib/s 7.934 Gib/s 3.975 Gib/s 2.782 Gib/s
size=256.0Kib, N=4096 29.742 Gib/s 10.116 Gib/s 7.625 Gib/s 3.886 Gib/s 2.781 Gib/s
size=512.0Kib, N=2048 29.994 Gib/s 10.090 Gib/s 7.627 Gib/s 3.985 Gib/s 2.785 Gib/s
size=1.0Mib, N=1024 11.760 Gib/s 10.091 Gib/s 7.959 Gib/s 3.989 Gib/s 2.787 Gib/s

Design

This proposal introduces 2 new instructions:

move_memory:

Copy data from a source memory region to destination region; these regions may overlap: the copy is performed as if the source region was first copied to a temporary buffer, then the temporary buffer is copied to the destination region.

The instruction has the signature [i32 i32 i32] -> []. The parameters are, in order:

  • top-2: destination address
  • top-1: source address
  • top-0: size of memory region in bytes

set_memory: Set all bytes in a memory region to a given byte.

The instruction has the signature [i32 i32 i32] -> []. The parameters are, in order:

  • top-2: destination address
  • top-1: byte value to set
  • top-0: size of memory region in bytes

Structure

Unlike most other memory operations, the bulk operations do not have a memarg immediate.

instr ::== ...
    | move_memory
    | set_memory

Validation

move_memory

  • The memory C.mems[0] must be defined in the context.
  • Then the instruction is valid with type [i32 i32 i32] -> [].

set_memory

  • The memory C.mems[0] must be defined in the context.
  • Then the instruction is valid with type [i32 i32 i32] -> [].

Execution

move_memory

  1. Let F be the current frame.
  2. Assert: due to validation, F.module.memaddrs[0] exists.
  3. Let a be the memory address F.module.memaddrs[0].
  4. Assert: due to validation, S.mems[a] exists.
  5. Let mem be the memory instance S.mems[a].
  6. Assert: due to validation, a value of value type i32 is on the top of the stack.
  7. Pop the value i32.const n from the stack.
  8. Assert: due to validation, a value of value type i32 is on the top of the stack.
  9. Pop the value i32.const s from the stack.
  10. If s + n is larger than the length of mem.data, then:
    1. Trap.
  11. Assert: due to validation, a value of value type i32 is on the top of the stack.
  12. Pop the value i32.const d from the stack.
  13. If d + n is larger than the length of mem.data, then:
    1. Trap.
  14. Let b* be the byte sequence mem.data[s:n].
  15. Replace the bytes mem.data[d:n] with b*.

set_memory

  1. Let F be the current frame.
  2. Assert: due to validation, F.module.memaddrs[0] exists.
  3. Let a be the memory address F.module.memaddrs[0].
  4. Assert: due to validation, S.mems[a] exists.
  5. Let mem be the memory instance S.mems[a].
  6. Assert: due to validation, a value of value type i32 is on the top of the stack.
  7. Pop the value i32.const n from the stack.
  8. Assert: due to validation, a value of value type i32 is on the top of the stack.
  9. Pop the value i32.const c from the stack.
  10. Assert: due to validation, a value of value type i32 is on the top of the stack.
  11. Pop the value i32.const d from the stack.
  12. If d + n is larger than the length of mem.data, then:
    1. Trap.
  13. Let c_w be the result of computing wrap_{32,8}(c).
  14. Let b* be the byte sequence {bytes_i8(c_w)}^n.
  15. Replace the bytes mem.data[d:n] with b*.

Binary Format

instr ::= ...
    | 0xC5 0x00 => move_memory
    | 0xC6 0x00 => set_memory

Note that this skips 0xC0..0xC4 because those are currently proposed to be used for the new sign-extending operators (see https://github.com/WebAssembly/threads/blob/master/proposals/threads/Overview.md#new-sign-extending-operators).

An immediate byte is included for future extensions. It currently must be zero.

Text Format

plaininstr_I ::= ...
    | `move_memory` => move_memory
    | `set_memory` => set_memory
@JohnSully
Copy link

Why do we need operators? Couldn't the embedding just export memmove and meset functions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment