cdyk/bytesinwasm.md

## bytesinwasm.md

      
    Raw
  

              bytesinwasm.md
            
          
    Who to blame for the bytes in the WASM?

Christopher Dyken
Recently, I've worked with Emscripten, producing web-assembly from C++. As a
result, I've become curious of how much the wasm-binary grows by including
this or that code. Even though caring about code size seems to be out of
fashion since we stopped using floppy disks to move programs around, it does
matter somewhat on the web, as every client has to download the binary, maybe
over a cellular connection etc.
So, I was curious: Given a wasm-binary of a given size, can I determine how
many bytes this and that source-file contributed? There might be some tool
that does this, but I didn't manage to find any. But I found a way.
Sample program

To have something concrete to discuss around, I present this silly little
program that populates and array with pseudo-random numbers, sorts them and
prints them out:
#include <vector>
#include <algorithm>
#include <cstdio>
#include <cstdlib>

int main() {
    std::vector<int> quux(25);
    srand(42);
    for(auto & e : quux) {
        e = rand();
    }
    std::sort(quux.begin(), quux.end());
    for(auto e : quux) {
        printf("%d\n", e);
    }
    return 0;
}
I compiled this with -O2 -g4, which produces a binary of 28 KB. The debug
flag is needed to produce the information I will use. Running objdump
produces the following info:
$ wasm-objdump -h main.wasm

main.wasm:      file format wasm 0x1

Sections:

     Type start=0x0000000b end=0x00000096 (size=0x0000008b) count: 20
   Import start=0x00000099 end=0x00000208 (size=0x0000016f) count: 17
 Function start=0x0000020a end=0x0000025b (size=0x00000051) count: 80
   Global start=0x0000025d end=0x0000026c (size=0x0000000f) count: 2
   Export start=0x0000026f end=0x000003d8 (size=0x00000169) count: 24
     Elem start=0x000003da end=0x00000401 (size=0x00000027) count: 1
     Code start=0x00000405 end=0x00006425 (size=0x00006020) count: 80
     Data start=0x00006428 end=0x00006749 (size=0x00000321) count: 22
   Custom start=0x0000674c end=0x0000706e (size=0x00000922) "name"
   Custom start=0x00007070 end=0x0000708f (size=0x0000001f) "sourceMappingURL"

That is, the code segment is 24 KB and is located between 0x405 and
0x6425.
Source maps

Emscripten can produce debug output (the -g4 flag) in the form of Javascript
sourcemaps, of which the specification can be found
here.
The sourcemap for my silly program is a JSON-file that looks like:
{
  "version": 3,
  "sources": [
    "main.cpp",
    "emsdk/fastcomp/emscripten/system/include/libcxx/vector",
    "emsdk/fastcomp/emscripten/system/include/libcxx/iterator",
    "emsdk/fastcomp/emscripten/system/include/libcxx/algorithm",
    "emsdk/fastcomp/emscripten/system/include/libcxx/memory",
    "emsdk/fastcomp/emscripten/system/include/libcxx/stdexcept",
    "emsdk/fastcomp/emscripten/system/include/libcxx/new"
  ],
  "names": [],
  "mappings": "2yEAMA,MACA,IC+9CA,OAgBA,cCAA,SF7+CA,...,SKhtBA,OA0CA"
}
The list of sources is pretty obvious, but the mappings is a sequence of
segments mapping between generated code positions in the source files. The
actual encoding is interesting, and is well explained
here.
A segment contains (up to) five pices of information:

A starting column in the generated code.
A source file index.
A starting line in the source file.
A starting column in the source file.
A symbol index.

The line number in the generated code is implicit, as the segments are emitted
line-by-line and next line is encoded as ;.
The wasm sourcemaps appear to interpret the wasm-binary as a single line of
bytes, where a column in the generated code is a byte offset. I interpreted a
segment to start at its starting position and continue until the next segment
starts.
Aggregate source map segments

Thus, my idea was to parse the source maps, run through the segments and
aggregate the lengths of each segment (i.e. segment byte size) on a
source-by-source basis. I made a small Python-script (see end of the post)
that does this, and it produces the following:
       7 bytes emsdk/fastcomp/emscripten/system/include/libcxx/memory
       8 bytes emsdk/fastcomp/emscripten/system/include/libcxx/algorithm
     100 bytes emsdk/fastcomp/emscripten/system/include/libcxx/new
     296 bytes emsdk/fastcomp/emscripten/system/include/libcxx/iterator
     365 bytes main.cpp

Accounted for 776 bytes
Range is [0x514, 0x81c] (length is 776)

The range covered by the segments lies within the code segment (between
0x405 and 0x6425), though, only 776 bytes of 24KB is covered. This is
probably some constant overhead that grows relatively large for such a small
program, if I run this script on a decently sized application, the coverage
grows above 90%.
Update: Looking at the disassembly, I discovered that the uncovered range
is the wasm-side of the C-library. In hindsight, not very surprising.
Interpreting the results

Now, there is not too much too learn with such a silly example. But on larger
code bases there is some insight to be had. In my experience, the byte count
for a cpp-file corresponds relatively to the reduction in wasm-size if that
file is removed from the build. Thus, one can get an idea of the cost of a
feature by summarizing byte count of cpp-files according to the feature they
belong to.
Header files are less conclusive. However, if a header-file comes up high in
the list, it might indicate that it gets inlined a lot and the inlining
actually produces a lot of code. So this might highlight code that might
benefit being moved from a header file into a cpp-file.
To conclude, parsing the sourcemaps in this way have at least given my some
insight into how much each feature of an application costs byte-wise. It is
pretty simple, and, for all I know, this might be some well-known standard
approach I didn't know about. But I found it interesting.
Source code

Here is the source code for the parser-aggregater. It is in no way polished
and I am not a great python programmer, so you are warned that it will
probably crash down horribly and burn.
import json

class WasmBlamer:

    def _debase64(self, c):
        v = ord(c)
        if ord('A') <= v and v <= ord('Z'):
            return v-ord('A')
        elif ord('a') <= v and v <= ord('z'):
            return v-ord('a') + 26
        elif ord('0') <= v and v <= ord('9'):
            return v-ord('0') + 52
        elif v == ord('+'):
            return 62
        elif v == ord('/'):
            return 63
        assert False, '"' + c + '" is not a valid base64 character'

    def _decodeVLQ(self, array):
        rv = []
        while array:
            e = self._debase64(array[0])
            array.pop(0)

            sign = 1 if (e & 0x1) == 0 else -1
            value = (e >> 1) & 0xF

            offset = 4
            while e & 0x20:
                e = self._debase64(array[0])
                array.pop(0)
                assert value < (1<<offset)
                value = value | ((e & 0x1f)<<offset)
                offset += 5
            rv.append(sign * value)
        return rv

    def _handleSegment(self, segment):
        dst_col_next = 0
        src_col_next = 0
        src_row_next = 0
        src_next = -1
        if segment:
            vlq = self._decodeVLQ(segment)
            dst_col_next = self.dst_col + vlq[0]
            if self.dst_col_min < 0 or dst_col_next < self.dst_col_min:
                self.dst_col_min = dst_col_next
            self.dst_col_max = max(self.dst_col_max, dst_col_next)

            if 4 <= len(vlq):
                src_next = max(0, self.src) + vlq[1]
                src_row_next = self.src_row + vlq[2]
                src_col_next = self.src_col + vlq[3]
                if 5 <= len(vlq):
                    pass # ignore symbols for now
        if 0 <= self.src:
            self.blame[self.src] += max(0, dst_col_next-self.dst_col)
        self.dst_col = dst_col_next
        self.src_col = src_col_next
        self.src_row = src_row_next
        self.src = src_next

    def _printReport(self):
        report = list(zip(self.blame, self.sources))
        report.sort(key=lambda e: e[0])
        sum = 0
        for item in report:
            print("%8d bytes %s" % (item[0], item[1]))
            sum += item[0]
        print("\nAccounted for %d bytes" % sum)
        print("Range is [%#0x, %#0x] (length is %d)" % (self.dst_col_min, self.dst_col_max,
        	                                            self.dst_col_max- self.dst_col_min))


    def _processMappings(self):
        segment = []
        for c in self.mappings:
            if c == ',' or c == ';':
                self._handleSegment(segment)
                segment.clear()
                if c == ';':
                    self.dst_col = 0
                    self.dst_row += 1
            else:
                segment.append(c)
        self._handleSegment([])

    def process(self, path):

        with open(path) as file:
            sourcemap = json.load(file)
            self.sources = sourcemap['sources']
            self.names = sourcemap['names']
            self.mappings = sourcemap['mappings']

        self.dst_col_min = -1
        self.dst_col_max = 0
        self.dst_col = 0
        self.dst_row = 0
        self.src = -1
        self.src_col = 0
        self.src_row = 0
        self.blame = [0]*len(self.sources)
        self._processMappings()
        self._printReport()

if __name__ == "__main__":
    import sys
    if len(sys.argv) != 2:
        print("Usage: %s <binary>.wasm.map" % sys.argv[0], file=sys.stderr)
        sys.exit(-1)
    mapfile = sys.argv[1]
    blamer = WasmBlamer()
    blamer.process(mapfile)