Skip to content

Instantly share code, notes, and snippets.

@demurgos

demurgos/merge.md

Created Oct 7, 2018
Embed
What would you like to do?

Merge

This document describes the merge algorithm.

The merge algorithm takes a list of process coverages (or smaller items) and outputs a single object combining the information in the inputs. The goal is to simulate the result V8 would have returned if the coverages were executed sequentially inside the same process instead of their own processes (so the counts are summed).

This algorithm depends only on the input coverages: it does not use the source text or Abstract Syntax Tree (AST) of the covered scripts. This means that the algorithm does not have as much information as the V8 profiler.

If you do an experiment and run the V8 profiler on two entry points in their own process and then in the same process (experimental merging), you will get different results than if you use the merge algorithm. The experimental result may have more or smaller ranges, but both results should be compatible. In both cases, using the count of the smallest range containing an AST node should return the same result. (This is what I call "coverage inclusion": one is more general than the other but they yield the same results when applied on the source text).

This lack of source text information is also the reason for one the main unproven assumptions of the algorithm: the left bias rule. It will be detailed further down, but it relates to how partially overlapping ranges are handled. There are two choices "left bias" or "right bias". Choosing the wrong may break the validity of the coverage but there is no way to decide without the source text. I know cases where you need to use the "left bias", but wasn't able to find (or create) cases where the "right bias" is required. Because of this, the algorithm applies the rule of always using the "left bias". If a case is ever found where the "right bias" must be used, it would imply that merging coverages without the source text is ambiguous. If the left bias rule does not hold, merging without the source text is impossible.

The algorithm has the following properties (if you find a counter example, please file an issue):

  • Commutativity: If your list has two items, their order does not matter. Both merge([a, b]) and merge([b, a]) return the same result.
  • Associativity: You can reduce your input list by calling the algorithm on any contiguous subset of the input and get the same result. All of the following return the same result:
    • merge([a, b, c, d])
    • [b, c, d].reduce(a, (acc, next) => merge([acc, next]))
    • [a, b, c].reduceRight(d, (acc, next) => merge([next, acc]))
    • merge([merge([a, b]), merge([c, d])])

These two properties together mean that the order of the inputs and calls does not matter: the final result will always be the same.

Overview

The algorithm has two main main steps: function matching and range merging. The functions are the main units of coverage. The first step corresponds to group function coverages coming from different process coverages if they relate to the same function in the source text. The second step is the harder part: it takes lists of range coverages and returns a single list that represents the sum of the counts for all the AST nodes in the function.

Function matching

The coverages have a hierarchical structure: process coverages contain script coverages which contain function coverages. The function matching step searches FunctionCov values from different coverages that are corresponding to the same function. A function is matching another one if it is in the same file and covers the same span of source text.

This means that we are assuming that the source text of the scripts does not change between process coverages. In case the program is that dynamic, there's no easy way to merge coverages so it's usually a pretty good assumption.

The algorithm first matches scripts: across process coverages, it takes scripts with the same url value. The scriptId is ignored because this value depends on the load order and is not relied in the presence of dynamically loaded modules (optional dependencies, conditional dependencies, etc.).

Then inside each list of matching scripts, it matches functions based on their span, or root range. For each function coverage, ranges[0] is defined and corresponds to the whole span of the function: from just before the function keyword to just after the closing brace } (ranges[0].count is the number of times the function was called). For FunctionCov values corresponding to the whole module, the span simply covers the entire file.

Inside a script, the functions are matched with the startOffset and endOffset of their root range. The functionName is ignored because it is not that reliable (multiple functions can have the same name inside the same script).

Assuming the source text is not dynamically modified, this method of matching is unambiguous. There could be one edge case if the whole source text is a single function definition without anything outside (no assignation, no import/expert, no semicolon, no new line). I haven't tried it yet, but it may cause two FunctionCov with the same root range in the same ScriptCov. In this situation, the algorithm simply matches the two FunctionCov.

When a list of FunctionCov is matched, it is passed to the "range merging" phase to produce a single FunctionCov. This process is applied for each group of matching FunctionCov. The matching ScriptCov are then simply replaced by a ScriptCov with the same url and list of merged FunctionCov. The scriptId is just chosen to be unique. A small normalization is applied to improve determinism: the scripts are sorted by url, and the functions by their root range (increasing startOffset, decreasing endOffset).

Range merging

This phase is the meat of algorithm: it sums lists of ranges.

I will first give more details about the structure of ranges in a FunctionCov, then explained the expected results (including the left bias rule), and finally outline the algorithm.

Ranges

Inside each FunctionCov, ranges is a list of pre-order sorted nodes of a tree covering the AST.

Let me go with an example. I'll write a single-line function with its offsets:

(foo) => { for (const _ of "rgb") { alert(foo ?  0 : 1) } return foo ? "true" : "false"}
▏                                 ▏           ▏   ▏      ▏           ▏       ▏          ▏
0                                 34          46  50     57          69      77         88

Calling this function once with the argument false produces this FunctionCov:

{
  "functionName": "",
  "ranges": [
    {"startOffset":  0, "endOffset": 88, "count": 1},
    {"startOffset": 34, "endOffset": 57, "count": 3},
    {"startOffset": 46, "endOffset": 50, "count": 0},
    {"startOffset": 69, "endOffset": 77, "count": 0}
  ],
  "isBlockCoverage": true,
}

The easiest way to visualize the ranges is to draw them under the function. Their representation starts and ends at the corresponding offsets, the number corresponds to the count.

(foo) => { for (const _ of "rgb") { alert(foo ?  0 : 1) } return foo ? "true" : "false"}
▏                                 ▏           ▏   ▏      ▏           ▏       ▏          ▏
0                                 34          46  50     57          69      77         88
▏                                 ▏           ▏   ▏      ▏           ▏       ▏          ▏
[1-------------------------------------------------------------------------------------)
                                  [3---------------------)           [0-----)
                                              [0-)

When drawn this way, the tree formed by the ranges becomes apparent. The ranges inside a FunctionCov either nest or are disjoint: based on their inclusion relation, you can form a tree. The first range is always the root: it is larger than all of the following ranges. There is never any partial overlap inside single FunctionCov: the ranges either nest (ancestor/descendant relationship) or are disjoint.

This tree of ranges is called a range tree.

Pre-order tree traversal means that you start at the root and traverse the tree by always going to the left-most node that you haven't visited yet. Every time you enter a node for the first time, you emit it. The ranges are sorted according this traversal. Another way to state that the ranges are pre-order sorted is that they are sorted by increasing startOffset then decreasing endOffset.

The range tree is "covering" the Abstract Syntax Tree (AST). I mean two things: the count of an AST node is the smallest range containing it, and multiple AST nodes can correspond to the same range tree node. For exemple, the alert(...) expression is contained by two ranges: the root range and range with a count of 3. The latter is the smaller one, so the count associated to alert(...) is 3: alert was called three times. The for and return statements both correspond to the root range (with a count of 1).

The merge algorithm handles the ranges as a tree: converting from a flat list to a tree is the first step of the algorithm. It then merges the trees and flattens the result back to a list.

Expected behavior

The range trees from matching FunctionCov can have different structures. Some can have more nodes, nodes at different offsets, or even with partially overlapping ranges (across FunctionCov values from different process coverages).

The trees are merged recursively, starting at the root. One of the conditions for the function coverages to match is that their root ranges all have the same offsets. In the merged tree, the root also has the same offsets; its count is the sum of the counts. For example, if there a 3 process coverages, and the root range of the same function has respectively count values of 1, 2 and 3 then it means that in total the function was called 6 times and the merged tree will have a root with a count of 6. Pretty straight forward.

The interesting part is when we start handling the children because they can have different ranges. The merge algorithm merges any number of coverages at the same time, but it's mostly an optimization. You can think about the expected result when merging two trees and then generalize it to many trees processed at the same time. If you take any two arbitrary children ranges, there are 13 ways to compare them one relative to the other. Scalars have 3 possibilities: "LesserThan", "Equal", "GreaterThan"; for ranges there are 13 possibilities. For our purpose, these possibilities can be grouped into 4 families:

Configuration "equal":
        [1---------)            (Reference)
        [2---------)            Equal

Configuration "disjoint":
        [1---------)            (Reference)
[2-----)                        BeforeAdjacent
[2-)                            BeforeStrict
                    [2-----)    AfterAdjacent
                        [2-)    AfterStrict

Configuration "nested":
        [1---------)            (Reference)
[2-------------------------)    OverlapStrict
[2-----------------)            OverlapEqualEnd
        [2-----------------)    OverlapEqualStart
        [2-----)                ContainedEqualStart
            [2-----)            ContainedEqualEnd
            [2-)                ContainedStrict

Configuration "partial overlap":
        [1---------)            (Reference)
[2-------------)                PartialOverlapStart
            [2-------------)    PartialOverlapEnd

Each family describes a configuration between two children nodes. Merge rules be defined based on how these 4 configurations are handled. These definitions are based on the count of the parent nodes (context) and count of the child nodes. The definition will be given as ASCII drawings. A and B are the parent counts, a and b the children counts.

Equal

--A---------------------------
         [a---------)
+
--B---------------------------
         [b---------)
=
--A+B-------------------------
         [a+b-------)

This is the easiest situation: if both trees have the same structure, you just sum the corresponding counts.

Disjoint

--A---------------------------
       [a-----)
+
--B---------------------------
                  [b-----)
=
--A+B-------------------------
       [a+B---)   [A+b---)

Any AST node contained by the range [a--) will have a count of B in the the second function coverage (inherited from the parent). Summing both, we get the count a+B. The same reasoning applies to the other child range. Notice that the original trees had a single child each, but the merged tree has two. These children do not break the "AST coverage" property because their boundaries have the same offsets and "orientation" as in the input trees (they don't split an AST node from the other function coverage).

Nested

--A---------------------------
      [a---------------)
+
--B---------------------------
           [b-----)
=
--A+B-------------------------
      [a+B-------------)
           [a+b---)

The [a+b--) result comes from the same argument as for the "disjoint" configuration (AST nodes contained in [b--) have a count of a in the other coverage). The count for the nodes contained in [a--) but not by [b--) is a+B, but why introduce an extra nesting level? This avoids introducing extra splits. Since AST nodes get the count of the smallest range containing them, this merge result will assign them the correct results.

One may have thought of the following solution instead, with extra split but without extra layers:

--A---------------------------
      [a---------------)
+
--B---------------------------
           [b-----)
=
--A+B-------------------------
      [a+B)[a+b---)[a+B)
          ^        ^

But this introduces a closing split and opening split at the marked positions. Those splits were not present in the original trees so it's not safe to add them. Here is an example where this would not work:

/* ... */ if (foo) { if(bar) { doAction() }    } /* ... */
--10------------------------------------------------------
                   [1--------------------------)
+
--20------------------------------------------------------
                             [2-----------)

(first case: called 9 times with {foo: false} and once with {foo: true, bar: true}; second case: called 18 times with {foo: true, bar: false} and twice with {foo: true, bar: true})

The valid "nested" rule and invalid "flat nested" rule would produce:

/* ... */ if (foo) { if(bar) { doAction() }    } /* ... */
nested (valid):
--30------------------------------------------------------
                   [21-------------------------)
                             [3-----------)
flat nested (invalid):
--30------------------------------------------------------
                   [21------)[2-----------)[21-)

The statement if(bar) { ... } is evaluated 21 times in total, but in the invalid case, the smallest range containing this statement has a count 30.

Partial overlap

Understanding why the "nested" rule introduces an extra layer to avoid nesting is important to understand the chosen "partial overlap" rule.

--A---------------------------
      [a----------)
+
--B---------------------------
           [b----------)
=
--A+B-------------------------
      [a+B--------)[A+b)
           [a+b---)

The partially overlapping configuration is the only configuration where you need to introduce splits in the merged tree to get matching counts on AST nodes attached to the leafs of the tree. The next best solution introduces a single split. The chose rule (above) is the "left bias" variant because it favors the left range and splits the right range. The "right bias" variant looks like this:

--A---------------------------
      [a----------)
+
--B---------------------------
           [b----------)
=
--A+B-------------------------
      [a+B)[A+b--------)
           [a+b---)

If the merge algorithm was working in the void, there would be no way to decide which which variant should be used. Without the AST, you don't know which of the splits is safe and which breaks an AST node. But we're no working on some imaginary trees, but ones produced based on the JS syntax.

A split usually corresponds to a branch in the control flow graph. This can be introduced by elements such as if, for, &&, default function parameters, return, throw, etc. Out of those, the first tokens of the node always executed, then there's a body and then an optional sequence of closing tokens. Because of this syntax, it's more common to have cases where parent/child ranges and at the same offset than start at the same offset. Two ranges starting at the same offset mean the start of a node gets a different count than its parent node.

I found a few cases where two ranges start at the same offset. Most of them were bugs (that I reported to the V8 team), but some may be legit situations. If there were no legit cases where two ranges can start at the same offset, it would have given an easy proof that the "right bias" variant is impossible. I haven't tried running my tests with the patched V8 version. Maybe all the cases I found were bugs. For example, this function exhibited this behavior.

Since I can't definitely discard the "right bias" rule, I can at least find cases supporting the "left bias" rule. Here's one such case (slightly simplified to remove an irrelevant range over whitespace):

(n) => { if (n > 0) { if (true) { return true; console.log('foo') } } return false; }

[1----------------------------------------------------------------------------------)
                    [0----------------------------------------------)
+
[2----------------------------------------------------------------------------------)
                                 [0----------------------------------------------)

(first case: called once with {n: 0}; second case: called twice with {n: 1})

Here is what you get depending on the variant:

(n) => { if (n > 0) { if (true) { return true; console.log('foo') } } return false; }
left bias:
[3----------------------------------------------------------------------------------)
                    [2----------------------------------------------)[1----------)
                                             [0-------------------)
right bias:
[3----------------------------------------------------------------------------------)
                    [2----------------------)[1----------------------------------)
                                             [0-------------------)

The if (true) { ... } statement is evaluated twice. You get this result using the left bias rule, but using the right-bias variant you get a count of 3. When generating the merged result experimentally (performing the calls inside the same process), V8 returns coverage matching the left bias rule. It confirms that it's the right one in this case.

Because there exists a case when the left bias must be used, that the right bias rule seems unlikely (after looking at the JS syntax) and because you must pick one, I have chose to use the left bias rule.

If a case is ever found when the left bias rule produces invalid results (so the right bias variant should have been used), it means that it is impossible to decide which variant to use without the source text and the whole algorithm is invalid.

Summary

Here are all the merge rules together:

--A---------------------------
         [a---------)
+
--B---------------------------
         [b---------)
=
--A+B-------------------------
         [a+b-------)
--A---------------------------
       [a-----)
+
--B---------------------------
                  [b-----)
=
--A+B-------------------------
       [a+B---)   [A+b---)
--A---------------------------
      [a---------------)
+
--B---------------------------
           [b-----)
=
--A+B-------------------------
      [a+B-------------)
           [a+b---)
--A---------------------------
      [a----------)
+
--B---------------------------
           [b----------)
=
--A+B-------------------------
      [a+B--------)[A+b)
           [a+b---)

Overview of the merge algorithm

This section gives an overview of the actual implementation of the algorithm. Check the actual implementation for the whole details.

This algorithm is designed to be fast. This algorithm should be used to handle the Node or npm test suites. These generates thousands of process coverages (adding up to hundreds of megabytes). The initial implementation took roughly 50 seconds on my computer to merge these coverages. Careful changes to the algorithm and implementation took this time down to about 3 seconds. These times are for the Node implementation. I also wrote a Rust implementation to compare the performance (and have another implementation to double check). I only managed to go down to 4 seconds with single-threaded Rust (1.2s with multithreading). I was very impressed that V8 managed to run the algorithm as fast (or faster) as native code. (Disclaimer: I'm good at Node, not so much at Rust, tips are welcome).

TODO: Actually explain the algorithm:

  • Tree construction
  • Delta counts
  • Scan line
  • Start events
  • Open range
  • Wrappers
  • Tree normalization
  • Flattening
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.