Skip to content

Instantly share code, notes, and snippets.

@TimvdLippe
Created February 1, 2018 16:13
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save TimvdLippe/c1e1d44b180e428bc158f183df4e8cc3 to your computer and use it in GitHub Desktop.
Save TimvdLippe/c1e1d44b180e428bc158f183df4e8cc3 to your computer and use it in GitHub Desktop.
Proposal for storage of static analysis warnings in WatchDog

Storage format of Static Analysis Warnings

There are multiple options for storing the static analysis warnings that WatchDog tracks. Since the frequency of warnings is high, we can not send every single event to the server. In that case, the networkload and storage capacity would quickly be overwhelming.

The following requirements are imposed:

  1. The networkload imposed by WatchDog may not increase by more than 20% compared to the current network usage.
  2. All static analysis warning events must be preserved.
  3. All static analysis warnings must be anonymized on the client.
  4. Warnings can be clustered based on warning type.
  5. Warnings that existed for less than X seconds can be grouped to 1 event. (X to be determined)

Possible solutions

All solutions should adhere to the above requirements. Moreover, they must be compatible with and non-intrusive to the existing event-based WatchDog networking infrastructure.

Only store max, min, latest

The warnings can be grouped by time-interval, which means that any warning generated in interval X from timestamp T1 to T2 is grouped and sent as once. The warnings are sent as a map, with as key the warning type and as value the maximum, minimum and latest number of warnings present in the document. To be able to gather this data, every interval needs to maintain these three values for each category. Data is only sent over if one of the values is different compared to the previous sent interval. This means that we have to store the data from the previous interval we sent.

Pseudo code

class TimeStamped {
  int max,
  int min,
  int latest
}

class TimeStamps {
  Map<Category, TimeStamped> previousTimestamps;
  Map<Category, TimeStamped> currentTimestamps;

  increaseForCategory(category) {
    stamped = currentTimestamps[category];
    stamped.latest++;
    stamped.max = max(stamped.max, stamped.latest);
  }

  decreaseForCategory(category) {
    stamped = currentTimestamps[category];
    stamped.latest--;
    stamped.min = min(stamped.min, stamped.latest);
  }

  class TimerTask {
    run() {
      for (category in currentTimestamps) {
        if (categoryIsdifferentInMaps(category)) {
          trackEventManager.addEvent(currentTimestamps[category]);
          if (currentTimestamps[category].latest == 0) {
            delete previousTimestamps[category];
            delete currentTimestamps[category];
          } else {
            previousTimestamps[category] = currentTimestamps[category];
          }
        }
      }
    }

    categoryIsdifferentInMaps(category) {
      if (!previousTimestamps[category]) {
        return true;
      }
      for (field in {max, min, latest}) do {
        if (previousTimestamps[category][field] != currentTimestamps[category][field]) {
          return true;
        }
      }
      return false;
    }
  }
}

Pros

  1. Grouped timestamps significantly reduces networkload
  2. Small memory footprint, only storing current warnings
  3. Easy algorithm for checking updates

Cons

  1. Significantly less granular data due to timestamps
  2. Only max and min, no in between data. Trendlines are harder to determine

Group changes in array

Instead of generating an event for every single warning, we can group them up per interval. This means that we still store every single change, but prevent the overhead of every event. The effect on the networkload and storage will be reduced, but could still be severe depending on the frequency of warning generation. A timestamp for each change could be optional, as we can make the assumption it is uniformly distributed in the interval.

Pseudo code

class Change {
  Direction direction,
  optional DateTime time,

  enum Direction {
    INCREASE, DECREASE
  }
}

class CategoryChanges {
  Map<Category, List<Change>> currentChanges;

  increaseForCategory(category) {
    currentChanges[category] = currentChanges[category] || new List()
    currentChanges[category].push(new Change(INCREASE, now()))
  }

  decreaseForCategory(category) {
    currentChanges[category] = currentChanges[category] || new List()
    currentChanges[category].push(new Change(DECREASE, now()))
  }

  class TimerTask {
    run() {
      for (category in currentChanges) {
        changes = currentChanges[category];
        if (changes.length == 0) {
          delete currentChanges[category]
        } else {
          trackEventManager.addEvent(currentChanges[category]);
          currentChanges[category].clear();
        }
      }
    }
  }
}

Pros

  1. Extremely granular data, storing every change
  2. Less overhead than sending every event

Cons

  1. Potentially still very data-intensive, as no filtering is in place
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment