jmcarp/badge.md

## badge.md

      
    Raw
  

              badge.md
            
          
    Optimizing badge performance
Before proposing possible solutions, here are a few questions I would ask if this were more than an exercise:

The badge endpoint uses the pyramid app, the sql backend (authorization, block list), and the elastic backend (search). Which commponent(s) take up the most time or are most challenging to scale?

For the purpose of the exercise, I'll assume the slowest/most expensive comopnent is elastic.


What are the current and anticipated costs of the badge endpoint under the status quo? Would the cost of additional complexity (e.g. learning a new database, web framework, or programming paradigm) outweigh the savings from buying more servers?

I'll assume that cloud infrastructure is cheap and developer time expensive, and lean toward keeping things simple rather than maximizing performance.


What's the hypothesized value-add of the badge feature, and how can we test that hypothesis?

Does the badge make users more likely to view or create annotations?
Does A/B testing validate that badges with exact counts increase engagment more than badges with public counts or a present/absent indicator, or no badges at all?


Some aspects of the problem cut across the different UX possibilities:

To offload expensive queries from either the sql or elastic backends, full or partial information about badge counts should be pre-computed. A few orthogonal considerations:

Real-time vs batch

Real-time: use a sql trigger to update badge information in sql, publish to a subscriber, write to a foreign data store, etc
Batch: periodically refresh a materialized view in sql, ingest data into a separate data store, etc


Store counts vs presence/absence

Counts: serve annotation counts directly
Presence/absence: if no annotations present, return zero count and bail out; else search elastic as normal


Use existing data stores or add separate store
Update existing app or move to separate app

Add logic to existing app
Add logic to proxy in front of existing app (nginx/openresty)
Move logic to a new app


Using these classifications, the solution proposed in https://github.com/hypothesis/badger/blob/master/docs/design.md uses batch updates to store counts in a new data store managed by a new app.
Because I'm assuming that (1) the bottleneck is elastic and (2) adding and maintaining new tools is expensive, I'm proposing the simplest options that will also improve performance. I'm not considering options that would immprove performance while increasing complexity or expanding the stack, like rewriting in falcon/aiohttp/node/go.
Note: example sql code requires postgres >=9.5.
Scenario #1

For the first scenario, which requires accurate counts of public and private annotations, I propose using real-time updates to store annotation presence/absence to an existing data store in an existing app. Specifically, this approach would use a sql trigger to update a badge presence table on inserting a new annotation:
create table badge_presence (uri text primary key);

create function badge_presence_update() returns trigger as $badge_presence_update$
  begin
    insert into badge_presence (uri) values (new.target_uri_normalized) on conflict do nothing;
    return null;
  end;
$badge_presence_update$ language plpgsql;

create trigger badge_presence_update
  after insert on annotation
  for each row
  execute procedure badge_presence_update();
On a request to the badge endpoint, the app checks the badge presence table for the requested uri. If absent, return zero; else, search elastic as normal. Since checking the badge presence table is at least an order of magnitude faster than searching elastic (based on local testing), and given the assumption that the large majority of badge queries will return a count of zero, this logic makes the large majority of badge queries significantly more performant while adding a negligible overhead to the small minority of queries that return non-zero results.
We can also avoid the overhead of gunicorn and pyramid for the majority of requests by pushing the query against the badge presence table to the nginx layer. Specifically, if we replace nginx with openresty, its more-powerful but also battle-tested fork, we can skip requests to the backend entirely when we know that the annotation count is zero:
location /api/badge {
  access_by_lua_block {
    local res = pg:query('select 1 from badge_presence where uri = ' .. ngx.var.uri)
    if #res == 0 then
      ngx.say('{"count": 0}')
      return
    end
  }}
  proxy_pass api;
}
However, in the interest of keeping the stack simple, we can defer this kind of optimization until it becomes necessary (if ever).
Similarly, we could publish annotation presence/absence to a faster data store like redis using listen/notify or foreign data wrappers, at the cost of adding a new data store to the stack. Again, given the diminishing returns of optimization, this might be best deferred until needed.
Finally, if storing and querying the entire set of annotated uris becomes too expensive, we could instead test annotation presence using a bloom filter, a probabilistic data structure that efficiently checks whether an element is part of a set. Bloom filters may generate false positives (item is reported in set but isn't actually present) but don't generate false negatives (item is reported not in set but actually is present), so we'll never inappropriately short-circuit and skip searching elastic.
Scenario #2

For the second scenario, which requires accurate counts of public annotations, I propose a slight modification of the approach for the first scenario: storing annotation counts instead of presence/absence, and never falling through to elastic.
We can modify the sql trigger approach from above to store counts rather than presence/absence:
create table badge_count (
  uri text primary key,
  count int
);

create function should_count(record annotation) returns bool as $should_count$
  begin
    return record.shared = true and record.deleted = false;
  end;
$should_count$ language plpgsql;

create function badge_count_update() returns trigger as $badge_count_update$
  begin
    if (tg_op = 'INSERT') then
      if (should_count(new)) then
        insert into badge_count (uri, count) values (new.target_uri_normalized, 1) on conflict (uri) do update set count = badge_count.count + 1;
       end if;
    elsif (tg_op = 'UPDATE') then
      if (should_count(new) and not should_count(old)) then
        insert into badge_count (uri, count) values (new.target_uri_normalized, 1) on conflict (uri) do update set count = badge_count.count + 1;
      elsif (should_count(old) and not should_count(new)) then
        update badge_count set count = count - 1 where uri = new.target_uri_normalized;
      end if;
    end if;
    return null;
  end;
$badge_count_update$ language plpgsql;

create trigger badge_count_update
  after insert or update on annotation
  for each row
  execute procedure badge_count_update();
Because the pre-computed counts are enough to answer requests to the badge endpoint, we never need to query elastic to fetch public badge counts.
As in the first scenario, we can take additional steps to take load off the pyramid app and sql store, such as publishing counts to redis or querying counts from a separate app; again, I propose taking those steps when necessary but not before. Often, postgres is good enough.
Scenario #3

The third scenario requires showing a badge for a non-zero annotation count. For this exercise, I'll assume this means a non-zero public annotation count; if the requirement is for non-zero annotation counts visible to the current user, I would modify the proposal slightly. For this scenario, I propose a similar approach as for the second scenario: store annotation counts to power present/absent queries, while never hitting the elastic database.
If we're serving present/absent requests, why store counts? Unlike the approach for the first scenario, in which we rely on elastic for exact counts, this approach never queries elastic. This means false positives in the badge table will result in incorrect behavior. To prevent false positives caused by annotations that are deleted or made private, we keep exact counts while serving present/absent information.
Notes


Why not redis?

Redis might be a useful addition to the stack, but I recommend keeping the stack as small as possible. Introducing a new data store adds new failure modes, such as redis and sql/elastic falling out of sync. Keeping data in sql means that pre-computed values are updated within transactions and will remain consistent with annotations.


Why not a separate app?

Again, factoring badge counts into a separate app could improve performance, but also increases the complexity of the stack. Based on the assumption that the current bottleneck is elastic, and that a stateless pyramid app is simple to scale horizontally, I propose keeping the badge endpoint in the current pyramid app until splitting it becomes necessary.