wheresrhys/sw.md

## sw.md

      
    Raw
  

              sw.md
            
          
    Service Workers Union, Financial Times branch

Report of the Extraordinary Working Group on Controlling the Means of Releasing to Production

Contents:


Introduction - Comrade Evans
Universal test coverage - Comrade Florisca
Cooperation, and freedom, among independent features - Comrade Legg
Some workers are more equal than others - Comrade Phillips
The purge of service worker traitors - Comrade Militaru
Any other business

1. Introduction

For the past 4 years the Financial Times branch has proudly controlled the means of releasing to production. Long gone are the days of multiple beaureaucratic environments, pitting developer against integration engineer, and product owner against proletariat. For four glorious years we have released code to production dozens of times every day, for the greater good of all. Our continuous deployment pipeline, backed by the boundless invention of our engineers in the field of test automation, has meant we can confidently release any application within 10 minutes of merging to master.
However, following our great patriotic service worker release, many comrades reported disturbances in the provinces

Barrier pages would be shown even after members had provided adequate authorisation
A great veil of blank error pages would descend upon any member making changes to their global site preferences
Cookies, grown fat on the irresponsible largesse of third parties, prevented the release of bugfixes to small numbers of members

In the wake of these failures, the committee was put under great pressure to reconcile two opposing forces of history:

the irrepressible march of the continuous development pipeline, forever intent on deploying code to production within 10 minutes of merging to master
the irrepressible march of our members towards unsubscribing if serious flaws emerged, and persisted, in our production environment

Below follows a summary of the findings of the working group
2. Universal test coverage

Investigations uncovered that Service Workers were being denied their rights. In particular, their right to universal test coverage, as enshrined in the TDD constitution. After interrogating those involved, protestations were made that this was an inevitable consequence of the immaturity of the service worker testing economy. Further investigations revealed that there was some truth in this defence but that, nonetheless, it was a situation that needed rectifying.
A pamphlet published by Comrade Gaunt was - following a thorough examination for traces of subversion - accepted as the basis of a possible solution. The working group strove over many months to bring the patterns it contained in line with orthodoxy. What emerged were the following five principles.
Service worker APIs are often double-agents for the DOM

Because most Service worker APIs are also implemented in the DOM, any modules that do not interact directly with the service worker lifecycle can often be tested directly in a web page, using any orthodox test driver. Even Cache is available in the DOM.
sw-mocks

Install and control a puppet Service Worker

// worker
self.addEventListener('message', ev => {
	const msg = ev.data;
	if (msg.type === 'claim') {
		self.clients.claim();
		ev.ports[0].postMessage('claimed');
	}
});
Maintain a secure communication channel

Set up and tear down of the worker

Allowing inspection of fetch calls

Looking at fetching and caching in particular, we wrap the native fetch implementation in code to track what it's called with. You could use sinon or a similar stubbing library to do this, but we found we didn't need anything beyond a basic check for whether fetch, in the service worker, was called with a given url. We also added a slight delay to fetch's responses as this makes it easier to test scenarios where, in anything except your test environment, you would expect the network to be slower than local async processes. Finally, using the same messaging mechanism as described above, we are able to post a message to the worker to request information about the call history
const nativeFetch = fetch;
let fetchCalls = [];

function domainify (url) {
	return (url.charAt(0) === '/') ? self.registration.scope.replace(/\/$/, '')+ url : url;
}

function queryFetchHistory (url, port) {
	port.postMessage(fetchCalls.indexOf(domainify(url)) > -1);
}

function clearFetchHistory (url, port) {
	fetchCalls = fetchCalls.filter(storedUrl => storedUrl !== domainify(url));
	port.postMessage('done');
}

self.fetch = function (req, opts) {
	fetchCalls.push(req.url || req);
	return nativeFetch.call(self, req, opts)
		// slow fetch down a little in test to make doubly sure it's slower than
		// local async operations
		.then(res => {
			return new Promise(resolve => setTimeout(() => resolve(res), 50));
		});
};

self.addEventListener('message', ev => {
	const msg = ev.data;
	if (msg.type === 'queryFetchHistory') {
		queryFetchHistory(msg.url, ev.ports[0]);
	} else if (msg.type === 'clearFetchHistory') {
		clearFetchHistory(msg.url, ev.ports[0]);
	}
});
Whose afraid of the big bad worker
…
or
It's the end of the web as we know it, and I feel fine
At the FT, we pride ourselves on having one of the fastest media websites in the world. We also release to production dozens of times a day, with small changes being rolled out to users on a near continuous basis. But when a service worker - the new browser API that promises much in the way of performance gains, and more besides - was added to the mix, things didn't go so smoothly.
Below I'll share a little about what went wrong, how we went back to the drawing board and, for the dedicated service worker enthusiast, some tips and code snippets for avoiding the same pitfalls.
How we release FT.com

At the heart of our ability to release software quickly and easily is our continuous deployment pipeline. The reason it works so effectively is we only have one environment (unless you count our local development machines), namely production. Within about 10 minutes of merging to master, a new version of the application is serving production traffic - exhilirating stuff if you're not used to it.
But we're not careless. A couple of practices help keep the wheels on the bus:

Before deploying the new version to production, we spin up a copy of the app in a near-production environment, and check a selection of urls to make sure they respond normally
Any feature (be it user-facing or some abstract API) can be safely hidden behind a feature flag, so its code can live in production for a long time while we test it, and if, when we turn it on, the feature turns out to be broken we can easily switch it off outside of a release cycle.

How not to release your service worker

Perhaps somewhat arrogantly, and despite cautionary tales form our legendary webapp team, we dove in head first with our first service worker. The features we experimented with were the sorts of things advocated by numerous blog posts and conference talks, and not obviously massive risks. Things such as caching static assets, and preloading and caching content to be read offline.
But we failed to take into account a number of things, some of which we really should have seen coming, but others no-one could've guessed, and we were stung pretty badly

Users with cookies more than 4000 characters long got stuck on bad versions indefinitely

Those bugs are all pretty awful, particularly given that our users pay a lot to read our content, and we were putting some users into a persistent state where they were unable to read what they'd paid for.
So lesson 1 of rolling out service workers is don't rush in.
How to rush in

While we are normally very relaxed about releasing straight to production, when it comes to service workers, we should've realised that we were doing/not doing a number of things which increased the risk level beyond reasonable limits:
No automated unit or integration tests in CI

This is fairly common for features on ft.com; while experimenting and not e.g. working with sensitive data we often release things that have few or no tests. I've been on the receiving end of some epic eye rolling in response to this admission, but it works for us, and our up-time and bug/error rate holds up to scrutiny.
Added to this, testing service workers is hard. A year ago (when we released our first service worker) resources and information on testing service workers were almost none existent. We would've had to put a lot of effort into running tests for our service worker. Naively, we underestimated the risks and decided the effort wasn't worth it. Boy, oh boy, were we proved wrong.
No way to test changes to service worker without releasing to all users

We have no test environment, so any changes were released to all users. Any bugs in that release of the service worker would affect every user, and potentially stay around for a long time.
No easy way to roll back or turn off a broken service worker

We thought we had this covered with a feature flag to toggle our sw on and off, but it proved ineffective in extreme circumstances.
No way to turn off individual features of the service worker

Unlike with all other features on the site, which are neatly contained within a feature flag until well-tested in production, any code added to our service worker would be executed by every user. So even assuming the problems above could be tackled, we'd still be left with an all or nothing service worker, so if e.g. we broke the caching elements of the sw, we'd have to disable the entire sw, thus disabling e.g. push notifications too. This degree of tight-coupling is anathema to our way of working
How not to rush in

Before releasing the sw again we therefore had 4 conditions to satisfy, and I'll describe our approaches to these problems below

Good coverage by tests, running, as far as necessary, in real browsers
Bulletproof kill switch
Mechanism for turning on features independently of one another
Mechanism for releasing to some users only (including for targeting at selected internal staff only)

Ok, now it's time to level with you, dear reader - this is gonna be a long one.
An effective kill switch

Targeted releases

Feature flags and service workers

Testing service workers

Resources for testing service workers have arrived since our first, ill-fated voyage. Notably, sw-test-env, a mock sw environment, running in node, allowing you to write tests run by your test driver of choice. This is a huge step forward - we introduced it late on in our work on testing, and the benefits of speed and simplicity it brought to our test suite were huge.
But ideally browser code should be tested somewhat in real browsers. This is trickier.
We owe a lot to this article by Matt Gaunt of Google, which introduces the fundamental idea, and some sample code, to aid with manipulating the service worker life cycle in order to integration test features of your service worker in relative isolation. The basic idea is:

Register your service worker
Wait for the registration event
Run a test
Tear down the service worker
Repeat

Where we built on Matt's idea is in making this work with an off the shelf browser test harness, rather than having to write our own server and mechanism for capturing test results. We use [https://karma-runner.github.io/1.0/index.html] but many of the ideas should be transferrable to other off the shelf tools.
Cache is a DOM api

An observation rather than a trick per se - Cache is a DOM API, so you can directly inspect from your tests, running in the page, what has been put in the sw Cache. The same is true for IndexedDB, if you also use that asa. persistence mechanism.
Testing service workers