Postmortem on a multi-layered bug
Also, a very long, mostly miserable day.
A web project was recently updated but now won't build on some computers. The updates largely involved migrating the project to our webpack-based Docker toolset which is in use on other projects and is supposed to be stable. Updates were authored on Macs, then tested on Windows -- where the whole project failed to run.
"THIS IS IMPOSSIBLE"
The first changes were made while experimenting on a significantly underpowered MacBook. The code worked, slowly, then was confirmed again on a MacBook Pro. Moving the code over to a Windows laptop, the tools failed to run. That initial failure was consistent on both Windows and WSL2 filesystems -- which should have been a big clue. After a while banging my head on that, I tried the code on a fourth, older Windows laptop where it built correctly.
With the codebase now cloned across four computers, five counting WSL, watching seemingly identical Docker builds fail for no apparent reason was maddening. The whole point of Docker is consistent, portable environments, this should not be happening.
Too many possibilities
Debugging this was an exercise in guessing wrong from a huge list of variables: macOS and Windows, different versions of Windows, multiple versions of Docker, multiple versions of node.js and the all the npm dependencies in node_modules.
My first thought was Docker. I run Docker Desktop Edge releases, and having been burned a couple times by new versions, I've learned to be cautious about upgrades. But a few days before picking this up I updated everything on my Windows machine, including the latest Docker. So when the build worked on the older PC, I first tried rolling Docker back to a previous version -- actually many previous versions going back six months. Having the build eventually fail on a known good version was somewhat reassuring in that it took Docker out of the equation.
Right? Because docker images are consistent?
Some time was wasted trying to debug Webpack in node. This was monotonous and fruitless.
Each computer had different versions of node.js installed. Not to mention the version running inside the Docker container. Those were normalized, but still no luck.
I was very close to giving up and restoring Windows on the PC -- which wouldn't have worked anyway.
Finally, trying to eliminate any variation, I switched to running builds directly inside the containers. This eventually led to the solution.
What took so long?
This was a nightmare to debug. Code was mirrored across four computers (five counting WSL) mostly using GitHub as the intermediary. Every step necessary to iterate took minutes to prepare. The build pipeline needed to run for dozens of seconds before it would fail. Dependencies were frequently cleared and re-installed, with both
npm ci and
npm install. Different versions of node.js were installed and re-installed. Backup snapshots of other, working projects were pulled, reconstructed, and spun up. Docker images were repeatedly rebuilt.
What eventually proved most useful, running commands inside ephemeral Docker containers, required starting the project in one shell, finding the container name with
docker ps in another shell, then finally entering the container with
docker exec -it <container_name> bash. Each theory and attempted fix just took forever.
Looking in the wrong places didn't help either.
The actual error
Well, there were actually two errors.
The first was a mistake on my part. During some earlier experiments, the
writeToDisk DevServer option was disabled, pushed to docker Hub, then locally reverted. That missing option crashes the build with this DevServer error message:
｢wdm｣: Error: ENOENT: no such file or directory, open '/usr/src/site/wp-content/themes/iop/dist/dependency-manifest.json'
Once that was discovered and resolved, I was left with this build error on half the computers:
Error [ERR_PACKAGE_PATH_NOT_EXPORTED]: No "exports" main resolved in /usr/src/tools/node_modules/@babel/helper-compilation-targets/package.json
That error, thankfully, led me to this comment on a Babel issue:
This is a regression in Node.js.
Yes, the docker-build image was built on v13 of node.js. But that still doesn't answer the question of why the same freshly rebuilt Docker images worked on some computers but not others?
Docker and the cache from hell
The real problem is that Docker caches the hell out of everything. Updates do not necessarily cascade, since the base image used for a local docker build might be using a locally-cached older version.
Initially suspecting there might be a Docker issue, I did the prudent thing; deleted and locally rebuilt the Docker images being used by the project.
But that's where I ultimately failed. While the project-specific images were deleted and reconstructed, the base images those images built from were not cleared.
The build tools were using
node:13-slim as a base image. Since that requirement is satisfied by every minor and patch under v13, any existing
node:13 images would work, Docker doesn't check to see if something newer would fulfill the requirement. Everywhere the project built successfully turned out to be a computer which had hung onto an older
node:13 image. If there was no image available, Docker pulled the current version which contained the regression.
Once the problem was finally isolated to loosely-defined base images and reproducible across machines, the fix was obvious and fast. Mostly born of frustration, I initially switched the base image from
node:ltx-slim and everything immediately worked on all platforms. But that's still a slippery target which could change and break.
Going forward, and this is lesson I want to remember, base images for shared tools should always be tied to a specific version. Preferably Long Term Support (LTS) releases. In this case, the most recent LTS version of node is 12.16.2.
In addition to using an explicit base image, versioned docker tags are now published to Docker Hub to match every versioned Git tag. Each project can and should specify an explicit image version to make sure they keep working regardless of what happens with
Docker should be consistent, next time it isn't, I'm going to start by clearing all images. Images should be disposable, so this should be a safe reset to baseline.
docker system prune -a should be enough, but double-check by running
docker images -a afterwards to be sure they're all gone.
If some behavior is different between two computers, try checking Docker's Image ID hashes to see if the images differ.
I recall some introductory Docker tutorials warning about building from
latest or version-less images. Obviously very good advice, but when a new technology all starts works, warnings are easy to dismiss. Based on a quick random sample of Dockerfiles on GitHub, virtually no one else listened either.