The discussion of render-diffing strategies at https://gist.github.com/joshbachynski/a24df8e8f0deeb0c521a4013074edc9e covers the why and how. This article addresses a question that every team adopting render-diff faces a few weeks in: how much diff noise is acceptable before the rule is useless? Set the threshold too tight and you spend your week investigating false positives. Set it too loose and you miss real regressions. Finding the equilibrium is empirical, but there are shape constraints worth respecting.
The first useful number is the natural diff floor. Even with deterministic rendering, network jitter, A/B test bucketing, and time-of-day content rotation produce a baseline diff. Measure it before setting a threshold: render the same URL ten times in a row on a clean main branch and record the diff distribution. The 95th percentile of that distribution is your floor. Setting the alarm threshold below the floor guarantees flake.
A reasonable target for vis