Skip to content

Instantly share code, notes, and snippets.

@laughinghan
Last active March 10, 2023 23:44
Show Gist options
  • Save laughinghan/4b4e52989a3110206a4b1ade5d68fb8f to your computer and use it in GitHub Desktop.
Save laughinghan/4b4e52989a3110206a4b1ade5d68fb8f to your computer and use it in GitHub Desktop.

Inspiration/prior art:

If you think about it, the blurry preview image technique is not great. A blurry preview offers a general impression of the colors of an image while omitting information about what's actually in the image, which is probably the information the viewer is more interested in. If we imagine, conversely, omitting color information by converting the image to grayscale, we actually retain the vast majority of the information the viewer is interested in! (Traditional image processing techniques like Canny edge detection and normalized graph cut operate on just grayscale images!)

Converting images to grayscale can potentially reduce them to like a quarter the size, which is nowhere small enough for a preview, which we want to be like ~1kb (1 packet ≈ 1.5kb). (Facebook's blurry image previews are 140B!)

First thing to try is simply blur with median filter (edge-preserving, tho still rounds down corners), downsample to a small resolution (128x128? That's 16K pixels), reduce to eg 4 levels of grayscale (at 2 bits per pixel, 16K pixels would be max 4KB), and expect that PNG run-length encoding gets the size <1KB). However, that downsampling would lose a lot of detail!

Along similar lines, we can try 256x256 (64K pixels) black-and-white (1 bit per pixel, max 8KB), with so few buckets there should be even more contiguous runs of the same color so run-length encoding should work even better. However, black-and-white loses even more detail, on top of the blurring and downsampling, so now the choice of thresholding matters a lot. More on this in a bit.

If we're already thresholding, we can also go further and potrace SVG silhouettes. If the silhouettes are good, this looks really good, see Mikael Ainalem's hand-drawn silhouettes which are sublime. On the other hand, the automatically potraced silhouettes look subpar. (One cool thing about SVG silhouettes is that reducing detail to fit in a size budget won't make them look worse. Note that the best way to reduce detail is to downsample and re-trace, that should be best for preserving corners.)

Without machine learning, we probably can't get silhouettes as good as hand-drawn, and therefore can't trust our auto-generated silhouettes enough to shade in foreground pixels as all dark and background as all light (or vice versa) without regard to the actual brightness/darkness of the pixels in the original image like we do with hand-drawn.

A conservative improvement to the existing potrace silhouette work is to improve thresholding. Instead of Otsu's (which is almost certainly what they're doing), do very rough foreground detection and choose a threshold to maximize contrast between foreground and background. Foreground detection can be very rough, eg:

  • loosely inspired by GrabCut, let's assume some colors are mostly foreground and other colors are mostly background, and estimate a Gaussian mixture model of ~9 components over all the colors in the image
    • Note: we want something like GMM and definitely don't want k-means. GMM partitions into well-separated clusters; k-means partitions into clusters of similar numbers of points. Imagine an image with a dark thing in the foreground and many shades of blue in the background. For choosing a limited palette, k-means would be a good choice, it will find many clusters of various shades of blue, which is good because the palette-restricted image can capture a lot of detail in the background. For segmenting foreground from background, we want all the blues to be one cluster because they're similar to each other and dissimilar to the dark foreground colors, so we want GMM.
  • let's also assume the foreground is mostly a contiguous thing (or a few contiguous things close together), whereas the background is a rectangle with a mostly contiguous hole in the middle.
    • Let's base our metric on mean absolute deviation (MAD) for easy visualization/interpretability (altho stddev or variance or anything else monotonic with MAD should work fine too). Note that for a given number of pixels, a solid circular disk of pixels is the configuration with minimal MAD; and (if you're constrained to be inside a rectangular image bounds) a solid rectangle with a disk cut out from its center is the configuration with maximal MAD.
    • So our metric for contiguousness of a set of pixels will be the ratio between the MAD of that set and the MAD of a disk with that many pixels. The smaller that ratio, the more like a contiguous disk those pixels are.
    • Our goal shall be to find a set of GMM components of maximal contiguousness (minimal MAD relative to number of pixels of those colors), and whose complement has maximal MAD relative to the number of pixels (this favors being near the center in addition to being contiguous).
      • When identifying a set of pixels for each GMM component, ignore "outliers" (colors that are >2.5 stddevs from any component). (We can only ignore "outliers" during this contiguousness classification step ofc, we can't ignore "outliers" during GMM estimation.)
    • I don't think there are any especially clever ways to solve this (it feels very knapsack-like), but I think an A* search-style approach should be able to significantly reduce the constant factor, where we keep around a "frontier" set of sets of components, initialized with each component as a singleton set, then try exploring from whichever is currently the most promising solution (least MAD so far), prioritizing components whose centroids are near each other.
      • (Randomly subsampling the pixels we look at should also speed up both MAD computation and GMM estimation.)
  • Once we've decided which components we think are foreground and which we think are background, we choose a threshold that maximizes the contrast between them.

Note that this categorization of pixels into foreground/background will be VERY rough because of similar colors that may appear in foreground and background; by contrast GrabCut uses separate GMMs for foreground and background which may have overlapping components, and pixels of ambiguous color will be categorized smartly based on the contiguousness heuristic by the graph cut. But it's ok because all we do with the foreground/background information is choose a threshold, what you actually see in the final preview image only depends on what the brightness/darkness of each pixel was in the original image, not at all on precisely which pixels were categorized as foreground vs background.

Looking at eg these examples, though, choosing a better threshold or even a reasonable adaptive threshold will only be of limited help. One idea, if we have silhouettes that are pretty good but still not as excellent as hand-drawn, is to use different thresholds for foreground and background to even further increase the contrast between foreground and background. For example if the foreground is darker than the background, we could use a high threshold for the foreground (so it gets biased towards black) and a low threshold for the background (so it gets biased towards white); there's still wiggle room here for really dark pixels that were (perhaps improperly?) categorized as "background" to end up thresholded as black, and vice versa. (Note that treating the silhouette as perfect and shading in all "foreground" pixels regardless of brightness is equivalent to pushing the different thresholds to opposite extremes.)

This would be pretty hard because we can't use such a rough foreground detection, we'd have to use actual GrabCut. There are a few things about this that aren't great:

  • as far as I can tell, GrabCut's primary heuristic assumption is that the foreground and background have different color GMMs, and secondarily assumes contiguousness. It does not appear to use either gradient information or assumptions about shape (smooth curves, straight lines, occasional corners more likely than lots of jagged edges) like active contours do. The one paper I found on Google that tries to combine GrabCut and active contour methods does a dumb weighted average between them. GrabCut does have an energy term that it minimizes; I wonder if the active contour assumptions (gradient & shape) could be incorporated into it? This would be a cool research project but feels like overkill, almost as much as machine learning
  • GrabCut (and active contours) require an initial guess, and the original GrabCut paper actually assumes that the initial bounding box is correct and any pixels outside it are hard-assumed to be background (never updated during estimation). I think it can be easily adapted to have soft assumptions only and start with a fixed initial guess, but I expect it to perform poorly. Should we start with the above "rough foreground detection"? Seems a little redundant since it makes similar assumptions as GrabCut already does, shouldn't really improve performance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment