Skip to content

Instantly share code, notes, and snippets.

@dherman
Created November 10, 2018 01:22
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dherman/c00a51ee07534872606e8ae70bf3fc3c to your computer and use it in GitHub Desktop.
Save dherman/c00a51ee07534872606e8ae70bf3fc3c to your computer and use it in GitHub Desktop.
  • We want to provide good quality progress meters at console for fetch+unpack operation
  • Measuring progress smoothly requires knowing not just how many compressed bytes you have read but how many decompressed bytes you have written
  • Knowing the percentage complete of decompressed bytes written requires knowing the total decompressed size
  • Knowing the total decompressed size requires reading the field of the gzip layout that indicates decompressed size
  • That field is at a fixed offset from the end of the gzip file
  • In order to still get benefit of streaming we have to make a separate HTTP HEAD request to find the content length of the file and then a subsequent GET request to fetch just that tiny byte range to read that one field
  • (This does have overhead of a couple extra HTTP requests — anecdotally this seems cheap enough not to matter but maybe could matter in some environments?)
  • Unfortunately for GitHub releases, the files redirect to S3 URLs which seem to reject HEAD requests with a 403
  • We created a GH repo with the files in the repo directly instead of in GH release URLs, which meant we could do the HEAD requests successfully
  • But long term, there are a few questions:
  1. Is it true that progress reporting is importantly smoother when reporting based on decompressed size? My experience suggests yes but it seems worth more empirical investigation (or domain knowledge from someone who knows better than me)
  2. Is it not worth the trade off of the extra up front requests? (3 round trips instead of 1)
  3. Is there any possible way to do this well for zip files on windows? I think the answer is no and we just have to use compressed size
  4. Should we just back off gracefully to compressed size when the HEAD request fails?
  5. Or is there some way to get an S3 URL to respond successfully to a HEAD request?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment