Skip to content

Instantly share code, notes, and snippets.

@max-mapper
Last active November 14, 2017 20:01
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save max-mapper/11a85ae12074fed0b9f6 to your computer and use it in GitHub Desktop.
Save max-mapper/11a85ae12074fed0b9f6 to your computer and use it in GitHub Desktop.
streaming unzip example

experimental zip stream parsing

using punzip, which uses mount-url and yauzl

problem: theres a 500mb ZIP with a few CSVs in it, but you only care about one of the files and dont want to download the whole thing and have to unzip the whole zip just to get the one file

  1. brew install osxfuse (or however you install fuse on your OS)
  2. npm install punzip csv-parser -g
  3. punzip http://download.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Downloads/PartD_Prescriber_PUF_NPI_DRUG_13.zip --entry=2 | csv-parser --separator=$'\t'

this will:

  • mount the zip as a virtual fuse file and convert fs read calls into http range requests
  • initialize a streaming zip parser, which is capable of reading the zip file entry index at the end of the zip thanks to the random read capabilities of the fuse layer
  • the entry table is located at the end of the file, which makes zip bad for streaming. but with this approach we can translate the zip parsers calls to the end of the file into http range requests like this:
mount-url requested +542ms 514105344-514170879 received 65536 bytes
mount-url requested +173ms 514170880-514172204 received 1325 bytes
  • in the above case it only needed to download ~67 kilobytes of a 500mb file in order to know how to find any file from the zip and stream it out
  • --entry=2 makes it stream the 2nd file in the zip
  • it only needs to download the parts of the zip related to the entry you choose. and since everything is streaming you dont need to wait for the whole thing to download before you start getting output

issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment