Skip to content

Instantly share code, notes, and snippets.

@anjackson
Last active November 9, 2015 00:27
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save anjackson/48308ecab5f954218d4b to your computer and use it in GitHub Desktop.
Save anjackson/48308ecab5f954218d4b to your computer and use it in GitHub Desktop.
Ideal WARC ID result?

When analysing example.warc.gz containing a HTML response that was GZip encoded.

  • application/warc
    • application/gzip
      (outer gzip chunk)
      • application/warc; version="1.0", type=response
        (The whole WARC Record)
        • application/http; msgtype=response
          (WARC Record content, i.e. HTTP headers and entity body)
          • application/gzip
            (i.e. the entity body is compressed)
            • text/html; version=5
@richardlehane
Copy link

This is helpful. At the moment, sf already makes most of the compromises you describe. It iterates through a WARC file by "payload" rather than by "record", skipping non-response/resource types & merging continuations (https://godoc.org/github.com/richardlehane/webarchive#WARCReader.NextPayload), & so transparently decoding transfer/content encoding may simply be carrying this approach through to its logical conclusion. Thanks for chiming in on this one Andy!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment