Skip to content

Instantly share code, notes, and snippets.

@anjackson
Last active November 9, 2015 00:27
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save anjackson/48308ecab5f954218d4b to your computer and use it in GitHub Desktop.
Save anjackson/48308ecab5f954218d4b to your computer and use it in GitHub Desktop.
Ideal WARC ID result?

When analysing example.warc.gz containing a HTML response that was GZip encoded.

  • application/warc
    • application/gzip
      (outer gzip chunk)
      • application/warc; version="1.0", type=response
        (The whole WARC Record)
        • application/http; msgtype=response
          (WARC Record content, i.e. HTTP headers and entity body)
          • application/gzip
            (i.e. the entity body is compressed)
            • text/html; version=5
@richardlehane
Copy link

thanks Andy, have linked to this from the issue (richardlehane/siegfried#57)

@anjackson
Copy link
Author

I does depend on the question, really. I think most people think of WARC as a 'container for files' and so getting into describing the WARC structure may make as much sense as trying to capture the structure and properties of the individual records in a ZIP file. So maybe you can ignore the WARC layer and just report on the Content, thus hiding any details from the other levels (like WARC record version, type, etc.)

Note that this is pretty much what I do right now when scanning the formats in our (W)ARCs.

But these are not files, and treating them as such means you are ignoring anything other than the WARC-Type: response records, and also largely ignoring the HTTP headers (except to allowing Content-Type to act as a hint). This hides the Content-Encoding, and Transfer-Encoding, and other things like the HTTP or WARC version. It's not good enough if you want to really understand what you have.

But then again, I'm not expecting to get all the info I need in a single pass, at one point in time, with a single toolset. I'm expecting to go back over and over again to ask slightly different questions with (I hope) ever improving tools. Which is exactly why I would never actually use that WARC-Identified-Payload-Type field in my WARC records.

@richardlehane
Copy link

This is helpful. At the moment, sf already makes most of the compromises you describe. It iterates through a WARC file by "payload" rather than by "record", skipping non-response/resource types & merging continuations (https://godoc.org/github.com/richardlehane/webarchive#WARCReader.NextPayload), & so transparently decoding transfer/content encoding may simply be carrying this approach through to its logical conclusion. Thanks for chiming in on this one Andy!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment