When analysing example.warc.gz containing a HTML response that was GZip encoded.
- application/warc
- application/gzip
(outer gzip chunk)- application/warc; version="1.0", type=response
(The whole WARC Record)- application/http; msgtype=response
(WARC Record content, i.e. HTTP headers and entity body)- application/gzip
(i.e. the entity body is compressed)- text/html; version=5
- application/gzip
- application/http; msgtype=response
- application/warc; version="1.0", type=response
- application/gzip
I does depend on the question, really. I think most people think of WARC as a 'container for files' and so getting into describing the WARC structure may make as much sense as trying to capture the structure and properties of the individual records in a ZIP file. So maybe you can ignore the WARC layer and just report on the
Content
, thus hiding any details from the other levels (like WARC record version, type, etc.)Note that this is pretty much what I do right now when scanning the formats in our (W)ARCs.
But these are not files, and treating them as such means you are ignoring anything other than the
WARC-Type: response
records, and also largely ignoring the HTTP headers (except to allowingContent-Type
to act as a hint). This hides theContent-Encoding
, andTransfer-Encoding
, and other things like the HTTP or WARC version. It's not good enough if you want to really understand what you have.But then again, I'm not expecting to get all the info I need in a single pass, at one point in time, with a single toolset. I'm expecting to go back over and over again to ask slightly different questions with (I hope) ever improving tools. Which is exactly why I would never actually use that
WARC-Identified-Payload-Type
field in my WARC records.