When analysing example.warc.gz containing a HTML response that was GZip encoded.
- application/warc
- application/gzip
(outer gzip chunk)- application/warc; version="1.0", type=response
(The whole WARC Record)- application/http; msgtype=response
(WARC Record content, i.e. HTTP headers and entity body)- application/gzip
(i.e. the entity body is compressed)- text/html; version=5
- application/gzip
- application/http; msgtype=response
- application/warc; version="1.0", type=response
- application/gzip
This is helpful. At the moment, sf already makes most of the compromises you describe. It iterates through a WARC file by "payload" rather than by "record", skipping non-response/resource types & merging continuations (https://godoc.org/github.com/richardlehane/webarchive#WARCReader.NextPayload), & so transparently decoding transfer/content encoding may simply be carrying this approach through to its logical conclusion. Thanks for chiming in on this one Andy!