Skip to content

Instantly share code, notes, and snippets.

@edsu
Last active December 23, 2015 22:39
Show Gist options
  • Save edsu/6705103 to your computer and use it in GitHub Desktop.
Save edsu/6705103 to your computer and use it in GitHub Desktop.

Greetings,

At the Library of Congress we've recently been exploring rewriting a Java web archiving tool in Go. So far this has involved working with an existing body (~500TB) of data encoded using ISO/DIS 28500 aka the WARC file format. One of the features of WARC is its use of Gzip as a packaging format, which allows individual WARC records to be represented as separate members in the larger Gzip file. Or as the spec says:

Per section 2.2 of the GZIP specification, a valid GZIP file consists of any number of gzip "members", each independently compressed. Where possible, this property should be exploited to compress each record of a WARC file independently. This results in a valid GZIP file whose per-record subranges also stand alone as valid GZIP files. External indexes of WARC file content may then be used to record each record's starting position in the GZIP file, allowing for random access of individual records without requiring decompression of all preceding records.

We ran into difficulty using gzip.Reader since it does not provide any insight into when a member has been read. It simply reads through all the members in the file. While fishing around for people with a similar problem we ran across a go-nuts thread initiated by Dan Kortschak who needed to access members in a gzip file for in his Biogo for processing genomic and metagenomic data sets.

We would like to propose a small design change for gzip that would introduce a MemberReader which would expose when a the end of a member has been reached, as well as the byte offset position in the underlying compressed data.

For example:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment