Skip to content

Instantly share code, notes, and snippets.

@eikeon
Forked from edsu/MemberReader.md
Last active December 23, 2015 22:49
Show Gist options
  • Save eikeon/6705483 to your computer and use it in GitHub Desktop.
Save eikeon/6705483 to your computer and use it in GitHub Desktop.

Greetings,

At the Library of Congress we've recently been exploring rewriting a Java web archiving tool in Go. So far this has involved working with an existing body (~500TB) of data encoded using ISO/DIS 28500 aka the WARC file format. One of the features of WARC is its use of Gzip as a packaging format, which allows individual WARC records to be represented as separate members in the larger Gzip file. Or as the spec says:

Per section 2.2 of the GZIP specification, a valid GZIP file consists of any number of gzip "members", each independently compressed. Where possible, this property should be exploited to compress each record of a WARC file independently. This results in a valid GZIP file whose per-record subranges also stand alone as valid GZIP files. External indexes of WARC file content may then be used to record each record's starting position in the GZIP file, allowing for random access of individual records without requiring decompression of all preceding records.

We ran into difficulty using gzip.Reader since it does not provide any insight into when a member has been read. It simply reads through all the members in the file. While fishing around for people with a similar problem we ran across a go-nuts thread initiated by Dan Kortschak who needed to access members in a gzip file in his Biogo for processing genomic and metagenomic data sets.

We would like to propose a small addition for gzip that would introduce a MemberReader which would expose when a the end of a member has been reached, as well as the byte offset position in the underlying compressed data.

For example, if you want to print out the header and the end position of each member:

f, _ := os.Open("test.gz")
defer f.Close()
if gz, err := gzip.NewMemberReader(f); err == nil {
        for {
                if _, err := io.Copy(ioutil.Discard, gz); err == nil {
                        return nil
                } else if err == gzip.EndOfMember {
                        fmt.Printf("Header: %#v\n", gz.Header)
                        fmt.Print("End Position:", gz.EndPosition(), "\n")
                } else {
                        return err
                }
        }
} else {
        return err
}

Then to read one member at a known position:

f, _ := os.Open("test.gz")
f.Seek(position, 417)
gz, _ := gzip.NewMemberReader(f)

Thoughts? We are ready to work on an implementation once the design looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment