Skip to content

Instantly share code, notes, and snippets.

@eikeon
eikeon / MemberReader.md
Last active December 23, 2015 22:49 — forked from edsu/MemberReader.md

Greetings,

At the Library of Congress we've recently been exploring rewriting a [Java web archiving tool][1] in Go. So far this has involved working with an existing body (~500TB) of data encoded using [ISO/DIS 28500][2] aka the WARC file format. One of the features of WARC is its use of [Gzip][3] as a packaging format, which allows individual WARC records to be represented as separate members in the larger Gzip file. Or as the spec says:

Per section 2.2 of the GZIP specification, a valid GZIP file consists of any number of gzip "members", each independently compressed. Where possible, this property should be exploited to compress each record of a WARC file independently. This results in a valid GZIP file whose per-record subranges also stand alone as valid GZIP files. External indexes of WARC file content may then be used to record each record's starting position in the GZIP file, allowing for random access of individual records without requiring decompression of all preceding records.

We ran into di

@eikeon
eikeon / ammem.go
Created January 19, 2012 20:22 — forked from edsu/ammem.go
package main
import (
"encoding/xml"
"fmt"
"log"
"net/http"
)
type Set struct {
@eikeon
eikeon / emacs.rb
Created August 1, 2011 13:04 — forked from pingles/emacs.rb
Homebrew Emacs for OSX Lion with native full-screen
require 'formula'
class Emacs < Formula
url 'http://ftp.gnu.org/pub/gnu/emacs/emacs-23.3.tar.bz2'
md5 'a673c163b4714362b94ff6096e4d784a'
homepage 'http://www.gnu.org/software/emacs/'
if ARGV.include? "--use-git-head"
head 'git://repo.or.cz/emacs.git'
else