Skip to content

Instantly share code, notes, and snippets.

@akdh
Created February 23, 2015 22:43
Show Gist options
  • Save akdh/aa2b2d573fe9e56ac6f6 to your computer and use it in GitHub Desktop.
Save akdh/aa2b2d573fe9e56ac6f6 to your computer and use it in GitHub Desktop.
import sys
import warc
import json
if len(sys.argv) != 2:
print("usage: %s FILENAME" % sys.argv[0])
exit()
filename = sys.argv[1]
f = warc.open(filename)
for record in f:
if record.type != 'response':
continue
trec_id = record.header['warc-trec-id']
body = unicode(record.payload.read(), errors='ignore')
body = body[body.find("\r\n\r\n")+4:]
print(json.dumps({'id': trec_id, 'content': body}))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment