Skip to content

Instantly share code, notes, and snippets.

View Segerberg's full-sized avatar

Andreas Segerberg Segerberg

View GitHub Profile
#!/usr/bin/env python
"""
This utility extracts media urls from tweet jsonl.gz and save them as warc records.
Warcio (https://github.com/webrecorder/warcio) is a dependency and before you can use it you need to:
% pip install warcio
You run it like this:
% python media2warc.py /mnt/tweets/ferguson/tweets-0001.jsonl.gz /mnt/tweets/ferguson/tweets-0001.warc.gz