Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
from warcio.archiveiterator import ArchiveIterator
with open('path/to/file.wet.gz', 'rb') as stream:
for record in ArchiveIterator(stream):
if record.rec_type == 'conversion':
url = record.rec_headers.get_header('WARC-Target-URI')
text = record.content_stream().read().decode('utf-8')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment