Skip to content

Instantly share code, notes, and snippets.

@sebastian-nagel
Created October 9, 2019 13:07
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sebastian-nagel/96cb4b94962008c70d3e89b96da7c27c to your computer and use it in GitHub Desktop.
Save sebastian-nagel/96cb4b94962008c70d3e89b96da7c27c to your computer and use it in GitHub Desktop.
from warcio.archiveiterator import ArchiveIterator
with open('path/to/file.wet.gz', 'rb') as stream:
for record in ArchiveIterator(stream):
if record.rec_type == 'conversion':
url = record.rec_headers.get_header('WARC-Target-URI')
text = record.content_stream().read().decode('utf-8')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment