Skip to content

Instantly share code, notes, and snippets.

@edsu
Created August 3, 2023 14:51
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save edsu/c2913ba885caecf8bae2c745fdfcd6a1 to your computer and use it in GitHub Desktop.
Save edsu/c2913ba885caecf8bae2c745fdfcd6a1 to your computer and use it in GitHub Desktop.
Check a specific WARC file that is being discussed in IIPC Slack
#!/usr/bin/env python
from warcio.archiveiterator import ArchiveIterator
with open('archive/rec-20230722210008512613-81a34b41ee13.warc.gz', 'rb') as stream:
for i, record in enumerate(ArchiveIterator(stream)):
print(i, record.rec_headers.get_header('WARC-Target-URI'))
if record.rec_type == 'response':
content = record.content_stream().read()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment