Skip to content

Instantly share code, notes, and snippets.

@benoit74
Created November 16, 2023 07:21
Show Gist options
  • Save benoit74/4fc145463dfdcd3e732794b2dcbb0a4a to your computer and use it in GitHub Desktop.
Save benoit74/4fc145463dfdcd3e732794b2dcbb0a4a to your computer and use it in GitHub Desktop.
List requests present in WARC files
import glob
from warcio import ArchiveIterator
for warc in glob.glob("output/.tmph919m5n3/collections/crawl-*/archive/*.warc.gz"):
with open(warc, "rb") as fh:
for record in ArchiveIterator(fh):
if record.rec_type == "request":
print(record.rec_headers.get_header('WARC-Target-URI'))
print("DONE")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment