Skip to content

Instantly share code, notes, and snippets.

@edsu
Created September 24, 2022 21:54
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save edsu/97006eccb0f4884df850d3a20ad6db67 to your computer and use it in GitHub Desktop.
Save edsu/97006eccb0f4884df850d3a20ad6db67 to your computer and use it in GitHub Desktop.
#!/usr/bin/env python3
#
# This demonstrates an inconsistency in results from the Internet Archive CDX
# API when querying by scopeType=domain vs scopeType=prefix. For context see:
#
# https://inkdroid.org/2022/09/24/pdfs/
#
# Note: you'll need to
#
# pip install git+https://github.com/edsu/wayback.git@t88-invalid-month
#
# until https://github.com/edgi-govdata-archiving/wayback/issues/88 is fixed
#
import pandas
from urllib.parse import urlparse
from wayback import WaybackClient
ia = WaybackClient()
prefix = pandas.DataFrame(ia.search('assets.lapdonline.org', matchType='prefix'))
domain = pandas.DataFrame(ia.search('lapdonline.org', matchType='domain'))
get_host = lambda s: urlparse(s).netloc
prefix['hostname'] = prefix.url.apply(get_host)
domain['hostname'] = domain.url.apply(get_host)
print(prefix.hostname.value_counts())
print(domain.hostname.value_counts())
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment