Skip to content

Instantly share code, notes, and snippets.

@JaimieMurdock
Last active September 25, 2017 22:32
Show Gist options
  • Save JaimieMurdock/3e9235128024afa680a7e6fdd0b0b746 to your computer and use it in GitHub Desktop.
Save JaimieMurdock/3e9235128024afa680a7e6fdd0b0b746 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# IA API Workshop Notebook"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook contains a tutorial for using the various APIs at the Internet Archive via Python.\n",
"\n",
"Rather than using command line tools such as `curl` and `jq` to parse information, we will use Python to process each file. This will make it easy to visualize the data and do advanced transformations on the results.\n",
"\n",
"## Libraries\n",
"\n",
"\n",
"Python's standard library includes [`http`](https://docs.python.org/3/library/http.html) and [`urllib.request`](https://docs.python.org/3/library/urllib.request.html) modules for retrieving data from API endpoints. However, the documentation itself recommends using [`requests`](http://docs.python-requests.org/en/master/), a third-party library for interacting with APIs.\n",
"\n",
"One very useful part of requests is built in `json` and HTTP authentication support.\n",
"\n",
"We also use [`matplotlib`](https://matplotlib.org/) for visualization, especially the [`pyplot`](https://matplotlib.org/api/pyplot_summary.html) module"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"\n",
"import requests"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# APIs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Wayback Availability API\n",
"\n",
"We start with the Wayback Availability API. First, we use `requests.get()` to complete a request cycle. Then, we show how to genericize the request into a function that handles cases when Wayback is not available."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{'archived_snapshots': {'closest': {'available': True,\n",
" 'status': '200',\n",
" 'timestamp': '20170925160637',\n",
" 'url': 'http://web.archive.org/web/20170925160637/http://example.com'}},\n",
" 'url': 'example.com'}"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Demonstrate a successful Wayback Availability query\n",
"URL = 'example.com'\n",
"r = requests.get('http://archive.org/wayback/available?url={}'.format(URL))\n",
"r.json()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{'archived_snapshots': {}, 'url': 'foo.example.com'}"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Demonstrate an unsuccessful Wayback Availability query\n",
"URL = 'foo.example.com'\n",
"r = requests.get('http://archive.org/wayback/available?url={}'.format(URL))\n",
"r.json()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"example.com True\n",
"foo.example.com False\n"
]
}
],
"source": [
"# convert to a boolean function\n",
"def available(URL):\n",
" r = requests.get('http://archive.org/wayback/available?url={}'.format(URL))\n",
" data = r.json()\n",
" return bool(data['archived_snapshots'])\n",
"\n",
"print(\"example.com\", available('example.com'))\n",
"print(\"foo.example.com\", available('foo.example.com'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Wayback CDX API\n",
"\n",
"The Wayback CDX (Capture InDeX) contains records in an 11-column table that uses URL and timestamp as keys. This is an API that truly benefits from Python usage, as we can build a custom parsing function to extract the values from the CDX records."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"r = requests.get('http://web.archive.org/cdx/search?url=example.com')"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"<generator object parse_cdx at 0x7f57e10f7150>"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import csv\n",
"from io import StringIO\n",
"from datetime import datetime\n",
"\n",
"def parse_cdx(URL):\n",
" \"\"\"\n",
" Parses the data from a Wayback CDX API request and returns a generator of CDX data.\n",
" It converts the timestamp, statuscode, and length into native datatypes.\n",
" \"\"\"\n",
" FIELDS = ['urlkey', 'timestamp', 'original', 'mimetype', 'statuscode', 'digest', 'length']\n",
" r = requests.get('http://web.archive.org/cdx/search?url={}'.format(URL))\n",
" text = StringIO(r.text)\n",
" reader = csv.DictReader(text, fieldnames=FIELDS, delimiter=' ')\n",
" for row in reader:\n",
" try:\n",
" row['timestamp'] = datetime.strptime(row['timestamp'], '%Y%m%d%H%M%S')\n",
" row['length'] = 0 if row['statuscode'] == '-' else int(row['length'])\n",
" yield row\n",
" except ValueError:\n",
" pass\n",
"\n",
"# Note, this only returns the generator! Not a list!\n",
"parse_cdx('example.com')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### CDX API Example: counting response codes"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAZsAAAEWCAYAAACwtjr+AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xm8XlV97/HP14CAIGFKYwhguJKrBq4iRETFkcogamgL\nmDoQublgFYvWqgy1QHuLBW1FuQi+uKAEHCCilsgoFwRHAmESw1DCZAgBEoYwWIbA9/6x1gM7D+ec\nnISzz5OcfN+v13md/ay99t5rPdNvr7XXs7ZsExER0aaX9boAEREx8iXYRERE6xJsIiKidQk2ERHR\nugSbiIhoXYJNRES0LsEmnifpE5J+3cJ+L5f0v4Z6v6ujoXou2nqtojfWhM9Igs0qTtLhki7sSrut\nn7Spw1u6dkj6iKQ5kh6XtFDShZJ2qeuOlvSMpMfq339KOlHSuMb2H5R0n6RNGmlTJC2QNLoXderL\nqhIw+vqik/RuSffU5bn1tXhc0rOSnmw8PqKx/GRd33k8t25vSU/UtAWSvi5pVD9lmSLpekmPSlos\n6TJJW9d1R0v63grU6/k6tEHSTpIukPSIpIckXSXpgLaOt7pLsFn1/RJ4W+fDWb9U1wbe1JW2Tc27\nWpP0eeAbwFeAscBWwLeADzWynW37lcAmwF8ArwKu6QQc2z8DLgOOr/vcCDgZ+JTtJYMow1pDVqER\nwPa2tjewvQHwK+Aznce2v9JY9zfA7xrrtm3s5o01z67AR4ADu48jaRvgDODvgdHA1pTX/tl2a7ji\nJL2V8h67gvLZ2xT4FLBHL8u1KkuwWfVdTQku29fH7wB+AdzalXa77XsBJH1T0vx6dniNpHfU9FdJ\n+pOkTTs7l7SDpEWS1n4hSSdKWiLpFkm7NvIeIOnm2qK4Q9InmwXtOiu9XdKLPniSxkn6vaQv9rFu\nNPDPwMG2f2L7CdvP2D7P9pe689d1c4EPA4soX1IdhwB7StqdEnSusD2rrydY0oR69j1d0h8pXyJI\n2lnSb+uZ6w2S3t3Y5hP1OXhM0p2SPlrTlzn7bux7ra5jvh74NvDWesb/SE1/v6Sb6n4XSPpCX2V+\nYTcvfq0k7Svpmq6Mn5d07gD7Gha2b6EErO36WL09cKftS108ZvvHtv9Y30tHAB+uz9cN0P97UtL6\nwIXA5o2W1uaSTpf0L50Ddrd+JB1an/fHJN3afP93+Roww/ZxthfX8l5j+8ONfR0oaV5t9cyStHlj\n3fvqa7ZE0omAmjuX9D9rvR6WdLGkV6/QE70KSrBZxdl+GpgNvLMmvZPyYf11V1qzVXM15YO7CfAD\n4EeS1rV9H3A5sF8j78eBs2w/Ux+/Bbgd2Aw4CviJXuiOegD4ALAhcABwvKQdoHQpUM5KvwhsVMt0\nV7MuKt0hVwAn2v5aH9V9K7Au8NPlPC3LsP0scC4l6HbSFgOfBb5fy3zIIHb1LuD1wO6SxgPnA/9C\neR6/APxY0pj6RXYCsGdtYb0NuH4Fy3wzy7YENqqrTgM+Wfe7HTXw9aO/12oWsHUNaB0fp7w+PSVp\nEuV1uq6P1dcCr5N0vKT3SNqgs8L2RZTW7tn1+XpjXdXne9L2E8CewL2Nlta9yynba4HPAG+uz//u\ndL2Ha75XUN6r5wywr/cC/0r5rI0D7gbOqus2A34CfJny2t0OvL2x7RRKYP1LYAzl8/7Dgcq+Okiw\nWT1cwQuB5R2UN9+vutKu6GS2/T3bD9peavvfgXWA19bVM4CPAah0w/01cGbjWA8A36ithrMpLai9\n6n7Pt317PYu7Avg5L3zBTwe+Y/sS28/ZXlDPYjsmUVpkR9k+pZ96bgostr108E/N8+6lBIWmKynd\nMT+3vWgQ+zi6tqb+i/IcXWD7glqfS4A5wPtr3ueA7SStZ3thbWENhWeASZI2tP2w7WsHyNvna2X7\nKeBsXnidtwUmAOcNsK8TagvukdrKGijvyrhW0sPAz4BTge92Z7B9B/BuYDwwE1hcWyIbdOdtbDPQ\ne3JFPUv5rEyStLbtu2zf3ke+jSnfnQsH2NdHKZ+Ha+vrcTilFTuB8h6aa/ucepL3DeC+xrZ/A/yr\n7ZvrZ+ErwPare+smwWb18Etgl3rWOsb2bcBvKddyNqGcAT/fspH0hdoEX1K/OEZTzqCgtAAm1VbG\n+4Altq9qHGuBl52d9W5g87rfPSVdWbsFHqF8aDr73ZJyhtafjwILGOBsEHgQ2Ky7y2mQxgMPdaWd\nQjmbf79KH/vyzG8svxrYt+sLeBdgXD1r/jDlS2GhpPMlvW4lytyXv6I8r3dLumI55e73taKcVHxE\nkiitmpn1S68/h9jeqPNHaS0MpR1sb2z7Nba/bPu5vjLZvtL2frbHUILGO4F/6G+ny3lPrhDb84DP\nAUcDD0g6q9n11fAw5WRjXB/rOjanvB6dfT9OeX+Pr+vmN9aZF7/3vtl43z1E6WYbvxLVWmUk2Kwe\nfkcJGAcCvwGw/SjlbP5ASlfBnQAq12e+RGm+b1y/OJZQ+4RtP0k5a/wY5UvoTJY1vn5BdWwF3Ctp\nHeDHwL8BY+t+L+CFvub5wGsGqMPRwGLgB+pnJFKt51PA3gPs50UkvQz4IKW110mbTgmAn6Z0SZwq\n6eXL2VXzi3s+cGbzC9j2+raPBbB9se33Ub5wbgH+b93uCeAVjf28apDHo+73attTgD8D/oPyWvWn\nz9eq7udK4GnKF/ZHePHrvMqzfTWlu6lzfWeZ52sQ78m+prQf8PWx/QPbu1C+8A0c10e5/kR5r/7V\nAMW/t+6jU9b1KS33BZQW0ZaNdWo+prz3Ptn13lvP9m8HON4qL8FmNVC7deYAn6fxhUq5bvN5lr1e\n80pgKeWC+VqSjqT0ZzedAXyCMsKr+0voz4BDJK0taV/KNYwLgJdTuhgWAUsl7Qns1tjuNOAASbtK\nepmk8V1n+88A+wLrA2fUANFdzyXAkcC3JO0t6RW1HHtK+mp3fklr1esSP6R8aXy9pm9OuYB7YD2b\n/zblrLLfM+Q+fA/4oKTdJY2StG69mLyFpLEqgyHWpwTHxylnulCu3bxT0lYqAx4OH+AY9wNbdIKg\npJdL+qik0bV75dHGfvvS32vVcQZwIvCM7Z4PsV4eSbuoXFT/s/r4dZT36JU1y/3AhMZ7Z3nvyfuB\nTbXscPfrKS3dTSS9itKS6Rz/tZLeW4PYk8B/0f/z/yXgE5K+qDrgRtIbJZ1V1/+Q8nnYvu7vK8Bs\n23dRrgVuK+kvayv+EJYNet8GDq/dn0gaXV/f1VqCzerjCsqXS/NL41c1rRlsLgYuAv6T0ox/kmWb\n6Nj+DeVDdK3tu1nWbGAipRVyDLBPvf7zGOVDMZPSjfARyoXozj6vol6gpbSkrqBxZlfzPE256DkW\n+E4/AeffKQH0y5QvkfmUi7b/0cj2YUmP1+PMogSSHRsXgE+iDHr4Vd2nKS3Az3U+wMtjez7QuVDb\nKccXKZ+Zl9Uy3kvp4ngXZdgr9drO2cDvgWsY+NrHZcBc4D5Ji2vax4G7JD1K6ab76ADb9/laNdaf\nSWkVDPq3KT32CCW43Fhf34sog0U6Jxo/qv8flHTtIN6Tt1C+9O+oXVKbU56TGygX/n9Oea061gGO\npTyf91E+W32eLNRWxnvr3x2SHqJ0215Q1/8/4B8pLa+FlFb/1LpuMeXE61jKe3citceirv8ppUV1\nVn0f/IEy2GG1JufmaWskSZcBP7B9aq/LEu2QtB5lEMEO9TpfRM/kx2trIElvBnagnLnHyPUp4OoE\nmlgVJNisYSTNoFyA/2zthogRSNJdlAvlKzTYIqItrV2zkfQdSQ9I+kMjbRNJl6jM43WJpI0b6w5X\n+bXtrSq/+u6k7yjpxrruhM7oG0nrSDq7ps9WGb/e2WZaPcZtkqa1VcfVke1ptkfbPr3XZYn22J5g\n+9W2+/rxZMSwa3OAwOm8eJ6gw4BLbU8ELq2PO78qngpsW7c5qTE89mTKxd2J9a+zz+nAw7a3oVyU\nPq7uaxPKr6nfAuwEHNUMahERMfxa60az/ctma6OaQvmFMJQfnV0OHFrTz6rDVO+UNA/YqXYFbFh/\nM4CkMyjdAhfWbY6u+zoHOLG2enYHLrH9UN3mEkqAGnC6h80228wTJnQXNyIiBnLNNdcsrj/CHdBw\nX7MZa7szxcN9lCGwUH4Ze2Uj3z017Zm63J3e2WY+gO2lkpZQfjT1fHof2yxD0kHAQQBbbbUVc+bM\nWblaRUSsoSR1/3yiTz37nU397UNPx13bPsX2ZNuTx4xZbmCOiIiVNNzB5n7Ve47U/w/U9AUsO13D\nFjVtQV3uTl9mm/or3NGUH0j1t6+IiOiR4Q42s4DO6LBplEkhO+lT6wizrSkDAa6qXW6PqtxXRMD+\nXdt09rUPcFltLV0M7CZp4zowYLeaFhERPdLaNRtJP6QMBthM5eZER1GmZ5ipMkni3dT7qtieK2km\ncBNlXq+DXe5RAmUixdOB9SgDAzq3Qz4NOLMOJniIF6aCeEjS/6bc0wXgnzuDBSIiojcyXU01efJk\nZ4BARMSKkXSN7cnLy5eJOCMionUJNhER0boEm4iIaF2CTUREtC6zPg+RCYed3+siDIm7jt2r10WI\niBEoLZuIiGhdgk1ERLQuwSYiIlqXYBMREa1LsImIiNYl2EREROsSbCIionUJNhER0boEm4iIaF2C\nTUREtC7BJiIiWpdgExERrUuwiYiI1iXYRERE6xJsIiKidQk2ERHRugSbiIhoXYJNRES0LsEmIiJa\nl2ATERGtS7CJiIjWJdhERETrEmwiIqJ1CTYREdG6BJuIiGhdgk1ERLQuwSYiIlqXYBMREa1LsImI\niNb1JNhI+jtJcyX9QdIPJa0raRNJl0i6rf7fuJH/cEnzJN0qafdG+o6SbqzrTpCkmr6OpLNr+mxJ\nE4a/lhER0THswUbSeOAQYLLt7YBRwFTgMOBS2xOBS+tjJE2q67cF9gBOkjSq7u5k4EBgYv3bo6ZP\nBx62vQ1wPHDcMFQtIiL60atutLWA9SStBbwCuBeYAsyo62cAe9flKcBZtp+yfScwD9hJ0jhgQ9tX\n2jZwRtc2nX2dA+zaafVERMTwG/ZgY3sB8G/AH4GFwBLbPwfG2l5Ys90HjK3L44H5jV3cU9PG1+Xu\n9GW2sb0UWAJsOuSViYiIQelFN9rGlJbH1sDmwPqSPtbMU1sqHoayHCRpjqQ5ixYtavtwERFrrF50\no/05cKftRbafAX4CvA24v3aNUf8/UPMvALZsbL9FTVtQl7vTl9mmdtWNBh7sLojtU2xPtj15zJgx\nQ1S9iIjo1otg80dgZ0mvqNdRdgVuBmYB02qeacC5dXkWMLWOMNuaMhDgqtrl9qiknet+9u/aprOv\nfYDLamspIiJ6YK3hPqDt2ZLOAa4FlgLXAacAGwAzJU0H7gb2q/nnSpoJ3FTzH2z72bq7TwOnA+sB\nF9Y/gNOAMyXNAx6ijGaLiIgeGfZgA2D7KOCoruSnKK2cvvIfAxzTR/ocYLs+0p8E9n3pJY2IiKGQ\nGQQiIqJ1CTYREdG6BJuIiGhdgk1ERLQuwSYiIlqXYBMREa1LsImIiNYl2EREROsSbCIionUJNhER\n0boEm4iIaF2CTUREtC7BJiIiWpdgExERrUuwiYiI1iXYRERE6xJsIiKidQk2ERHRugSbiIhoXYJN\nRES0LsEmIiJal2ATERGtS7CJiIjWJdhERETrEmwiIqJ1CTYREdG6BJuIiGhdgk1ERLQuwSYiIlqX\nYBMREa1LsImIiNYl2EREROsSbCIionUJNhER0boEm4iIaF1Pgo2kjSSdI+kWSTdLequkTSRdIum2\n+n/jRv7DJc2TdKuk3RvpO0q6sa47QZJq+jqSzq7psyVNGP5aRkRER69aNt8ELrL9OuCNwM3AYcCl\nticCl9bHSJoETAW2BfYATpI0qu7nZOBAYGL926OmTwcetr0NcDxw3HBUKiIi+jbswUbSaOCdwGkA\ntp+2/QgwBZhRs80A9q7LU4CzbD9l+05gHrCTpHHAhravtG3gjK5tOvs6B9i10+qJiIjh14uWzdbA\nIuC7kq6TdKqk9YGxthfWPPcBY+vyeGB+Y/t7atr4utydvsw2tpcCS4BNuwsi6SBJcyTNWbRo0ZBU\nLiIiXqwXwWYtYAfgZNtvAp6gdpl11JaK2y6I7VNsT7Y9ecyYMW0fLiJijdWLYHMPcI/t2fXxOZTg\nc3/tGqP+f6CuXwBs2dh+i5q2oC53py+zjaS1gNHAg0Nek4iIGJRhDza27wPmS3ptTdoVuAmYBUyr\nadOAc+vyLGBqHWG2NWUgwFW1y+1RSTvX6zH7d23T2dc+wGW1tRQRET2wVo+O+7fA9yW9HLgDOIAS\n+GZKmg7cDewHYHuupJmUgLQUONj2s3U/nwZOB9YDLqx/UAYfnClpHvAQZTRbRET0yHKDjaSvAv8C\n/BdwEfAG4O9sf29lD2r7emByH6t27Sf/McAxfaTPAbbrI/1JYN+VLV9ERAytwXSj7Wb7UeADwF3A\nNsAX2yxURESMLIMJNp3Wz17Aj2wvabE8ERExAg3mms15km6hdKN9StIY4Ml2ixURESPJcls2tg8D\n3gZMtv0M8CfgkLYLFhERI8eghj7bfqgzAsz2E8CPWi1VRESMKCv7O5vMMxYREYO2ssEmP5CMiIhB\n63eAgKSf0XdQEX1MahkREdGfgUaj/dtKrouIiFhGv8HG9hX9rZP09naKExERI9FA3WijKPOTjafc\nVfMPkj4AHEGZi+xNw1PEiIhY3Q3UjXYaZZr+q4ATJN1Lmc/sMNv/MRyFi4iIkWGgYDMZeIPt5ySt\nS7l75mts574wERGxQgYa+vy07efg+VmU70igiYiIlTFQy+Z1kn5flwW8pj4W5c7Nb2i9dBERMSIM\nFGxeP2yliIiIEW2goc93D2dBIiJi5Bpo6PNjLDuDgIHFwC+AQ3P9JiIiBqvfAQK2X2l7w8bfaMoI\ntbnAt4ethBERsdpboYk4bT9s+3jgNS2VJyIiRqAVnvVZ0toM7g6fERERwMDXbP6yj+SNgQ8D57RW\nooiIGHEGaqF8sOuxgQeBb9o+v70iRUTESDPQ0OcDhrMgERExcq3snTojIiIGLcEmIiJal2ATERGt\nW26wkfTlxvI67RYnIiJGon6DjaRDJb0V2KeR/Lv2ixQRESPNQEOfbwH2Bf6bpF/Vx5tKeq3tW4el\ndBERMSIM1I32CHAEMA94N/DNmn6YpN+2XK6IiBhBBmrZ7A4cSZkH7evA74En8vubiIhYUQPN+nyE\n7V2Bu4AzgVHAGEm/lvSzYSpfRESMAIOZUPNi23OAOZI+ZXsXSZu1XbCIiBg5ljv02faXGg8/UdMW\nv9QDSxol6TpJ59XHm0i6RNJt9f/GjbyHS5on6VZJuzfSd5R0Y113giTV9HUknV3TZ0ua8FLLGxER\nK29F72dzwxAe+7PAzY3HhwGX2p4IXFofI2kSMBXYFtgDOEnSqLrNycCBwMT6t0dNnw48bHsb4Hjg\nuCEsd0RErKCezCAgaQtgL+DURvIUYEZdngHs3Ug/y/ZTtu+kjI7bSdI4YEPbV9o2cEbXNp19nQPs\n2mn1RETE8OvVdDXfAL4EPNdIG2t7YV2+Dxhbl8cD8xv57qlp4+tyd/oy29heCiwBNu0uhKSDJM2R\nNGfRokUvqUIREdG/YQ82kj4APGD7mv7y1JaK2y6L7VNsT7Y9ecyYMW0fLiJijdWL2zu/HfiQpPcD\n6wIbSvoecL+kcbYX1i6yB2r+BcCWje23qGkL6nJ3enObeyStBYym3PgtIiJ6YNhbNrYPt72F7QmU\nC/+X2f4YMAuYVrNNA86ty7OAqXWE2daUgQBX1S63RyXtXK/H7N+1TWdf+9RjtN5SioiIvvWiZdOf\nY4GZkqYDdwP7AdieK2kmcBOwFDjY9rN1m08DpwPrARfWP4DTgDMlzQMeogS1iIjokZ4GG9uXA5fX\n5QeBXfvJdwxwTB/pc4Dt+kh/kjKJaERErAJy87SIiGhdgk1ERLQuwSYiIlqXYBMREa1LsImIiNYl\n2EREROsSbCIionUJNhER0boEm4iIaF2CTUREtC7BJiIiWpdgExERrUuwiYiI1iXYRERE6xJsIiKi\ndQk2ERHRugSbiIhoXYJNRES0LsEmIiJal2ATERGtS7CJiIjWJdhERETrEmwiIqJ1CTYREdG6BJuI\niGhdgk1ERLQuwSYiIlqXYBMREa1LsImIiNYl2EREROsSbCIionUJNhER0boEm4iIaF2CTUREtG7Y\ng42kLSX9QtJNkuZK+mxN30TSJZJuq/83bmxzuKR5km6VtHsjfUdJN9Z1J0hSTV9H0tk1fbakCcNd\nz4iIeEEvWjZLgb+3PQnYGThY0iTgMOBS2xOBS+tj6rqpwLbAHsBJkkbVfZ0MHAhMrH971PTpwMO2\ntwGOB44bjopFRETfhj3Y2F5o+9q6/BhwMzAemALMqNlmAHvX5SnAWbafsn0nMA/YSdI4YEPbV9o2\ncEbXNp19nQPs2mn1RETE8OvpNZvavfUmYDYw1vbCuuo+YGxdHg/Mb2x2T00bX5e705fZxvZSYAmw\naR/HP0jSHElzFi1aNAQ1ioiIvvQs2EjaAPgx8DnbjzbX1ZaK2y6D7VNsT7Y9ecyYMW0fLiJijdWT\nYCNpbUqg+b7tn9Tk+2vXGPX/AzV9AbBlY/MtatqCutydvsw2ktYCRgMPDn1NIiJiMHoxGk3AacDN\ntr/eWDULmFaXpwHnNtKn1hFmW1MGAlxVu9welbRz3ef+Xdt09rUPcFltLUVERA+s1YNjvh34OHCj\npOtr2hHAscBMSdOBu4H9AGzPlTQTuIkyku1g28/W7T4NnA6sB1xY/6AEszMlzQMeooxmi4iIHhn2\nYGP710B/I8N27WebY4Bj+kifA2zXR/qTwL4voZgRETGEMoNARES0LsEmIiJal2ATERGtS7CJiIjW\nJdhERETrEmwiIqJ1CTYREdG6BJuIiGhdgk1ERLQuwSYiIlqXYBMREa1LsImIiNYl2EREROsSbCIi\nonUJNhER0bpe3DwtRpgJh53f6yIMmbuO3avXRYgYkdKyiYiI1iXYRERE6xJsIiKidQk2ERHRugSb\niIhoXYJNRES0LsEmIiJal2ATERGtS7CJiIjWJdhERETrEmwiIqJ1CTYREdG6BJuIiGhdgk1ERLQu\nwSYiIlqX+9lExErLvYxisNKyiYiI1iXYRERE6xJsIiKidSM62EjaQ9KtkuZJOqzX5YmIWFON2GAj\naRTwLWBPYBLw15Im9bZUERFrphEbbICdgHm277D9NHAWMKXHZYqIWCON5KHP44H5jcf3AG9pZpB0\nEHBQffi4pFuHqWwrazNgcZsH0HFt7v0lab3ukPqvovLar9qv/asHk2kkB5vlsn0KcEqvyzFYkubY\nntzrcvTCmlx3WLPrvybXHUZO/UdyN9oCYMvG4y1qWkREDLORHGyuBiZK2lrSy4GpwKwelykiYo00\nYrvRbC+V9BngYmAU8B3bc3tcrJdqtenya8GaXHdYs+u/JtcdRkj9ZbvXZYiIiBFuJHejRUTEKiLB\nJiIiWpdgE6sESVtK+oWkmyTNlfTZmr6JpEsk3Vb/b9zY5vA6FdGtknbvXelfGknrSrpK0g217v9U\n0/usu6RN63P1uKQTe1v6l2Yl6v4+SddIurH+f29vazA0JI2SdJ2k8+rjft/3df1W9fX/Qm9KvOIS\nbGJVsRT4e9uTgJ2Bg+v0QocBl9qeCFxaH1PXTQW2BfYATqpTFK2OngLea/uNwPbAHpJ2pp+6A08C\n/wisNl80A1jRui8GPmj7fwDTgDN7UOY2fBa4ufG4v/p3fB24cJjKNiQSbGKVYHuh7Wvr8mOUD954\nyhRDM2q2GcDedXkKcJbtp2zfCcyjTFG02nHxeH24dv0z/dTd9hO2f00JOqu1laj7dbbvrelzgfUk\nrTOMRR5ykrYA9gJObST3975H0t7AnZT6rzYSbGKVI2kC8CZgNjDW9sK66j5gbF3uazqi8cNUxCFX\nu1GuBx4ALrE9UN1HlJdQ978CrrX91DAVtS3fAL4EPNdI67P+kjYADgX+aVhLOAQSbGKVUj9MPwY+\nZ/vR5jqXcfojcqy+7Wdtb0+Z6WInSdt1rU/dGyRtCxwHfHLYCtoCSR8AHrB9TX95uup/NHB8ozW4\n2hixP+ocaSQdDBxYH76/0ZUwYkhamxJovm/7JzX5fknjbC+UNI5y9gsjdDoi249I+gXlOlR/dR+R\nBlv32u30U2B/27f3qLhD5e3AhyS9H1gX2FDS9+i//m8B9pH0VWAj4DlJT9pe5QeKpGWzmrD9Ldvb\n17+RGGgEnAbcbPvrjVWzKBeCqf/PbaRPlbSOpK2BicBVw1XeoSRpjKSN6vJ6wPuAW+i/7iPGita9\n5j0fOMz2b4a/xEPL9uG2t7A9gTLg5TLbH6Of+tt+h+0JNf83gK+sDoEG0rKJVcfbgY8DN9b+e4Aj\ngGOBmZKmA3cD+wHYnitpJnATZSTbwbafHf5iD4lxwIw6mu5lwEzb50n6HX3UHUDSXcCGwMvrBePd\nbN80/EV/yVa07p8BtgGOlHRkTdvN9khr9fX5vl+dZbqaiIhoXbrRIiKidQk2ERHRugSbiIhoXYJN\nRES0LsEmIiJal2ATsYIk/UOdofj3kq6X9Jaa/jlJrxjE9oPK1yZJEyT9oZdliDVLgk3ECpD0VuAD\nwA623wD8OS/M0fY5YDBBZLD5IkaMBJuIFTMOWNyZ/NH2Ytv3SjoE2Bz4RZ1yBUknS5rTdZ+WvvI9\nP8+VpH0knV6X95X0h3qvl1/2VRhJh9Z7u9wg6diatr2kK2vL66eNe8HsWPPdABzc2McoSV+TdHXd\nZrWebyxWTQk2ESvm58CWkv5T0kmS3gVg+wTgXuA9tt9T8/6D7cnAG4B3SXpDP/n6cySwe73Xy4e6\nV0rakzIV/Vtqnq/WVWcAh9aW143AUTX9u8Df1rxN04Eltt8MvBk4sE4BFDFkEmwiVkCdbXdH4CBg\nEXC2pE/0k30/SdcC11Fu8jZpBQ/3G+B0SQcCfd0Y7s+B79r+Uy3bQ5JGAxvZvqLmmQG8s84ptpHt\nTgupedPa7c0QAAABZElEQVSx3YD96zRBs4FNKXPNRQyZzI0WsYLqHGyXA5dLupEyUeLpzTy1ZfAF\n4M22H65dY+v2t8vG8vN5bP9NHXywF3CNpB1tPzhU9WgWl9LiubiFfUcAadlErBBJr5XUPOvfnjJR\nIsBjwCvr8obAE8ASSWOBPRvbNPNBmU7+9ZJeBvxF41ivsT3b9pGUVlTzlgoAlwAHdEa2SdrE9hLg\nYUnvqHk+Dlxh+xHgEUm71PSPNvZzMfCpeosHJP13SesP6gmJGKS0bCJWzAbA/6ndUkspt6M+qK47\nBbhI0r223yPpOsp0+fMpXWL0lY9yf/nzKAFlTj0GwNdqYBPlPvQ3NAti+yJJ2wNzJD0NXECZKXsa\n8O0ahO4ADqibHAB8R5Ip1546TgUmANfWWz0sonEb4oihkFmfIyKidelGi4iI1iXYRERE6xJsIiKi\ndQk2ERHRugSbiIhoXYJNRES0LsEmIiJa9/8BTXF9BENr4VgAAAAASUVORK5CYII=\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x7f580c4a4160>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from collections import defaultdict\n",
"\n",
"counts = defaultdict(int)\n",
"for row in parse_cdx('example.com'):\n",
" counts[row['statuscode']] += 1\n",
"\n",
"# get the values and sort them\n",
"codes = sorted(counts.keys())\n",
"data = [counts[code] for code in codes]\n",
"\n",
"# plot the data\n",
"plt.title(\"Wayback CDX results by HTTP Status Code\")\n",
"plt.bar(np.arange(len(data)), data)\n",
"plt.xlabel(\"Status code\")\n",
"plt.ylabel(\"# URLs\")\n",
"plt.xticks(np.arange(len(data)), codes)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You'll notice that most of the requests have a status code of `-`. These entries also have a mime-type of `warc/revisit`. \n",
"\n",
"```\n",
"com,example)/ 20090925093245 http://www.example.com/ warc/revisit - EF7YLJGKQUMLJFP3F7A7LBALC65T5W2O 498\n",
"```\n",
"\n",
"`warc/revisit` refers to an entry that was crawled, but revealed to have no change in content from the last crawl. Rather than store the same data twice, we simply mark that it was unchanged and continue with the crawl. This drastically reduces the size of the WARC files, but creates some usability issues when it comes to issues like this.\n",
"\n",
"We handle revisits by setting their `length` to zero, so it does not increase the crawl size statistics.\n",
"When calculating response code stats, we skip the `-` status code CDX entries.\n",
"When calculating crawl visit statistics, we include the `-` status code CDX entries.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAZUAAAEWCAYAAACufwpNAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAH7lJREFUeJzt3XmUHVWh7/HvjzAKMscYAhoe5qmBpwgRQXAClSBouCoY\nBUEeDxRQ4DlAQC9437p4cbigXAQXD5SAQ4g4EBnlhcGRoZmEMEiYDEmAhCEMyhD4vT9qt1QOpzud\ndHWf7uT3WeusrrNrV9U+deqcX+2q6jqyTURERBNW6XQDIiJixZFQiYiIxiRUIiKiMQmViIhoTEIl\nIiIak1CJiIjGJFRWQpI+I+kPAzDfqyT9r6bnOxw1tS4G6r2KzlgZPiMJlSFC0jGSLmkpu7uHssmD\n27qBIelTkrokPS1pvqRLJO1Uxn1d0guSniqPv0o6VdLo2vQflvSQpA1rZZMkzZW0XideUztDJRja\nfaFJeq+kB8vwrPJePC3pRUnP1p4fWxt+tozvfj6rTG9Jz5SyuZJOkjSih7ZMknSzpCclLZR0haTN\ny7ivS/rxMryuf76GgSBpO0kXS3pC0mOSrpN0wEAtb7hLqAwdvwPe2f0hLF+eqwFvayl7Q6k7rEn6\nIvBd4BvAKOB1wPeBj9SqnWf71cCGwL8ArwVu6A4W278BrgBOLvNcHzgdOMT2oj60YdXGXtAKwPaW\nttexvQ7we+Dz3c9tf6M27nPAn2vjtqzN5q2lzi7Ap4CDWpcj6Q3AOcCXgPWAzane+xcH9hUuO0k7\nUG1jV1N99jYCDgEmdrJdQ1lCZei4nipEti7P3wVcCdzVUnaP7XkAkr4naU7Z27tB0rtK+Wsl/V3S\nRt0zl7SNpAWSVnu5SKdKWiTpTkm71OoeIOmO0kO4V9Jn6w1t2cu8R9IrPmCSRkv6i6SvtBm3HvB/\ngMNs/9L2M7ZfsH2h7aNa65dxs4BPAAuovoy6HQ7sJmlXqnC52vaMditY0tiyN32gpL9RfVkgaXtJ\nfyp7ordIem9tms+UdfCUpPsk7VPKl9ibrs171ZZlvhn4AbBD2YN/opR/SNLtZb5zJX25XZtfns0r\n3ytJe0m6oaXiFyVd0Mu8BoXtO6mCaas2o7cG7rM905WnbP/C9t/KtnQs8Imyvm6BnrdJSWsDlwCb\n1HpOm0g6W9K/dy+wtTcj6eiy3p+SdFd9+2/xbWCq7W/aXljae4PtT9TmdZCk2aUXM0PSJrVxHyjv\n2SJJpwKqz1zS/yyv63FJl0l6/TKt6CEooTJE2H4euBZ4dyl6N9WH8g8tZfVeyvVUH9ANgZ8CP5e0\npu2HgKuAvWt1Pw1Ms/1Cef4O4B5gY+B44Jd6+TDSI8AewLrAAcDJkraB6lAA1V7mV4D1S5vur78W\nVYcxrgZOtf3tNi93B2BN4FdLWS1LsP0icAFVuHaXLQSOAH5S2nx4H2b1HuDNwK6SxgAXAf9OtR6/\nDPxC0sjyhXUKsFvpMb0TuHkZ23wHS+7Zr19GnQV8tsx3K0rA9aCn92oGsHkJrm6fpnp/OkrSeKr3\n6aY2o28E3iTpZEnvk7RO9wjbl1L1Xs8r6+utZVTbbdL2M8BuwLxaz2neUtr2RuDzwNvL+t+Vlm24\n1HsV1bZ6fi/z2hn4D6rP2mjgAWBaGbcx8Evga1Tv3T3AjrVpJ1EF6EeBkVSf95/11vbhIKEytFzN\nywHyLqqN7PctZVd3V7b9Y9uP2l5s+z+BNYA3ltFTgX0BVB0++yRwbm1ZjwDfLb2A86h6RLuX+V5k\n+56yV3Y18Fte/iI/EPih7cttv2R7btkr7Taeqod1vO0zenidGwELbS/u+6r5p3lUX/5111AdRvmt\n7QV9mMfXS+/oH1Tr6GLbF5fXcznQBXyo1H0J2ErSWrbnlx5TE14Axkta1/bjtm/spW7b98r2c8B5\nvPw+bwmMBS7sZV6nlB7ZE6XX1Fvd5XGjpMeB3wBnAj9qrWD7XuC9wBhgOrCw9CzWaa1bm6a3bXJZ\nvUj1WRkvaTXb99u+p029Dai+I+f3Mq99qD4PN5b34xiqXulYqm1olu3zy87cd4GHatN+DvgP23eU\nz8I3gK2He28loTK0/A7YqeyFjrR9N/AnqnMtG1Lt0f6zpyLpy6XrvKh8QaxHtUcE1R79+NJr+ACw\nyPZ1tWXN9ZJ3E30A2KTMdzdJ15Tu/BNUH47u+W5GtcfVk32AufSydwc8Cmzceqioj8YAj7WUnUG1\nd/4hVcfAl2ZObfj1wF4tX7Q7AaPLXvAnqD788yVdJOlNy9Hmdj5GtV4fkHT1Utrd43tFtfPwKUmi\n6qVML19uPTnc9vrdD6q9/yZtY3sD21vY/prtl9pVsn2N7b1tj6QKh3cDX+1ppkvZJpeJ7dnAkcDX\ngUckTasfsqp5nGqnYnSbcd02oXo/uuf9NNX2PaaMm1MbZ1657X2vtt09RnV4bMxyvKwhI6EytPyZ\nKhgOAv4IYPtJqr3zg6i6+PcBqDp/chRVt3uD8gWxiHLM1vazVHuB+1J92ZzLksaUL6JurwPmSVoD\n+AXwHWBUme/FvHwseA6wRS+v4evAQuCn6uHKn/I6nwP27GU+ryBpFeDDVL237rIDqYLuUKpDCWdK\nWn0ps6p/Qc8Bzq1/0dpe2/aJALYvs/0Bqi+WO4H/W6Z7BnhVbT6v7ePyKPO93vYk4DXAr6neq560\nfa/KfK4Bnqf6Yv4Ur3yfhzzb11MdJuo+/7LE+urDNtnuVuu9vj+2f2p7J6ovdgPfbNOuv1Ntqx/r\npfnzyjy627o2VU98LlUPZ7PaONWfU217n23Z9tay/adeljfkJVSGkHI4pgv4IrUvTqrzKl9kyfMp\nrwYWU524XlXScVTHm+vOAT5DdUVV65fNa4DDJa0maS+qcwwXA6tTHRpYACyWtBvwwdp0ZwEHSNpF\n0iqSxrTsvb8A7AWsDZxTgqD1dS4CjgO+L2lPSa8q7dhN0rda60tatZw3+BnVl8NJpXwTqhOpB5W9\n8x9Q7SX2uMfbxo+BD0vaVdIISWuWk7qbShql6qKEtalC8GmqPVeozq28W9LrVF14cEwvy3gY2LQ7\n7CStLmkfSeuVwyJP1ubbTk/vVbdzgFOBF2x3/NLlpZG0k6qT268pz99EtY1eU6o8DIytbTtL2yYf\nBjbSkpeR30zVc91Q0mupeibdy3+jpJ1LWD0L/IOe1/9RwGckfUXlwhdJb5U0rYz/GdXnYesyv28A\n19q+n+pc3ZaSPlp65YezZLj9ADimHLZE0nrl/R3WEipDz9VUXyL1L4ffl7J6qFwGXAr8lar7/SxL\ndq2x/UeqD8uNth9gSdcC46h6FScAHy/nZ56i2vinU3X/P0V1Qrh7ntdRTpRS9YyupranVuo8T3Xy\ncRTwwx6C5T+pgvJrVF8Wc6hOnv66Vu0Tkp4uy5lBFRjb1k7EnkZ18cHvyzxN1aM7svuDujS25wDd\nJ0y72/EVqs/GKqWN86gOTbyH6nJSyrmX84C/ADfQ+7mJK4BZwEOSFpayTwP3S3qS6vDaPr1M3/a9\nqo0/l2ovv8//29FhT1CFyK3l/b2U6qKN7h2Kn5e/j0q6sQ/b5J1UX+73lkNJm1Ctk1uoTsD/luq9\n6rYGcCLV+nyI6rPVdqeg9Bp2Lo97JT1Gdbj14jL+/wH/StWTmk/Vi59cxi2k2sE6kWrbHUc5AlHG\n/4qqhzStbAe3UV10MKzJ+ZGuFZqkK4Cf2j6z022JgSFpLaqT+duU83ARHZN//lqBSXo7sA3Vnnis\nuA4Brk+gxFCQUFlBSZpKdSL8iHL4IFZAku6nOmG9TBc9RAyUHP6KiIjG5ER9REQ0ZqU7/LXxxht7\n7NixnW5GRMSwcsMNNyws/6zaq5UuVMaOHUtXV1enmxERMaxIav23hLZy+CsiIhqTUImIiMYkVCIi\nojEJlYiIaExCJSIiGpNQiYiIxiRUIiKiMQmViIhoTEIlIiIas9L9R31/jJ1yUaeb0FH3n7h7p5sQ\nEUNceioREdGYhEpERDQmoRIREY1JqERERGMSKhER0ZiESkRENCahEhERjRmwUJH0Q0mPSLqtVrah\npMsl3V3+blAbd4yk2ZLukrRrrXxbSbeWcadIUilfQ9J5pfxaSWMH6rVERETfDGRP5WxgYkvZFGCm\n7XHAzPIcSeOBycCWZZrTJI0o05wOHASMK4/ueR4IPG77DcDJwDcH7JVERESfDFio2P4d8FhL8SRg\nahmeCuxZK59m+znb9wGzge0kjQbWtX2NbQPntEzTPa/zgV26ezEREdEZg31OZZTt+WX4IWBUGR4D\nzKnVe7CUjSnDreVLTGN7MbAI2KjdQiUdLKlLUteCBQuaeB0REdFGx07Ul56HB2lZZ9ieYHvCyJEj\nB2ORERErpcEOlYfLIS3K30dK+Vxgs1q9TUvZ3DLcWr7ENJJWBdYDHh2wlkdExFINdqjMAPYvw/sD\nF9TKJ5crujanOiF/XTlU9qSk7cv5kv1apume18eBK0rvJyIiOmTAbn0v6WfAe4GNJT0IHA+cCEyX\ndCDwALA3gO1ZkqYDtwOLgcNsv1hmdSjVlWRrAZeUB8BZwLmSZlNdEDB5oF5LRET0zYCFiu1P9jBq\nlx7qnwCc0Ka8C9iqTfmzwF79aWNERDQr/1EfERGNSahERERjEioREdGYhEpERDQmoRIREY1JqERE\nRGMSKhER0ZiESkRENCahEhERjUmoREREYxIqERHRmIRKREQ0JqESERGNSahERERjEioREdGYhEpE\nRDQmoRIREY1JqERERGMSKhER0ZiESkRENCahEhERjUmoREREYxIqERHRmIRKREQ0JqESERGNSahE\nRERjEioREdGYhEpERDQmoRIREY1JqERERGM6EiqS/rekWZJuk/QzSWtK2lDS5ZLuLn83qNU/RtJs\nSXdJ2rVWvq2kW8u4UySpE68nIiIqgx4qksYAhwMTbG8FjAAmA1OAmbbHATPLcySNL+O3BCYCp0ka\nUWZ3OnAQMK48Jg7iS4mIiBadOvy1KrCWpFWBVwHzgEnA1DJ+KrBnGZ4ETLP9nO37gNnAdpJGA+va\nvsa2gXNq00RERAcMeqjYngt8B/gbMB9YZPu3wCjb80u1h4BRZXgMMKc2iwdL2Zgy3Fr+CpIOltQl\nqWvBggWNvZaIiFhSJw5/bUDV+9gc2ARYW9K+9Tql5+Gmlmn7DNsTbE8YOXJkU7ONiIgWnTj89X7g\nPtsLbL8A/BJ4J/BwOaRF+ftIqT8X2Kw2/aalbG4Zbi2PiIgO6USo/A3YXtKrytVauwB3ADOA/Uud\n/YELyvAMYLKkNSRtTnVC/rpyqOxJSduX+exXmyYiIjpg1cFeoO1rJZ0P3AgsBm4CzgDWAaZLOhB4\nANi71J8laTpwe6l/mO0Xy+wOBc4G1gIuKY+IiOiQQQ8VANvHA8e3FD9H1WtpV/8E4IQ25V3AVo03\nMCIilkv+oz4iIhqTUImIiMYkVCIiojEJlYiIaExCJSIiGpNQiYiIxiRUIiKiMQmViIhoTEIlIiIa\nk1CJiIjGJFQiIqIxCZWIiGhMQiUiIhqTUImIiMYkVCIiojEJlYiIaExCJSIiGpNQiYiIxiRUIiKi\nMQmViIhoTEIlIiIak1CJiIjGJFQiIqIxCZWIiGhMQiUiIhqTUImIiMYsNVQkfUvSupJWkzRT0gJJ\n+w5G4yIiYnjpS0/lg7afBPYA7gfeAHxlIBsVERHDU19CZdXyd3fg57YXDWB7IiJiGFt16VW4UNKd\nwD+AQySNBJ4d2GZFRMRwtNSeiu0pwDuBCbZfAP4OHN6fhUpaX9L5ku6UdIekHSRtKOlySXeXvxvU\n6h8jabakuyTtWivfVtKtZdwpktSfdkVERP/06eov24/ZfrEMPwP8vJ/L/R5wqe03AW8F7gCmADNt\njwNmludIGg9MBrYEJgKnSRpR5nM6cBAwrjwm9rNdERHRD8t7SfFy9wgkrQe8GzgLwPbztp8AJgFT\nS7WpwJ5leBIwzfZztu8DZgPbSRoNrGv7GtsGzqlNExERHbC8oeJ+LHNzYAHwI0k3STpT0trAKNvz\nS52HgFFleAwwpzb9g6VsTBluLX8FSQdL6pLUtWDBgn40PSIietPjiXpJv6F9eAjYqJ/L3Ab4gu1r\nJX2Pcqirm21L6k9wLcH2GcAZABMmTGhsvhERsaTerv76znKOW5oHgQdtX1uen08VKg9LGm17fjm0\n9UgZPxfYrDb9pqVsbhluLY+IiA7pMVRsX93TOEk7Lu8CbT8kaY6kN9q+C9gFuL089gdOLH8vKJPM\nAH4q6SRgE6oT8tfZflHSk5K2B64F9gP+a3nbFRER/dfb4a8RwN5U5ykutX2bpD2AY4G1gLf1Y7lf\nAH4iaXXgXuAAqvM70yUdCDxQlo3tWZKmU4XOYuCw7ivRgEOBs0t7LimPiIjokN4Of51FddjpOuAU\nSfOACcAU27/uz0Jt31zm1WqXHuqfAJzQprwL2Ko/bYmIiOb0FioTgLfYfknSmlRXZG1h+9HBaVpE\nRAw3vV1S/LztlwBsPwvcm0CJiIje9NZTeZOkv5RhAVuU56K66vctA966iIgYVnoLlTcPWisiImKF\n0NslxQ8MZkMiImL46+2S4qdY8j/qDSwErgSOzvmViIho1eOJetuvtr1u7bEe1RVhs4AfDFoLIyJi\n2FimG0raftz2ycAWA9SeiIgYxpb5LsWSVqNvvxgZERErmd7OqXy0TfEGwCeobgIZERGxhN56HB9u\neW7gUeB7ti8auCZFRMRw1dslxQcMZkMiImL4W95ffoyIiHiFhEpERDQmoRIREY1ZaqhI+lpteI2B\nbU5ERAxnPYaKpKMl7QB8vFb854FvUkREDFe9XVJ8J7AX8N8k/b4836j22/IRERFL6O3w1xNUv0c/\nG3gv8L1SPkXSnwa4XRERMQz11lPZFTiO6j5fJwF/AZ7J/69ERERPertL8bG2dwHuB84FRgAjJf1B\n0m8GqX0RETGM9OXGkJfZ7gK6JB1ieydJGw90wyIiYvhZ6iXFto+qPf1MKVs4UA2KiIjha1l/T+WW\ngWpIREQMf/mP+oiIaExCJSIiGpNQiYiIxiRUIiKiMQmViIhoTEIlIiIa07FQkTRC0k2SLizPN5R0\nuaS7y98NanWPkTRb0l2Sdq2Vbyvp1jLuFEnqxGuJiIhKJ3sqRwB31J5PAWbaHgfMLM+RNB6YDGwJ\nTAROkzSiTHM6cBAwrjwmDk7TIyKinY6EiqRNgd2BM2vFk4CpZXgqsGetfJrt52zfR3XX5O0kjQbW\ntX2NbQPn1KaJiIgO6FRP5bvAUcBLtbJRtueX4YeAUWV4DDCnVu/BUjamDLeWR0REhwx6qEjaA3jE\n9g091Sk9Dze4zIMldUnqWrBgQVOzjYiIFp3oqewIfETS/cA0YGdJPwYeLoe0KH8fKfXnApvVpt+0\nlM0tw63lr2D7DNsTbE8YOXJkk68lIiJqBj1UbB9je1PbY6lOwF9he19gBrB/qbY/cEEZngFMlrSG\npM2pTshfVw6VPSlp+3LV1361aSIiogP68nsqg+VEYLqkA4EHgL0BbM+SNB24HVgMHGb7xTLNocDZ\nwFrAJeUREREd0tFQsX0VcFUZfhTYpYd6JwAntCnvArYauBZGRMSyyH/UR0REYxIqERHRmIRKREQ0\nJqESERGNSahERERjEioREdGYhEpERDQmoRIREY1JqERERGMSKhER0ZiESkRENCahEhERjUmoRERE\nY4bSre8johdjp1zU6SZ03P0n7t7pJsRSpKcSERGNSahERERjEioREdGYhEpERDQmoRIREY1JqERE\nRGMSKhER0ZiESkRENCahEhERjUmoREREYxIqERHRmIRKREQ0JqESERGNSahERERjEioREdGYhEpE\nRDQmoRIREY0Z9FCRtJmkKyXdLmmWpCNK+YaSLpd0d/m7QW2aYyTNlnSXpF1r5dtKurWMO0WSBvv1\nRETEyzrRU1kMfMn2eGB74DBJ44EpwEzb44CZ5Tll3GRgS2AicJqkEWVepwMHAePKY+JgvpCIiFjS\noIeK7fm2byzDTwF3AGOAScDUUm0qsGcZngRMs/2c7fuA2cB2kkYD69q+xraBc2rTREREB3T0nIqk\nscDbgGuBUbbnl1EPAaPK8BhgTm2yB0vZmDLcWt5uOQdL6pLUtWDBgsbaHxERS+pYqEhaB/gFcKTt\nJ+vjSs/DTS3L9hm2J9ieMHLkyKZmGxERLToSKpJWowqUn9j+ZSl+uBzSovx9pJTPBTarTb5pKZtb\nhlvLIyKiQzpx9ZeAs4A7bJ9UGzUD2L8M7w9cUCufLGkNSZtTnZC/rhwqe1LS9mWe+9WmiYiIDli1\nA8vcEfg0cKukm0vZscCJwHRJBwIPAHsD2J4laTpwO9WVY4fZfrFMdyhwNrAWcEl5REREhwx6qNj+\nA9DT/5Ps0sM0JwAntCnvArZqrnUREdEf+Y/6iIhoTEIlIiIak1CJiIjGJFQiIqIxCZWIiGhMQiUi\nIhqTUImIiMYkVCIiojEJlYiIaExCJSIiGpNQiYiIxiRUIiKiMQmViIhoTEIlIiIak1CJiIjGJFQi\nIqIxCZWIiGhMQiUiIhqTUImIiMYkVCIiojEJlYiIaExCJSIiGpNQiYiIxiRUIiKiMQmViIhoTEIl\nIiIak1CJiIjGJFQiIqIxCZWIiGhMQiUiIhoz7ENF0kRJd0maLWlKp9sTEbEyG9ahImkE8H1gN2A8\n8ElJ4zvbqoiIldewDhVgO2C27XttPw9MAyZ1uE0RESutVTvdgH4aA8ypPX8QeEdrJUkHAweXp09L\numsQ2jYQNgYWdmrh+manltyYjq6/FUS2wf4Zztvg6/tSabiHSp/YPgM4o9Pt6C9JXbYndLodw1XW\nX/9lHfbPyrD+hvvhr7nAZrXnm5ayiIjogOEeKtcD4yRtLml1YDIwo8NtiohYaQ3rw1+2F0v6PHAZ\nMAL4oe1ZHW7WQBr2h/A6LOuv/7IO+2eFX3+y3ek2RETECmK4H/6KiIghJKESERGNSagMIZI2k3Sl\npNslzZJ0RCnfUNLlku4ufzeoTXNMuUXNXZJ27VzrO0/SmpKuk3RLWX//Vsrbrj9JG5X1/bSkUzvb\n+s5bjvX3AUk3SLq1/N25s6+g8ySNkHSTpAvL8x4/u2X868r29+XOtLh5CZWhZTHwJdvjge2Bw8pt\nZ6YAM22PA2aW55Rxk4EtgYnAaeXWNSur54Cdbb8V2BqYKGl7elh/wLPAvwIrzAe6n5Z1/S0EPmz7\nfwD7A+d2oM1DzRHAHbXnPa27bicBlwxS2wZFQmUIsT3f9o1l+CmqjXMM1a1nppZqU4E9y/AkYJrt\n52zfB8ymunXNSsmVp8vT1crD9LD+bD9j+w9U4bLSW471d5PteaV8FrCWpDUGsclDiqRNgd2BM2vF\nPX12kbQncB/VulthJFSGKEljgbcB1wKjbM8vox4CRpXhdrepGTNITRySyuGHm4FHgMtt97b+okU/\n1t/HgBttPzdITR2KvgscBbxUK2u77iStAxwN/NugtnAQJFSGoLLB/QI40vaT9XGurgHPdeA9sP2i\n7a2p7q6wnaStWsZn/fViedafpC2BbwKfHbSGDjGS9gAesX1DT3Va1t3XgZNrPcMVxrD+58cVkaTV\nqALlJ7Z/WYofljTa9nxJo6n2IiG3qemR7SckXUl1rqmn9Rc96Ov6K4d8fgXsZ/ueDjV3KNgR+Iik\nDwFrAutK+jE9r7t3AB+X9C1gfeAlSc/aHvYXjKSnMoRIEnAWcIftk2qjZlCdCKX8vaBWPlnSGpI2\nB8YB1w1We4caSSMlrV+G1wI+ANxJz+svapZ1/ZW6FwFTbP9x8Fs8dNg+xvamtsdSXTxzhe196WHd\n2X6X7bGl/neBb6wIgQLpqQw1OwKfBm4tx7UBjgVOBKZLOhB4ANgbwPYsSdOB26muHDvM9ouD3+wh\nYzQwtVwBtwow3faFkv5Mm/UHIOl+YF1g9XLi9IO2bx/8pg8Jy7r+Pg+8AThO0nGl7IO20xN8WdvP\n7oost2mJiIjG5PBXREQ0JqESERGNSahERERjEioREdGYhEpERDQmoRLRA0lfLXfr/YukmyW9o5Qf\nKelVfZi+T/UGkqSxkm7rZBti5ZJQiWhD0g7AHsA2tt8CvJ+X77N2JNCXsOhrvYgVRkIlor3RwMLu\nGyTaXmh7nqTDgU2AK8ttTJB0uqSult8gaVfvn/d5kvRxSWeX4b0k3VZ+x+R37Roj6ejyuyW3SDqx\nlG0t6ZrSk/pV7XdOti31bgEOq81jhKRvS7q+TLPS3qsrBk5CJaK93wKbSfqrpNMkvQfA9inAPOB9\ntt9X6n7V9gTgLcB7JL2lh3o9OQ7YtfyOyUdaR0rajeoW6u8odb5VRp0DHF16UrcCx5fyHwFfKHXr\nDgQW2X478HbgoHJ7n4jGJFQi2ih3j90WOBhYAJwn6TM9VN9b0o3ATVQ/mDZ+GRf3R+BsSQcB7X5k\n7f3Aj2z/vbTtMUnrAevbvrrUmQq8u9yPa33b3T2e+g9nfRDYr9wC6FpgI6r7xUU0Jvf+iuhBuY/a\nVcBVkm6luiHg2fU6ZU//y8DbbT9eDmmt2dMsa8P/rGP7c+UigN2BGyRta/vRpl5HvblUPZjLBmDe\nEUB6KhFtSXqjpPpe/NZUNwQEeAp4dRleF3gGWCRpFLBbbZp6Pahug/5mSasA/1Jb1ha2r7V9HFWv\nqP5zBgCXAwd0X0kmaUPbi4DHJb2r1Pk0cLXtJ4AnJO1Uyvepzecy4JDy8wpI+u+S1u7TConoo/RU\nItpbB/ivcjhpMdVPNR9cxp0BXCppnu33SbqJ6hbxc6gOZdGuHtXvk19IFRxdZRkA3y4BJqrfMb+l\n3hDbl0raGuiS9DxwMdXdq/cHflDC5l7ggDLJAcAPJZnq3FC3M4GxwI3lZxYWUPt524gm5C7FERHR\nmBz+ioiIxiRUIiKiMQmViIhoTEIlIiIak1CJiIjGJFQiIqIxCZWIiGjM/weG48JIVH03qgAAAABJ\nRU5ErkJggg==\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x7f57e0fe6048>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from collections import defaultdict\n",
"\n",
"counts = defaultdict(int)\n",
"for row in parse_cdx('example.com'):\n",
" if row['statuscode'] is not '-':\n",
" counts[row['statuscode']] += 1\n",
"\n",
"# get the values and sort them\n",
"codes = sorted(counts.keys())\n",
"data = [counts[code] for code in codes]\n",
"\n",
"# plot the data\n",
"plt.title(\"Wayback CDX results by HTTP Status Code\")\n",
"plt.bar(np.arange(len(data)), data)\n",
"plt.xlabel(\"Status code\")\n",
"plt.ylabel(\"# URLs\")\n",
"plt.xticks(np.arange(len(data)), codes)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Much better!**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### CDX API Example: Cumulative Capture Size\n",
"\n",
"In this example, we will use the time stamps to plot how much data was collected over time per URL. \n",
"\n",
"It uses the [`matplotlib.dates`](https://matplotlib.org/api/dates_api.html) library to improve the chart legibility."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEWCAYAAACJ0YulAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XmcXFWZ//HPl7CEJYBARAgJiRBlkJGtBVRQXBgBF3CQ\nkcVxNzKKyqgzgDqCgzqMjPxEQRCQwQ1QZBFliaOCATFKIhEIEIhsWUTCviuB5/fHOdW5Kaqqb3fq\n1tL9fb9e/eq6+1N1l+eec+6iiMDMzAxgtW4HYGZmvcNJwczMBjkpmJnZICcFMzMb5KRgZmaDnBTM\nzGyQk0IHSTpW0vdXYfr5kvZsY0gdJelsSV+sYL4haet2z7ffSfqMpDPbPM+1Jf1U0iOSzm/nvHuZ\npLskvbHbcXTCmEgKkg6RNEfS45L+LOlySbt3O65WGh1AI+JlEXFVm5czU9KRhe5J+SDbqN+L2rns\nbpC0Zk7Ot0t6Iu/sZ0ma2oZ5V5L0hljmfpLmSXpU0v2SfiVpGkBEfDkiPtjmRb4D2BTYOCIObPO8\nrQeM+qQg6ZPA14AvkzbmKcApwNu6GVcPmQW8ptD9GuDWBv1uj4h7OxlYRX5MWveHABsA2wNzgDd0\nMygASasPc/ytge8CnyJ9l2mkbfvZ9kc3aEvgtohYPtwJh/v9rEsiYtT+kXaUx4EDW4xzNvDFQvee\nwOJC913AvwE3AE8A3yYll8uBx4BfAC9oNG1h+jfmz8cC3y8MOx+4F3iEdHB+We4/A3gG+FuO/6fF\neQGbA08BGxXmtSNwP7BG7n4/cAvwEDAT2LLJ998DeBhYLXd/E/gw8Je6fmfmzy8AfgYsy/P+GbBF\nHnYgMLdu/p8EflL4rU8D/i//dr8uxgWcBCwCHgXmAnsUho0DPgP8KU87F5ichwWwdf68e57Hng2+\n6xvz7za5xfbwvvy7PQbcAXy4ftvIcdyf18ehQ6yzwdjqt7fC/I7M28H3cv+3APPyerkWeHmTWN8B\nzGvxXY4lb2/AyTmu2t9y4Ng8bHPggrxO7wQ+3mR+X8jf75k8jw+QTiw/B9wN3EdKUhvk8afm7/8B\n4B5gVpP5Nvy+wFbAg8BOhTiX1dZtyXX17zmuPwP7A/sCt+X5fqbut/ox8MM8vz8A2zfZj1cDjiJt\niw8AP6KwLzb4fh8qxHlz4fv8HXBV/t7zgbfVbSffJB1nHgd+A7yIdIL7EOnEbcdKjptVzLRX/oC9\n88a/eotxzmbopDCblAgm5Q3sD6SD8HjgV8AxjaZtsDEdy8pJ4f3ABGCtvLLnNYurwbx+BXyoMOwE\n4LT8eT9gYd7oVifttNc2+f5rkQ6UO+bum4AX542w2O/d+fPGwAHAOjn284GLC/N6EPi7wvyvBw4o\nfKfHSCWPtUhJ4JrCuO/K81+ddPZ7LzA+D/s34EbgpYBIZ/gb52EBbJ3X9yJglybf9Xjg10NsM28m\nHYwEvBZ4khU78Z55ezoxx/9a0onCS1uss6GSwnLgv/P81iZtV/cBu5IS4Xvyel+rQawvBp4G/h/w\nOmC9uuHHUtjeCv13IB1cdyQd4OYCnwfWzPO8A3hTk99npXmStuGFebr1gAtZkdym5u//XWBdYO0G\n82v5fUkH1JtJ29tM4H+Gua4+D6yR57MMOIe03b6MtN1PK3yvZ0iJdg3g06QEWTvJuosV+94nSMeE\nLfJ6+xZwbpPf60BgCfCKHOfWpNLWGvl3+0z+3V9P2jeK29L9wM6sOM7cCbw7/05fBK6s5LhZxUyr\n/gPOyhvSTUOMdyhpp52X/24DHq4b52yGTgqHFrovAE4tdH+MFQfFlaZtsDEdS4OdNA/bMO9AGzSK\nq8G8Pgj8Kn8W6WD4mtx9OfCBwnSr5R1myybLvipv6BsBi3K/4wv9nmsx7Q7AQ4XuU4Ev5c8vI53V\nrFX4TucVxl2PVNXR8Mw9T7t9/rwA2K/JeAEcTTpb3a7F9nBGcfklt7WLgU8U1u9yYN3C8B8B/9Fi\nnQ2VFP5GTnyF3++4unksAF7bJL7dcgzLSAnibHJyaLS9ARPzdnRQ7t4VuKdunKOB/22yvJXmCfwS\n+Eih+6Wkg+vqrEgKL27x+w75fYFLSCcEN9AgObZYV08B43L3hBzLroXx5wL7F77X7Lp95s/k0ior\n73u3AG8ojLtZ7Ts3iGlmLaa6/nuQTnpWK/Q7lxWlt7OBMwrDPgbcUuj+e+qOZe3669c2hbNJZ4VD\neYCUyQciYgfgG6QzmeH6S+HzUw261xvuDCWNk3S8pD9JepS00QFsUnIWFwCvlLQZ6cz7OeDqPGxL\n4CRJD0t6mHT2LlJJp5Fau8IepBICwDWFfosi4u4c9zqSviXp7hz3LGBDSePydN8BDpEk4J+BH0XE\nXwvLWlT7EBGP59g2z/P+tKRb8pUtD5Oq/2q/x2RScb2ZI/KybmoxzgOkHbgpSftImi3pwRzDvqy8\nTh6KiCcK3XfX4h+hZRHxdKF7S+BTtXWXY5jcbBkRMTsi/ikiJpLW1WuAzzYaV9IapCqScyLivMLy\nNq9b3mdIJeMyNif9BjV3kxJCcfpFNFfm+54BbAd8o7gtlVhXD0RErX3lqfy/1b5b3DafI1U/Nfrd\ntwQuKsR7C+nkptFv1my73Zy0Xz1X6Hc3K++jbT/ulNGXSSEiZpEOJoMkbSXpCklzJV0taRvgt8Bf\nSXWJAAeTsnHRE6Siac2qXGGz0rzygXJik3EPIVXzvJF08Jtamyz/j1YLioiHgJ8D78zzOi/yKQRp\n4/5wRGxY+Fs7Iq5tMrtZrDig1BLLb4BX536zCuN+inQ2uGtErM+KBmnluGaTzn73yHF9r25Zk2sf\nJK1HKokslbQHqf73n0htNBuS2lpqv8ciUlVBMwcC+0v6RItxfgHsImmLRgMlrUVKtv8DbJpjuKwQ\nA8ALJK1b6J4CLM2fG62zJ2m9fdVPs4hU0iquu3Uion67fZ6IuI500rNdk1G+QWqv+Vzd8u6sW96E\niNh3qOVlS0kHyZoppNJU8QDWaltu+X3zNvI1UlvesZI2yv3LrKvhKm6bq5Gqh5Y2GG8RsE9dzOMj\nYkmTcRttt0uByXk5NVNIVU1d1ZdJoYnTgY9FxM6k+sBvRsQjpDrFUyR9iHR1xqx8hvGVPN08YF9J\nG+VLLo9YhRhuA8ZLenM+K/scqaTSyARSwnqAdND4ct3wv5DqaVs5h1TH+I78ueY04GhJLwOQtIGk\nVpcP/pZUffUuclLISWdZ7ldMChNIZykP5x30mAbz+y6pYfOZiLimbti+knaXtCZwHKnIvijPd3le\n5uqSPg+sX5juTOA4SdOVvFzSxoXhS0lXEH1C0r80+pIR8QtSI/dFknaWtLqkCZIOk/R+Ut3uWjmG\n5ZL2Af6hway+kC9t3YPUSFq7Xr/ROptHKjmNk7Q3qe67lTOAwyTtmr/nunl7mlA/Yv4dPyTphbl7\nG9KVVbMbjPvhvOxD685Ofw88JulIpXsQxknaTtIrhoiz5lzgXyVNywfwLwM/jPJXJw31fU8C5kS6\ntPZS0rYN5dfVcOws6R/zVVJHkPbP5/2WOYYvSdoSQNJESfs1meeZwKfz9iZJW+fpfkc6Yfh3SWso\n3X/0VuC8JvPpmFGRFPLG+CrgfEnzSA0/mwFExFdJV8AcQypa3g0cTqp/hHQm+0dS9c3PSVcfjEhO\nQh8hbQhLSCWHxU1G/26OZQmpIa1+4/s2sG0uol5cP3F2CTAduDci/liI4yJS4+V5uYrnJmCfFnE/\nQapfXTOPW3M18EJWTgpfIzWI3p9jvqLBLL9HOlttdKPeOaR18SCpEe1duf/MPK/bSL/L06xc7XAi\nqe7856Sz3W/nOIrf4x5SYjhKUrPr899BOqP8IakkchMwAPwiIh4DPp6X8xCppHNJ3fT35mFLgR8A\nh0XErXlYo3X2CdLO/jCpjavZuqx9hzmkRtGT83IWAu9tMvrDpCRwo6THSb/fRcBXGox7MClhLVW6\nX+dxSZ/J1StvIbUN3Ular2eSSq9lnEVa37Py9E+T6r9LafV984F2b6CW5D8J7CTp0JLrarh+Qip5\nP0Sq+vzHiHimwXgn5WX9XNJjpP1g19rA/Nvukb/f+cCXSNv9Y6T1v1FE/I20XexD+s2/SbqY41a6\nTCtqHPqL0s1GP4uI7SStDyyIiKb1xZKuBz7aogrF2kTS2qQLAXaKiNu7HU+75LO570dEw+on61+S\njiVdEPCuocYd7UZFSSEiHgXurFWR5GLa9rXhuVj9AlI1iVXvX4DrRlNCMBsr+vIOQ0nnki4520TS\nYlJ1xKHAqZI+R7oG+DxStRDAQazcEGsVkXQXqbFv/yFGNbMe1LfVR2Zm1n6jovrIzMzao++qjzbZ\nZJOYOnVqt8MwM+src+fOvT/f5NhS3yWFqVOnMmfOnG6HYWbWVyTdPfRYrj4yM7MCJwUzMxvkpGBm\nZoOcFMzMbJCTgpmZDeq7q4/MzMaaqUdd+rx+dx3/5kqW5ZKCmVkPa5QQWvVfVU4KZmY9qqoDfytO\nCmZmPagbCQGcFMzMrMBJwczMBjkpmJn1mG5VHYGTgplZTymbEHxJqpmZAdUlBHBSMDPrGd2sNqpx\nUjAz6wG9kBDAScHMrK9UWXUETgpmZn2j6oQATgpmZlbgpGBm1mVl2hM6UUoAJwUzs57XqYQAFSYF\nSWdJuk/STU2GS9LXJS2UdIOknaqKxcysV/XKVUc1VZYUzgb2bjF8H2B6/psBnFphLGZmPafXEgJU\n+Oa1iJglaWqLUfYDvhsRAcyWtKGkzSLiz1XFZGbWC3oxGdR0s01hErCo0L0493seSTMkzZE0Z9my\nZR0JzsysCsNNCJ1sT4A+aWiOiNMjYiAiBiZOnNjtcMzMRqSXSwg13UwKS4DJhe4tcj8zs1FnJAmh\n06UE6G5SuAR4d74KaTfgEbcnmJkl3UgIUGFDs6RzgT2BTSQtBo4B1gCIiNOAy4B9gYXAk8D7qorF\nzKyber0doajKq48OHmJ4AB+tavlmZr2gnxICVJgUzMzGuuEkhG4ng5q+uPrIzKzf9GNCACcFM7O2\n64dLT5txUjAza6N+a0Oo56RgZtYm/Z4QwA3NZmarrF9uTCvDScHMbIT2OvEqbr/viWFP16sJAVx9\nZGY2IqMxIYCTgpnZiIzGhABOCmZmHdEPCQGcFMzMhm00XGXUjJOCmdkwjOaEAL76yMysEv2WDGpc\nUjAza7N+TQjgpGBmZgVOCmZmJZVpT+jnUgI4KZiZtU2/JwRwUjAzswInBTMzG+SkYGZmg5wUzMxK\nGKqReTS0J4CTgpmZFTgpmJkNoZ/fuTxcTgpmZjbIScHMzAY5KZiZtTAW7mIuclIwM7NBlSYFSXtL\nWiBpoaSjGgzfQNJPJf1R0nxJ76syHjOzdhtNpQSoMClIGgecAuwDbAscLGnbutE+CtwcEdsDewJf\nlbRmVTGZmVlrLV+yI2k88BZgD2Bz4CngJuDSiJg/xLx3ARZGxB15XucB+wE3F8YJYIIkAesBDwLL\nR/A9zMysDZomBUlfICWEq4DfAfcB44GXAMfnhPGpiLihySwmAYsK3YuBXevGORm4BFgKTADeGRHP\nDf9rmJl13mirOoLWJYXfR8QxTYadKOmFwJRVXP6bgHnA64GtgP+TdHVEPFocSdIMYAbAlCmrukgz\ns3LG0k1rNU3bFCLieb+GpNUkrZ+H3xcRc1rMewkwudC9Re5X9D7gwkgWAncC2zSI5fSIGIiIgYkT\nJ7ZYpJmZrYohG5olnSNpfUnrktoTbpb0byXmfR0wXdK03Hh8EKmqqOge4A15OZsCLwXuGM4XMDOz\n9ilz9dG2uTpnf+ByYBrwz0NNFBHLgcOBmcAtwI8iYr6kwyQdlkc7DniVpBuBXwJHRsT9I/geZmbW\nBi2vPsrWkLQGKSmcHBHPSIoyM4+Iy4DL6vqdVvi8FPiHYcRrZmYVKlNS+BZwF7AuMEvSlsCjLacw\nMxvlRuOVR1CipBARXwe+Xuh1t6TXVReSmZl1S5mG5k0lfVvS5bl7W+A9lUdmZtZFY/FyVChXfXQ2\nqbF489x9G3BEVQGZmVn3lEkKm0TEj4DnYPCqomcrjcrMzLqiTFJ4QtLGpOcUIWk34JFKozIz66Kx\nWnUE5S5J/STpprOtJP0GmAgcWGlUZmZdMpYTApRLCvOB15LuNhawAL+cx8xGobIJYbRejgrlDu6/\njYjlETE/Im6KiGeA31YdmJmZdV6rR2e/iPT467Ul7UgqJQCsD6zTgdjMzKzDWlUfvQl4L+nppicW\n+j8KfKbCmMzMetZorjqCFkkhIr4DfEfSARFxQQdjMjPrSaM9IUC5NoUTJJ0g6e8qj8bMrEeNhYQA\n5ZLC9qS7mL8tabakGbUX7ZiZ2egyZFKIiMci4oyIeBVwJHAM8GdJ35G0deURmplZx5R5IN44SW+T\ndBHwNeCrwIuBn1L3rgQzM+tvZW5eux24EjghIq4t9P+xpNdUE5aZmXVDmaTw8oh4vNGAiPh4m+Mx\nM7MuKtPQ/EJJP5V0v6T7JP1E0osrj8zMzDquTFI4B/gR8CLSOxXOB86tMigzM+uOMklhnYj4Xn7+\n0fKI+D4wvurAzMys81o9+2ij/PFySUcB55HeqfBOfNWRmdmo1KqheS4pCdQehPfhwrAAjq4qKDMz\n645Wzz6a1slAzMys+5q2KUjavdWEktaXtF37QzIzs25pVX10gKSvAFeQqpKWkRqYtwZeB2wJfKry\nCM3MrGNaVR/9a25sPoD0TubNgKeAW4BvRcQ1nQnRzMw6peUdzRHxIHBG/jMzs1GuzH0KIyZpb0kL\nJC3Ml7U2GmdPSfMkzZf06yrjMTNrZupRl3Y7hJ5Q5tlHIyJpHHAKsBewGLhO0iURcXNhnA2BbwJ7\nR8Q9kl5YVTxmZs04IaxQZUlhF2BhRNwREX8j3fy2X904hwAXRsQ9ABFxX4XxmJk9jxPCysq8T2Ed\nSf8h6YzcPV3SW0rMexKwqNC9OPcregnwAklXSZor6d1NYpghaY6kOcuWLSuxaDOzoTkhPF+ZksL/\nAn8FXpm7lwBfbNPyVwd2Bt4MvAn4D0kvqR8pIk6PiIGIGJg4cWKbFm1mZvXKJIWtIuIrwDMAEfEk\nKx590coSYHKhe4vcr2gxMDMinoiI+4FZpHdCm5n1jLuOf3O3Q+iYMknhb5LWJj3vCElbkUoOQ7kO\nmC5pmqQ1gYOAS+rG+Qmwu6TVJa0D7Eq6D8LMrFJlq47GUkKAclcfHUu6q3mypB8ArwbeO9REEbFc\n0uHATGAccFZEzJd0WB5+WkTcIukK4AbgOeDMiLhpRN/EzKzNxlpCAFBEDD2StDGwG6naaHau6umK\ngYGBmDNnTrcWb2ajxFAlhdGWECTNjYiBocYrc/XRL4FdI+LSiPhZRNwv6fS2RGlmZj2lTJvCNOBI\nSccU+g2ZbczMrP+USQoPA28ANpX0U0kbVByTmVmlxlrV0XCUSQrK72b+CHABcA3gx1GYWV/yDWut\nlbn66LTah4g4W9KNwEerC8nMrBpOCENrmhQkrR8RjwLn5/cq1NwJfLryyMzMrONalRTOAd5Ceuta\nsPJdzAG8uMK4zMzayqWEclq9ee0t+f+0zoVjZtZdY7mRGcrdp/BqSevmz++SdKKkKdWHZmbWHn6k\nRXllrj46FXhS0vbAp4A/Ad+rNCozszZxQhieMklheaRnYewHnBwRpwATqg3LzMy6ocwlqY9JOhp4\nF/AaSasBa1QblpnZqnMpYfjKlBTeSXpU9gci4l7SexFOqDQqM7NVVDYhjB9X5vUwY8eQJYWcCE4s\ndN8DfLfKoMzMOuXWL+3b7RB6SpmSgplZX3G10cg5KZiZ2aBSSUHS2pJeWnUwZmad4lJCY2VuXnsr\nMI/0Sk4k7SCp/l3LZmY9oUzVkRNCc2VKCscCu5Deq0BEzCO9eMfMzEaZMknhmYh4pK7f0C92NjPr\nsG0+e9mQ47iU0FqZm9fmSzoEGCdpOvBx4NpqwzIzG76nn/X56qoqU1L4GPAy0g1s5wCPAEdUGZSZ\nmXVHy5KCpHHAf0bEp4HPdiYkM7PhcwNze7QsKUTEs8DuHYrFzMy6rEybwvX5EtTzgSdqPSPiwsqi\nMjNrM5cSyimTFMYDDwCvL/QLwEnBzGyUKfNAvPd1IhAzM+u+IZOCpP+lwX0JEfH+EtPuDZwEjAPO\njIjjm4z3CuC3wEER8eOh5mtmVjRUI7OrjsorU330s8Ln8cDbgaVDTZSvXDoF2AtYDFwn6ZKIuLnB\neP8N/Lxs0GZmVo0y1UcXFLslnQtcU2LeuwALI+KOPN15pFd63lw33seAC4BXlAnYzKyo7GOyrZyR\nPDp7OvDCEuNNAhYVuhfnfoMkTSKVPE5tNSNJMyTNkTRn2bJlwwzXzMzKKtOm8BgrtyncCxzZpuV/\nDTgyIp6Tmr8SLyJOB04HGBgY8H3sZmYVKVN9NGGE814CTC50b5H7FQ0A5+WEsAmwr6TlEXHxCJdp\nZmOI72JuvzLvU/hlmX4NXAdMlzRN0prAQcBK72GIiGkRMTUipgI/Bj7ihGBm1j1NSwqSxgPrAJtI\negFQq99Zn7q2gUYiYrmkw4GZpEtSz4qI+ZIOy8NPW9XgzWzscimhGq2qjz5Mehrq5sAfCv0fBU4u\nM/OIuAy4rK5fw2QQEe8tM08zM6tO06QQEScBJ0n6WER8o4MxmZlZl5S5ee1MSZ8kPS01gKuB0yLi\n6UojMzNrwlVH1SmTFL4DPAbUSguHAN8DDqwqKDMz644ySWG7iNi20H2lpPq7ks3MOsKlhGqVuaP5\nD5J2q3VI2hWYU11IZmaN+ZEW1StTUtgZuFbSPbl7CrBA0o1ARMTLK4vOzMw6qkxS2LvyKMzMhlC2\nlOCqo1VT5jEXdwNIeiHp0dm1/vc0ncjMzPpSmcdcvE3S7cCdwK+Bu4DLK47LzGyQSwmdU6ah+Thg\nN+C2iJgGvAGYXWlUZmaZE0JnlUkKz0TEA8BqklaLiCtJTzc1M6uUrzbqvDINzQ9LWg+YBfxA0n3A\nE9WGZWZWnksJ7VOmpLAf8CTwr8AVwJ+At1YZlJmZdUfTpCBpa0mvjognIuK5iFgeEd8hPTF1w86F\naGZjkdsSuqNVSeFrpMdk13skDzMzq4QTQve0SgqbRsSN9T1zv6mVRWRmY5oTQne1SgqtqojWbncg\nZmbWfa2SwhxJH6rvKemDwNzqQjKzscqXoHZfq0tSjwAuknQoK5LAALAm8PaqAzOzsWU4CcFVR9Vp\n9TrOvwCvkvQ6YLvc+9KI+FVHIjOzMcMJoXeUeSDelcCVHYjFzMYgJ4TeUubmNTMzGyOcFMysa1xK\n6D1OCmbW85wQOsdJwcx6mhNCZzkpmFlXlKk6ckLoPCcFM+s436TWuypNCpL2lrRA0kJJRzUYfqik\nGyTdKOlaSdtXGY+ZdZ+fbdTbKksKksYBpwD7ANsCB0vatm60O4HXRsTfk177eXpV8ZiZ2dCqLCns\nAiyMiDsi4m/AeaQX9gyKiGsj4qHcORvYosJ4zKzLXG3U+6pMCpOARYXuxblfMx8ALq8wHjPrIt+T\n0B/KvKO5cvn5Sh8Adm8yfAYwA2DKlCkdjMzMbGypsqSwBJhc6N4i91uJpJcDZwL7RcQDjWYUEadH\nxEBEDEycOLGSYM2sOi4l9I8qSwrXAdMlTSMlg4OAQ4ojSJoCXAj8c0TcVmEsZtYlvtqov1SWFCJi\nuaTDgZnAOOCsiJgv6bA8/DTg88DGwDclASyPiIGqYjKzznGjcn9SRHQ7hmEZGBiIOXPmdDsMM2ti\nJMnApYTqSZpb5qS7Jxqazay/uVQwejgpmNmwtTMJuJTQW5wUzGxIVZUEnBB6j5OCma2kU1VBTgi9\nyUnBbAzrVluAE0LvclIwGyN6oTHYyaD3OSmYjUK9kACKnAz6h5OC2SjQa0kAnAj6lZOCWR/pxYN/\njZPA6OCkYNbDnASs05wUzHpMLyYCJ4Cxw0nBrAe8/JgrePSvz3Y7DMAJYKxzUjDrom6XCpwArJ6T\nglmXdDohOAFYGU4KZh3mx0hYL3NSMOugKhKCD/7WTk4KZh3gR01bv3BSMFtFe514Fbff90Sly3Ai\nsE5xUjCrM+2oS+mFl9Q6EVg3OCnYmNfty0LrORlYNzkp2Kj2uYtv5Puz7+l2GKU5IVi3OSlYadt8\n9jKefnZFxcr4ceL4d2zPCTMXsOThp7oYWf9zMrBe4aRgDZWpUnn62eCIH87rQDSjmxOC9RInBVtJ\nr9Wvj2ZOBtaLnBTGEB/wO8cHfOtXTgqjTH29v7WHD/I2Vjgp9JleuYZ+NHvXblP44v5/3+0wzLrC\nSaGki69fwgkzF7D04afYfMO1Wf7ss/zlsb+1dRlrrb4aUzZau/K7Y9vp1VttxF0PPNUXVx+tv9Y4\nbvjC3t0Ow6ynVZoUJO0NnASMA86MiOPrhisP3xd4EnhvRPyh3XHUH9Bft81Errx1GUsfforVV4Nn\nnhve/Ko6AP51+XN9lRB8Rm02+lSWFCSNA04B9gIWA9dJuiQibi6Mtg8wPf/tCpya/7fNxdcv4egL\nb+SpZ9JbrZY8/NRKNzMNNyGMVZtOWJPffXavbodhZhWrsqSwC7AwIu4AkHQesB9QTAr7Ad+NiABm\nS9pQ0mYR8ed2BXHCzAWDCcFa23TCmg2rxNzIajZ2VJkUJgGLCt2LeX4poNE4k4CVkoKkGcAMgClT\npgwriKV9UNfdLa7+MbN6fdHQHBGnA6cDDAwMDOvim803XLsvGkGrIuBOn+mbWUlVJoUlwORC9xa5\n33DHWSX/9qaXrtSm0MtGevWRq3fMrF2qTArXAdMlTSMd6A8CDqkb5xLg8NzesCvwSDvbEwD233ES\nQFuvPoJ0KeYPPvTKdoZqZtZ1lSWFiFgu6XBgJumS1LMiYr6kw/Lw04DLSJejLiRdkvq+KmLZf8dJ\ng8nBzMyaq7RNISIuIx34i/1OK3wO4KNVxmBmZuWt1u0AzMysdzgpmJnZICcFMzMb5KRgZmaDlNp6\n+4ekZcDdI5h0E+D+NofTLo5tZBzb8PVqXODYRqpsbFtGxMShRuq7pDBSkuZExEC342jEsY2MYxu+\nXo0LHNurXaDlAAAGmklEQVRItTs2Vx+ZmdkgJwUzMxs0lpLC6d0OoAXHNjKObfh6NS5wbCPV1tjG\nTJuCmZkNbSyVFMzMbAhOCmZmNqhvk4KkyZKulHSzpPmSPpH7byTp/yTdnv+/oDDN0ZIWSlog6U25\n3zqSLpV0a57P8b0SW908L5F0Uy/FJmlNSadLui3/fgf0UGwHS7pR0g2SrpC0SSdjk7RxHv9xSSfX\nzWvnHNtCSV+XpG7H1Qv7QavfrDDPruwHQ6zPru4HQ8Q2/P0gIvryD9gM2Cl/ngDcBmwLfAU4Kvc/\nCvjv/Hlb4I/AWsA04E+kR3qvA7wuj7MmcDWwTy/EVpjfPwLnADf1yu+Wh30B+GL+vBqwSS/ERnr6\n7321ePL0x3Y4tnWB3YHDgJPr5vV7YDfSi/EuX5XtrV1x9ch+0PQ364H9oNX67PZ+0Gydjmg/WKUf\ntpf+gJ8AewELgM0KP+6C/Plo4OjC+DOBVzaYz0nAh3olNmA94Jq8UazyztDm2BYB6/baOgXWAJYB\nW5IOvKcBMzoZW2G899btqJsBtxa6Dwa+1e24Gsyn4/tBq9i6vR8MEVtX94MW29qI9oO+rT4qkjQV\n2BH4HbBprHh7273ApvnzJNLKq1mc+xXnsyHwVuCXPRTbccBXSS8haqtViS3/VgDHSfqDpPMlbUqb\nrEpsEfEM8C/AjcBS0oHk2x2OrZlJOc6VYu6BuIrz6dZ+0Eq394Nm0/bCftDQSPeDvk8KktYDLgCO\niIhHi8MipctS19xKWh04F/h6RNzRC7FJ2gHYKiIuakc87YyNVDTdArg2InYCfgv8Ty/EJmkN0s6w\nI7A5cAOpVNH12Kri/aA7sTEK94O+Tgr5S18A/CAiLsy9/yJpszx8M1KdGqT3RE8uTL5F7ldzOnB7\nRHyth2J7JTAg6S5S0fklkq7qkdgeIJ211aY/H9ipR2LbASAi/pR3nh8Br+pwbM0syXHWx9ztuGq6\nuR800wv7QTO9sB80M6L9oG+TgiSRikK3RMSJhUGXAO/Jn99Dqo+r9T9I0lqSpgHTSQ1+SPoisAFw\nRC/FFhGnRsTmETGV1JB0W0Ts2SOxBfBToBbPG4CbeyE20kF2W0m1J0LuBdzS4dgaysX/RyXtluf5\n7qGm6URceV7d3g8a6pH9oFlsvbAfNDOy/aCqxpGq/0gbR5CKRPPy377AxqS60NuBXwAbFab5LOkK\nlQXkKytIZ2qRf6zafD7YC7HVzXMq7bnqom2xkRqwZuV5/RKY0kOxHZbX6Q2knXbjLsR2F/Ag8Dip\n7WDb3H8AuCnHfTL5yQLdjKuH9oOGv1mP7AfN1mcv7AfNYhv2fuDHXJiZ2aC+rT4yM7P2c1IwM7NB\nTgpmZjbIScHMzAY5KZiZ2SAnBRsTJL1I0nmS/iRprqTLJL2kzcu4SlLLF6hL2lPSz4YYZwdJ+7Yz\nNrOynBRs1Ms3A10EXBURW0XEzqTb/TetG2/1bsTXwA6k69LNOs5JwcaC1wHPRMRptR4R8ceIuDqf\nuV8t6RLynaiSLs6lifmSZuR+B0o6MX/+hKQ78ucXS/pNq4VL2lvpOft/ID3+udZ/F0m/lXS9pGsl\nvVTSmsB/Au+UNE/SOyWtK+ksSb/P4+7X5t/HbFCvnBmZVWk7YG6L4TsB20XEnbn7/RHxoKS1gesk\nXUB6v8C/5+F7AA9ImpQ/z2o2Y0njgTOA1wMLgR8WBt8K7BERyyW9EfhyRBwg6fPAQEQcnufxZeBX\nEfF+pady/l7SLyLiiWH9CmYlOCmYpWc53Vno/rikt+fPk4HpETFb0nqSJuR+5wCvISWFC2luG+DO\niLgdQNL3gRl52AbAdyRNJz3WYI0m8/gH4G2SPp27xwNTWMXnOZk14uojGwvmAzu3GD54xi1pT+CN\npBcJbQ9cTzoIA1wLvI/0nKWrSQnhlUDL6qMWjgOujIjtSO8vGN9kPAEHRMQO+W9KRDghWCWcFGws\n+BWwVq19AEDSyyXt0WDcDYCHIuJJSduQXptZczXwaVJ10fWktoq/RsQjLZZ9KzBV0la5++C6ZdUe\nm/3eQv/HSK9hrJkJfCw3mCNpxxbLM1slTgo26kV66uPbgTfmS1LnA/9FentVvSuA1SXdAhwPzC4M\nu5pUdTQrIp4lvfXtmiGW/TSpuujS3NBcfAb+V4D/knQ9K1flXkl65PE8Se8klSjWAG7IsR9X8qub\nDZufkmpmZoNcUjAzs0FOCmZmNshJwczMBjkpmJnZICcFMzMb5KRgZmaDnBTMzGzQ/wc3GdYSHwvf\nzQAAAABJRU5ErkJggg==\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x7f58080daf60>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.dates as mdates\n",
"\n",
"URL = 'example.com'\n",
"\n",
"# aggregate the data\n",
"dates = []; lengths = []\n",
"for row in parse_cdx(URL):\n",
" dates.append(mdates.date2num(row['timestamp']))\n",
" lengths.append(row['length'])\n",
"\n",
"# sort the data\n",
"dates, lengths = zip(*sorted(zip(dates, lengths)))\n",
"\n",
"# plot the data\n",
"plt.title(\"Cumulative Wayback Capture Size for {}\".format(URL))\n",
"plt.plot_date(dates, np.cumsum(lengths))\n",
"plt.xlabel(\"Crawl date\")\n",
"plt.ylabel(\"Capture size (bytes)\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### CDX Example: Crawl Rate"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The crawl rate for example.com is 19.085849 visits/day\n"
]
}
],
"source": [
"import matplotlib.dates as mdates\n",
"\n",
"URL = 'example.com'\n",
"\n",
"# aggregate the data\n",
"dates = []; lengths = []\n",
"for row in parse_cdx(URL):\n",
" dates.append(mdates.date2num(row['timestamp']))\n",
" lengths.append(row['length'])\n",
"\n",
"# sort the data\n",
"dates, lengths = zip(*sorted(zip(dates, lengths)))\n",
"\n",
"crawl_rate = len(dates) / (max(dates) - min(dates))\n",
"print(\"The crawl rate for {} is {:4f} visits/day\".format(URL, crawl_rate))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment