cjauvin/knn.json

## knn.json
{
 "metadata": {
  "name": "",
  "signature": "sha256:470129fc930ac898a2329a0d092295ee9959019757b377ff1f8aa45c837d7456"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Fun with kNN"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "My goal here is to explore and validate some results from my first kNN run, to better understand what goes on. In a nutshell, the idea is to first build a matrix of advertisers and their publishers. A cell of that matrix can be either the number of times that particular advertiser has used that particular publisher, or a binary value, indicating whether at least one such association has ever happened. Hopefully, the matrix will be very sparse, which should obviously help us. Next, from that matrix, we can compute the cosine similarity matrix giving us the pairwise similarity between **all** advertisers, which is what we ultimately want. The challenge here is that our matrix is quite big, so we need to take special care in how we do this. In particular, since we couldn't reasonably store the full similarity matrix (which is **not** sparse), we can keep only the best $k$ values for every advertiser, i.e. their $k$ nearest neighbors."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from __future__ import division, print_function\n",
      "import numpy as np\n",
      "from sqlalchemy import *\n",
      "from sqlalchemy.sql import *\n",
      "\n",
      "engine = create_engine(open('credentials_KEEP_SECRET.txt').readline())\n",
      "conn = engine.connect()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "One particular results of my first run was for `advertiser_id = 12`, so let's set it as our target:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "target_ad_id = 12\n",
      "r = conn.execute('select data from ad_end_redirect_domain where id = %s', [target_ad_id])\n",
      "print('target advertiser:', r.fetchone()[0])\n",
      "\n",
      "q = \"\"\"select publisher_id, pub.data, ad_count\n",
      "       from advertiser_publisher_overview, found_on_domain pub\n",
      "       where advertiser_id = %s\n",
      "       and publisher_id = pub.id\n",
      "       order by publisher_id\"\"\"\n",
      "\n",
      "r = conn.execute(q, [target_ad_id])\n",
      "rows = r.fetchall()\n",
      "for row in rows:\n",
      "    print(dict(row))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "target advertiser: adserver.adtech.de\n",
        "{u'data': 'jp.dk', u'publisher_id': 6, u'ad_count': 274}"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "{u'data': 'epn.dk', u'publisher_id': 345, u'ad_count': 44}\n",
        "{u'data': 'radio.jp.dk', u'publisher_id': 350, u'ad_count': 14}\n",
        "{u'data': 'foto.jp.dk', u'publisher_id': 453, u'ad_count': 6}\n",
        "{u'data': 'fpn.dk', u'publisher_id': 1551, u'ad_count': 93}\n",
        "{u'data': 'www.berlingske.dk', u'publisher_id': 2945, u'ad_count': 3}\n",
        "{u'data': 'kpn.dk', u'publisher_id': 3490, u'ad_count': 26}\n",
        "{u'data': 'spn.dk', u'publisher_id': 3514, u'ad_count': 114}\n",
        "{u'data': 'blogs.jp.dk', u'publisher_id': 4943, u'ad_count': 5}\n",
        "{u'data': 'guide.dk', u'publisher_id': 5234, u'ad_count': 7}\n",
        "{u'data': 'debat.bt.dk', u'publisher_id': 5294, u'ad_count': 2}\n"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can extract a sparse vector from that data, where each element is a column index (corresponding to a publisher) along with the associated count. Note that instead of regular Numpy vectors, I'll be using `dict`s here, as a sparse vector implementation:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "target_sparse_vec = {r.publisher_id: r.ad_count for r in rows}\n",
      "target_sparse_vec"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 3,
       "text": [
        "{6: 274,\n",
        " 345: 44,\n",
        " 350: 14,\n",
        " 453: 6,\n",
        " 1551: 93,\n",
        " 2945: 3,\n",
        " 3490: 26,\n",
        " 3514: 114,\n",
        " 4943: 5,\n",
        " 5234: 7,\n",
        " 5294: 2}"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Next we need to L2-normalize that vector, i.e. divide each value by the square root of the sum of the squared elements (note that we can either use the counts or binary values):"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def normalize(vec, use_counts=True):    \n",
      "    s = np.sqrt(np.sum([(v if use_counts else 1) ** 2 for v in vec.itervalues()]))\n",
      "    for k in vec.iterkeys():\n",
      "        if use_counts:\n",
      "            vec[k] /= s\n",
      "        else:\n",
      "            vec[k] = 1 / s\n",
      "\n",
      "normalize(target_sparse_vec)\n",
      "target_sparse_vec"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 4,
       "text": [
        "{6: 0.8679756727896063,\n",
        " 345: 0.13938295475453533,\n",
        " 350: 0.044349121967352148,\n",
        " 453: 0.019006766557436636,\n",
        " 1551: 0.29460488164026782,\n",
        " 2945: 0.0095033832787183182,\n",
        " 3490: 0.082362655082225414,\n",
        " 3514: 0.36112856459129605,\n",
        " 4943: 0.015838972131197195,\n",
        " 5234: 0.022174560983676074,\n",
        " 5294: 0.0063355888524788788}"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "To compute the cosine similarity of this vector to some other, we use the dot product, which is simply the sum of element products:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def dot_product(vec1, vec2):\n",
      "    s = 0\n",
      "    for i, v in vec1.iteritems():\n",
      "        if i in vec2:\n",
      "            s += (v * vec2[i])\n",
      "    return s"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "My test run has returned the list of advertisers `[39704, 12169, 16809, 9637, 27561]` as the nearest neighbors of our target (i.e. the advertisers with the greatest similarity to it), so let's look at them in details:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for similar_ad_id in [39704, 12169, 16809, 9637, 27561]:\n",
      "\n",
      "    r = conn.execute('select data from ad_end_redirect_domain where id = %s', [similar_ad_id])\n",
      "    print('similar advertiser:', r.fetchone()[0])\n",
      "\n",
      "    r = conn.execute(q, [similar_ad_id])\n",
      "    rows = r.fetchall()\n",
      "    for row in rows:\n",
      "        print(dict(row))\n",
      "\n",
      "    similar_sparse_vec = {r.publisher_id: r.ad_count for r in rows}\n",
      "    normalize(similar_sparse_vec)\n",
      "\n",
      "    print('sim:', dot_product(target_sparse_vec, similar_sparse_vec), end='\\n\\n')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "similar advertiser: spot.dk\n",
        "{u'data': 'jp.dk', u'publisher_id': 6, u'ad_count': 77}"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "{u'data': 'epn.dk', u'publisher_id': 345, u'ad_count': 10}\n",
        "{u'data': 'radio.jp.dk', u'publisher_id': 350, u'ad_count': 3}\n",
        "{u'data': 'foto.jp.dk', u'publisher_id': 453, u'ad_count': 4}\n",
        "{u'data': 'fpn.dk', u'publisher_id': 1551, u'ad_count': 12}\n",
        "{u'data': 'kpn.dk', u'publisher_id': 3490, u'ad_count': 6}\n",
        "{u'data': 'spn.dk', u'publisher_id': 3514, u'ad_count': 10}\n",
        "{u'data': 'blogs.jp.dk', u'publisher_id': 4943, u'ad_count': 2}\n",
        "{u'data': 'guide.dk', u'publisher_id': 5234, u'ad_count': 1}\n",
        "sim: 0.956212905621\n",
        "\n",
        "similar advertiser:"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " www.mitspot.dk\n",
        "{u'data': 'jp.dk', u'publisher_id': 6, u'ad_count': 15}"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "{u'data': 'epn.dk', u'publisher_id': 345, u'ad_count': 4}\n",
        "{u'data': 'radio.jp.dk', u'publisher_id': 350, u'ad_count': 2}\n",
        "{u'data': 'foto.jp.dk', u'publisher_id': 453, u'ad_count': 2}\n",
        "{u'data': 'fpn.dk', u'publisher_id': 1551, u'ad_count': 4}\n",
        "{u'data': 'kpn.dk', u'publisher_id': 3490, u'ad_count': 4}\n",
        "{u'data': 'spn.dk', u'publisher_id': 3514, u'ad_count': 3}\n",
        "{u'data': 'blogs.jp.dk', u'publisher_id': 4943, u'ad_count': 1}\n",
        "sim: 0.956166200274\n",
        "\n",
        "similar advertiser:"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " www.sms1218.dk\n",
        "{u'data': 'jp.dk', u'publisher_id': 6, u'ad_count': 4}"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "{u'data': 'radio.jp.dk', u'publisher_id': 350, u'ad_count': 1}\n",
        "{u'data': 'foto.jp.dk', u'publisher_id': 453, u'ad_count': 1}\n",
        "{u'data': 'kpn.dk', u'publisher_id': 3490, u'ad_count': 3}\n",
        "{u'data': 'spn.dk', u'publisher_id': 3514, u'ad_count': 4}\n",
        "{u'data': 'blogs.jp.dk', u'publisher_id': 4943, u'ad_count': 1}\n",
        "sim: 0.790366729253\n",
        "\n",
        "similar advertiser:"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " mediawatch.dk\n",
        "{u'data': 'jp.dk', u'publisher_id': 6, u'ad_count': 13}"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "{u'data': 'epn.dk', u'publisher_id': 345, u'ad_count': 177}\n",
        "{u'data': 'radio.jp.dk', u'publisher_id': 350, u'ad_count': 1}\n",
        "{u'data': 'foto.jp.dk', u'publisher_id': 453, u'ad_count': 1}\n",
        "{u'data': 'fpn.dk', u'publisher_id': 1551, u'ad_count': 1}\n",
        "{u'data': 'kpn.dk', u'publisher_id': 3490, u'ad_count': 3}\n",
        "{u'data': 'spn.dk', u'publisher_id': 3514, u'ad_count': 9}\n",
        "sim: 0.223978802642\n",
        "\n",
        "similar advertiser:"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " www.profiloptik.dk\n",
        "{u'data': 'jp.dk', u'publisher_id': 6, u'ad_count': 3}"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "{u'data': 'www.bt.dk', u'publisher_id': 15, u'ad_count': 3104}\n",
        "{u'data': 'radio.jp.dk', u'publisher_id': 350, u'ad_count': 1}\n",
        "{u'data': 'kpn.dk', u'publisher_id': 3490, u'ad_count': 1}\n",
        "{u'data': 'spn.dk', u'publisher_id': 3514, u'ad_count': 1}\n",
        "sim: 0.000996058451881\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Ok so the first two ones are clearly similar, but what happened with the last three, didn't my first run retrieve those as the most similar? The explanation is simple: here I'm querying directly the DB, whereas with my test run, I first applied low and high count filters on both the advertisers and the publishers, to significantly reduce the size of the input matrix. Here, since we don't do that, the possibly very high counts of some publishers might skew very heavily the normalization process, and hence the results. So this is an important observation, of which I'm not sure of the exact implications, but it seems that **the way we initially filter our input matrix has a great influence not only on the efficiency of the computation, but also on the results.** However, one way to mitigate this effect would be to use binary values instead of counts:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "r = conn.execute(q, [target_ad_id])\n",
      "rows = r.fetchall()\n",
      "target_sparse_vec = dict([(r.publisher_id, r.ad_count) for r in rows])\n",
      "normalize(target_sparse_vec, use_counts=False) # use binary values!\n",
      "\n",
      "for similar_ad_id in [39704, 12169, 16809, 9637, 27561]:\n",
      "\n",
      "    r = conn.execute('select data from ad_end_redirect_domain where id = %s', [similar_ad_id])\n",
      "    print('similar advertiser:', r.fetchone()[0])\n",
      "\n",
      "    r = conn.execute(q, [similar_ad_id])\n",
      "    rows = r.fetchall()\n",
      "    for row in rows:\n",
      "        print(dict(row))\n",
      "\n",
      "    similar_sparse_vec = {r.publisher_id: r.ad_count for r in rows}\n",
      "    normalize(similar_sparse_vec, use_counts=False) # use binary values!\n",
      "\n",
      "    print('sim:', dot_product(target_sparse_vec, similar_sparse_vec), end='\\n\\n')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "similar advertiser: spot.dk\n",
        "{u'data': 'jp.dk', u'publisher_id': 6, u'ad_count': 77}"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "{u'data': 'epn.dk', u'publisher_id': 345, u'ad_count': 10}\n",
        "{u'data': 'radio.jp.dk', u'publisher_id': 350, u'ad_count': 3}\n",
        "{u'data': 'foto.jp.dk', u'publisher_id': 453, u'ad_count': 4}\n",
        "{u'data': 'fpn.dk', u'publisher_id': 1551, u'ad_count': 12}\n",
        "{u'data': 'kpn.dk', u'publisher_id': 3490, u'ad_count': 6}\n",
        "{u'data': 'spn.dk', u'publisher_id': 3514, u'ad_count': 10}\n",
        "{u'data': 'blogs.jp.dk', u'publisher_id': 4943, u'ad_count': 2}\n",
        "{u'data': 'guide.dk', u'publisher_id': 5234, u'ad_count': 1}\n",
        "sim: 0.904534033733\n",
        "\n",
        "similar advertiser:"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " www.mitspot.dk\n",
        "{u'data': 'jp.dk', u'publisher_id': 6, u'ad_count': 15}"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "{u'data': 'epn.dk', u'publisher_id': 345, u'ad_count': 4}\n",
        "{u'data': 'radio.jp.dk', u'publisher_id': 350, u'ad_count': 2}\n",
        "{u'data': 'foto.jp.dk', u'publisher_id': 453, u'ad_count': 2}\n",
        "{u'data': 'fpn.dk', u'publisher_id': 1551, u'ad_count': 4}\n",
        "{u'data': 'kpn.dk', u'publisher_id': 3490, u'ad_count': 4}\n",
        "{u'data': 'spn.dk', u'publisher_id': 3514, u'ad_count': 3}\n",
        "{u'data': 'blogs.jp.dk', u'publisher_id': 4943, u'ad_count': 1}\n",
        "sim: 0.852802865422\n",
        "\n",
        "similar advertiser:"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " www.sms1218.dk\n",
        "{u'data': 'jp.dk', u'publisher_id': 6, u'ad_count': 4}"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "{u'data': 'radio.jp.dk', u'publisher_id': 350, u'ad_count': 1}\n",
        "{u'data': 'foto.jp.dk', u'publisher_id': 453, u'ad_count': 1}\n",
        "{u'data': 'kpn.dk', u'publisher_id': 3490, u'ad_count': 3}\n",
        "{u'data': 'spn.dk', u'publisher_id': 3514, u'ad_count': 4}\n",
        "{u'data': 'blogs.jp.dk', u'publisher_id': 4943, u'ad_count': 1}\n",
        "sim: 0.738548945876\n",
        "\n",
        "similar advertiser:"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " mediawatch.dk\n",
        "{u'data': 'jp.dk', u'publisher_id': 6, u'ad_count': 13}"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "{u'data': 'epn.dk', u'publisher_id': 345, u'ad_count': 177}\n",
        "{u'data': 'radio.jp.dk', u'publisher_id': 350, u'ad_count': 1}\n",
        "{u'data': 'foto.jp.dk', u'publisher_id': 453, u'ad_count': 1}\n",
        "{u'data': 'fpn.dk', u'publisher_id': 1551, u'ad_count': 1}\n",
        "{u'data': 'kpn.dk', u'publisher_id': 3490, u'ad_count': 3}\n",
        "{u'data': 'spn.dk', u'publisher_id': 3514, u'ad_count': 9}\n",
        "sim: 0.797724035217\n",
        "\n",
        "similar advertiser:"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " www.profiloptik.dk\n",
        "{u'data': 'jp.dk', u'publisher_id': 6, u'ad_count': 3}"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "{u'data': 'www.bt.dk', u'publisher_id': 15, u'ad_count': 3104}\n",
        "{u'data': 'radio.jp.dk', u'publisher_id': 350, u'ad_count': 1}\n",
        "{u'data': 'kpn.dk', u'publisher_id': 3490, u'ad_count': 1}\n",
        "{u'data': 'spn.dk', u'publisher_id': 3514, u'ad_count': 1}\n",
        "sim: 0.539359889971\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "These results make more sense, but they are stil far from being in line with the those obtained by filtering the matrix. Finally, let's look at the similarity with a bunch of random advertisers, just to be sure:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "i = 0\n",
      "while True:\n",
      "\n",
      "    random_ad_id = np.random.randint(1600000)\n",
      "    \n",
      "    r = conn.execute('select data from ad_end_redirect_domain where id = %s', [random_ad_id])\n",
      "    print('random advertiser:', r.fetchone()[0])\n",
      "\n",
      "    r = conn.execute(q, [random_ad_id])\n",
      "    rows = r.fetchall()\n",
      "\n",
      "    if len(rows) > 50: continue\n",
      "    \n",
      "    for row in rows:\n",
      "        print(dict(row))\n",
      "\n",
      "    random_sparse_vec = {r.publisher_id: r.ad_count for r in rows}\n",
      "    normalize(random_sparse_vec, False)\n",
      "\n",
      "    print('sim:', dot_product(target_sparse_vec, random_sparse_vec), end='\\n\\n')\n",
      "\n",
      "    i += 1\n",
      "    if i > 5: break"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "random advertiser: www.baltimoreravensstore.com\n",
        "{u'data': 'www.baltimoreravens.com', u'publisher_id': 43381, u'ad_count': 134}"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "sim: 0\n",
        "\n",
        "random advertiser:"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " www.gringolocal.com\n",
        "{u'data': 'www.blogtopsites.com', u'publisher_id': 211, u'ad_count': 1}"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "{u'data': 'gulfnews.com', u'publisher_id': 3903, u'ad_count': 5}\n",
        "{u'data': 'www.travel.darkroastedblend.com', u'publisher_id': 3988, u'ad_count': 1}\n",
        "sim: 0\n",
        "\n",
        "random advertiser:"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " www.clazlaw.net\n",
        "{u'data': 'www.whaleoil.co.nz', u'publisher_id': 30604, u'ad_count': 1}"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "{u'data': 'freemasonrywatch.org', u'publisher_id': 52891, u'ad_count': 1}\n",
        "{u'data': 'parentalconsentletter.com', u'publisher_id': 60098, u'ad_count': 1}\n",
        "{u'data': 'bmhm.com', u'publisher_id': 74877, u'ad_count': 1}\n",
        "{u'data': 'www.bkzoom.com', u'publisher_id': 76490, u'ad_count': 1}\n",
        "{u'data': 'www.doglaw.com', u'publisher_id': 82489, u'ad_count': 1}\n",
        "{u'data': 'myeaglecountry.com', u'publisher_id': 90963, u'ad_count': 1}\n",
        "{u'data': 'www.oc-breeze.com', u'publisher_id': 105657, u'ad_count': 1}\n",
        "sim: 0\n",
        "\n",
        "random advertiser:"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " www.engleservices.com\n",
        "{u'data': 'www.dailyhome.com', u'publisher_id': 40687, u'ad_count': 1}"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "{u'data': 'www.thestclairtimes.com', u'publisher_id': 42964, u'ad_count': 1}\n",
        "sim: 0\n",
        "\n",
        "random advertiser:"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " www.playbadminton.sg\n",
        "{u'data': 'www.mybowlinggames.com', u'publisher_id': 516, u'ad_count': 1}"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "{u'data': 'www.independent.co.uk', u'publisher_id': 1532, u'ad_count': 2}\n",
        "{u'data': 'www.allkyhoops.com', u'publisher_id': 3637, u'ad_count': 2}\n",
        "{u'data': 'www.telegraph.co.uk', u'publisher_id': 5245, u'ad_count': 2}\n",
        "{u'data': 'www.short-hair-style.com', u'publisher_id': 8451, u'ad_count': 11}\n",
        "{u'data': 'www.unlimitedwebgames.com', u'publisher_id': 8693, u'ad_count': 3}\n",
        "{u'data': 'artistlife.craftgossip.com', u'publisher_id': 24555, u'ad_count': 4}\n",
        "{u'data': 'sportsvibe.co.uk', u'publisher_id': 26162, u'ad_count': 1}\n",
        "{u'data': 'www.covers.com', u'publisher_id': 26943, u'ad_count': 1}\n",
        "{u'data': 'www.designsponge.com', u'publisher_id': 28672, u'ad_count': 1}\n",
        "{u'data': 'www.theguardian.com', u'publisher_id': 120935, u'ad_count': 1}\n",
        "sim: 0\n",
        "\n",
        "random advertiser:"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " www.discoland.fi\n",
        "{u'data': 'klubitus.org', u'publisher_id': 60643, u'ad_count': 1}"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "{u'data': 'www.ruuvipenkki.fi', u'publisher_id': 120280, u'ad_count': 6}\n",
        "sim: 0\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "All dissimilar, so it looks good!"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}
	{
	"metadata": {
	"name": "",
	"signature": "sha256:470129fc930ac898a2329a0d092295ee9959019757b377ff1f8aa45c837d7456"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "heading",
	"level": 1,
	"metadata": {},
	"source": [
	"Fun with kNN"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"My goal here is to explore and validate some results from my first kNN run, to better understand what goes on. In a nutshell, the idea is to first build a matrix of advertisers and their publishers. A cell of that matrix can be either the number of times that particular advertiser has used that particular publisher, or a binary value, indicating whether at least one such association has ever happened. Hopefully, the matrix will be very sparse, which should obviously help us. Next, from that matrix, we can compute the cosine similarity matrix giving us the pairwise similarity between all advertisers, which is what we ultimately want. The challenge here is that our matrix is quite big, so we need to take special care in how we do this. In particular, since we couldn't reasonably store the full similarity matrix (which is not sparse), we can keep only the best $k$ values for every advertiser, i.e. their $k$ nearest neighbors."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"from __future__ import division, print_function\n",
	"import numpy as np\n",
	"from sqlalchemy import *\n",
	"from sqlalchemy.sql import *\n",
	"\n",
	"engine = create_engine(open('credentials_KEEP_SECRET.txt').readline())\n",
	"conn = engine.connect()"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 1
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"One particular results of my first run was for `advertiser_id = 12`, so let's set it as our target:"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"target_ad_id = 12\n",
	"r = conn.execute('select data from ad_end_redirect_domain where id = %s', [target_ad_id])\n",
	"print('target advertiser:', r.fetchone()[0])\n",
	"\n",
	"q = \"\"\"select publisher_id, pub.data, ad_count\n",
	" from advertiser_publisher_overview, found_on_domain pub\n",
	" where advertiser_id = %s\n",
	" and publisher_id = pub.id\n",
	" order by publisher_id\"\"\"\n",
	"\n",
	"r = conn.execute(q, [target_ad_id])\n",
	"rows = r.fetchall()\n",
	"for row in rows:\n",
	" print(dict(row))"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"target advertiser: adserver.adtech.de\n",
	"{u'data': 'jp.dk', u'publisher_id': 6, u'ad_count': 274}"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"\n",
	"{u'data': 'epn.dk', u'publisher_id': 345, u'ad_count': 44}\n",
	"{u'data': 'radio.jp.dk', u'publisher_id': 350, u'ad_count': 14}\n",
	"{u'data': 'foto.jp.dk', u'publisher_id': 453, u'ad_count': 6}\n",
	"{u'data': 'fpn.dk', u'publisher_id': 1551, u'ad_count': 93}\n",
	"{u'data': 'www.berlingske.dk', u'publisher_id': 2945, u'ad_count': 3}\n",
	"{u'data': 'kpn.dk', u'publisher_id': 3490, u'ad_count': 26}\n",
	"{u'data': 'spn.dk', u'publisher_id': 3514, u'ad_count': 114}\n",
	"{u'data': 'blogs.jp.dk', u'publisher_id': 4943, u'ad_count': 5}\n",
	"{u'data': 'guide.dk', u'publisher_id': 5234, u'ad_count': 7}\n",
	"{u'data': 'debat.bt.dk', u'publisher_id': 5294, u'ad_count': 2}\n"
	]
	}
	],
	"prompt_number": 2
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We can extract a sparse vector from that data, where each element is a column index (corresponding to a publisher) along with the associated count. Note that instead of regular Numpy vectors, I'll be using `dict`s here, as a sparse vector implementation:"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"target_sparse_vec = {r.publisher_id: r.ad_count for r in rows}\n",
	"target_sparse_vec"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 3,
	"text": [
	"{6: 274,\n",
	" 345: 44,\n",
	" 350: 14,\n",
	" 453: 6,\n",
	" 1551: 93,\n",
	" 2945: 3,\n",
	" 3490: 26,\n",
	" 3514: 114,\n",
	" 4943: 5,\n",
	" 5234: 7,\n",
	" 5294: 2}"
	]
	}
	],
	"prompt_number": 3
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Next we need to L2-normalize that vector, i.e. divide each value by the square root of the sum of the squared elements (note that we can either use the counts or binary values):"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"def normalize(vec, use_counts=True): \n",
	" s = np.sqrt(np.sum([(v if use_counts else 1) ** 2 for v in vec.itervalues()]))\n",
	" for k in vec.iterkeys():\n",
	" if use_counts:\n",
	" vec[k] /= s\n",
	" else:\n",
	" vec[k] = 1 / s\n",
	"\n",
	"normalize(target_sparse_vec)\n",
	"target_sparse_vec"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 4,
	"text": [
	"{6: 0.8679756727896063,\n",
	" 345: 0.13938295475453533,\n",
	" 350: 0.044349121967352148,\n",
	" 453: 0.019006766557436636,\n",
	" 1551: 0.29460488164026782,\n",
	" 2945: 0.0095033832787183182,\n",
	" 3490: 0.082362655082225414,\n",
	" 3514: 0.36112856459129605,\n",
	" 4943: 0.015838972131197195,\n",
	" 5234: 0.022174560983676074,\n",
	" 5294: 0.0063355888524788788}"
	]
	}
	],
	"prompt_number": 4
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"To compute the cosine similarity of this vector to some other, we use the dot product, which is simply the sum of element products:"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"def dot_product(vec1, vec2):\n",
	" s = 0\n",
	" for i, v in vec1.iteritems():\n",
	" if i in vec2:\n",
	" s += (v * vec2[i])\n",
	" return s"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 6
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"My test run has returned the list of advertisers `[39704, 12169, 16809, 9637, 27561]` as the nearest neighbors of our target (i.e. the advertisers with the greatest similarity to it), so let's look at them in details:"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"for similar_ad_id in [39704, 12169, 16809, 9637, 27561]:\n",
	"\n",
	" r = conn.execute('select data from ad_end_redirect_domain where id = %s', [similar_ad_id])\n",
	" print('similar advertiser:', r.fetchone()[0])\n",
	"\n",
	" r = conn.execute(q, [similar_ad_id])\n",
	" rows = r.fetchall()\n",
	" for row in rows:\n",
	" print(dict(row))\n",
	"\n",
	" similar_sparse_vec = {r.publisher_id: r.ad_count for r in rows}\n",
	" normalize(similar_sparse_vec)\n",
	"\n",
	" print('sim:', dot_product(target_sparse_vec, similar_sparse_vec), end='\\n\\n')"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"similar advertiser: spot.dk\n",
	"{u'data': 'jp.dk', u'publisher_id': 6, u'ad_count': 77}"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"\n",
	"{u'data': 'epn.dk', u'publisher_id': 345, u'ad_count': 10}\n",
	"{u'data': 'radio.jp.dk', u'publisher_id': 350, u'ad_count': 3}\n",
	"{u'data': 'foto.jp.dk', u'publisher_id': 453, u'ad_count': 4}\n",
	"{u'data': 'fpn.dk', u'publisher_id': 1551, u'ad_count': 12}\n",
	"{u'data': 'kpn.dk', u'publisher_id': 3490, u'ad_count': 6}\n",
	"{u'data': 'spn.dk', u'publisher_id': 3514, u'ad_count': 10}\n",
	"{u'data': 'blogs.jp.dk', u'publisher_id': 4943, u'ad_count': 2}\n",
	"{u'data': 'guide.dk', u'publisher_id': 5234, u'ad_count': 1}\n",
	"sim: 0.956212905621\n",
	"\n",
	"similar advertiser:"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	" www.mitspot.dk\n",
	"{u'data': 'jp.dk', u'publisher_id': 6, u'ad_count': 15}"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"\n",
	"{u'data': 'epn.dk', u'publisher_id': 345, u'ad_count': 4}\n",
	"{u'data': 'radio.jp.dk', u'publisher_id': 350, u'ad_count': 2}\n",
	"{u'data': 'foto.jp.dk', u'publisher_id': 453, u'ad_count': 2}\n",
	"{u'data': 'fpn.dk', u'publisher_id': 1551, u'ad_count': 4}\n",
	"{u'data': 'kpn.dk', u'publisher_id': 3490, u'ad_count': 4}\n",
	"{u'data': 'spn.dk', u'publisher_id': 3514, u'ad_count': 3}\n",
	"{u'data': 'blogs.jp.dk', u'publisher_id': 4943, u'ad_count': 1}\n",
	"sim: 0.956166200274\n",
	"\n",
	"similar advertiser:"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	" www.sms1218.dk\n",
	"{u'data': 'jp.dk', u'publisher_id': 6, u'ad_count': 4}"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"\n",
	"{u'data': 'radio.jp.dk', u'publisher_id': 350, u'ad_count': 1}\n",
	"{u'data': 'foto.jp.dk', u'publisher_id': 453, u'ad_count': 1}\n",
	"{u'data': 'kpn.dk', u'publisher_id': 3490, u'ad_count': 3}\n",
	"{u'data': 'spn.dk', u'publisher_id': 3514, u'ad_count': 4}\n",
	"{u'data': 'blogs.jp.dk', u'publisher_id': 4943, u'ad_count': 1}\n",
	"sim: 0.790366729253\n",
	"\n",
	"similar advertiser:"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	" mediawatch.dk\n",
	"{u'data': 'jp.dk', u'publisher_id': 6, u'ad_count': 13}"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"\n",
	"{u'data': 'epn.dk', u'publisher_id': 345, u'ad_count': 177}\n",
	"{u'data': 'radio.jp.dk', u'publisher_id': 350, u'ad_count': 1}\n",
	"{u'data': 'foto.jp.dk', u'publisher_id': 453, u'ad_count': 1}\n",
	"{u'data': 'fpn.dk', u'publisher_id': 1551, u'ad_count': 1}\n",
	"{u'data': 'kpn.dk', u'publisher_id': 3490, u'ad_count': 3}\n",
	"{u'data': 'spn.dk', u'publisher_id': 3514, u'ad_count': 9}\n",
	"sim: 0.223978802642\n",
	"\n",
	"similar advertiser:"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	" www.profiloptik.dk\n",
	"{u'data': 'jp.dk', u'publisher_id': 6, u'ad_count': 3}"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"\n",
	"{u'data': 'www.bt.dk', u'publisher_id': 15, u'ad_count': 3104}\n",
	"{u'data': 'radio.jp.dk', u'publisher_id': 350, u'ad_count': 1}\n",
	"{u'data': 'kpn.dk', u'publisher_id': 3490, u'ad_count': 1}\n",
	"{u'data': 'spn.dk', u'publisher_id': 3514, u'ad_count': 1}\n",
	"sim: 0.000996058451881\n",
	"\n"
	]
	}
	],
	"prompt_number": 7
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Ok so the first two ones are clearly similar, but what happened with the last three, didn't my first run retrieve those as the most similar? The explanation is simple: here I'm querying directly the DB, whereas with my test run, I first applied low and high count filters on both the advertisers and the publishers, to significantly reduce the size of the input matrix. Here, since we don't do that, the possibly very high counts of some publishers might skew very heavily the normalization process, and hence the results. So this is an important observation, of which I'm not sure of the exact implications, but it seems that the way we initially filter our input matrix has a great influence not only on the efficiency of the computation, but also on the results. However, one way to mitigate this effect would be to use binary values instead of counts:"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"r = conn.execute(q, [target_ad_id])\n",
	"rows = r.fetchall()\n",
	"target_sparse_vec = dict([(r.publisher_id, r.ad_count) for r in rows])\n",
	"normalize(target_sparse_vec, use_counts=False) # use binary values!\n",
	"\n",
	"for similar_ad_id in [39704, 12169, 16809, 9637, 27561]:\n",
	"\n",
	" r = conn.execute('select data from ad_end_redirect_domain where id = %s', [similar_ad_id])\n",
	" print('similar advertiser:', r.fetchone()[0])\n",
	"\n",
	" r = conn.execute(q, [similar_ad_id])\n",
	" rows = r.fetchall()\n",
	" for row in rows:\n",
	" print(dict(row))\n",
	"\n",
	" similar_sparse_vec = {r.publisher_id: r.ad_count for r in rows}\n",
	" normalize(similar_sparse_vec, use_counts=False) # use binary values!\n",
	"\n",
	" print('sim:', dot_product(target_sparse_vec, similar_sparse_vec), end='\\n\\n')"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"similar advertiser: spot.dk\n",
	"{u'data': 'jp.dk', u'publisher_id': 6, u'ad_count': 77}"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"\n",
	"{u'data': 'epn.dk', u'publisher_id': 345, u'ad_count': 10}\n",
	"{u'data': 'radio.jp.dk', u'publisher_id': 350, u'ad_count': 3}\n",
	"{u'data': 'foto.jp.dk', u'publisher_id': 453, u'ad_count': 4}\n",
	"{u'data': 'fpn.dk', u'publisher_id': 1551, u'ad_count': 12}\n",
	"{u'data': 'kpn.dk', u'publisher_id': 3490, u'ad_count': 6}\n",
	"{u'data': 'spn.dk', u'publisher_id': 3514, u'ad_count': 10}\n",
	"{u'data': 'blogs.jp.dk', u'publisher_id': 4943, u'ad_count': 2}\n",
	"{u'data': 'guide.dk', u'publisher_id': 5234, u'ad_count': 1}\n",
	"sim: 0.904534033733\n",
	"\n",
	"similar advertiser:"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	" www.mitspot.dk\n",
	"{u'data': 'jp.dk', u'publisher_id': 6, u'ad_count': 15}"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"\n",
	"{u'data': 'epn.dk', u'publisher_id': 345, u'ad_count': 4}\n",
	"{u'data': 'radio.jp.dk', u'publisher_id': 350, u'ad_count': 2}\n",
	"{u'data': 'foto.jp.dk', u'publisher_id': 453, u'ad_count': 2}\n",
	"{u'data': 'fpn.dk', u'publisher_id': 1551, u'ad_count': 4}\n",
	"{u'data': 'kpn.dk', u'publisher_id': 3490, u'ad_count': 4}\n",
	"{u'data': 'spn.dk', u'publisher_id': 3514, u'ad_count': 3}\n",
	"{u'data': 'blogs.jp.dk', u'publisher_id': 4943, u'ad_count': 1}\n",
	"sim: 0.852802865422\n",
	"\n",
	"similar advertiser:"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	" www.sms1218.dk\n",
	"{u'data': 'jp.dk', u'publisher_id': 6, u'ad_count': 4}"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"\n",
	"{u'data': 'radio.jp.dk', u'publisher_id': 350, u'ad_count': 1}\n",
	"{u'data': 'foto.jp.dk', u'publisher_id': 453, u'ad_count': 1}\n",
	"{u'data': 'kpn.dk', u'publisher_id': 3490, u'ad_count': 3}\n",
	"{u'data': 'spn.dk', u'publisher_id': 3514, u'ad_count': 4}\n",
	"{u'data': 'blogs.jp.dk', u'publisher_id': 4943, u'ad_count': 1}\n",
	"sim: 0.738548945876\n",
	"\n",
	"similar advertiser:"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	" mediawatch.dk\n",
	"{u'data': 'jp.dk', u'publisher_id': 6, u'ad_count': 13}"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"\n",
	"{u'data': 'epn.dk', u'publisher_id': 345, u'ad_count': 177}\n",
	"{u'data': 'radio.jp.dk', u'publisher_id': 350, u'ad_count': 1}\n",
	"{u'data': 'foto.jp.dk', u'publisher_id': 453, u'ad_count': 1}\n",
	"{u'data': 'fpn.dk', u'publisher_id': 1551, u'ad_count': 1}\n",
	"{u'data': 'kpn.dk', u'publisher_id': 3490, u'ad_count': 3}\n",
	"{u'data': 'spn.dk', u'publisher_id': 3514, u'ad_count': 9}\n",
	"sim: 0.797724035217\n",
	"\n",
	"similar advertiser:"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	" www.profiloptik.dk\n",
	"{u'data': 'jp.dk', u'publisher_id': 6, u'ad_count': 3}"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"\n",
	"{u'data': 'www.bt.dk', u'publisher_id': 15, u'ad_count': 3104}\n",
	"{u'data': 'radio.jp.dk', u'publisher_id': 350, u'ad_count': 1}\n",
	"{u'data': 'kpn.dk', u'publisher_id': 3490, u'ad_count': 1}\n",
	"{u'data': 'spn.dk', u'publisher_id': 3514, u'ad_count': 1}\n",
	"sim: 0.539359889971\n",
	"\n"
	]
	}
	],
	"prompt_number": 8
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"These results make more sense, but they are stil far from being in line with the those obtained by filtering the matrix. Finally, let's look at the similarity with a bunch of random advertisers, just to be sure:"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"i = 0\n",
	"while True:\n",
	"\n",
	" random_ad_id = np.random.randint(1600000)\n",
	" \n",
	" r = conn.execute('select data from ad_end_redirect_domain where id = %s', [random_ad_id])\n",
	" print('random advertiser:', r.fetchone()[0])\n",
	"\n",
	" r = conn.execute(q, [random_ad_id])\n",
	" rows = r.fetchall()\n",
	"\n",
	" if len(rows) > 50: continue\n",
	" \n",
	" for row in rows:\n",
	" print(dict(row))\n",
	"\n",
	" random_sparse_vec = {r.publisher_id: r.ad_count for r in rows}\n",
	" normalize(random_sparse_vec, False)\n",
	"\n",
	" print('sim:', dot_product(target_sparse_vec, random_sparse_vec), end='\\n\\n')\n",
	"\n",
	" i += 1\n",
	" if i > 5: break"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"random advertiser: www.baltimoreravensstore.com\n",
	"{u'data': 'www.baltimoreravens.com', u'publisher_id': 43381, u'ad_count': 134}"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"\n",
	"sim: 0\n",
	"\n",
	"random advertiser:"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	" www.gringolocal.com\n",
	"{u'data': 'www.blogtopsites.com', u'publisher_id': 211, u'ad_count': 1}"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"\n",
	"{u'data': 'gulfnews.com', u'publisher_id': 3903, u'ad_count': 5}\n",
	"{u'data': 'www.travel.darkroastedblend.com', u'publisher_id': 3988, u'ad_count': 1}\n",
	"sim: 0\n",
	"\n",
	"random advertiser:"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	" www.clazlaw.net\n",
	"{u'data': 'www.whaleoil.co.nz', u'publisher_id': 30604, u'ad_count': 1}"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"\n",
	"{u'data': 'freemasonrywatch.org', u'publisher_id': 52891, u'ad_count': 1}\n",
	"{u'data': 'parentalconsentletter.com', u'publisher_id': 60098, u'ad_count': 1}\n",
	"{u'data': 'bmhm.com', u'publisher_id': 74877, u'ad_count': 1}\n",
	"{u'data': 'www.bkzoom.com', u'publisher_id': 76490, u'ad_count': 1}\n",
	"{u'data': 'www.doglaw.com', u'publisher_id': 82489, u'ad_count': 1}\n",
	"{u'data': 'myeaglecountry.com', u'publisher_id': 90963, u'ad_count': 1}\n",
	"{u'data': 'www.oc-breeze.com', u'publisher_id': 105657, u'ad_count': 1}\n",
	"sim: 0\n",
	"\n",
	"random advertiser:"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	" www.engleservices.com\n",
	"{u'data': 'www.dailyhome.com', u'publisher_id': 40687, u'ad_count': 1}"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"\n",
	"{u'data': 'www.thestclairtimes.com', u'publisher_id': 42964, u'ad_count': 1}\n",
	"sim: 0\n",
	"\n",
	"random advertiser:"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	" www.playbadminton.sg\n",
	"{u'data': 'www.mybowlinggames.com', u'publisher_id': 516, u'ad_count': 1}"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"\n",
	"{u'data': 'www.independent.co.uk', u'publisher_id': 1532, u'ad_count': 2}\n",
	"{u'data': 'www.allkyhoops.com', u'publisher_id': 3637, u'ad_count': 2}\n",
	"{u'data': 'www.telegraph.co.uk', u'publisher_id': 5245, u'ad_count': 2}\n",
	"{u'data': 'www.short-hair-style.com', u'publisher_id': 8451, u'ad_count': 11}\n",
	"{u'data': 'www.unlimitedwebgames.com', u'publisher_id': 8693, u'ad_count': 3}\n",
	"{u'data': 'artistlife.craftgossip.com', u'publisher_id': 24555, u'ad_count': 4}\n",
	"{u'data': 'sportsvibe.co.uk', u'publisher_id': 26162, u'ad_count': 1}\n",
	"{u'data': 'www.covers.com', u'publisher_id': 26943, u'ad_count': 1}\n",
	"{u'data': 'www.designsponge.com', u'publisher_id': 28672, u'ad_count': 1}\n",
	"{u'data': 'www.theguardian.com', u'publisher_id': 120935, u'ad_count': 1}\n",
	"sim: 0\n",
	"\n",
	"random advertiser:"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	" www.discoland.fi\n",
	"{u'data': 'klubitus.org', u'publisher_id': 60643, u'ad_count': 1}"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"\n",
	"{u'data': 'www.ruuvipenkki.fi', u'publisher_id': 120280, u'ad_count': 6}\n",
	"sim: 0\n",
	"\n"
	]
	}
	],
	"prompt_number": 9
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"All dissimilar, so it looks good!"
	]
	}
	],
	"metadata": {}
	}
	]
	}