rjpower/learn-to-rank-example.ipynb

## learn-to-rank-example.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Learning to rank using embeddings\n",
    "\n",
    "This notebook is a template for using keras with research paper data.  In the real system, we'd load in our ranking data alongside and try to predict our ranker, but for now we'll predict keyphrases given titles as a proxy.\n",
    "\n",
    "## TODO\n",
    "* Tokenizer should take a pre-built word_to_idx mapping\n",
    "* Write tweak layer\n",
    "* Load ranking data\n",
    "* Try out convolution layer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Using TensorFlow backend.\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "import sys\n",
    "import os\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "import keras\n",
    "from keras.layers import Dense, Activation, Dropout, Flatten, Merge, Layer, RepeatVector\n",
    "from keras.layers.recurrent import LSTM\n",
    "from keras.layers.embeddings import Embedding\n",
    "from keras.models import Sequential\n",
    "from keras.preprocessing import text as keras_text\n",
    "import keras.preprocessing.sequence as keras_sequence\n",
    "\n",
    "import keras.backend as K\n",
    "\n",
    "import deeplearn\n",
    "\n",
    "from __future__ import print_function"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Loading Data\n",
    "\n",
    "To start, we'll be using the pre-built word2vec vectors from Google; we should be able to swap in our own w2v vectors at a later date.  We will be using the filtered paper data from our corpus (340k documents)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading.... /data/filtered-papers/joined-json/part-r-00000-4a18d38d-7ee1-4190-8c9a-f453040bea01\n",
      "Loading.... /data/filtered-papers/joined-json/part-r-00001-4a18d38d-7ee1-4190-8c9a-f453040bea01\n",
      "Loading.... /data/filtered-papers/joined-json/part-r-00002-4a18d38d-7ee1-4190-8c9a-f453040bea01\n",
      "Loading.... /data/filtered-papers/joined-json/part-r-00003-4a18d38d-7ee1-4190-8c9a-f453040bea01\n",
      "Loading.... /data/filtered-papers/joined-json/part-r-00004-4a18d38d-7ee1-4190-8c9a-f453040bea01\n",
      "Loading.... /data/filtered-papers/joined-json/part-r-00005-4a18d38d-7ee1-4190-8c9a-f453040bea01\n",
      "Loading.... /data/filtered-papers/joined-json/part-r-00006-4a18d38d-7ee1-4190-8c9a-f453040bea01\n",
      "Loading.... /data/filtered-papers/joined-json/part-r-00007-4a18d38d-7ee1-4190-8c9a-f453040bea01\n",
      "Loading.... /data/filtered-papers/joined-json/part-r-00008-4a18d38d-7ee1-4190-8c9a-f453040bea01\n",
      "Loading.... /data/filtered-papers/joined-json/part-r-00009-4a18d38d-7ee1-4190-8c9a-f453040bea01\n"
     ]
    }
   ],
   "source": [
    "import glob\n",
    "files = sorted(glob.glob('/data/filtered-papers/joined-json/part-r-*'))\n",
    "\n",
    "def load_files():\n",
    "    frames = []\n",
    "    for file in files[:10]:\n",
    "        print('Loading....', file)\n",
    "        with open(file) as f:\n",
    "            frames.append(pd.DataFrame.from_records([\n",
    "                json.loads(line) for line in f\n",
    "            ]))\n",
    "            del frames[-1]['body_text']\n",
    "            \n",
    "    return pd.concat(frames).reset_index()\n",
    "                          \n",
    "paper_data = load_files()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "VOCAB_SIZE = 50000\n",
    "BATCH_SIZE = 32\n",
    "MAX_CONTEXT = 16\n",
    "EMBEDDING_SIZE = 100"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Processing... 4.60MB."
     ]
    }
   ],
   "source": [
    "from deeplearn import preprocessing\n",
    "import importlib\n",
    "importlib.reload(preprocessing)\n",
    "tokenizer = preprocessing.Tokenizer(vocab_size=VOCAB_SIZE)\n",
    "tokenizer.fit(paper_data.title)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Predicting Keyphrases\n",
    "\n",
    "We'll use a toy problem: predicting the top keyphrase of a paper given it's title to get started.  We'll take the top `MAX_KEYPHRASES` keyphrases by count, and build a predictor for them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "MAX_KEYPHRASES = 1000\n",
    "\n",
    "import collections\n",
    "kp_counts = collections.Counter()\n",
    "for kp in paper_data.key_phrases:\n",
    "    kp_counts.update(kp)\n",
    "    \n",
    "keyphrases = dict(sorted(kp_counts.items(), key=lambda kv: kv[1], reverse=True))\n",
    "kp_to_idx = dict(zip(keyphrases.keys(), range(0, MAX_KEYPHRASES)))\n",
    "idx_to_kp = { idx:kp for (kp, idx) in kp_to_idx.items() }"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batching...\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "(array([[    0,     0,     0,     0,   264,     1,    33,     2,   469,\n",
       "          1845,     0,  1957,    94,     6,   550,  1612],\n",
       "        [    0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "             4, 13135,    32,   152,   164,    23,  1699],\n",
       "        [    0,     0,     0,     0,     0,     0,  1059,  1489,     1,\n",
       "           255,  1622,   191,     0,    21,   690,    87],\n",
       "        [    0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "             0,   856,     1,     4,    47,    18,    82],\n",
       "        [    0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "             0,     0,  1180,  1953,     0,    49,   223],\n",
       "        [    0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "             0,   576,   208,     1,   184,   401,    12],\n",
       "        [    0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "             0,     0,   955,     3,    54,   184,    11],\n",
       "        [    0,     0,     0,     0,    52,    94,    36,     7,   534,\n",
       "           796,     1,   667,     6,  8078,  2040,    11],\n",
       "        [    0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "           304,    44,     0,   406,   601,     6,   551],\n",
       "        [    0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "             0,     0,    19,     8,     4,  1617,  1405],\n",
       "        [    0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "             0,     0,     0,     0,     7,     4,     1],\n",
       "        [    0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "             0,     0,     0,    36,     3,   986,   391],\n",
       "        [    0,     0,     0,     0,     0,    24,   701,     2,   462,\n",
       "             3,    21,   130,    12,   783,    86,  1489],\n",
       "        [    0,     0,     0,     0,   455,     7,  1562,   991,   337,\n",
       "            99,    62,     0,   117,    93,   372,   989],\n",
       "        [    0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "             0,     0,     0,     0,    87,     6,   122],\n",
       "        [    0,     0,     0,     0,     0,     0,     0,     0,   109,\n",
       "           468,     0,   221,    13,     2,    14,   356],\n",
       "        [    0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "          4314,   136,  3757,  1597,    93, 25798,  4548],\n",
       "        [    0,     0,     0,     0,     0,   153,     0,   417,  2690,\n",
       "           414,    51,     7,   429,     2,    35,  2158],\n",
       "        [    0,     0,     0,     0,     0,     0,     0,   120,    11,\n",
       "             0,   123,   340,   294,     1,     4,    13],\n",
       "        [    0,     0,     0,     0,     0,     0,     0,     0,  1405,\n",
       "         17160,  9421,     3,   385,   315,    20,   183],\n",
       "        [    0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "             0,  2565,   828,   162,     2,    14,   441],\n",
       "        [    0,     0,     0,     0,   437,     4,  1278,  7404,     3,\n",
       "          1003,  1705,  1278,  7404,   183,   383,    50],\n",
       "        [    0,     0,     0,     0,     0,     0,   233,     7,     4,\n",
       "           524,     1,   533,     2,   211,   666,   349],\n",
       "        [    0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "             0,     0,     0,     0,     4,     1,    10],\n",
       "        [    0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "             0,     0,  4731,  4176,    15,   217,   785],\n",
       "        [    0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "             0,     0,  4661,  2418,   302,    55,    15],\n",
       "        [    0,     0,     0,   608,   291,   987,   508,     3,  2334,\n",
       "           726,   588,   719,   650,    79,   448,   529],\n",
       "        [    0,     0,     0,    19,     8,   168,     4,   107,     1,\n",
       "           179,  1000,   209,     3,     4,  2674,   359],\n",
       "        [    0,     0,     0,     0,     0,     0,     0,     0,   912,\n",
       "           109,    10,  1991,  1259,     3,  1473,    12],\n",
       "        [    0,     0,     0,     0,     0,     0,     0,     0,    27,\n",
       "            96,   364,     0,   352,     5,    20,    27],\n",
       "        [    0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "             0,     0,     0,     0,     0,     2,     3],\n",
       "        [    0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
       "             0,    95,  2014,     0,  1461,   821,  2288]], dtype=int32),\n",
       " array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],\n",
       "        [ 0.,  0.,  0., ...,  0.,  0.,  0.],\n",
       "        [ 0.,  0.,  0., ...,  0.,  0.,  0.],\n",
       "        ..., \n",
       "        [ 0.,  0.,  0., ...,  0.,  0.,  0.],\n",
       "        [ 0.,  0.,  0., ...,  0.,  0.,  0.],\n",
       "        [ 0.,  0.,  0., ...,  0.,  0.,  0.]], dtype=float32))"
      ]
     },
     "execution_count": 93,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def _terms(txt):\n",
    "    terms_ary = np.zeros((MAX_CONTEXT,), dtype=np.int32)\n",
    "    terms = np.asarray(list(tokenizer.terms_from_text(txt)))\n",
    "    terms = terms[:MAX_CONTEXT]\n",
    "    if len(terms) < 3:\n",
    "        return None\n",
    "    terms_ary[-len(terms):] = terms     \n",
    "    return terms_ary\n",
    "    \n",
    "\n",
    "def training_data():\n",
    "    batch = []\n",
    "    for idx, paper in paper_data.iterrows():\n",
    "        kp_ary = np.zeros((MAX_KEYPHRASES,), dtype=np.float32)\n",
    "        kps = [kp_to_idx[k] for k in paper.key_phrases if k in kp_to_idx]\n",
    "        terms_ary = _terms(paper.title)\n",
    "        \n",
    "        if terms_ary is None or len(kps) == 0:\n",
    "            continue\n",
    "\n",
    "        kp_ary[kps[0]] = 1.\n",
    "        yield terms_ary, kp_ary\n",
    "        \n",
    "def batchify(gen, batch_size):\n",
    "    print('Batching...')\n",
    "    examples = []\n",
    "    labels = []\n",
    "    for e, l in gen:\n",
    "        examples.append(e)\n",
    "        labels.append(l)\n",
    "        if len(examples) >= batch_size:\n",
    "            yield np.asarray(examples), np.asarray(labels)\n",
    "            examples = []\n",
    "            labels = []\n",
    "\n",
    "next(batchify(preprocessing.forever(lambda: training_data()), 32))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "____________________________________________________________________________________________________\n",
      "Layer (type)                     Output Shape          Param #     Connected to                     \n",
      "====================================================================================================\n",
      "bow_5 (BOW)                      (None, 100)           5000000     bow_input_5[0][0]                \n",
      "____________________________________________________________________________________________________\n",
      "stripmask_5 (StripMask)          (None, 100)           0           bow_5[0][0]                      \n",
      "____________________________________________________________________________________________________\n",
      "dense_5 (Dense)                  (None, 1000)          101000      stripmask_5[0][0]                \n",
      "____________________________________________________________________________________________________\n",
      "activation_5 (Activation)        (None, 1000)          0           dense_5[0][0]                    \n",
      "====================================================================================================\n",
      "Total params: 5101000\n",
      "____________________________________________________________________________________________________\n"
     ]
    }
   ],
   "source": [
    "import keras.backend as K\n",
    "import tensorflow as tf\n",
    "from keras.regularizers import l2\n",
    "from deeplearn.layers import BOW, StripMask\n",
    "\n",
    "def model():\n",
    "    with tf.device('/gpu:0'):\n",
    "        words = Sequential()\n",
    "        words.add(BOW(\n",
    "                    input_dim=VOCAB_SIZE, \n",
    "                    output_dim=EMBEDDING_SIZE,\n",
    "                    input_length=MAX_CONTEXT, \n",
    "                    mask_zero=True,\n",
    "                    weights=None\n",
    "        ))\n",
    "        words.add(StripMask())\n",
    "        words.add(Dense(MAX_KEYPHRASES))\n",
    "        words.add(Activation('sigmoid'))\n",
    "        words.compile(loss='categorical_crossentropy', optimizer='adadelta')\n",
    "        return words\n",
    "\n",
    "training_model = model()\n",
    "training_model.summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batching...Epoch 1/10\n",
      "\n"
     ]
    }
   ],
   "source": [
    "training_model.fit_generator(\n",
    "    batchify(preprocessing.forever(lambda: training_data()), 32),\n",
    "    samples_per_epoch=100000,\n",
    "    verbose=2,\n",
    "    nb_epoch=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Exploring the role of individual differences in information visualization\n",
      "6.76226e-05 DATA SET\n",
      "6.79356e-05 SEMANTIC WEB\n",
      "6.89067e-05 White Matter\n",
      "7.18886e-05 Superposition\n",
      "7.61459e-05 Pomdp\n",
      "No terms!\n",
      "No terms!\n",
      "No terms!\n",
      "Mahler measures, short walks and log-sine integrals\n",
      "0.0145818 QOS\n",
      "0.016177 Genetic Algorithm\n",
      "0.0163752 DATA SET\n",
      "0.0181475 Modularity\n",
      "0.0212178 SEMANTIC WEB\n",
      "No terms!\n",
      "No terms!\n",
      "No terms!\n",
      "No terms!\n",
      "An hybrid finite volume-finite element method for variable density incompressible flows\n",
      "0.0161782 Magnetic Field\n",
      "0.0214398 SVM\n",
      "0.0215576 Reconfigurability\n",
      "0.0329632 Design Space Exploration\n",
      "0.0974607 Genetic Algorithm\n",
      "No terms!\n",
      "Minimization of exclusive sum-of-products expressions for multiple-valued input, incompletely specified functions\n",
      "0.00200467 MSA\n",
      "0.00236093 QOS\n",
      "0.00250315 Constraint Satisfaction\n",
      "0.00267944 Probability Density Function\n",
      "0.00300554 Magnetic Field\n",
      "No terms!\n",
      "No terms!\n",
      "Body-centric interaction with mobile devices\n",
      "0.00463328 Collision Detection\n",
      "0.00537107 Latent Variable\n",
      "0.00651588 Machine Learning Technique\n",
      "0.00730504 Manet\n",
      "0.00765606 Gesture\n",
      "Implicit modeling using subdivision curves\n",
      "0.0485823 Convex Relaxation\n",
      "0.0499089 Implicit Surface\n",
      "0.0543536 Accelerometer\n",
      "0.0547447 Applet\n",
      "0.0649556 Lambda Calculus\n",
      "Performance and Energy Benefits of Instruction Set Extensions in an FPGA Soft Core\n",
      "0.000145312 Knowledge Representation\n",
      "0.000151492 Machine Translation\n",
      "0.000152743 EEG\n",
      "0.000160667 Triangulation\n",
      "0.000168405 User Profile\n",
      "No terms!\n",
      "No terms!\n",
      "Visual analytic roadblocks for novice investigators\n",
      "0.172402 DATA SET\n",
      "0.193841 Lambda Calculus\n",
      "0.253551 SVM\n",
      "0.2562 SEMANTIC WEB\n",
      "0.27894 Genetic Algorithm\n"
     ]
    }
   ],
   "source": [
    "def evaluate(model, example):\n",
    "    terms = _terms(example)\n",
    "    if terms is None:\n",
    "        print('No terms!')\n",
    "        return\n",
    "    prediction = model.predict_on_batch(terms.reshape((1,16)))[0]\n",
    "    best_idx = np.argsort(prediction)[-5:]\n",
    "    print(example)\n",
    "    for i in best_idx:\n",
    "        print(prediction[i], idx_to_kp[i])\n",
    "        \n",
    "for i in range(20):\n",
    "    evaluate(training_model, paper_data.loc[i].title)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<link href='http://fonts.googleapis.com/css?family=Fenix' rel='stylesheet' type='text/css'>\n",
       "<link href='http://fonts.googleapis.com/css?family=Alegreya+Sans:100,300,400,500,700,800,900,100italic,300italic,400italic,500italic,700italic,800italic,900italic' rel='stylesheet' type='text/css'>\n",
       "<link href='http://fonts.googleapis.com/css?family=Source+Code+Pro:300,400' rel='stylesheet' type='text/css'>\n",
       "<style>\n",
       "    @font-face {\n",
       "        font-family: \"Computer Modern\";\n",
       "        src: url('http://mirrors.ctan.org/fonts/cm-unicode/fonts/otf/cmunss.otf');\n",
       "    }\n",
       "    div.cell{\n",
       "        width:800px;\n",
       "        margin-left:16% !important;\n",
       "        margin-right:auto;\n",
       "    }\n",
       "    h1 {\n",
       "        font-family: 'Alegreya Sans', sans-serif;\n",
       "    }\n",
       "    h2 {\n",
       "        font-family: 'Fenix', serif;\n",
       "    }\n",
       "    h3{\n",
       "\t\tfont-family: 'Fenix', serif;\n",
       "        margin-top:12px;\n",
       "        margin-bottom: 3px;\n",
       "       }\n",
       "\th4{\n",
       "\t\tfont-family: 'Fenix', serif;\n",
       "       }\n",
       "    h5 {\n",
       "        font-family: 'Alegreya Sans', sans-serif;\n",
       "    }\t   \n",
       "    div.text_cell_render{\n",
       "        font-family: 'Alegreya Sans',Computer Modern, \"Helvetica Neue\", Arial, Helvetica, Geneva, sans-serif;\n",
       "        line-height: 135%;\n",
       "        font-size: 120%;\n",
       "        width:600px;\n",
       "        margin-left:auto;\n",
       "        margin-right:auto;\n",
       "    }\n",
       "    .CodeMirror{\n",
       "            font-family: \"Source Code Pro\";\n",
       "\t\t\tfont-size: 90%;\n",
       "    }\n",
       "/*    .prompt{\n",
       "        display: None;\n",
       "    }*/\n",
       "    .text_cell_render h1 {\n",
       "        font-weight: 200;\n",
       "        font-size: 50pt;\n",
       "\t\tline-height: 100%;\n",
       "        color:#CD2305;\n",
       "        margin-bottom: 0.5em;\n",
       "        margin-top: 0.5em;\n",
       "        display: block;\n",
       "    }\t\n",
       "    .text_cell_render h5 {\n",
       "        font-weight: 300;\n",
       "        font-size: 16pt;\n",
       "        color: #CD2305;\n",
       "        font-style: italic;\n",
       "        margin-bottom: .5em;\n",
       "        margin-top: 0.5em;\n",
       "        display: block;\n",
       "    }\n",
       "    \n",
       "    .warning{\n",
       "        color: rgb( 240, 20, 20 )\n",
       "        }  \n",
       "</style>\n",
       "<script>\n",
       "    MathJax.Hub.Config({\n",
       "                        TeX: {\n",
       "                           extensions: [\"AMSmath.js\"]\n",
       "                           },\n",
       "                tex2jax: {\n",
       "                    inlineMath: [ ['$','$'], [\"\\\\(\",\"\\\\)\"] ],\n",
       "                    displayMath: [ ['$$','$$'], [\"\\\\[\",\"\\\\]\"] ]\n",
       "                },\n",
       "                displayAlign: 'center', // Change this to 'center' to center equations.\n",
       "                \"HTML-CSS\": {\n",
       "                    styles: {'.MathJax_Display': {\"margin\": 4}}\n",
       "                }\n",
       "        });\n",
       "</script>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from IPython.core.display import HTML\n",
    "import requests\n",
    "\n",
    "def css_styling():\n",
    "    styles = requests.get('https://raw.githubusercontent.com/barbagroup/CFDPython/master/styles/custom.css').text\n",
    "    return HTML(styles)\n",
    "css_styling()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Learning to rank using embeddings\n",
	"\n",
	"This notebook is a template for using keras with research paper data. In the real system, we'd load in our ranking data alongside and try to predict our ranker, but for now we'll predict keyphrases given titles as a proxy.\n",
	"\n",
	"## TODO\n",
	"* Tokenizer should take a pre-built word_to_idx mapping\n",
	"* Write tweak layer\n",
	"* Load ranking data\n",
	"* Try out convolution layer"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"Using TensorFlow backend.\n"
	]
	}
	],
	"source": [
	"import json\n",
	"import sys\n",
	"import os\n",
	"\n",
	"import pandas as pd\n",
	"import numpy as np\n",
	"\n",
	"import keras\n",
	"from keras.layers import Dense, Activation, Dropout, Flatten, Merge, Layer, RepeatVector\n",
	"from keras.layers.recurrent import LSTM\n",
	"from keras.layers.embeddings import Embedding\n",
	"from keras.models import Sequential\n",
	"from keras.preprocessing import text as keras_text\n",
	"import keras.preprocessing.sequence as keras_sequence\n",
	"\n",
	"import keras.backend as K\n",
	"\n",
	"import deeplearn\n",
	"\n",
	"from __future__ import print_function"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Loading Data\n",
	"\n",
	"To start, we'll be using the pre-built word2vec vectors from Google; we should be able to swap in our own w2v vectors at a later date. We will be using the filtered paper data from our corpus (340k documents)."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 58,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Loading.... /data/filtered-papers/joined-json/part-r-00000-4a18d38d-7ee1-4190-8c9a-f453040bea01\n",
	"Loading.... /data/filtered-papers/joined-json/part-r-00001-4a18d38d-7ee1-4190-8c9a-f453040bea01\n",
	"Loading.... /data/filtered-papers/joined-json/part-r-00002-4a18d38d-7ee1-4190-8c9a-f453040bea01\n",
	"Loading.... /data/filtered-papers/joined-json/part-r-00003-4a18d38d-7ee1-4190-8c9a-f453040bea01\n",
	"Loading.... /data/filtered-papers/joined-json/part-r-00004-4a18d38d-7ee1-4190-8c9a-f453040bea01\n",
	"Loading.... /data/filtered-papers/joined-json/part-r-00005-4a18d38d-7ee1-4190-8c9a-f453040bea01\n",
	"Loading.... /data/filtered-papers/joined-json/part-r-00006-4a18d38d-7ee1-4190-8c9a-f453040bea01\n",
	"Loading.... /data/filtered-papers/joined-json/part-r-00007-4a18d38d-7ee1-4190-8c9a-f453040bea01\n",
	"Loading.... /data/filtered-papers/joined-json/part-r-00008-4a18d38d-7ee1-4190-8c9a-f453040bea01\n",
	"Loading.... /data/filtered-papers/joined-json/part-r-00009-4a18d38d-7ee1-4190-8c9a-f453040bea01\n"
	]
	}
	],
	"source": [
	"import glob\n",
	"files = sorted(glob.glob('/data/filtered-papers/joined-json/part-r-*'))\n",
	"\n",
	"def load_files():\n",
	" frames = []\n",
	" for file in files[:10]:\n",
	" print('Loading....', file)\n",
	" with open(file) as f:\n",
	" frames.append(pd.DataFrame.from_records([\n",
	" json.loads(line) for line in f\n",
	" ]))\n",
	" del frames[-1]['body_text']\n",
	" \n",
	" return pd.concat(frames).reset_index()\n",
	" \n",
	"paper_data = load_files()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"VOCAB_SIZE = 50000\n",
	"BATCH_SIZE = 32\n",
	"MAX_CONTEXT = 16\n",
	"EMBEDDING_SIZE = 100"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Processing... 4.60MB."
	]
	}
	],
	"source": [
	"from deeplearn import preprocessing\n",
	"import importlib\n",
	"importlib.reload(preprocessing)\n",
	"tokenizer = preprocessing.Tokenizer(vocab_size=VOCAB_SIZE)\n",
	"tokenizer.fit(paper_data.title)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Predicting Keyphrases\n",
	"\n",
	"We'll use a toy problem: predicting the top keyphrase of a paper given it's title to get started. We'll take the top `MAX_KEYPHRASES` keyphrases by count, and build a predictor for them."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 92,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"MAX_KEYPHRASES = 1000\n",
	"\n",
	"import collections\n",
	"kp_counts = collections.Counter()\n",
	"for kp in paper_data.key_phrases:\n",
	" kp_counts.update(kp)\n",
	" \n",
	"keyphrases = dict(sorted(kp_counts.items(), key=lambda kv: kv[1], reverse=True))\n",
	"kp_to_idx = dict(zip(keyphrases.keys(), range(0, MAX_KEYPHRASES)))\n",
	"idx_to_kp = { idx:kp for (kp, idx) in kp_to_idx.items() }"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 93,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Batching...\n"
	]
	},
	{
	"data": {
	"text/plain": [
	"(array([[ 0, 0, 0, 0, 264, 1, 33, 2, 469,\n",
	" 1845, 0, 1957, 94, 6, 550, 1612],\n",
	" [ 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
	" 4, 13135, 32, 152, 164, 23, 1699],\n",
	" [ 0, 0, 0, 0, 0, 0, 1059, 1489, 1,\n",
	" 255, 1622, 191, 0, 21, 690, 87],\n",
	" [ 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
	" 0, 856, 1, 4, 47, 18, 82],\n",
	" [ 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
	" 0, 0, 1180, 1953, 0, 49, 223],\n",
	" [ 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
	" 0, 576, 208, 1, 184, 401, 12],\n",
	" [ 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
	" 0, 0, 955, 3, 54, 184, 11],\n",
	" [ 0, 0, 0, 0, 52, 94, 36, 7, 534,\n",
	" 796, 1, 667, 6, 8078, 2040, 11],\n",
	" [ 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
	" 304, 44, 0, 406, 601, 6, 551],\n",
	" [ 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
	" 0, 0, 19, 8, 4, 1617, 1405],\n",
	" [ 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
	" 0, 0, 0, 0, 7, 4, 1],\n",
	" [ 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
	" 0, 0, 0, 36, 3, 986, 391],\n",
	" [ 0, 0, 0, 0, 0, 24, 701, 2, 462,\n",
	" 3, 21, 130, 12, 783, 86, 1489],\n",
	" [ 0, 0, 0, 0, 455, 7, 1562, 991, 337,\n",
	" 99, 62, 0, 117, 93, 372, 989],\n",
	" [ 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
	" 0, 0, 0, 0, 87, 6, 122],\n",
	" [ 0, 0, 0, 0, 0, 0, 0, 0, 109,\n",
	" 468, 0, 221, 13, 2, 14, 356],\n",
	" [ 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
	" 4314, 136, 3757, 1597, 93, 25798, 4548],\n",
	" [ 0, 0, 0, 0, 0, 153, 0, 417, 2690,\n",
	" 414, 51, 7, 429, 2, 35, 2158],\n",
	" [ 0, 0, 0, 0, 0, 0, 0, 120, 11,\n",
	" 0, 123, 340, 294, 1, 4, 13],\n",
	" [ 0, 0, 0, 0, 0, 0, 0, 0, 1405,\n",
	" 17160, 9421, 3, 385, 315, 20, 183],\n",
	" [ 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
	" 0, 2565, 828, 162, 2, 14, 441],\n",
	" [ 0, 0, 0, 0, 437, 4, 1278, 7404, 3,\n",
	" 1003, 1705, 1278, 7404, 183, 383, 50],\n",
	" [ 0, 0, 0, 0, 0, 0, 233, 7, 4,\n",
	" 524, 1, 533, 2, 211, 666, 349],\n",
	" [ 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
	" 0, 0, 0, 0, 4, 1, 10],\n",
	" [ 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
	" 0, 0, 4731, 4176, 15, 217, 785],\n",
	" [ 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
	" 0, 0, 4661, 2418, 302, 55, 15],\n",
	" [ 0, 0, 0, 608, 291, 987, 508, 3, 2334,\n",
	" 726, 588, 719, 650, 79, 448, 529],\n",
	" [ 0, 0, 0, 19, 8, 168, 4, 107, 1,\n",
	" 179, 1000, 209, 3, 4, 2674, 359],\n",
	" [ 0, 0, 0, 0, 0, 0, 0, 0, 912,\n",
	" 109, 10, 1991, 1259, 3, 1473, 12],\n",
	" [ 0, 0, 0, 0, 0, 0, 0, 0, 27,\n",
	" 96, 364, 0, 352, 5, 20, 27],\n",
	" [ 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
	" 0, 0, 0, 0, 0, 2, 3],\n",
	" [ 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
	" 0, 95, 2014, 0, 1461, 821, 2288]], dtype=int32),\n",
	" array([[ 0., 0., 0., ..., 0., 0., 0.],\n",
	" [ 0., 0., 0., ..., 0., 0., 0.],\n",
	" [ 0., 0., 0., ..., 0., 0., 0.],\n",
	" ..., \n",
	" [ 0., 0., 0., ..., 0., 0., 0.],\n",
	" [ 0., 0., 0., ..., 0., 0., 0.],\n",
	" [ 0., 0., 0., ..., 0., 0., 0.]], dtype=float32))"
	]
	},
	"execution_count": 93,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"def _terms(txt):\n",
	" terms_ary = np.zeros((MAX_CONTEXT,), dtype=np.int32)\n",
	" terms = np.asarray(list(tokenizer.terms_from_text(txt)))\n",
	" terms = terms[:MAX_CONTEXT]\n",
	" if len(terms) < 3:\n",
	" return None\n",
	" terms_ary[-len(terms):] = terms \n",
	" return terms_ary\n",
	" \n",
	"\n",
	"def training_data():\n",
	" batch = []\n",
	" for idx, paper in paper_data.iterrows():\n",
	" kp_ary = np.zeros((MAX_KEYPHRASES,), dtype=np.float32)\n",
	" kps = [kp_to_idx[k] for k in paper.key_phrases if k in kp_to_idx]\n",
	" terms_ary = _terms(paper.title)\n",
	" \n",
	" if terms_ary is None or len(kps) == 0:\n",
	" continue\n",
	"\n",
	" kp_ary[kps[0]] = 1.\n",
	" yield terms_ary, kp_ary\n",
	" \n",
	"def batchify(gen, batch_size):\n",
	" print('Batching...')\n",
	" examples = []\n",
	" labels = []\n",
	" for e, l in gen:\n",
	" examples.append(e)\n",
	" labels.append(l)\n",
	" if len(examples) >= batch_size:\n",
	" yield np.asarray(examples), np.asarray(labels)\n",
	" examples = []\n",
	" labels = []\n",
	"\n",
	"next(batchify(preprocessing.forever(lambda: training_data()), 32))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 94,
	"metadata": {
	"collapsed": false,
	"scrolled": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"____________________________________________________________________________________________________\n",
	"Layer (type) Output Shape Param # Connected to \n",
	"====================================================================================================\n",
	"bow_5 (BOW) (None, 100) 5000000 bow_input_5[0][0] \n",
	"____________________________________________________________________________________________________\n",
	"stripmask_5 (StripMask) (None, 100) 0 bow_5[0][0] \n",
	"____________________________________________________________________________________________________\n",
	"dense_5 (Dense) (None, 1000) 101000 stripmask_5[0][0] \n",
	"____________________________________________________________________________________________________\n",
	"activation_5 (Activation) (None, 1000) 0 dense_5[0][0] \n",
	"====================================================================================================\n",
	"Total params: 5101000\n",
	"____________________________________________________________________________________________________\n"
	]
	}
	],
	"source": [
	"import keras.backend as K\n",
	"import tensorflow as tf\n",
	"from keras.regularizers import l2\n",
	"from deeplearn.layers import BOW, StripMask\n",
	"\n",
	"def model():\n",
	" with tf.device('/gpu:0'):\n",
	" words = Sequential()\n",
	" words.add(BOW(\n",
	" input_dim=VOCAB_SIZE, \n",
	" output_dim=EMBEDDING_SIZE,\n",
	" input_length=MAX_CONTEXT, \n",
	" mask_zero=True,\n",
	" weights=None\n",
	" ))\n",
	" words.add(StripMask())\n",
	" words.add(Dense(MAX_KEYPHRASES))\n",
	" words.add(Activation('sigmoid'))\n",
	" words.compile(loss='categorical_crossentropy', optimizer='adadelta')\n",
	" return words\n",
	"\n",
	"training_model = model()\n",
	"training_model.summary()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Batching...Epoch 1/10\n",
	"\n"
	]
	}
	],
	"source": [
	"training_model.fit_generator(\n",
	" batchify(preprocessing.forever(lambda: training_data()), 32),\n",
	" samples_per_epoch=100000,\n",
	" verbose=2,\n",
	" nb_epoch=10)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 89,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Exploring the role of individual differences in information visualization\n",
	"6.76226e-05 DATA SET\n",
	"6.79356e-05 SEMANTIC WEB\n",
	"6.89067e-05 White Matter\n",
	"7.18886e-05 Superposition\n",
	"7.61459e-05 Pomdp\n",
	"No terms!\n",
	"No terms!\n",
	"No terms!\n",
	"Mahler measures, short walks and log-sine integrals\n",
	"0.0145818 QOS\n",
	"0.016177 Genetic Algorithm\n",
	"0.0163752 DATA SET\n",
	"0.0181475 Modularity\n",
	"0.0212178 SEMANTIC WEB\n",
	"No terms!\n",
	"No terms!\n",
	"No terms!\n",
	"No terms!\n",
	"An hybrid finite volume-finite element method for variable density incompressible flows\n",
	"0.0161782 Magnetic Field\n",
	"0.0214398 SVM\n",
	"0.0215576 Reconfigurability\n",
	"0.0329632 Design Space Exploration\n",
	"0.0974607 Genetic Algorithm\n",
	"No terms!\n",
	"Minimization of exclusive sum-of-products expressions for multiple-valued input, incompletely specified functions\n",
	"0.00200467 MSA\n",
	"0.00236093 QOS\n",
	"0.00250315 Constraint Satisfaction\n",
	"0.00267944 Probability Density Function\n",
	"0.00300554 Magnetic Field\n",
	"No terms!\n",
	"No terms!\n",
	"Body-centric interaction with mobile devices\n",
	"0.00463328 Collision Detection\n",
	"0.00537107 Latent Variable\n",
	"0.00651588 Machine Learning Technique\n",
	"0.00730504 Manet\n",
	"0.00765606 Gesture\n",
	"Implicit modeling using subdivision curves\n",
	"0.0485823 Convex Relaxation\n",
	"0.0499089 Implicit Surface\n",
	"0.0543536 Accelerometer\n",
	"0.0547447 Applet\n",
	"0.0649556 Lambda Calculus\n",
	"Performance and Energy Benefits of Instruction Set Extensions in an FPGA Soft Core\n",
	"0.000145312 Knowledge Representation\n",
	"0.000151492 Machine Translation\n",
	"0.000152743 EEG\n",
	"0.000160667 Triangulation\n",
	"0.000168405 User Profile\n",
	"No terms!\n",
	"No terms!\n",
	"Visual analytic roadblocks for novice investigators\n",
	"0.172402 DATA SET\n",
	"0.193841 Lambda Calculus\n",
	"0.253551 SVM\n",
	"0.2562 SEMANTIC WEB\n",
	"0.27894 Genetic Algorithm\n"
	]
	}
	],
	"source": [
	"def evaluate(model, example):\n",
	" terms = _terms(example)\n",
	" if terms is None:\n",
	" print('No terms!')\n",
	" return\n",
	" prediction = model.predict_on_batch(terms.reshape((1,16)))[0]\n",
	" best_idx = np.argsort(prediction)[-5:]\n",
	" print(example)\n",
	" for i in best_idx:\n",
	" print(prediction[i], idx_to_kp[i])\n",
	" \n",
	"for i in range(20):\n",
	" evaluate(training_model, paper_data.loc[i].title)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 37,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<link href='http://fonts.googleapis.com/css?family=Fenix' rel='stylesheet' type='text/css'>\n",
	"<link href='http://fonts.googleapis.com/css?family=Alegreya+Sans:100,300,400,500,700,800,900,100italic,300italic,400italic,500italic,700italic,800italic,900italic' rel='stylesheet' type='text/css'>\n",
	"<link href='http://fonts.googleapis.com/css?family=Source+Code+Pro:300,400' rel='stylesheet' type='text/css'>\n",
	"<style>\n",
	" @font-face {\n",
	" font-family: \"Computer Modern\";\n",
	" src: url('http://mirrors.ctan.org/fonts/cm-unicode/fonts/otf/cmunss.otf');\n",
	" }\n",
	" div.cell{\n",
	" width:800px;\n",
	" margin-left:16% !important;\n",
	" margin-right:auto;\n",
	" }\n",
	" h1 {\n",
	" font-family: 'Alegreya Sans', sans-serif;\n",
	" }\n",
	" h2 {\n",
	" font-family: 'Fenix', serif;\n",
	" }\n",
	" h3{\n",
	"\t\tfont-family: 'Fenix', serif;\n",
	" margin-top:12px;\n",
	" margin-bottom: 3px;\n",
	" }\n",
	"\th4{\n",
	"\t\tfont-family: 'Fenix', serif;\n",
	" }\n",
	" h5 {\n",
	" font-family: 'Alegreya Sans', sans-serif;\n",
	" }\t \n",
	" div.text_cell_render{\n",
	" font-family: 'Alegreya Sans',Computer Modern, \"Helvetica Neue\", Arial, Helvetica, Geneva, sans-serif;\n",
	" line-height: 135%;\n",
	" font-size: 120%;\n",
	" width:600px;\n",
	" margin-left:auto;\n",
	" margin-right:auto;\n",
	" }\n",
	" .CodeMirror{\n",
	" font-family: \"Source Code Pro\";\n",
	"\t\t\tfont-size: 90%;\n",
	" }\n",
	"/* .prompt{\n",
	" display: None;\n",
	" }*/\n",
	" .text_cell_render h1 {\n",
	" font-weight: 200;\n",
	" font-size: 50pt;\n",
	"\t\tline-height: 100%;\n",
	" color:#CD2305;\n",
	" margin-bottom: 0.5em;\n",
	" margin-top: 0.5em;\n",
	" display: block;\n",
	" }\t\n",
	" .text_cell_render h5 {\n",
	" font-weight: 300;\n",
	" font-size: 16pt;\n",
	" color: #CD2305;\n",
	" font-style: italic;\n",
	" margin-bottom: .5em;\n",
	" margin-top: 0.5em;\n",
	" display: block;\n",
	" }\n",
	" \n",
	" .warning{\n",
	" color: rgb( 240, 20, 20 )\n",
	" } \n",
	"</style>\n",
	"<script>\n",
	" MathJax.Hub.Config({\n",
	" TeX: {\n",
	" extensions: [\"AMSmath.js\"]\n",
	" },\n",
	" tex2jax: {\n",
	" inlineMath: [ ['$','$'], [\"\\\\(\",\"\\\\)\"] ],\n",
	" displayMath: [ ['$$','$$'], [\"\\\\[\",\"\\\\]\"] ]\n",
	" },\n",
	" displayAlign: 'center', // Change this to 'center' to center equations.\n",
	" \"HTML-CSS\": {\n",
	" styles: {'.MathJax_Display': {\"margin\": 4}}\n",
	" }\n",
	" });\n",
	"</script>\n"
	],
	"text/plain": [
	"<IPython.core.display.HTML object>"
	]
	},
	"execution_count": 37,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"from IPython.core.display import HTML\n",
	"import requests\n",
	"\n",
	"def css_styling():\n",
	" styles = requests.get('https://raw.githubusercontent.com/barbagroup/CFDPython/master/styles/custom.css').text\n",
	" return HTML(styles)\n",
	"css_styling()"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.5.1"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}