rafguns/Confidence intervals for weighted cosine similarity.ipynb

## Confidence intervals for weighted cosine similarity.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Confidence intervals for weighted cosine similarity\n",
    "\n",
    "This document provides a detailed overview of our bootstrap-based approach to determining confidence intervals for weighted cosine similarity. It is a companion to the [guide that focuses on distance between barycenters](http://nbviewer.jupyter.org/gist/rafguns/6fa3460677741e356538337003692389). While the procedure in both cases is very similar, it is not exactly the same. Parts of this document are based on or taken from the barycenter guide.\n",
    "\n",
    "To make it easier to follow, we will use a small hypothetical example, determining the confidence interval for simialrity between two sets of publications. As in the barycenter guide, the implementation is in Python.\n",
    "\n",
    "First we load some libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Weighted cosine similarity\n",
    "\n",
    "### Getting the data ready\n",
    "\n",
    "Now suppose that we have a panel member `pm` and a research group `group`. For both, we have counted their number of publications in each Web of Science Subject Category (SC). For instance, `pm` has 65 publications in *Chemistry, Inorganic Nuclear*, 40 in *Chemistry, Organic*, and so on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pm = pd.Series({\n",
    "    'CHEMISTRY, INORGANIC & NUCLEAR': 65,\n",
    "    'CHEMISTRY, ORGANIC': 40,\n",
    "    'CHEMISTRY, MULTIDISCIPLINARY': 15,\n",
    "    'CRYSTALLOGRAPHY': 4,    \n",
    "})\n",
    "\n",
    "group = pd.Series({\n",
    "    'CHEMISTRY, MULTIDISCIPLINARY': 122,\n",
    "    'BIOPHYSICS': 89,\n",
    "    'CHEMISTRY, ORGANIC': 45,\n",
    "    'CHEMISTRY, PHYSICAL': 42,\n",
    "    'OPTICS': 26,\n",
    "    'ELECTROCHEMISTRY': 14,\n",
    "    'MICROSCOPY': 3,\n",
    "})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Apart from the publication profiles, we also need a similarity matrix. This is a symmetric matrix $S$ in which each element $S_{i,j}$ expresses the similarity between SCs $i$ and $j$. For this example we load a similarity matrix that is freely available on [Loet Leydesdorff's website](http://www.leydesdorff.net/overlaytoolkit/). Of course, other similarity matrices might yield different results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "matrix = pd.read_excel('http://www.leydesdorff.net/overlaytoolkit/matrix10.xlsx', sheetname='Cosin Sim Citing')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since the matrix has 224 rows and columns, it is too large to display. We can, however, get an idea of what is inside by looking at the first few rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ACOUSTICS</th>\n",
       "      <th>AGRICULTURAL ECONOMICS &amp; POLICY</th>\n",
       "      <th>AGRICULTURAL ENGINEERING</th>\n",
       "      <th>AGRICULTURE, DAIRY &amp; ANIMAL SCIENCE</th>\n",
       "      <th>AGRICULTURE, MULTIDISCIPLINARY</th>\n",
       "      <th>AGRONOMY</th>\n",
       "      <th>ALLERGY</th>\n",
       "      <th>ANATOMY &amp; MORPHOLOGY</th>\n",
       "      <th>ANDROLOGY</th>\n",
       "      <th>ANESTHESIOLOGY</th>\n",
       "      <th>...</th>\n",
       "      <th>TRANSPORTATION</th>\n",
       "      <th>TRANSPORTATION SCIENCE &amp; TECHNOLOGY</th>\n",
       "      <th>TROPICAL MEDICINE</th>\n",
       "      <th>URBAN STUDIES</th>\n",
       "      <th>UROLOGY &amp; NEPHROLOGY</th>\n",
       "      <th>VETERINARY SCIENCES</th>\n",
       "      <th>VIROLOGY</th>\n",
       "      <th>WATER RESOURCES</th>\n",
       "      <th>WOMEN</th>\n",
       "      <th>ZOOLOGY</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>ACOUSTICS</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.007630</td>\n",
       "      <td>0.038938</td>\n",
       "      <td>0.018006</td>\n",
       "      <td>0.025958</td>\n",
       "      <td>0.013559</td>\n",
       "      <td>0.017154</td>\n",
       "      <td>0.097334</td>\n",
       "      <td>0.090629</td>\n",
       "      <td>0.057525</td>\n",
       "      <td>...</td>\n",
       "      <td>0.007886</td>\n",
       "      <td>0.102819</td>\n",
       "      <td>0.022850</td>\n",
       "      <td>0.001234</td>\n",
       "      <td>0.053652</td>\n",
       "      <td>0.024052</td>\n",
       "      <td>0.024069</td>\n",
       "      <td>0.045031</td>\n",
       "      <td>0.007219</td>\n",
       "      <td>0.044930</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>AGRICULTURAL ECONOMICS &amp; POLICY</th>\n",
       "      <td>0.007630</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.060740</td>\n",
       "      <td>0.069574</td>\n",
       "      <td>0.127465</td>\n",
       "      <td>0.047864</td>\n",
       "      <td>0.006178</td>\n",
       "      <td>0.013845</td>\n",
       "      <td>0.017045</td>\n",
       "      <td>0.003245</td>\n",
       "      <td>...</td>\n",
       "      <td>0.096653</td>\n",
       "      <td>0.048697</td>\n",
       "      <td>0.014034</td>\n",
       "      <td>0.326270</td>\n",
       "      <td>0.005122</td>\n",
       "      <td>0.031746</td>\n",
       "      <td>0.010665</td>\n",
       "      <td>0.065005</td>\n",
       "      <td>0.050875</td>\n",
       "      <td>0.036458</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>AGRICULTURAL ENGINEERING</th>\n",
       "      <td>0.038938</td>\n",
       "      <td>0.060740</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.159571</td>\n",
       "      <td>0.444598</td>\n",
       "      <td>0.319055</td>\n",
       "      <td>0.031748</td>\n",
       "      <td>0.115049</td>\n",
       "      <td>0.090755</td>\n",
       "      <td>0.011790</td>\n",
       "      <td>...</td>\n",
       "      <td>0.010778</td>\n",
       "      <td>0.066520</td>\n",
       "      <td>0.106832</td>\n",
       "      <td>0.004900</td>\n",
       "      <td>0.024656</td>\n",
       "      <td>0.122964</td>\n",
       "      <td>0.190731</td>\n",
       "      <td>0.494796</td>\n",
       "      <td>0.001752</td>\n",
       "      <td>0.109991</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>AGRICULTURE, DAIRY &amp; ANIMAL SCIENCE</th>\n",
       "      <td>0.018006</td>\n",
       "      <td>0.069574</td>\n",
       "      <td>0.159571</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.412862</td>\n",
       "      <td>0.121670</td>\n",
       "      <td>0.049027</td>\n",
       "      <td>0.184242</td>\n",
       "      <td>0.246865</td>\n",
       "      <td>0.021821</td>\n",
       "      <td>...</td>\n",
       "      <td>0.004274</td>\n",
       "      <td>0.002057</td>\n",
       "      <td>0.099812</td>\n",
       "      <td>0.000707</td>\n",
       "      <td>0.040783</td>\n",
       "      <td>0.560081</td>\n",
       "      <td>0.130650</td>\n",
       "      <td>0.029649</td>\n",
       "      <td>0.005266</td>\n",
       "      <td>0.166784</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>AGRICULTURE, MULTIDISCIPLINARY</th>\n",
       "      <td>0.025958</td>\n",
       "      <td>0.127465</td>\n",
       "      <td>0.444598</td>\n",
       "      <td>0.412862</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.614654</td>\n",
       "      <td>0.056437</td>\n",
       "      <td>0.159252</td>\n",
       "      <td>0.132730</td>\n",
       "      <td>0.025192</td>\n",
       "      <td>...</td>\n",
       "      <td>0.007933</td>\n",
       "      <td>0.014974</td>\n",
       "      <td>0.110815</td>\n",
       "      <td>0.005229</td>\n",
       "      <td>0.043964</td>\n",
       "      <td>0.194955</td>\n",
       "      <td>0.156331</td>\n",
       "      <td>0.191959</td>\n",
       "      <td>0.008421</td>\n",
       "      <td>0.170428</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 224 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                     ACOUSTICS  \\\n",
       "ACOUSTICS                             1.000000   \n",
       "AGRICULTURAL ECONOMICS & POLICY       0.007630   \n",
       "AGRICULTURAL ENGINEERING              0.038938   \n",
       "AGRICULTURE, DAIRY & ANIMAL SCIENCE   0.018006   \n",
       "AGRICULTURE, MULTIDISCIPLINARY        0.025958   \n",
       "\n",
       "                                     AGRICULTURAL ECONOMICS & POLICY  \\\n",
       "ACOUSTICS                                                   0.007630   \n",
       "AGRICULTURAL ECONOMICS & POLICY                             1.000000   \n",
       "AGRICULTURAL ENGINEERING                                    0.060740   \n",
       "AGRICULTURE, DAIRY & ANIMAL SCIENCE                         0.069574   \n",
       "AGRICULTURE, MULTIDISCIPLINARY                              0.127465   \n",
       "\n",
       "                                     AGRICULTURAL ENGINEERING  \\\n",
       "ACOUSTICS                                            0.038938   \n",
       "AGRICULTURAL ECONOMICS & POLICY                      0.060740   \n",
       "AGRICULTURAL ENGINEERING                             1.000000   \n",
       "AGRICULTURE, DAIRY & ANIMAL SCIENCE                  0.159571   \n",
       "AGRICULTURE, MULTIDISCIPLINARY                       0.444598   \n",
       "\n",
       "                                     AGRICULTURE, DAIRY & ANIMAL SCIENCE  \\\n",
       "ACOUSTICS                                                       0.018006   \n",
       "AGRICULTURAL ECONOMICS & POLICY                                 0.069574   \n",
       "AGRICULTURAL ENGINEERING                                        0.159571   \n",
       "AGRICULTURE, DAIRY & ANIMAL SCIENCE                             1.000000   \n",
       "AGRICULTURE, MULTIDISCIPLINARY                                  0.412862   \n",
       "\n",
       "                                     AGRICULTURE, MULTIDISCIPLINARY  AGRONOMY  \\\n",
       "ACOUSTICS                                                  0.025958  0.013559   \n",
       "AGRICULTURAL ECONOMICS & POLICY                            0.127465  0.047864   \n",
       "AGRICULTURAL ENGINEERING                                   0.444598  0.319055   \n",
       "AGRICULTURE, DAIRY & ANIMAL SCIENCE                        0.412862  0.121670   \n",
       "AGRICULTURE, MULTIDISCIPLINARY                             1.000000  0.614654   \n",
       "\n",
       "                                      ALLERGY  ANATOMY & MORPHOLOGY  \\\n",
       "ACOUSTICS                            0.017154              0.097334   \n",
       "AGRICULTURAL ECONOMICS & POLICY      0.006178              0.013845   \n",
       "AGRICULTURAL ENGINEERING             0.031748              0.115049   \n",
       "AGRICULTURE, DAIRY & ANIMAL SCIENCE  0.049027              0.184242   \n",
       "AGRICULTURE, MULTIDISCIPLINARY       0.056437              0.159252   \n",
       "\n",
       "                                     ANDROLOGY  ANESTHESIOLOGY    ...     \\\n",
       "ACOUSTICS                             0.090629        0.057525    ...      \n",
       "AGRICULTURAL ECONOMICS & POLICY       0.017045        0.003245    ...      \n",
       "AGRICULTURAL ENGINEERING              0.090755        0.011790    ...      \n",
       "AGRICULTURE, DAIRY & ANIMAL SCIENCE   0.246865        0.021821    ...      \n",
       "AGRICULTURE, MULTIDISCIPLINARY        0.132730        0.025192    ...      \n",
       "\n",
       "                                     TRANSPORTATION  \\\n",
       "ACOUSTICS                                  0.007886   \n",
       "AGRICULTURAL ECONOMICS & POLICY            0.096653   \n",
       "AGRICULTURAL ENGINEERING                   0.010778   \n",
       "AGRICULTURE, DAIRY & ANIMAL SCIENCE        0.004274   \n",
       "AGRICULTURE, MULTIDISCIPLINARY             0.007933   \n",
       "\n",
       "                                     TRANSPORTATION SCIENCE & TECHNOLOGY  \\\n",
       "ACOUSTICS                                                       0.102819   \n",
       "AGRICULTURAL ECONOMICS & POLICY                                 0.048697   \n",
       "AGRICULTURAL ENGINEERING                                        0.066520   \n",
       "AGRICULTURE, DAIRY & ANIMAL SCIENCE                             0.002057   \n",
       "AGRICULTURE, MULTIDISCIPLINARY                                  0.014974   \n",
       "\n",
       "                                     TROPICAL MEDICINE  URBAN STUDIES  \\\n",
       "ACOUSTICS                                     0.022850       0.001234   \n",
       "AGRICULTURAL ECONOMICS & POLICY               0.014034       0.326270   \n",
       "AGRICULTURAL ENGINEERING                      0.106832       0.004900   \n",
       "AGRICULTURE, DAIRY & ANIMAL SCIENCE           0.099812       0.000707   \n",
       "AGRICULTURE, MULTIDISCIPLINARY                0.110815       0.005229   \n",
       "\n",
       "                                     UROLOGY & NEPHROLOGY  \\\n",
       "ACOUSTICS                                        0.053652   \n",
       "AGRICULTURAL ECONOMICS & POLICY                  0.005122   \n",
       "AGRICULTURAL ENGINEERING                         0.024656   \n",
       "AGRICULTURE, DAIRY & ANIMAL SCIENCE              0.040783   \n",
       "AGRICULTURE, MULTIDISCIPLINARY                   0.043964   \n",
       "\n",
       "                                     VETERINARY SCIENCES  VIROLOGY  \\\n",
       "ACOUSTICS                                       0.024052  0.024069   \n",
       "AGRICULTURAL ECONOMICS & POLICY                 0.031746  0.010665   \n",
       "AGRICULTURAL ENGINEERING                        0.122964  0.190731   \n",
       "AGRICULTURE, DAIRY & ANIMAL SCIENCE             0.560081  0.130650   \n",
       "AGRICULTURE, MULTIDISCIPLINARY                  0.194955  0.156331   \n",
       "\n",
       "                                     WATER RESOURCES     WOMEN   ZOOLOGY  \n",
       "ACOUSTICS                                   0.045031  0.007219  0.044930  \n",
       "AGRICULTURAL ECONOMICS & POLICY             0.065005  0.050875  0.036458  \n",
       "AGRICULTURAL ENGINEERING                    0.494796  0.001752  0.109991  \n",
       "AGRICULTURE, DAIRY & ANIMAL SCIENCE         0.029649  0.005266  0.166784  \n",
       "AGRICULTURE, MULTIDISCIPLINARY              0.191959  0.008421  0.170428  \n",
       "\n",
       "[5 rows x 224 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "matrix.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One can see that similarity values range between zero and one.\n",
    "\n",
    "At the moment, the `pm` and `group` data only contain counts for the SCs in which they have actually published. We will now add zeroes for all the other SCs by copying them from the matrix. This way, the matrix, `pm` and `group` all carry information on the exact same set of SCs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pm = pm.reindex(matrix.index, fill_value=0).astype(int)\n",
    "group = group.reindex(matrix.index, fill_value=0).astype(int)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Determining weighted cosine similarity\n",
    "\n",
    "Weighted cosine similarity (WCS) was proposed by [Zhou et al. (2012)](http://doi.org/10.1007/s11192-012-0767-9). The WCS between a panel member, represented by publication vector $M$, and a research group, represented by publication vector $R$, is defined as:\n",
    "\n",
    "$$WCS(M, R) = \\frac{\\sum_{i=1}^N M_i (\\sum_{j=1}^N R_j S_{ji})}\n",
    "                   {\\sqrt{\\sum_{i=1}^N M_i (\\sum_{j=1}^N M_j S_{ji})} \\cdot \\sqrt{\\sum_{i=1}^N R_i (\\sum_{j=1}^N R_j S_{ji})}}$$\n",
    "                   \n",
    "Written as matrix operations, this becomes:\n",
    "\n",
    "$$WCS(M, R) = \\frac{M^t \\cdot S \\cdot R}{\\sqrt{M^t \\cdot S \\cdot M} \\cdot \\sqrt{R^t \\cdot S \\cdot R}}$$\n",
    "\n",
    "where $^t$ denotes matrix transposition and $S$ is the similarity matrix. If the similarity matrix $S$ is the identity matrix, the above equation reduces to regular cosine similarity. We can write a Python fuction to compute WCS using matrix operations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def weighted_cosine_sim(M, R, S):\n",
    "    # because of the way one-dimensional arrays work, explicit transposition is not necessary\n",
    "    return M.dot(S).dot(R) / np.sqrt(M.dot(S).dot(M) * R.dot(S).dot(R))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using this function we obtain the WCS between `pm` and `group`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.79151248916360006"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "weighted_cosine_sim(pm, group, matrix)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We find a WCS of about 0.79. Note that this is higher than one might expect based on the limited overlap in SCs between `pm` and `group`. The reason for the high similarity is that many of the SCs are very similar themselves.\n",
    "\n",
    "For comparison, the regular cosine similarity is only 0.28: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.28118138870339504"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "weighted_cosine_sim(pm, group, np.identity(len(pm)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bootstrapped confidence intervals\n",
    "\n",
    "\n",
    "### Bootstrap sample\n",
    "\n",
    "We now turn to the problem of estimating the stability of the value we found. As in the barycenter guide, we do this using a bootstrapping approach.\n",
    "\n",
    "Bootstrapping is a simulation-based method for estimating standard error and confidence intervals. Bootstrapping depends on the notion of a **bootstrap sample**. To determine a bootstrap sample for a panel member or research group with N publications, we randomly sample with replacement N publications from its set of publications. In other words, the same publication can be chosen multiple times. Some publications in the original data set will not occur in the bootstrap data set, whereas others will occur once, twice or even more times.\n",
    "\n",
    "In the following small example, there are six papers, published in respectively SC 2, SC 1, SC 3, SC 2, SC 3 and SC 5."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "example = np.array([2, 1, 3, 2, 3, 5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A bootstrap sample for this example is a new array with the same size (six elements), sampled from the original data. We sample with replacement, meaning that the same paper can be picked more than once."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def bootstrap_sample_from_papers(papers):\n",
    "    return np.random.choice(papers, papers.size, replace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because SCs 2 and 3 occur twice in the original data, these are also more likely to be chosen. We call `bootstrap_sample_from_papers` a few times to show how it samples from the input data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 3, 2, 2, 3, 3])"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bootstrap_sample_from_papers(example)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 1, 2, 5, 5, 5])"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bootstrap_sample_from_papers(example)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([3, 3, 2, 2, 1, 5])"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bootstrap_sample_from_papers(example)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, our data are in a form that is centered around SCs rather than individual papers: each row of `pm` and `group` represents a SC and the number of papers in it. We therefore translate from the SC-centered representation to the paper-centered one that is expected by `bootstrap_sample_from_papers`. Once we have a sample, we translate it back to a SC-centered representation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def bootstrap_sample_from_categories(num_papers_per_category):\n",
    "    # First, transform num_papers_per_category into a paper array\n",
    "    # where each different value denotes a different SC.\n",
    "    n_categories = num_papers_per_category.size\n",
    "    papers = np.repeat(np.arange(n_categories), num_papers_per_category)\n",
    "\n",
    "    # Draw sample from papers\n",
    "    sample = bootstrap_sample_from_papers(papers)\n",
    "\n",
    "    # Count number of papers in each SC\n",
    "    return np.bincount(sample, minlength=n_categories)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To illustrate the kind of output this gives, we call it a few times with a fictional example: a list of SCs, in which we have respectively 5, 0, 4, 10, 0, and 0 papers. Note that the total number of papers (19) stays constant."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([7, 0, 3, 9, 0, 0], dtype=int64)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example_data = np.array([5, 0, 4, 10, 0, 0])\n",
    "bootstrap_sample_from_categories(example_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([6, 0, 4, 9, 0, 0], dtype=int64)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bootstrap_sample_from_categories(example_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can determine a *bootstrap replication*, the statistic we are interested in, starting not from the original data but from the bootstrap sample data. Below, we obtain a bootstrap sample for `pm` and `group` and consequently the corresponding bootstrap replication of their similarity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.76507180844009237"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pm_sample = bootstrap_sample_from_categories(pm)\n",
    "group_sample = bootstrap_sample_from_categories(group)\n",
    "\n",
    "weighted_cosine_sim(pm_sample, group_sample, matrix)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The value we obtain is slightly lower than the similarity based on the empirically found counts (0.79)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Confidence intervals\n",
    "\n",
    "One or a few bootstrap samples are not enough to estimate a confidence interval. For that, it is recommended to obtain 1000 or more bootstrap samples. From each sample we calculate the bootstrap replication.\n",
    "\n",
    "In our case, we calculate 1000 bootstrapped similarity values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "replication_count = 1000\n",
    "replications = np.empty(replication_count)\n",
    "\n",
    "for i in range(replication_count):\n",
    "    pm_sample = bootstrap_sample_from_categories(pm)\n",
    "    group_sample = bootstrap_sample_from_categories(group)\n",
    "\n",
    "    replications[i] = weighted_cosine_sim(pm_sample, group_sample, matrix)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the first ten replications."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0.80221476,  0.76605643,  0.79051016,  0.82246141,  0.7661963 ,\n",
       "        0.80273074,  0.78869054,  0.78316525,  0.79785403,  0.78274641])"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "replications[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we expected, they all seem to lie fairly close to the WCS which was based on the empirical data. Below we show the minimum, maximum, median and mean."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Minimum: 0.730969905648\n",
      "Maximum: 0.847106600479\n",
      "Median: 0.790739216682\n",
      "Mean: 0.7904931757\n"
     ]
    }
   ],
   "source": [
    "print('Minimum:', np.min(replications))\n",
    "print('Maximum:', np.max(replications))\n",
    "print('Median:', np.median(replications))\n",
    "print('Mean:', np.mean(replications))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, to obtain a 95% confidence interval, we use **bootstrap percentiles** (other ways of determining confidence intervals in bootstrapping also exist). We first sort the distances from small to large. The lower and upper bound of the confidence interval are then (in this case) the 25th and 975th distance, such that 95% of the variation is within the interval.\n",
    "\n",
    "More generally, for a $1 - \\alpha$ interval, the lower and upper bound are found at $\\frac{N \\alpha}{2}$ and $\\frac{N (1 - \\alpha)}{2}$, respectively, where $N$ denotes the number of samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Lower = 0.756301798053\n",
      "Upper = 0.82287302336\n"
     ]
    }
   ],
   "source": [
    "replications = np.sort(replications)\n",
    "lower = replications[25]\n",
    "upper = replications[975]\n",
    "\n",
    "print('Lower =', lower)\n",
    "print('Upper =', upper)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In summary, for this example we obtain a confidence interval between (roughly) 0.76 and 0.82."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}