Skip to content

Instantly share code, notes, and snippets.

@rafguns
Last active November 24, 2016 10:06
Show Gist options
  • Save rafguns/6fa3460677741e356538337003692389 to your computer and use it in GitHub Desktop.
Save rafguns/6fa3460677741e356538337003692389 to your computer and use it in GitHub Desktop.
Bootstrapping confidence intervals for the distances between barycenters
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Bootstrapping confidence intervals for the distances between barycenters\n",
"\n",
"This document is an appendix to the publication\n",
"\n",
"> Rahman, A.I.M.J., Guns, R., Leydesdorff, L., & Engels, T.C.E. (2016). Measuring the match between evaluators and evaluees: cognitive distances between panel members and research groups at the journal level. *Scientometrics*, in press. http://doi.org/10.1007/S11192-016-2132-X\n",
"\n",
"This appendix provides instructions to our bootstrap-based approach to determining confidence intervals. In general, we follow the standard textbook by Efron and Tibshirani (1998), although this document focuses on our specific application, namely distances between barycenters.\n",
"\n",
"To make it easier to follow, we will use a small hypothetical example, determining the confidence interval between two barycenters. The same logic can however be applied to much larger data sets. Our implementation is in Python and uses various libraries from the scientific Python stack like [Numpy](http://www.numpy.org) and [Pandas](http://pandas.pydata.org), but can in principle be translated to other languages and environments.\n",
"\n",
"First we load these libraries."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Euclidean distance between barycenters\n",
"\n",
"### Getting the data ready\n",
"\n",
"Now suppose that we have a panel member `pm` and a research group `group`. We have their journal publication profile and want to determine the cognitive distance between them. The data below is, again, a hypothetical example. Here, `pm` has 15 papers in _J Biol Chem_, 6 in _Plos One_, and so on. Likewise, `group` has 24 papers in _Embo J_, 17 in _Mol Cell Biol_, and so on."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"pm = pd.Series({\n",
" 'J Biol Chem': 15,\n",
" 'Plos One': 6,\n",
" 'Gene': 5,\n",
" 'Cancer Lett': 1,\n",
" 'Febs J': 1, \n",
"})\n",
"\n",
"group = pd.Series({\n",
" 'Embo J': 24,\n",
" 'Mol Cell Biol': 17,\n",
" 'J Biol Chem': 10,\n",
" 'Febs Lett': 4,\n",
" 'Cell Sinal': 2,\n",
" 'Plos One': 1,\n",
" 'Cancer Lett': 1, \n",
"})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Apart from the publication profiles, we also need a journal base map in order to establish the location of the barycenters. We load a map that is publicly available. Since we only need the journal titles and their x and y coordinates, we keep those columns and disregard the rest."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"basemap = pd.read_csv('http://www.leydesdorff.net/journals11/cited_all.txt', index_col=1)\n",
"basemap = basemap[['x', 'y']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first five rows of our map data look as follows."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>x</th>\n",
" <th>y</th>\n",
" </tr>\n",
" <tr>\n",
" <th>label</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>4Or-Q J Oper Res</th>\n",
" <td>0.7355</td>\n",
" <td>0.1316</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Aaohn J</th>\n",
" <td>-0.1649</td>\n",
" <td>-0.3810</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Aapg Bull</th>\n",
" <td>0.3257</td>\n",
" <td>0.6377</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Aaps J</th>\n",
" <td>-0.2699</td>\n",
" <td>0.3145</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Aaps Pharmscitech</th>\n",
" <td>-0.2208</td>\n",
" <td>0.3412</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" x y\n",
"label \n",
"4Or-Q J Oper Res 0.7355 0.1316\n",
"Aaohn J -0.1649 -0.3810\n",
"Aapg Bull 0.3257 0.6377\n",
"Aaps J -0.2699 0.3145\n",
"Aaps Pharmscitech -0.2208 0.3412"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"basemap.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At the moment, the `pm` and `group` data only contain counts for the journals where they have actually published. We will now add zeroes for all the other journals by copying them from the basemap data. This way, the map, `pm` and `group` all carry information on the exact same set of journals.\n",
"\n",
"**Note**: In the example, the journals are chosen such that all of them are also in the base map. Generally speaking, this is not necessarily true, however, and one should be careful to avoid this."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"pm = pm.reindex(basemap.index, fill_value=0)\n",
"group = group.reindex(basemap.index, fill_value=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Determining the barycenters\n",
"\n",
"A barycenter is the point $C = (C_1, C_2)$ where:\n",
"\n",
"$$C_1 = \\frac{\\sum_{j=1}^N m_j L_{j,1}}{T} ; C_1 = \\frac{\\sum_{j=1}^N m_j L_{j,2}}{T}$$\n",
"\n",
"Here, $L_{j,1}$ and $L_{j,2}$ are the horizontal and vertical coordinates of journal $j$ on the map, $m_j$ is the number of publications in journal $j$, and $T = \\sum_{j=1}^N m_j$ is the total number of publications of the entity. There are $N$ journals.\n",
"\n",
"If $M = (m_1, ..., M_N)$ is the column vector with the number of publications per journal, and\n",
"$B = \\begin{pmatrix}\n",
" L_{1,1} & L_{1,2}\\\\\n",
" \\vdots & \\vdots \\\\ \n",
" L_{N,1} & L_{N,2}\n",
"\\end{pmatrix}$\n",
"is the N-by-2 matrix containing the coordinates on the base map, the barycenter is:\n",
"\n",
"$$C = \\frac{B \\cdot M}{\\sum M}$$\n",
"\n",
"The function below uses this matrix approach."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def barycenter(M, B):\n",
" a = (B.T * M)\n",
" return a.sum(axis=1) / np.sum(M)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using this function we obtain the barycenters for `pm` and `group`."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Barycenter for panel member = (-0.8057249999999999, 0.19977499999999998)\n",
"Barycenter for research group = (-0.8685, 0.20312280701754384)\n"
]
}
],
"source": [
"pm_center = barycenter(pm, basemap)\n",
"group_center = barycenter(group, basemap)\n",
"\n",
"print('Barycenter for panel member = ({c.x}, {c.y})'.format(c=pm_center))\n",
"print('Barycenter for research group = ({c.x}, {c.y})'.format(c=group_center))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Determining the distance between the barycenters\n",
"\n",
"The Euclidean distance between points $C= (C_1, C_2)$ and $D = (D_1, D_2)$ is calculated as follows:\n",
"\n",
"$$d = \\sqrt{(C_1 - D_1)^2 + (C_2 - D_2)^2}$$\n",
"\n",
"The code below uses a specific Numpy routine to implement this but it can easily be verified that it yields the same result on our example as a more direct transcription of the above formula. "
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def euclidean_distance(C, D):\n",
" return np.linalg.norm(C -D, axis=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using this function we obtain the distance between the two barycenters."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Euclidean distance between the two barycenters = 0.0628642063246\n"
]
}
],
"source": [
"dist = euclidean_distance(pm_center, group_center)\n",
"\n",
"print('Euclidean distance between the two barycenters =', dist)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The bootstrap\n",
"\n",
"\n",
"### Bootstrap sample\n",
"\n",
"The distance obtained (0.063) can be used to choose which of multiple potential panel members is the best for this research group. While it seems logical to choose the panel member that is closest to the group, the question might be raised: is a panel member at distance 0.064, for instance, really a worse choice than the PM in our example? It seems fair to assume that such small differences lie within the margin of error. At the same time, it is unclear where the boundaries lie. What about, for instance, another PM at distance 0.08?\n",
"\n",
"Bootstrapping is a simulation-based method for estimating standard error and confidence intervals. Bootstrapping depends on the notion of a **bootstrap sample**. To determine a bootstrap sample for a panel member or research group with N publications, we randomly sample with replacement N publications from its set of publications. In other words, the same publication can be chosen multiple times. Some publications in the original data set will not occur in the bootstrap data set, whereas others will occur once, twice or even more times.\n",
"\n",
"In the following small example, there are six papers, published in respectively journal 2, journal 1, journal 3, journal 2, journal 3 and journal 5."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"example = np.array([2, 1, 3, 2, 3, 5])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A bootstrap sample for this example is a new array with the same size (six elements), sampled from the original data. We sample with replacement, meqning that the same paper can be picked more than once."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def bootstrap_sample_from_papers(papers):\n",
" return np.random.choice(papers, papers.size, replace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because journals 2 and 3 occur twice in the original data, these are also more likely to be chosen. We call `bootstrap_sample_from_papers` a few times to show how it samples from the input data."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([2, 2, 2, 5, 5, 2])"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bootstrap_sample_from_papers(example)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([5, 3, 5, 3, 1, 1])"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bootstrap_sample_from_papers(example)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([2, 3, 2, 3, 1, 5])"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bootstrap_sample_from_papers(example)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The data we have used for barycenter calculation is 'journal-centered' rather than 'paper-centered'. That is, each row of `pm` and `group` represents a journal and the number of papers in it. We therefore translate from the journal-centered representation to the paper-centered one that is expected by `bootstrap_sample_from_papers`. Once we have a sample, we translate it back to a journal-centered representation."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def bootstrap_sample_from_journals(num_papers_per_journal):\n",
" # First, transform num_papers_per_journal into a paper array\n",
" # where each different value denotes a different journal.\n",
" n_journals = num_papers_per_journal.size\n",
" papers = np.repeat(np.arange(n_journals), num_papers_per_journal)\n",
"\n",
" # Draw sample from papers\n",
" sample = bootstrap_sample_from_papers(papers)\n",
"\n",
" # Count number of papers in each journal\n",
" return np.bincount(sample, minlength=n_journals)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To illustrate the kind of output this gives, we call it a few times with a fictional example: a list of journals, in which we have respectively 5, 0, 4, and 10 papers. Note that the total number of papers (19) stays constant."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([6, 0, 4, 9], dtype=int64)"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"example_data = np.array([5, 0, 4, 10])\n",
"bootstrap_sample_from_journals(example_data)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([8, 0, 3, 8], dtype=int64)"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bootstrap_sample_from_journals(example_data)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([ 5, 0, 4, 10], dtype=int64)"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bootstrap_sample_from_journals(example_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can determine a *bootstrap replication*, the statistic we are interested in, starting not from the original data but from the bootstrap sample data. Below, we obtain a bootstrap sample for `pm` and consequently the corresponding bootstrap replication of its barycenter."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"x -0.783582\n",
"y 0.195439\n",
"dtype: float64"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pm_sample = bootstrap_sample_from_journals(pm)\n",
"barycenter(pm_sample, basemap)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These coordinates are indeed fairly similar to (but not the same as) the ones we got from the actual data (-0.806, 0.200)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Confidence intervals\n",
"\n",
"One or a few bootstrap samples are not enough to estimate a confidence interval. For that, it is recommended to obtain 1000 or more bootstrap samples. From each sample we calculate the bootstrap replication.\n",
"\n",
"In our specific case, we calculate 1000 bootstrapped barycenters for the PM and 1000 for the research group."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"pm_barycenters = np.zeros((1000, 2))\n",
"group_barycenters = np.zeros((1000, 2))\n",
"\n",
"for i in range(1000):\n",
" # Panel member\n",
" pm_sample = bootstrap_sample_from_journals(pm)\n",
" pm_barycenter = barycenter(pm_sample, basemap)\n",
" pm_barycenters[i] = pm_barycenter\n",
" \n",
" # Research group\n",
" group_sample = bootstrap_sample_from_journals(group)\n",
" group_barycenter = barycenter(group_sample, basemap)\n",
" group_barycenters[i] = group_barycenter"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we look at (a part of) `pm_barycenters`, we see that it has different X and Y coordinates, all within a similar range."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[-0.80268571, 0.2088 ],\n",
" [-0.81938214, 0.193875 ],\n",
" [-0.7721 , 0.19037857],\n",
" ..., \n",
" [-0.83252143, 0.21212143],\n",
" [-0.81429643, 0.18568929],\n",
" [-0.78823571, 0.20265714]])"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pm_barycenters"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our foremost concern is not so much the barycenter coordinates in themselves (although those might be interesting as well), but especially the *distances* between the PM's and the group's barycenters. From the bootstrap sample barycenters, we can obtain 1000 bootstrap sample distances by determining the distance between the first PM barycenter and the first group barycenter, between the second PM barycenter and the second group barycenter, and so on."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"distances = np.zeros(1000)\n",
"\n",
"for i in range(1000):\n",
" distances[i] = euclidean_distance(pm_barycenters[i], group_barycenters[i])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now have an array of 1000 bootstrapped distances. Below we show the minimum, maximum, median and mean."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Minimum: 0.0102225087121\n",
"Maximum: 0.13107897628\n",
"Median: 0.0623148978082\n",
"Mean: 0.062806860686\n"
]
}
],
"source": [
"print('Minimum:', np.min(distances))\n",
"print('Maximum:', np.max(distances))\n",
"print('Median:', np.median(distances))\n",
"print('Mean:', np.mean(distances))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, to obtain a 95% confidence interval, we use **bootstrap percentiles** (other ways of determining confidence intervals in bootstrapping also exist). We first sort the distances from small to large. The lower and upper bound of the confidence interval are then (in this case) the 25th and 975th distance, such that 95% of the variation is within the interval.\n",
"\n",
"More generally, for a $1 - \\alpha$ interval, the lower and upper bound are found at $\\frac{N \\alpha}{2}$ and $\\frac{N (1 - \\alpha)}{2}$, respectively, where $N$ denotes the number of samples."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Lower = 0.0238465342385\n",
"Upper = 0.105393614124\n"
]
}
],
"source": [
"distances = np.sort(distances)\n",
"lower = distances[25]\n",
"upper = distances[975]\n",
"\n",
"print('Lower =', lower)\n",
"print('Upper =', upper)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking at this confidence interval, it can be seen that the choice between this PM and another PM at distance 0.064 or even 0.08 – the examples used earlier – would be within the margin of error."
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [default]",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment