Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save artificialsoph/a9d200b3d0b2ca6e78bbbba6b6ba22e7 to your computer and use it in GitHub Desktop.
Save artificialsoph/a9d200b3d0b2ca6e78bbbba6b6ba22e7 to your computer and use it in GitHub Desktop.
Evaluate runtime of pairwise distance calculations with sparse arrays
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"ExecuteTime": {
"end_time": "2017-09-12T17:00:55.334918Z",
"start_time": "2017-09-12T17:00:53.845091Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Populating the interactive namespace from numpy and matplotlib\n"
]
}
],
"source": [
"%pylab inline\n",
"import numpy\n",
"import scipy\n",
"from scipy.spatial.distance import pdist\n",
"from sklearn.metrics.pairwise import pairwise_distances"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Sparse Matrix creation\n",
"Create a sparse matrix where 90% of cells are exactly zero. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"ExecuteTime": {
"end_time": "2017-09-12T17:00:55.568337Z",
"start_time": "2017-09-12T17:00:55.337899Z"
}
},
"outputs": [],
"source": [
"dims = (2000,1000)\n",
"non_zero_ratio = .1\n",
"num_elements = int(dims[0]*dims[1]*non_zero_ratio)\n",
"\n",
"flat_indicies = numpy.random.choice(prod(dims), size=num_elements,replace=False)\n",
"values = numpy.random.randint(100,size=num_elements)\n",
"x_i, y_i = numpy.unravel_index(flat_indicies, dims)"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2017-09-12T15:19:05.203590Z",
"start_time": "2017-09-12T15:19:05.169649Z"
}
},
"source": [
"# Compare scikit-learn `pairwise_distances` performance on different sparse matrices\n",
"\n",
"scikit-learn's library supports sparse matrices for Eucledian and cosine distance metrics. However, the details of this support are unclear from the documentation. Intuitively, sparse optimization should be specific to the sparse representation, e.g. CSR matrices provide fast operations across rows while CSC matrices do not. With this in mind I test three of the most useful sparse representations.\n",
"\n",
"The results show that the input representation is not particularly important for the runtime of the pairwise distance calculation. We can infer that, under the hood, the library is likely converting representations to the most helpful one (likely CSR) and that this conversion is relatively inexpensive."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"ExecuteTime": {
"end_time": "2017-09-12T17:01:04.755551Z",
"start_time": "2017-09-12T17:00:55.571520Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"420 ms ± 44.3 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)\n"
]
}
],
"source": [
"%%timeit -r 20\n",
"sparse_row = scipy.sparse.csr_matrix( (values, (x_i,y_i)) , shape=dims)\n",
"pd_row = pairwise_distances(sparse_row)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"ExecuteTime": {
"end_time": "2017-09-12T17:01:13.653505Z",
"start_time": "2017-09-12T17:01:04.759771Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"417 ms ± 21 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)\n"
]
}
],
"source": [
"%%timeit -r 20\n",
"sparse_col = scipy.sparse.csc_matrix( (values, (x_i,y_i)) , shape=dims)\n",
"pd_col = pairwise_distances(sparse_col)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"ExecuteTime": {
"end_time": "2017-09-12T17:01:22.055313Z",
"start_time": "2017-09-12T17:01:13.655917Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"391 ms ± 17.3 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)\n"
]
}
],
"source": [
"%%timeit -r 20\n",
"sparse_coo = scipy.sparse.coo_matrix( (values, (x_i,y_i)) , shape=dims)\n",
"pd_coo = pairwise_distances(sparse_coo)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Compare SciPy `pdist` on a CSR matrix\n",
"\n",
"`pdist` throws an error when a sparse matrix is used as an input (see this [issue](https://github.com/scipy/scipy/issues/7565)) so we have to convert to dense.\n",
"\n",
"In our use case, we see that using dense matrices causes a tenfold increase in runtime, a tremendous performance drag."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"ExecuteTime": {
"end_time": "2017-09-12T17:02:55.147586Z",
"start_time": "2017-09-12T17:01:22.058596Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"4.44 s ± 415 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)\n"
]
}
],
"source": [
"%%timeit -r 20\n",
"sparse_row = scipy.sparse.csr_matrix( (values, (x_i,y_i)) , shape=dims)\n",
"pd_row = pdist(sparse_row.todense())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Using Numpy and linear algebra\n",
"\n",
"The eucledian distance can be implemented using linear algebra. Here, I test the speed of this implementation strategy against that of external libraries.\n",
"\n",
"This function is slower (by about 2 standard deviations) but distinctly close."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"ExecuteTime": {
"end_time": "2017-09-12T17:02:56.594604Z",
"start_time": "2017-09-12T17:02:55.150158Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def pd_euc(x,y=None):\n",
" if not y:\n",
" y = x\n",
" return np.sqrt(x.power(2).sum(1) + y.power(2).sum(1).T - 2*x.dot(y.T))\n",
"\n",
"# test to make sure it works\n",
"sparse_row = scipy.sparse.csr_matrix( (values, (x_i,y_i)) , shape=dims)\n",
"pd_row = pairwise_distances(sparse_row)\n",
"pd = pd_euc(sparse_row)\n",
"numpy.allclose(pd_row,pd)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"ExecuteTime": {
"end_time": "2017-09-12T17:03:08.217710Z",
"start_time": "2017-09-12T17:02:56.599197Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"541 ms ± 66 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)\n"
]
}
],
"source": [
"%%timeit -r 20\n",
"sparse_row = scipy.sparse.csr_matrix( (values, (x_i,y_i)) , shape=dims)\n",
"pd = pd_euc(sparse_row)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment