-
-
Save wusixer/b84a6c977f5fc1596f01e3f8cfeacfc9 to your computer and use it in GitHub Desktop.
{ | |
"cells": [ | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"id": "b3807586-479a-4f66-8ecf-524ab66a8f75", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import numpy as np\n", | |
"import pandas as pd" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "7f3832dd-d4fb-4930-b13e-2f17f461d6ed", | |
"metadata": {}, | |
"source": [ | |
"#### Today I learned how to pack a python function into numpy ufunc for boardcasting and vectorization to speed up the computation" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "aa72ee17-3dd0-4061-ba74-b816a378c2c7", | |
"metadata": {}, | |
"source": [ | |
"Let's say we have a python function that works on one input, we want to iterate this function over the entire dataset row-wise. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"id": "fde37878-a0dd-40ca-8873-5f6a0f94d7c9", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def pad_str_get_list(input_str:str, out_len: int) ->list:\n", | |
" \"\"\"\n", | |
" Pad an input string with 'X' to out_len and then return a list\n", | |
" \n", | |
" :param input_str: one input string, e.g 'ABC'\n", | |
" :param out_len: the targeted length after padding, an integer, such as 5\n", | |
" \n", | |
" return:\n", | |
" a list reflecting the padded output, such as ['A', 'B', 'C', 'X', 'X']\n", | |
" \n", | |
" \"\"\"\n", | |
" out_str = input_str + (out_len - len(input_str))* 'X'\n", | |
" return list(out_str)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "83e3b785-3544-4ea8-b6d0-2d4971b4db73", | |
"metadata": {}, | |
"source": [ | |
"Verify this function works." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"id": "2b37a4ca-05c8-493d-a696-8953bdf2b28c", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"['A', 'B', 'C', 'D', 'E', 'X', 'X', 'X', 'X', 'X']" | |
] | |
}, | |
"execution_count": 3, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"pad_str_get_list('ABCDE', 10)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "e5cbd7da-bdd6-4b81-b493-91b403f899d8", | |
"metadata": {}, | |
"source": [ | |
"**First example**: on 1D array" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"id": "12da9526-7844-4dfa-8a95-b600728b9659", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"example_1D = np.array(['apples', 'foobar', 'cowboy', 'banana', 'watermelon', 'pear'])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"id": "51ec8557-6ef1-4882-bc5c-0bf2898e0359", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array(['apples', 'foobar', 'cowboy', 'banana', 'watermelon', 'pear'],\n", | |
" dtype='<U10')" | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"example_1D" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"id": "0957f92a-aa6c-4331-b61c-f26926a47638", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(list,\n", | |
" [['a', 'p', 'p', 'l', 'e', 's', 'X', 'X', 'X', 'X'],\n", | |
" ['f', 'o', 'o', 'b', 'a', 'r', 'X', 'X', 'X', 'X'],\n", | |
" ['c', 'o', 'w', 'b', 'o', 'y', 'X', 'X', 'X', 'X'],\n", | |
" ['b', 'a', 'n', 'a', 'n', 'a', 'X', 'X', 'X', 'X'],\n", | |
" ['w', 'a', 't', 'e', 'r', 'm', 'e', 'l', 'o', 'n'],\n", | |
" ['p', 'e', 'a', 'r', 'X', 'X', 'X', 'X', 'X', 'X']])" | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"pd_apply_out = pd.Series(example_1D).apply(lambda x: pad_str_get_list(x, 10)).tolist()\n", | |
"type(pd_apply_out), pd_apply_out" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"id": "38b637c9-e8b8-46bd-b40f-edba4b78f852", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[['a', 'p', 'p', 'l', 'e', 's', 'X', 'X', 'X', 'X'],\n", | |
" ['f', 'o', 'o', 'b', 'a', 'r', 'X', 'X', 'X', 'X'],\n", | |
" ['c', 'o', 'w', 'b', 'o', 'y', 'X', 'X', 'X', 'X'],\n", | |
" ['b', 'a', 'n', 'a', 'n', 'a', 'X', 'X', 'X', 'X'],\n", | |
" ['w', 'a', 't', 'e', 'r', 'm', 'e', 'l', 'o', 'n'],\n", | |
" ['p', 'e', 'a', 'r', 'X', 'X', 'X', 'X', 'X', 'X']]" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"vec_pad_str_get_list = np.frompyfunc(pad_str_get_list, 2,1)\n", | |
"vec_pad_str_get_list(example_1D, 10).tolist()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"id": "1ae7f421-620f-429a-8357-c85ca4f0f741", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"The pandas approach on 1d array takes ..\n", | |
"242 µs ± 43.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n", | |
"The vectorized approach on 1d array takes ..\n", | |
"5.92 µs ± 746 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"print('The pandas approach on 1d array takes ..')\n", | |
"%timeit pd.Series(example_1D).apply(lambda x: pad_str_get_list(x, 10))\n", | |
"print('The vectorized approach on 1d array takes ..')\n", | |
"%timeit vec_pad_str_get_list(example_1D, 10).tolist()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "7990e783-8a22-422d-94d5-46a83f8e53d3", | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "f17a042b-4e3d-477c-9cf7-3d173196564a", | |
"metadata": {}, | |
"source": [ | |
"**Second Example**: on 2D array" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"id": "8195abe3-7487-4fa9-b90d-a72d1b442771", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"example_2D = np.array([['apples', 'foobar'], ['cowboy', 'banana'], ['watermelon', 'pear']])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "310f687a-f35d-45aa-8ff6-fb7e367eab01", | |
"metadata": {}, | |
"source": [ | |
"Vectorized result" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"id": "70af5bfe-2f37-49d3-81f1-9e1992e655b1", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([[list(['a', 'p', 'p', 'l', 'e', 's', 'X', 'X', 'X', 'X']),\n", | |
" list(['f', 'o', 'o', 'b', 'a', 'r', 'X', 'X', 'X', 'X'])],\n", | |
" [list(['c', 'o', 'w', 'b', 'o', 'y', 'X', 'X', 'X', 'X']),\n", | |
" list(['b', 'a', 'n', 'a', 'n', 'a', 'X', 'X', 'X', 'X'])],\n", | |
" [list(['w', 'a', 't', 'e', 'r', 'm', 'e', 'l', 'o', 'n']),\n", | |
" list(['p', 'e', 'a', 'r', 'X', 'X', 'X', 'X', 'X', 'X'])]],\n", | |
" dtype=object)" | |
] | |
}, | |
"execution_count": 10, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"vec_pad_str_get_list(example_2D, 10)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "8b56040d-c22a-402a-b577-0f501b892481", | |
"metadata": {}, | |
"source": [ | |
"Panda's approach - same as vectorized result" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"id": "178a550e-6e63-435b-a676-e5336ca49ee9", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([[list(['a', 'p', 'p', 'l', 'e', 's', 'X', 'X', 'X', 'X']),\n", | |
" list(['f', 'o', 'o', 'b', 'a', 'r', 'X', 'X', 'X', 'X'])],\n", | |
" [list(['c', 'o', 'w', 'b', 'o', 'y', 'X', 'X', 'X', 'X']),\n", | |
" list(['b', 'a', 'n', 'a', 'n', 'a', 'X', 'X', 'X', 'X'])],\n", | |
" [list(['w', 'a', 't', 'e', 'r', 'm', 'e', 'l', 'o', 'n']),\n", | |
" list(['p', 'e', 'a', 'r', 'X', 'X', 'X', 'X', 'X', 'X'])]],\n", | |
" dtype=object)" | |
] | |
}, | |
"execution_count": 11, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"pd.DataFrame(example_2D).applymap(lambda x: pad_str_get_list(x, 10)).to_numpy()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"id": "30142be2-099e-4a09-8922-f610d6cd76fc", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"The pandas approach on 2d array takes ..\n", | |
"710 µs ± 40.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n", | |
"The vectorized approach on 2d array takes ..\n", | |
"7.97 µs ± 1.12 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"print('The pandas approach on 2d array takes ..')\n", | |
"%timeit pd.DataFrame(example_2D).applymap(lambda x: pad_str_get_list(x, 10)).to_numpy()\n", | |
"\n", | |
"print('The vectorized approach on 2d array takes ..')\n", | |
"%timeit vec_pad_str_get_list(example_2D, 10)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "a995be23-231e-4e97-a668-e3c9c4643a6f", | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "esol_graph", | |
"language": "python", | |
"name": "esol_graph" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.8.12" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 5 | |
} |
Wow, didn't know about np.frompyfunc, but I think it might not actually do vectorization... that would be another interesting comparison.
Okay checkout my gist!
The pandas approach on 1d array takes ..
371 µs ± 30.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The vectorized approach on 1d array takes ..
7.14 µs ± 406 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
A list comprehension takes ..
8.7 µs ± 732 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
If anything, you have demonstrated that pandas apply is surprisingly slow.
Final comment: why use gists instead of bitbucket repos? I found it a pain to "clone" and run this locally, and it seems it will be harder to keep track of than a repo. Also what if you ever want to add another file?
@alokito I think gists are a good staging ground, perhaps? I've put code snippets up and notebooks (with outputs) up here to quickly share it with others. NBs with outputs are a bad idea in a repo. Perhaps if @wusixer amasses a collection of NBs, then that's a good time to put the stuff into a repo?
@Alokit Nice! The reason I put it in gist is that it was a quick and easy thing for me to do, and I can look up what I need by searching the names of gist (I guess you can do that in bitbucket/github too but it would require a bit more structure to make it a repo). Maybe someday I will have a repo called "little_things_Iearned_from_work" :)
This is so totally a blog post, @wusixer. Add a few more annotations - and maybe compare it to JAX's
vmap
, which operates only on numerical data but still good for you to see in action.