Last active
June 5, 2017 06:16
-
-
Save jgrotts/9f61a53f3509b6f94c1e8d97c01f8672 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Recently, I was working on image data and was running a for-loop to resize the images. This was embarrassingly parallel, but I wasn't familiar with running for-loops in parallel in python. This worksheet provides a template for for-loop parallelization. More information on parallel computing can be found [here](https://en.wikipedia.org/wiki/Parallel_computing).\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"import numpy as np # linear algebra\n", | |
"import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n", | |
"import time\n", | |
"import matplotlib.pyplot as plt\n", | |
"import numpy as np # linear algebra\n", | |
"import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n", | |
"import cv2 \n", | |
"from tqdm import tqdm\n", | |
"from joblib import Parallel, delayed\n", | |
"from joblib import load, dump\n", | |
"from time import time" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The data is from the competition titled [Planet: Understanding the Amazon from Space](https://www.kaggle.com/c/planet-understanding-the-amazon-from-space). The images used in this workbook are from the training dataset. I'm using the [cv2](http://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_gui/py_image_display/py_image_display.html) package to import and resize images. The [tqdm](https://pypi.python.org/pypi/tqdm) is a neat package that runs a progress bar as your loop runs. Need to use threading with the joblib package because the default *multiprocessing* option would not work on my windows machine.\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"100%|██████████████████████████████████| 40479/40479 [00:14<00:00, 2861.35it/s]\n", | |
"100%|██████████████████████████████████| 40479/40479 [00:11<00:00, 3613.34it/s]\n", | |
"100%|███████████████████████████████████| 40479/40479 [00:46<00:00, 862.62it/s]\n" | |
] | |
} | |
], | |
"source": [ | |
"\n", | |
"def image_resize(file_hold): \n", | |
" img = cv2.imread('...\\\\train-jpg\\\\{}.jpg'.format(file_hold))\n", | |
" return cv2.resize(img, (32, 32))\n", | |
" \n", | |
"if __name__ == '__main__':\n", | |
" \n", | |
" df_train = pd.read_csv('...\\\\train_v2.csv')\n", | |
" \n", | |
" \n", | |
" t0_4 = time()\n", | |
" results_4 = Parallel(n_jobs=4, backend=\"threading\")(delayed(image_resize)(f) for f in tqdm(df_train.values[:,0], miniters=1000)) \n", | |
" t1_4 = time()\n", | |
" \n", | |
" t0_8 = time()\n", | |
" results_8 = Parallel(n_jobs=8, backend=\"threading\")(delayed(image_resize)(f) for f in tqdm(df_train.values[:,0], miniters=1000)) \n", | |
" t1_8 = time()\n", | |
" \n", | |
" t0_0 = time()\n", | |
" results_0 = []\n", | |
" for f in tqdm(df_train.values[:,0], miniters=1000):\n", | |
" results_0.append(image_resize(f))\n", | |
" t1_0 = time()\n", | |
" " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"The run time for the for-loop without running in parallel was 46.963686\n", | |
"The run time for the for-loop with parallel on 4 cores was 14.210813\n", | |
"The run time for the for-loop with parallel on 8 cores was 11.293646\n" | |
] | |
} | |
], | |
"source": [ | |
"print(\"The run time for the for-loop without running in parallel was %f\" %(t1_0 - t0_0))\n", | |
"print(\"The run time for the for-loop with parallel on 4 cores was %f\" %(t1_4 - t0_4))\n", | |
"print(\"The run time for the for-loop with parallel on 8 cores was %f\" %(t1_8 - t0_8))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The 8-core for-loop doesn't decrease the time by half of the 4-core for-loop because there is a high cost of efficients for data i/o on the threads. This is a well-known phenomenon and means that more cores is not always better. " | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.5.3" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment