Skip to content

Instantly share code, notes, and snippets.

@pavithraes
Created July 15, 2021 22:37
Show Gist options
  • Save pavithraes/9e264ef4697529588a0027c69de0f2af to your computer and use it in GitHub Desktop.
Save pavithraes/9e264ef4697529588a0027c69de0f2af to your computer and use it in GitHub Desktop.
Code for video: pandas dataframe to dask dataframe
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "26e1fc35-08f1-441b-ac11-cde9b3a38fb7",
"metadata": {
"tags": []
},
"source": [
"# Convert pandas DataFrame to Dask DataFrame"
]
},
{
"cell_type": "markdown",
"id": "34b91c20-8ebb-4b9f-8266-1a23baa6dc0e",
"metadata": {},
"source": [
"## Reading data into pandas DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "2be2aae9-d517-4037-9be2-d92fc4b2dc52",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "c3ad57a6-b719-4ddf-8c6d-f0613f5c72d7",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 51.8 s, sys: 10.5 s, total: 1min 2s\n",
"Wall time: 1min 5s\n"
]
}
],
"source": [
"%%time\n",
"\n",
"df = pd.read_csv(\"checkouts.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "6299a750-8495-4a4a-9cea-d7cc31135747",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 17446311 entries, 0 to 17446310\n",
"Data columns (total 12 columns):\n",
" # Column Dtype \n",
"--- ------ ----- \n",
" 0 Unnamed: 0 int64 \n",
" 1 UsageClass object\n",
" 2 CheckoutType object\n",
" 3 MaterialType object\n",
" 4 CheckoutYear int64 \n",
" 5 CheckoutMonth int64 \n",
" 6 Checkouts int64 \n",
" 7 Title object\n",
" 8 Creator object\n",
" 9 Subjects object\n",
" 10 Publisher object\n",
" 11 PublicationYear object\n",
"dtypes: int64(4), object(8)\n",
"memory usage: 1.6+ GB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "d61f3377-9d82-437f-a2c3-2db3688fc152",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Unnamed: 0</th>\n",
" <th>UsageClass</th>\n",
" <th>CheckoutType</th>\n",
" <th>MaterialType</th>\n",
" <th>CheckoutYear</th>\n",
" <th>CheckoutMonth</th>\n",
" <th>Checkouts</th>\n",
" <th>Title</th>\n",
" <th>Creator</th>\n",
" <th>Subjects</th>\n",
" <th>Publisher</th>\n",
" <th>PublicationYear</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>Physical</td>\n",
" <td>Horizon</td>\n",
" <td>BOOK</td>\n",
" <td>2006</td>\n",
" <td>6</td>\n",
" <td>1</td>\n",
" <td>McGraw-Hill's dictionary of American slang and...</td>\n",
" <td>Spears, Richard A.</td>\n",
" <td>English language United States Slang Dictionar...</td>\n",
" <td>McGraw-Hill,</td>\n",
" <td>c2006.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>Physical</td>\n",
" <td>Horizon</td>\n",
" <td>BOOK</td>\n",
" <td>2006</td>\n",
" <td>6</td>\n",
" <td>1</td>\n",
" <td>Emma, Lady Hamilton / Flora Fraser.</td>\n",
" <td>Fraser, Flora</td>\n",
" <td>Hamilton Emma Lady 1761 1815, Nelson Horatio N...</td>\n",
" <td>Knopf : Distributed by Random House,</td>\n",
" <td>1987, c1986.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>Physical</td>\n",
" <td>Horizon</td>\n",
" <td>BOOK</td>\n",
" <td>2006</td>\n",
" <td>6</td>\n",
" <td>2</td>\n",
" <td>Red midnight</td>\n",
" <td>NaN</td>\n",
" <td>Survival Fiction, Emigration and immigration F...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>Physical</td>\n",
" <td>Horizon</td>\n",
" <td>BOOK</td>\n",
" <td>2006</td>\n",
" <td>6</td>\n",
" <td>1</td>\n",
" <td>Just the financial facts how to identify nugge...</td>\n",
" <td>NaN</td>\n",
" <td>Investments Information services</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>Physical</td>\n",
" <td>Horizon</td>\n",
" <td>SOUNDCASS</td>\n",
" <td>2006</td>\n",
" <td>6</td>\n",
" <td>3</td>\n",
" <td>single shard</td>\n",
" <td>NaN</td>\n",
" <td>Korea Fiction, Pottery Fiction</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Unnamed: 0 UsageClass CheckoutType MaterialType CheckoutYear \\\n",
"0 0 Physical Horizon BOOK 2006 \n",
"1 1 Physical Horizon BOOK 2006 \n",
"2 2 Physical Horizon BOOK 2006 \n",
"3 3 Physical Horizon BOOK 2006 \n",
"4 4 Physical Horizon SOUNDCASS 2006 \n",
"\n",
" CheckoutMonth Checkouts \\\n",
"0 6 1 \n",
"1 6 1 \n",
"2 6 2 \n",
"3 6 1 \n",
"4 6 3 \n",
"\n",
" Title Creator \\\n",
"0 McGraw-Hill's dictionary of American slang and... Spears, Richard A. \n",
"1 Emma, Lady Hamilton / Flora Fraser. Fraser, Flora \n",
"2 Red midnight NaN \n",
"3 Just the financial facts how to identify nugge... NaN \n",
"4 single shard NaN \n",
"\n",
" Subjects \\\n",
"0 English language United States Slang Dictionar... \n",
"1 Hamilton Emma Lady 1761 1815, Nelson Horatio N... \n",
"2 Survival Fiction, Emigration and immigration F... \n",
"3 Investments Information services \n",
"4 Korea Fiction, Pottery Fiction \n",
"\n",
" Publisher PublicationYear \n",
"0 McGraw-Hill, c2006. \n",
"1 Knopf : Distributed by Random House, 1987, c1986. \n",
"2 NaN NaN \n",
"3 NaN NaN \n",
"4 NaN NaN "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "43122602-6082-45e6-8006-93ac60c203bb",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 1.01 s, sys: 348 ms, total: 1.35 s\n",
"Wall time: 1.43 s\n"
]
},
{
"data": {
"text/plain": [
"UsageClass\n",
"Digital 8772938\n",
"Physical 52609482\n",
"Name: Checkouts, dtype: int64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"\n",
"df.groupby(\"UsageClass\").Checkouts.sum()"
]
},
{
"cell_type": "markdown",
"id": "2958f9a0-b7a7-4ac3-bb75-0aefd8e912e5",
"metadata": {},
"source": [
"## Convert to Dask DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "0af2f2b7-3eb5-4672-a8d2-d0b4bfd5af18",
"metadata": {},
"outputs": [],
"source": [
"import dask.dataframe as dd"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "5998f4e1-ca13-4586-a152-418eac28147d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 31.1 s, sys: 19.5 s, total: 50.5 s\n",
"Wall time: 1min 4s\n"
]
}
],
"source": [
"%%time\n",
"\n",
"ddf = dd.from_pandas(df, npartitions=10)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "37ad8b83-1ecf-490a-9cd0-4a9b35a7cdc6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 5.29 ms, sys: 9.41 ms, total: 14.7 ms\n",
"Wall time: 18.9 ms\n"
]
}
],
"source": [
"%%time\n",
"\n",
"result = ddf.groupby(\"UsageClass\").Checkouts.sum()"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "6d7fe469-b342-4e6c-b03b-505c740fcd3b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Dask Series Structure:\n",
"npartitions=1\n",
" int64\n",
" ...\n",
"Name: Checkouts, dtype: int64\n",
"Dask Name: series-groupby-sum-agg, 23 tasks"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "cca8e84a-c2cd-4a2d-aa49-97ce32a666e1",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 928 ms, sys: 93.2 ms, total: 1.02 s\n",
"Wall time: 521 ms\n"
]
},
{
"data": {
"text/plain": [
"UsageClass\n",
"Digital 8772938\n",
"Physical 52609482\n",
"Name: Checkouts, dtype: int64"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"\n",
"result.compute()"
]
},
{
"cell_type": "markdown",
"id": "44b2f73e-e821-47bd-a62f-62c520938911",
"metadata": {},
"source": [
"## Using the distributed scheduler"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "cb7dc62c-7388-48dd-8ce6-405b64435167",
"metadata": {},
"outputs": [],
"source": [
"from dask.distributed import Client"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "0ef375e0-3d6d-423f-9d7b-39180cd7f6c5",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" <div>\n",
" <div style=\"\n",
" width: 24px;\n",
" height: 24px;\n",
" background-color: #e1e1e1;\n",
" border: 3px solid #9D9D9D;\n",
" border-radius: 5px;\n",
" position: absolute;\"> </div>\n",
" <div style=\"margin-left: 48px;\">\n",
" <h3 style=\"margin-bottom: 0px;\">Client</h3>\n",
" <p style=\"color: #9D9D9D; margin-bottom: 0px;\">Client-5f7bdee4-e5bb-11eb-9621-acde48001122</p>\n",
" <table style=\"width: 100%; text-align: left;\">\n",
" \n",
" <tr>\n",
" <td style=\"text-align: left;\"><strong>Connection method:</strong> Cluster object</td>\n",
" <td style=\"text-align: left;\"><strong>Cluster type:</strong> LocalCluster</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Dashboard: </strong>\n",
" <a href=\"http://127.0.0.1:8787/status\">http://127.0.0.1:8787/status</a>\n",
" </td>\n",
" <td style=\"text-align: left;\"></td>\n",
" </tr>\n",
" \n",
" </table>\n",
" \n",
" <details>\n",
" <summary style=\"margin-bottom: 20px;\"><h3 style=\"display: inline;\">Cluster Info</h3></summary>\n",
" \n",
" <div class=\"jp-RenderedHTMLCommon jp-RenderedHTML jp-mod-trusted jp-OutputArea-output\">\n",
" <div style=\"\n",
" width: 24px;\n",
" height: 24px;\n",
" background-color: #e1e1e1;\n",
" border: 3px solid #9D9D9D;\n",
" border-radius: 5px;\n",
" position: absolute;\"> </div>\n",
" <div style=\"margin-left: 48px;\">\n",
" <h3 style=\"margin-bottom: 0px; margin-top: 0px;\">LocalCluster</h3>\n",
" <p style=\"color: #9D9D9D; margin-bottom: 0px;\">95642f34</p>\n",
" <table style=\"width: 100%; text-align: left;\">\n",
" \n",
" <tr>\n",
" <td style=\"text-align: left;\"><strong>Status:</strong> running</td>\n",
" <td style=\"text-align: left;\"><strong>Using processes:</strong> True</td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Dashboard:</strong> <a href=\"http://127.0.0.1:8787/status\">http://127.0.0.1:8787/status</a>\n",
" </td>\n",
" <td style=\"text-align: left;\"><strong>Workers:</strong> 4</td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Total threads:</strong>\n",
" 12\n",
" </td>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Total memory:</strong>\n",
" 16.00 GiB\n",
" </td>\n",
" </tr>\n",
" \n",
" </table>\n",
" <details>\n",
" <summary style=\"margin-bottom: 20px;\"><h3 style=\"display: inline;\">Scheduler Info</h3></summary>\n",
" \n",
" <div style=\"\">\n",
" \n",
" <div>\n",
" <div style=\"\n",
" width: 24px;\n",
" height: 24px;\n",
" background-color: #FFF7E5;\n",
" border: 3px solid #FF6132;\n",
" border-radius: 5px;\n",
" position: absolute;\"> </div>\n",
" <div style=\"margin-left: 48px;\">\n",
" <h3 style=\"margin-bottom: 0px;\">Scheduler</h3>\n",
" <p style=\"color: #9D9D9D; margin-bottom: 0px;\">Scheduler-e9b83dbc-89df-42ae-830a-3a32d86daae7</p>\n",
" <table style=\"width: 100%; text-align: left;\">\n",
" <tr>\n",
" <td style=\"text-align: left;\"><strong>Comm:</strong> tcp://127.0.0.1:50304</td>\n",
" <td style=\"text-align: left;\"><strong>Workers:</strong> 4</td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Dashboard:</strong> <a href=\"http://127.0.0.1:8787/status\">http://127.0.0.1:8787/status</a>\n",
" </td>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Total threads:</strong>\n",
" 12\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Started:</strong>\n",
" Just now\n",
" </td>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Total memory:</strong>\n",
" 16.00 GiB\n",
" </td>\n",
" </tr>\n",
" </table>\n",
" </div>\n",
" </div>\n",
" \n",
" <details style=\"margin-left: 48px;\">\n",
" <summary style=\"margin-bottom: 20px;\"><h3 style=\"display: inline;\">Workers</h3></summary>\n",
" \n",
" <div style=\"margin-bottom: 20px;\">\n",
" <div style=\"width: 24px;\n",
" height: 24px;\n",
" background-color: #DBF5FF;\n",
" border: 3px solid #4CC9FF;\n",
" border-radius: 5px;\n",
" position: absolute;\"> </div>\n",
" <div style=\"margin-left: 48px;\">\n",
" <details>\n",
" <summary>\n",
" <h4 style=\"margin-bottom: 0px; display: inline;\">Worker: 0</h4>\n",
" </summary>\n",
" <table style=\"width: 100%; text-align: left;\">\n",
" <tr>\n",
" <td style=\"text-align: left;\"><strong>Comm: </strong> tcp://127.0.0.1:50311</td>\n",
" <td style=\"text-align: left;\"><strong>Total threads: </strong> 3</td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Dashboard: </strong>\n",
" <a href=\"http://127.0.0.1:50314/status\">http://127.0.0.1:50314/status</a>\n",
" </td>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Memory: </strong>\n",
" 4.00 GiB\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\"><strong>Nanny: </strong> tcp://127.0.0.1:50306</td>\n",
" <td style=\"text-align: left;\"></td>\n",
" </tr>\n",
" <tr>\n",
" <td colspan=\"2\" style=\"text-align: left;\">\n",
" <strong>Local directory: </strong>\n",
" /Users/pavithra-coiled/Developer/pandas-dask/dask-worker-space/worker-xrh6h3wm\n",
" </td>\n",
" </tr>\n",
" \n",
" \n",
" </table>\n",
" </details>\n",
" </div>\n",
" </div>\n",
" \n",
" <div style=\"margin-bottom: 20px;\">\n",
" <div style=\"width: 24px;\n",
" height: 24px;\n",
" background-color: #DBF5FF;\n",
" border: 3px solid #4CC9FF;\n",
" border-radius: 5px;\n",
" position: absolute;\"> </div>\n",
" <div style=\"margin-left: 48px;\">\n",
" <details>\n",
" <summary>\n",
" <h4 style=\"margin-bottom: 0px; display: inline;\">Worker: 1</h4>\n",
" </summary>\n",
" <table style=\"width: 100%; text-align: left;\">\n",
" <tr>\n",
" <td style=\"text-align: left;\"><strong>Comm: </strong> tcp://127.0.0.1:50312</td>\n",
" <td style=\"text-align: left;\"><strong>Total threads: </strong> 3</td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Dashboard: </strong>\n",
" <a href=\"http://127.0.0.1:50313/status\">http://127.0.0.1:50313/status</a>\n",
" </td>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Memory: </strong>\n",
" 4.00 GiB\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\"><strong>Nanny: </strong> tcp://127.0.0.1:50307</td>\n",
" <td style=\"text-align: left;\"></td>\n",
" </tr>\n",
" <tr>\n",
" <td colspan=\"2\" style=\"text-align: left;\">\n",
" <strong>Local directory: </strong>\n",
" /Users/pavithra-coiled/Developer/pandas-dask/dask-worker-space/worker-f7o8b2f2\n",
" </td>\n",
" </tr>\n",
" \n",
" \n",
" </table>\n",
" </details>\n",
" </div>\n",
" </div>\n",
" \n",
" <div style=\"margin-bottom: 20px;\">\n",
" <div style=\"width: 24px;\n",
" height: 24px;\n",
" background-color: #DBF5FF;\n",
" border: 3px solid #4CC9FF;\n",
" border-radius: 5px;\n",
" position: absolute;\"> </div>\n",
" <div style=\"margin-left: 48px;\">\n",
" <details>\n",
" <summary>\n",
" <h4 style=\"margin-bottom: 0px; display: inline;\">Worker: 2</h4>\n",
" </summary>\n",
" <table style=\"width: 100%; text-align: left;\">\n",
" <tr>\n",
" <td style=\"text-align: left;\"><strong>Comm: </strong> tcp://127.0.0.1:50326</td>\n",
" <td style=\"text-align: left;\"><strong>Total threads: </strong> 3</td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Dashboard: </strong>\n",
" <a href=\"http://127.0.0.1:50327/status\">http://127.0.0.1:50327/status</a>\n",
" </td>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Memory: </strong>\n",
" 4.00 GiB\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\"><strong>Nanny: </strong> tcp://127.0.0.1:50310</td>\n",
" <td style=\"text-align: left;\"></td>\n",
" </tr>\n",
" <tr>\n",
" <td colspan=\"2\" style=\"text-align: left;\">\n",
" <strong>Local directory: </strong>\n",
" /Users/pavithra-coiled/Developer/pandas-dask/dask-worker-space/worker-_uan04op\n",
" </td>\n",
" </tr>\n",
" \n",
" \n",
" </table>\n",
" </details>\n",
" </div>\n",
" </div>\n",
" \n",
" <div style=\"margin-bottom: 20px;\">\n",
" <div style=\"width: 24px;\n",
" height: 24px;\n",
" background-color: #DBF5FF;\n",
" border: 3px solid #4CC9FF;\n",
" border-radius: 5px;\n",
" position: absolute;\"> </div>\n",
" <div style=\"margin-left: 48px;\">\n",
" <details>\n",
" <summary>\n",
" <h4 style=\"margin-bottom: 0px; display: inline;\">Worker: 3</h4>\n",
" </summary>\n",
" <table style=\"width: 100%; text-align: left;\">\n",
" <tr>\n",
" <td style=\"text-align: left;\"><strong>Comm: </strong> tcp://127.0.0.1:50317</td>\n",
" <td style=\"text-align: left;\"><strong>Total threads: </strong> 3</td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Dashboard: </strong>\n",
" <a href=\"http://127.0.0.1:50318/status\">http://127.0.0.1:50318/status</a>\n",
" </td>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Memory: </strong>\n",
" 4.00 GiB\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\"><strong>Nanny: </strong> tcp://127.0.0.1:50308</td>\n",
" <td style=\"text-align: left;\"></td>\n",
" </tr>\n",
" <tr>\n",
" <td colspan=\"2\" style=\"text-align: left;\">\n",
" <strong>Local directory: </strong>\n",
" /Users/pavithra-coiled/Developer/pandas-dask/dask-worker-space/worker-ppxzbpyt\n",
" </td>\n",
" </tr>\n",
" \n",
" \n",
" </table>\n",
" </details>\n",
" </div>\n",
" </div>\n",
" \n",
" </details>\n",
" </div>\n",
" \n",
" </details>\n",
" </div>\n",
" </div>\n",
" \n",
" </details>\n",
" \n",
" </div>\n",
" </div>\n",
" "
],
"text/plain": [
"<Client: 'tcp://127.0.0.1:50304' processes=4 threads=12, memory=16.00 GiB>"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"client = Client(n_workers=4)\n",
"client"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "255f3324-0c93-4b52-a4b7-6c249692cf82",
"metadata": {},
"outputs": [],
"source": [
"import dask.dataframe as dd "
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "33defad2-b207-4280-a5f3-a16cbf117d1d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 17.1 ms, sys: 12.1 ms, total: 29.2 ms\n",
"Wall time: 30.5 ms\n"
]
}
],
"source": [
"%%time\n",
"\n",
"ddf = dd.read_csv(\"checkouts.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "a62b9035-8ce1-40f5-ab54-249236bddbf4",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div><strong>Dask DataFrame Structure:</strong></div>\n",
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Unnamed: 0</th>\n",
" <th>UsageClass</th>\n",
" <th>CheckoutType</th>\n",
" <th>MaterialType</th>\n",
" <th>CheckoutYear</th>\n",
" <th>CheckoutMonth</th>\n",
" <th>Checkouts</th>\n",
" <th>Title</th>\n",
" <th>Creator</th>\n",
" <th>Subjects</th>\n",
" <th>Publisher</th>\n",
" <th>PublicationYear</th>\n",
" </tr>\n",
" <tr>\n",
" <th>npartitions=62</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th></th>\n",
" <td>int64</td>\n",
" <td>object</td>\n",
" <td>object</td>\n",
" <td>object</td>\n",
" <td>int64</td>\n",
" <td>int64</td>\n",
" <td>int64</td>\n",
" <td>object</td>\n",
" <td>object</td>\n",
" <td>object</td>\n",
" <td>object</td>\n",
" <td>object</td>\n",
" </tr>\n",
" <tr>\n",
" <th></th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th></th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th></th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
"<div>Dask Name: read-csv, 62 tasks</div>"
],
"text/plain": [
"Dask DataFrame Structure:\n",
" Unnamed: 0 UsageClass CheckoutType MaterialType CheckoutYear CheckoutMonth Checkouts Title Creator Subjects Publisher PublicationYear\n",
"npartitions=62 \n",
" int64 object object object int64 int64 int64 object object object object object\n",
" ... ... ... ... ... ... ... ... ... ... ... ...\n",
"... ... ... ... ... ... ... ... ... ... ... ... ...\n",
" ... ... ... ... ... ... ... ... ... ... ... ...\n",
" ... ... ... ... ... ... ... ... ... ... ... ...\n",
"Dask Name: read-csv, 62 tasks"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ddf"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "f3d79981-97cb-45ed-a678-880d8f4d5573",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Unnamed: 0</th>\n",
" <th>UsageClass</th>\n",
" <th>CheckoutType</th>\n",
" <th>MaterialType</th>\n",
" <th>CheckoutYear</th>\n",
" <th>CheckoutMonth</th>\n",
" <th>Checkouts</th>\n",
" <th>Title</th>\n",
" <th>Creator</th>\n",
" <th>Subjects</th>\n",
" <th>Publisher</th>\n",
" <th>PublicationYear</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>Physical</td>\n",
" <td>Horizon</td>\n",
" <td>BOOK</td>\n",
" <td>2006</td>\n",
" <td>6</td>\n",
" <td>1</td>\n",
" <td>McGraw-Hill's dictionary of American slang and...</td>\n",
" <td>Spears, Richard A.</td>\n",
" <td>English language United States Slang Dictionar...</td>\n",
" <td>McGraw-Hill,</td>\n",
" <td>c2006.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>Physical</td>\n",
" <td>Horizon</td>\n",
" <td>BOOK</td>\n",
" <td>2006</td>\n",
" <td>6</td>\n",
" <td>1</td>\n",
" <td>Emma, Lady Hamilton / Flora Fraser.</td>\n",
" <td>Fraser, Flora</td>\n",
" <td>Hamilton Emma Lady 1761 1815, Nelson Horatio N...</td>\n",
" <td>Knopf : Distributed by Random House,</td>\n",
" <td>1987, c1986.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>Physical</td>\n",
" <td>Horizon</td>\n",
" <td>BOOK</td>\n",
" <td>2006</td>\n",
" <td>6</td>\n",
" <td>2</td>\n",
" <td>Red midnight</td>\n",
" <td>NaN</td>\n",
" <td>Survival Fiction, Emigration and immigration F...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>Physical</td>\n",
" <td>Horizon</td>\n",
" <td>BOOK</td>\n",
" <td>2006</td>\n",
" <td>6</td>\n",
" <td>1</td>\n",
" <td>Just the financial facts how to identify nugge...</td>\n",
" <td>NaN</td>\n",
" <td>Investments Information services</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>Physical</td>\n",
" <td>Horizon</td>\n",
" <td>SOUNDCASS</td>\n",
" <td>2006</td>\n",
" <td>6</td>\n",
" <td>3</td>\n",
" <td>single shard</td>\n",
" <td>NaN</td>\n",
" <td>Korea Fiction, Pottery Fiction</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Unnamed: 0 UsageClass CheckoutType MaterialType CheckoutYear \\\n",
"0 0 Physical Horizon BOOK 2006 \n",
"1 1 Physical Horizon BOOK 2006 \n",
"2 2 Physical Horizon BOOK 2006 \n",
"3 3 Physical Horizon BOOK 2006 \n",
"4 4 Physical Horizon SOUNDCASS 2006 \n",
"\n",
" CheckoutMonth Checkouts \\\n",
"0 6 1 \n",
"1 6 1 \n",
"2 6 2 \n",
"3 6 1 \n",
"4 6 3 \n",
"\n",
" Title Creator \\\n",
"0 McGraw-Hill's dictionary of American slang and... Spears, Richard A. \n",
"1 Emma, Lady Hamilton / Flora Fraser. Fraser, Flora \n",
"2 Red midnight NaN \n",
"3 Just the financial facts how to identify nugge... NaN \n",
"4 single shard NaN \n",
"\n",
" Subjects \\\n",
"0 English language United States Slang Dictionar... \n",
"1 Hamilton Emma Lady 1761 1815, Nelson Horatio N... \n",
"2 Survival Fiction, Emigration and immigration F... \n",
"3 Investments Information services \n",
"4 Korea Fiction, Pottery Fiction \n",
"\n",
" Publisher PublicationYear \n",
"0 McGraw-Hill, c2006. \n",
"1 Knopf : Distributed by Random House, 1987, c1986. \n",
"2 NaN NaN \n",
"3 NaN NaN \n",
"4 NaN NaN "
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ddf.head()"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "74a76c5e-e3f2-4b42-a301-e3fd82834591",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 4.43 ms, sys: 706 µs, total: 5.13 ms\n",
"Wall time: 5.78 ms\n"
]
}
],
"source": [
"%%time\n",
"\n",
"result = ddf.groupby(\"UsageClass\").Checkouts.sum()"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "9f13b454-e5f3-4fdf-badf-64daead21989",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 1.85 s, sys: 324 ms, total: 2.18 s\n",
"Wall time: 23.2 s\n"
]
},
{
"data": {
"text/plain": [
"UsageClass\n",
"Digital 8772938\n",
"Physical 52609482\n",
"Name: Checkouts, dtype: int64"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"\n",
"result.compute()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b859acbc-f466-485e-aae6-0321ada60ba4",
"metadata": {},
"outputs": [],
"source": [
"client.close()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "351ac182-7567-4d8b-ade3-6841c40e7ff8",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment