beckernick/10min_cudf_cupy.ipynb

## 10min_cudf_cupy.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 10 Minutes to cuDF and CuPy\n",
    "\n",
    "This notebook provides introductory examples of how you can use cuDF and CuPy together to take advantage of CuPy array functionality (such as advanced linear algebra operations)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import timeit\n",
    "\n",
    "import cupy as cp\n",
    "import cudf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Converting a cuDF DataFrame to a CuPy Array\n",
    "\n",
    "If we want to convert a cuDF DataFrame to a CuPy ndarray, There are multiple ways to do it:\n",
    "\n",
    "1. We can use the [dlpack](https://github.com/dmlc/dlpack) interface.\n",
    "\n",
    "2. We can also use `DataFrame.values`.\n",
    "\n",
    "3. We can also convert via the [CUDA array interface](https://numba.pydata.org/numba-doc/dev/cuda/cuda_array_interface.html) by using cuDF's `as_gpu_matrix` and CuPy's `asarray` functionality."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "44.1 µs ± 689 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n",
      "209 µs ± 2.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n",
      "208 µs ± 3.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
     ]
    }
   ],
   "source": [
    "nelem = 10000\n",
    "df = cudf.DataFrame({'a':range(nelem),\n",
    "                     'b':range(500, nelem + 500),\n",
    "                     'c':range(1000, nelem + 1000)}\n",
    "                   )\n",
    "\n",
    "%timeit arr_cupy = cp.fromDlpack(df.to_dlpack())\n",
    "%timeit arr_cupy = df.values\n",
    "%timeit arr_cupy = cp.asarray(df.as_gpu_matrix())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[    0,   500,  1000],\n",
       "       [    1,   501,  1001],\n",
       "       [    2,   502,  1002],\n",
       "       ...,\n",
       "       [ 9997, 10497, 10997],\n",
       "       [ 9998, 10498, 10998],\n",
       "       [ 9999, 10499, 10999]])"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "arr_cupy = cp.fromDlpack(df.to_dlpack())\n",
    "arr_cupy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Converting a cuDF Series to a CuPy Array"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are also multiple ways to convert a cuDF Series to a CuPy array:\n",
    "\n",
    "1. We can pass the Series to `cupy.asarray` as cuDF Series exposes [`__cuda_array_interface__`](https://docs-cupy.chainer.org/en/stable/reference/interoperability.html).\n",
    "2. We can leverage the dlpack interface `to_dlpack()`. \n",
    "3. We can also use `Series.values` \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "22.1 µs ± 518 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n",
      "58.3 µs ± 647 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n",
      "80.2 µs ± 647 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
     ]
    }
   ],
   "source": [
    "col = 'a'\n",
    "\n",
    "%timeit cola_cupy = cp.asarray(df[col])\n",
    "%timeit cola_cupy = cp.fromDlpack(df[col].to_dlpack())\n",
    "%timeit cola_cupy = df[col].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([   0,    1,    2, ..., 9997, 9998, 9999])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cola_cupy = cp.asarray(df[col])\n",
    "cola_cupy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From here, we can proceed with normal CuPy workflows, such as reshaping the array, getting the diagonal, or calculating the norm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[   0,    1,    2, ...,  197,  198,  199],\n",
       "       [ 200,  201,  202, ...,  397,  398,  399],\n",
       "       [ 400,  401,  402, ...,  597,  598,  599],\n",
       "       ...,\n",
       "       [9400, 9401, 9402, ..., 9597, 9598, 9599],\n",
       "       [9600, 9601, 9602, ..., 9797, 9798, 9799],\n",
       "       [9800, 9801, 9802, ..., 9997, 9998, 9999]])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "reshaped_arr = cola_cupy.reshape(50, 200)\n",
    "reshaped_arr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([   0,  201,  402,  603,  804, 1005, 1206, 1407, 1608, 1809, 2010,\n",
       "       2211, 2412, 2613, 2814, 3015, 3216, 3417, 3618, 3819, 4020, 4221,\n",
       "       4422, 4623, 4824, 5025, 5226, 5427, 5628, 5829, 6030, 6231, 6432,\n",
       "       6633, 6834, 7035, 7236, 7437, 7638, 7839, 8040, 8241, 8442, 8643,\n",
       "       8844, 9045, 9246, 9447, 9648, 9849])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "reshaped_arr.diagonal()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(577306.967739)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cp.linalg.norm(reshaped_arr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Converting a CuPy Array to a cuDF DataFrame\n",
    "\n",
    "We can also convert a CuPy ndarray to a cuDF DataFrame. Like before, there are multiple ways to do it:\n",
    "\n",
    "1. **Easiest;** We can directly use the `DataFrame` constructor.\n",
    "\n",
    "2. We can use CUDA array interface with the `DataFrame` constructor.\n",
    "\n",
    "3. We can also use the [dlpack](https://github.com/dmlc/dlpack) interface.\n",
    "\n",
    "For the latter two cases, we'll need to make sure that our CuPy array is Fortran contiguous in memory (if it's not already). We can either transpose the array or simply coerce it to be Fortran contiguous beforehand."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "13.1 ms ± 193 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit reshaped_df = cudf.DataFrame(reshaped_arr)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>...</th>\n",
       "      <th>190</th>\n",
       "      <th>191</th>\n",
       "      <th>192</th>\n",
       "      <th>193</th>\n",
       "      <th>194</th>\n",
       "      <th>195</th>\n",
       "      <th>196</th>\n",
       "      <th>197</th>\n",
       "      <th>198</th>\n",
       "      <th>199</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "      <td>...</td>\n",
       "      <td>190</td>\n",
       "      <td>191</td>\n",
       "      <td>192</td>\n",
       "      <td>193</td>\n",
       "      <td>194</td>\n",
       "      <td>195</td>\n",
       "      <td>196</td>\n",
       "      <td>197</td>\n",
       "      <td>198</td>\n",
       "      <td>199</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>200</td>\n",
       "      <td>201</td>\n",
       "      <td>202</td>\n",
       "      <td>203</td>\n",
       "      <td>204</td>\n",
       "      <td>205</td>\n",
       "      <td>206</td>\n",
       "      <td>207</td>\n",
       "      <td>208</td>\n",
       "      <td>209</td>\n",
       "      <td>...</td>\n",
       "      <td>390</td>\n",
       "      <td>391</td>\n",
       "      <td>392</td>\n",
       "      <td>393</td>\n",
       "      <td>394</td>\n",
       "      <td>395</td>\n",
       "      <td>396</td>\n",
       "      <td>397</td>\n",
       "      <td>398</td>\n",
       "      <td>399</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>400</td>\n",
       "      <td>401</td>\n",
       "      <td>402</td>\n",
       "      <td>403</td>\n",
       "      <td>404</td>\n",
       "      <td>405</td>\n",
       "      <td>406</td>\n",
       "      <td>407</td>\n",
       "      <td>408</td>\n",
       "      <td>409</td>\n",
       "      <td>...</td>\n",
       "      <td>590</td>\n",
       "      <td>591</td>\n",
       "      <td>592</td>\n",
       "      <td>593</td>\n",
       "      <td>594</td>\n",
       "      <td>595</td>\n",
       "      <td>596</td>\n",
       "      <td>597</td>\n",
       "      <td>598</td>\n",
       "      <td>599</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>600</td>\n",
       "      <td>601</td>\n",
       "      <td>602</td>\n",
       "      <td>603</td>\n",
       "      <td>604</td>\n",
       "      <td>605</td>\n",
       "      <td>606</td>\n",
       "      <td>607</td>\n",
       "      <td>608</td>\n",
       "      <td>609</td>\n",
       "      <td>...</td>\n",
       "      <td>790</td>\n",
       "      <td>791</td>\n",
       "      <td>792</td>\n",
       "      <td>793</td>\n",
       "      <td>794</td>\n",
       "      <td>795</td>\n",
       "      <td>796</td>\n",
       "      <td>797</td>\n",
       "      <td>798</td>\n",
       "      <td>799</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>800</td>\n",
       "      <td>801</td>\n",
       "      <td>802</td>\n",
       "      <td>803</td>\n",
       "      <td>804</td>\n",
       "      <td>805</td>\n",
       "      <td>806</td>\n",
       "      <td>807</td>\n",
       "      <td>808</td>\n",
       "      <td>809</td>\n",
       "      <td>...</td>\n",
       "      <td>990</td>\n",
       "      <td>991</td>\n",
       "      <td>992</td>\n",
       "      <td>993</td>\n",
       "      <td>994</td>\n",
       "      <td>995</td>\n",
       "      <td>996</td>\n",
       "      <td>997</td>\n",
       "      <td>998</td>\n",
       "      <td>999</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 200 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   0    1    2    3    4    5    6    7    8    9    ...  190  191  192  193  \\\n",
       "0    0    1    2    3    4    5    6    7    8    9  ...  190  191  192  193   \n",
       "1  200  201  202  203  204  205  206  207  208  209  ...  390  391  392  393   \n",
       "2  400  401  402  403  404  405  406  407  408  409  ...  590  591  592  593   \n",
       "3  600  601  602  603  604  605  606  607  608  609  ...  790  791  792  793   \n",
       "4  800  801  802  803  804  805  806  807  808  809  ...  990  991  992  993   \n",
       "\n",
       "   194  195  196  197  198  199  \n",
       "0  194  195  196  197  198  199  \n",
       "1  394  395  396  397  398  399  \n",
       "2  594  595  596  597  598  599  \n",
       "3  794  795  796  797  798  799  \n",
       "4  994  995  996  997  998  999  \n",
       "\n",
       "[5 rows x 200 columns]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "reshaped_df = cudf.DataFrame(reshaped_arr)\n",
    "reshaped_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can check whether our array is Fortran contiguous by using cupy.isfortran or looking at the [flags](https://docs-cupy.chainer.org/en/stable/reference/generated/cupy.ndarray.html#cupy.ndarray.flags) of the array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cp.isfortran(reshaped_arr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, we'll need to convert it before going to a cuDF DataFrame. In the next two cells, we create the DataFrame by leveraging dlpack and the CUDA array interface, respectively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4.9 ms ± 26.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "\n",
    "fortran_arr = cp.asfortranarray(reshaped_arr)\n",
    "reshaped_df = cudf.DataFrame(fortran_arr)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5.1 ms ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "\n",
    "fortran_arr = cp.asfortranarray(reshaped_arr)\n",
    "reshaped_df = cudf.from_dlpack(fortran_arr.toDlpack())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>...</th>\n",
       "      <th>190</th>\n",
       "      <th>191</th>\n",
       "      <th>192</th>\n",
       "      <th>193</th>\n",
       "      <th>194</th>\n",
       "      <th>195</th>\n",
       "      <th>196</th>\n",
       "      <th>197</th>\n",
       "      <th>198</th>\n",
       "      <th>199</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "      <td>...</td>\n",
       "      <td>190</td>\n",
       "      <td>191</td>\n",
       "      <td>192</td>\n",
       "      <td>193</td>\n",
       "      <td>194</td>\n",
       "      <td>195</td>\n",
       "      <td>196</td>\n",
       "      <td>197</td>\n",
       "      <td>198</td>\n",
       "      <td>199</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>200</td>\n",
       "      <td>201</td>\n",
       "      <td>202</td>\n",
       "      <td>203</td>\n",
       "      <td>204</td>\n",
       "      <td>205</td>\n",
       "      <td>206</td>\n",
       "      <td>207</td>\n",
       "      <td>208</td>\n",
       "      <td>209</td>\n",
       "      <td>...</td>\n",
       "      <td>390</td>\n",
       "      <td>391</td>\n",
       "      <td>392</td>\n",
       "      <td>393</td>\n",
       "      <td>394</td>\n",
       "      <td>395</td>\n",
       "      <td>396</td>\n",
       "      <td>397</td>\n",
       "      <td>398</td>\n",
       "      <td>399</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>400</td>\n",
       "      <td>401</td>\n",
       "      <td>402</td>\n",
       "      <td>403</td>\n",
       "      <td>404</td>\n",
       "      <td>405</td>\n",
       "      <td>406</td>\n",
       "      <td>407</td>\n",
       "      <td>408</td>\n",
       "      <td>409</td>\n",
       "      <td>...</td>\n",
       "      <td>590</td>\n",
       "      <td>591</td>\n",
       "      <td>592</td>\n",
       "      <td>593</td>\n",
       "      <td>594</td>\n",
       "      <td>595</td>\n",
       "      <td>596</td>\n",
       "      <td>597</td>\n",
       "      <td>598</td>\n",
       "      <td>599</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>600</td>\n",
       "      <td>601</td>\n",
       "      <td>602</td>\n",
       "      <td>603</td>\n",
       "      <td>604</td>\n",
       "      <td>605</td>\n",
       "      <td>606</td>\n",
       "      <td>607</td>\n",
       "      <td>608</td>\n",
       "      <td>609</td>\n",
       "      <td>...</td>\n",
       "      <td>790</td>\n",
       "      <td>791</td>\n",
       "      <td>792</td>\n",
       "      <td>793</td>\n",
       "      <td>794</td>\n",
       "      <td>795</td>\n",
       "      <td>796</td>\n",
       "      <td>797</td>\n",
       "      <td>798</td>\n",
       "      <td>799</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>800</td>\n",
       "      <td>801</td>\n",
       "      <td>802</td>\n",
       "      <td>803</td>\n",
       "      <td>804</td>\n",
       "      <td>805</td>\n",
       "      <td>806</td>\n",
       "      <td>807</td>\n",
       "      <td>808</td>\n",
       "      <td>809</td>\n",
       "      <td>...</td>\n",
       "      <td>990</td>\n",
       "      <td>991</td>\n",
       "      <td>992</td>\n",
       "      <td>993</td>\n",
       "      <td>994</td>\n",
       "      <td>995</td>\n",
       "      <td>996</td>\n",
       "      <td>997</td>\n",
       "      <td>998</td>\n",
       "      <td>999</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 200 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   0    1    2    3    4    5    6    7    8    9    ...  190  191  192  193  \\\n",
       "0    0    1    2    3    4    5    6    7    8    9  ...  190  191  192  193   \n",
       "1  200  201  202  203  204  205  206  207  208  209  ...  390  391  392  393   \n",
       "2  400  401  402  403  404  405  406  407  408  409  ...  590  591  592  593   \n",
       "3  600  601  602  603  604  605  606  607  608  609  ...  790  791  792  793   \n",
       "4  800  801  802  803  804  805  806  807  808  809  ...  990  991  992  993   \n",
       "\n",
       "   194  195  196  197  198  199  \n",
       "0  194  195  196  197  198  199  \n",
       "1  394  395  396  397  398  399  \n",
       "2  594  595  596  597  598  599  \n",
       "3  794  795  796  797  798  799  \n",
       "4  994  995  996  997  998  999  \n",
       "\n",
       "[5 rows x 200 columns]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fortran_arr = cp.asfortranarray(reshaped_arr)\n",
    "reshaped_df = cudf.DataFrame(fortran_arr)\n",
    "reshaped_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Converting a CuPy Array to a cuDF Series\n",
    "\n",
    "To convert an array to a Series, we can directly pass the array to the `Series` constructor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0      0\n",
       "1    201\n",
       "2    402\n",
       "3    603\n",
       "4    804\n",
       "dtype: int64"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cudf.Series(reshaped_arr.diagonal()).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Interweaving CuDF and CuPy for Smooth PyData Workflows\n",
    "\n",
    "RAPIDS libraries and the entire GPU PyData ecosystem are developing quickly, but sometimes a one library may not have the functionality you need. One example of this might be taking the row-wise sum (or mean) of a Pandas DataFrame. cuDF's support for row-wise operations isn't mature, so you'd need to either transpose the DataFrame or write a UDF and explicitly calculate the sum across each row. Transposing could lead to hundreds of thousands of columns (which cuDF wouldn't perform well with) depending on your data's shape, and writing a UDF can be time intensive.\n",
    "\n",
    "By leveraging the interoperability of the GPU PyData ecosystem, this operation becomes very easy. Let's take the row-wise sum of our previously reshaped cuDF DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>...</th>\n",
       "      <th>190</th>\n",
       "      <th>191</th>\n",
       "      <th>192</th>\n",
       "      <th>193</th>\n",
       "      <th>194</th>\n",
       "      <th>195</th>\n",
       "      <th>196</th>\n",
       "      <th>197</th>\n",
       "      <th>198</th>\n",
       "      <th>199</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "      <td>...</td>\n",
       "      <td>190</td>\n",
       "      <td>191</td>\n",
       "      <td>192</td>\n",
       "      <td>193</td>\n",
       "      <td>194</td>\n",
       "      <td>195</td>\n",
       "      <td>196</td>\n",
       "      <td>197</td>\n",
       "      <td>198</td>\n",
       "      <td>199</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>200</td>\n",
       "      <td>201</td>\n",
       "      <td>202</td>\n",
       "      <td>203</td>\n",
       "      <td>204</td>\n",
       "      <td>205</td>\n",
       "      <td>206</td>\n",
       "      <td>207</td>\n",
       "      <td>208</td>\n",
       "      <td>209</td>\n",
       "      <td>...</td>\n",
       "      <td>390</td>\n",
       "      <td>391</td>\n",
       "      <td>392</td>\n",
       "      <td>393</td>\n",
       "      <td>394</td>\n",
       "      <td>395</td>\n",
       "      <td>396</td>\n",
       "      <td>397</td>\n",
       "      <td>398</td>\n",
       "      <td>399</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>400</td>\n",
       "      <td>401</td>\n",
       "      <td>402</td>\n",
       "      <td>403</td>\n",
       "      <td>404</td>\n",
       "      <td>405</td>\n",
       "      <td>406</td>\n",
       "      <td>407</td>\n",
       "      <td>408</td>\n",
       "      <td>409</td>\n",
       "      <td>...</td>\n",
       "      <td>590</td>\n",
       "      <td>591</td>\n",
       "      <td>592</td>\n",
       "      <td>593</td>\n",
       "      <td>594</td>\n",
       "      <td>595</td>\n",
       "      <td>596</td>\n",
       "      <td>597</td>\n",
       "      <td>598</td>\n",
       "      <td>599</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>600</td>\n",
       "      <td>601</td>\n",
       "      <td>602</td>\n",
       "      <td>603</td>\n",
       "      <td>604</td>\n",
       "      <td>605</td>\n",
       "      <td>606</td>\n",
       "      <td>607</td>\n",
       "      <td>608</td>\n",
       "      <td>609</td>\n",
       "      <td>...</td>\n",
       "      <td>790</td>\n",
       "      <td>791</td>\n",
       "      <td>792</td>\n",
       "      <td>793</td>\n",
       "      <td>794</td>\n",
       "      <td>795</td>\n",
       "      <td>796</td>\n",
       "      <td>797</td>\n",
       "      <td>798</td>\n",
       "      <td>799</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>800</td>\n",
       "      <td>801</td>\n",
       "      <td>802</td>\n",
       "      <td>803</td>\n",
       "      <td>804</td>\n",
       "      <td>805</td>\n",
       "      <td>806</td>\n",
       "      <td>807</td>\n",
       "      <td>808</td>\n",
       "      <td>809</td>\n",
       "      <td>...</td>\n",
       "      <td>990</td>\n",
       "      <td>991</td>\n",
       "      <td>992</td>\n",
       "      <td>993</td>\n",
       "      <td>994</td>\n",
       "      <td>995</td>\n",
       "      <td>996</td>\n",
       "      <td>997</td>\n",
       "      <td>998</td>\n",
       "      <td>999</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 200 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   0    1    2    3    4    5    6    7    8    9    ...  190  191  192  193  \\\n",
       "0    0    1    2    3    4    5    6    7    8    9  ...  190  191  192  193   \n",
       "1  200  201  202  203  204  205  206  207  208  209  ...  390  391  392  393   \n",
       "2  400  401  402  403  404  405  406  407  408  409  ...  590  591  592  593   \n",
       "3  600  601  602  603  604  605  606  607  608  609  ...  790  791  792  793   \n",
       "4  800  801  802  803  804  805  806  807  808  809  ...  990  991  992  993   \n",
       "\n",
       "   194  195  196  197  198  199  \n",
       "0  194  195  196  197  198  199  \n",
       "1  394  395  396  397  398  399  \n",
       "2  594  595  596  597  598  599  \n",
       "3  794  795  796  797  798  799  \n",
       "4  994  995  996  997  998  999  \n",
       "\n",
       "[5 rows x 200 columns]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "reshaped_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can just transform it into a CuPy array and use the `axis` argument of `sum`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([  19900,   59900,   99900,  139900,  179900,  219900,  259900,\n",
       "        299900,  339900,  379900,  419900,  459900,  499900,  539900,\n",
       "        579900,  619900,  659900,  699900,  739900,  779900,  819900,\n",
       "        859900,  899900,  939900,  979900, 1019900, 1059900, 1099900,\n",
       "       1139900, 1179900, 1219900, 1259900, 1299900, 1339900, 1379900,\n",
       "       1419900, 1459900, 1499900, 1539900, 1579900, 1619900, 1659900,\n",
       "       1699900, 1739900, 1779900, 1819900, 1859900, 1899900, 1939900,\n",
       "       1979900])"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_arr = cp.fromDlpack(reshaped_df.to_dlpack())\n",
    "new_arr.sum(axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With just that single line, we're able to seamlessly move between data structures in this ecosystem, giving us enormous flexibility without sacrificing speed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Converting a cuDF DataFrame to a CuPy Sparse Matrix\n",
    "\n",
    "We can also convert a DataFrame or Series to a CuPy sparse matrix. We might want to do this if downstream processes expect CuPy sparse matrices as an input.\n",
    "\n",
    "The sparse matrix data structure is defined by three dense arrays. We'll define a small helper function for cleanliness."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "def cudf_to_cupy_sparse_matrix(data, sparseformat='column'):\n",
    "    \"\"\"Converts a cuDF object to a CuPy Sparse Column matrix.\n",
    "    \"\"\"\n",
    "    if sparseformat not in ('row', 'column',):\n",
    "        raise ValueError(\"Let's focus on column and row formats for now.\")\n",
    "    \n",
    "    _sparse_constructor = cp.sparse.csc_matrix\n",
    "    if sparseformat == 'row':\n",
    "        _sparse_constructor = cp.sparse.csr_matrix\n",
    "\n",
    "    return _sparse_constructor(cp.fromDlpack(data.to_dlpack()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can define a sparsely populated DataFrame to illustrate this conversion to either sparse matrix format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = cudf.DataFrame()\n",
    "nelem = 10000\n",
    "nonzero = 1000\n",
    "for i in range(20):\n",
    "    arr = cp.random.normal(5, 5, nelem)\n",
    "    arr[cp.random.choice(arr.shape[0], nelem-nonzero, replace=False)] = 0\n",
    "    df['a' + str(i)] = arr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a0</th>\n",
       "      <th>a1</th>\n",
       "      <th>a2</th>\n",
       "      <th>a3</th>\n",
       "      <th>a4</th>\n",
       "      <th>a5</th>\n",
       "      <th>a6</th>\n",
       "      <th>a7</th>\n",
       "      <th>a8</th>\n",
       "      <th>a9</th>\n",
       "      <th>a10</th>\n",
       "      <th>a11</th>\n",
       "      <th>a12</th>\n",
       "      <th>a13</th>\n",
       "      <th>a14</th>\n",
       "      <th>a15</th>\n",
       "      <th>a16</th>\n",
       "      <th>a17</th>\n",
       "      <th>a18</th>\n",
       "      <th>a19</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>16.822959</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>6.618972</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>2.25678</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.715802</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4.296568</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4.865495</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    a0   a1        a2   a3   a4   a5   a6   a7   a8   a9       a10      a11  \\\n",
       "0  0.0  0.0  0.000000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.000000  0.00000   \n",
       "1  0.0  0.0  0.000000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.000000  0.00000   \n",
       "2  0.0  0.0  6.618972  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.000000  2.25678   \n",
       "3  0.0  0.0  0.000000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.000000  0.00000   \n",
       "4  0.0  0.0  0.000000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  4.296568  0.00000   \n",
       "\n",
       "   a12        a13  a14       a15  a16  a17  a18       a19  \n",
       "0  0.0  16.822959  0.0  0.000000  0.0  0.0  0.0  0.000000  \n",
       "1  0.0   0.000000  0.0  0.000000  0.0  0.0  0.0  0.000000  \n",
       "2  0.0   0.000000  0.0  0.000000  0.0  0.0  0.0  0.000000  \n",
       "3  0.0   0.000000  0.0  2.715802  0.0  0.0  0.0  0.000000  \n",
       "4  0.0   0.000000  0.0  0.000000  0.0  0.0  0.0  4.865495  "
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<cupyx.scipy.sparse.csc.csc_matrix at 0x7f25e49466a0>"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sparse_data = cudf_to_cupy_sparse_matrix(df)\n",
    "sparse_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From here, we could continue our workflow with a CuPy sparse matrix.\n",
    "\n",
    "For a full list of the functionality built into these libraries, we encourage you to check out the API docs for [cuDF](https://docs.rapids.ai/api/cudf/nightly/) and [CuPy](https://docs-cupy.chainer.org/en/stable/index.html)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}