devin-petersohn/a_pandas_on_ray_blogpost_01.ipynb

## a_pandas_on_ray_blogpost_01.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pandas on Ray - Early Lessons from Parallelizing Pandas\n",
    "\n",
    "##### Devin Petersohn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In our last [blog post](https://rise.cs.berkeley.edu/blog/pandas-on-ray/), we introduced [Pandas on Ray](https://github.com/modin-project/modin) with some preliminary progress for making Pandas workflows faster by requiring only a single line of code change. Since then, we have received a lot of feedback from the community and in response we worked to significantly improve the functionality and performance. In this blog post, we will go over a few of the lessons we learned along the way and talk about performance and how we plan to continue improving the library moving forward."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Useful links\n",
    "#### [[What is Ray?](https://github.com/ray-project/ray)] [[What is Pandas on Ray?](http://modin.readthedocs.io/en/latest/pandas_on_ray.html)] [[What is Modin?](https://github.com/modin-project/modin)]\n",
    "### [[Download this notebook!](https://gist.github.com/devin-petersohn/d860eaf3a9524a89a54b2d93f1a2aac0#file-a_pandas_on_ray_blogpost_01-ipynb)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Lesson 1: Choosing What to Implement First\n",
    "It is easy in a young project to start implementing the components of the API that offer easy performance wins or are low effort. In order to avoid bias in the order of implementation, we decided to take a data-driven approach to API completion. We went to [Kaggle](https://www.kaggle.com/) and grabbed the 1800 most upvoted Python notebooks and scripts and scraped them for the Pandas DataFrame methods they were invoking ([code](https://github.com/adgirish/kaggleScape)). \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Plot](https://docs.google.com/spreadsheets/d/e/2PACX-1vSJAqz2lmMe2yxUEV1BDYYJcb7F_javeq1mwW_uoiqOi8WuXQBnDIBAOkeF_WJ9iOtxuJxgvr_8PzFv/pubchart?oid=108581991&format=image)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We opted to complete the 90th percentile of operations first, such that the most popular methods would be completed. Currently, Pandas on Ray supports about 180 of the Pandas DataFrame methods. The list of currently supported APIs, details, and limitations of these implementations can be found in the [documentation](http://modin.readthedocs.io/en/latest/pandas_supported.html). We felt it was important that the implementation of the project be community-driven, rather than driven by our own bias."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Lesson 2: The Importance of Scaling the Pandas API\n",
    "When we started this project, we focused on developing a full drop-in replacement for Pandas that would scale with hardware availability. Pandas is still used to teach introductory Data Science [online](https://www.kaggle.com/learn) and in many [college courses](http://www.ds100.org/). Many experienced Data Scientists have multiple years of experience working with Pandas, and what we really want to ensure with this project is that experience and knowledge is not wasted. Learning a new API is time consuming, as is learning a new framework. Such costs should not be overlooked when building an API or choosing the best tool for your workflow."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Lesson 3: Lazy Execution in Interactive Environments\n",
    "Data Science is about interacting with datasets, visualizing them, and gathering some insight from the data. It is inherently interactive - meaning there’s a lot of thinking, typing, and digesting that happens between notebook blocks.\n",
    "\n",
    "Contrast this with the way that systems are being built: lazy execution. The systems community is fixated on minimizing the number of CPU cycles it takes to execute a given query or command. This type of execution works extremely well for batch data processing, however the real disconnect happens on interactive workflows. In an eager system, as a Data Scientist is typing a statement the previous one is executing. In contrast, with lazy evaluation nothing is executed until you need a result. In an interactive workflow, this means that the CPU is not working toward a result between statements or queries, leading to poor CPU utilization over the life of the workflow.\n",
    "\n",
    "Even in an interactive workflow, a lazy system will optimize for minimizing the number of CPU cycles it takes to complete the entire notebook. For example:\n",
    "* lazy systems can have inefficient resource utilization over the course of an interactive workflow\n",
    "* lazy systems are hard to debug as the error can happen in any of the statements that are lazily evaluated.\n",
    "* lazy systems are harder to understand, particularly because most developers are used to eager-execution systems."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The Future: The Best of Both Worlds\n",
    "The future of Pandas on Ray is a system that can switch between both eager and lazy execution. We want to achieve the scale and CPU benefits of lazy systems for large workflows and datasets while avoiding inefficient resource (CPU) utilization and poor debuggability they provide for small and medium interactive workflows."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Apache Arrow\n",
    "Some of the performance improvements we can boast (shown below) are due to the work by Wes McKinney (the original author of Pandas) and the team at [Apache Arrow](https://arrow.apache.org/). Apache Arrow is a cross-language development platform for in-memory data. Pandas on Ray relies on the Plasma Store and the zero-copy serialization for Pandas DataFrames that Arrow provides. In most distributed systems, (de)serialization costs can dominate single operation tasks. Having zero copy, fast serialization is crucial for overall system performance. Check out their [GitHub](https://github.com/apache/arrow) and [docs](https://arrow.apache.org/docs/).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using Pandas on Ray\n",
    "<a id='usingPandasOnRay'></a>\n",
    "Pandas on Ray has moved to the Modin project! See the **What is Modin?** section of this blog to learn more.\n",
    "\n",
    "To install simply use `pip`:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`pip install modin`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is still only a single line of code change to accelerate Pandas workflows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Process STDOUT and STDERR is being redirected to /tmp/raylogs/.\n",
      "Waiting for redis server at 127.0.0.1:33485 to respond...\n",
      "Waiting for redis server at 127.0.0.1:43424 to respond...\n",
      "Starting local scheduler with the following resources: {'CPU': 8, 'GPU': 0}.\n",
      "\n",
      "======================================================================\n",
      "View the web UI at http://localhost:8889/notebooks/ray_ui67261.ipynb?token=2077a90717f57f0ab5b4f066c5d419f6ed4ef9153fd3b818\n",
      "======================================================================\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# import pandas as pd\n",
    "import modin.pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, you are ready to use Pandas on Ray as you would use Pandas."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This file is 372MB decompressed and is running on a Laptop with 4 physical cores and 16GB of memory. [Data source](https://www.kaggle.com/cdc/mortality)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_2005 = pd.read_csv(\"2005_data.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you're running this notebook for yourself, you may have noticed that it completes almost instantly. In the background, the file is still being read, but we return the prompt to you as soon as we have deployed all of the tasks to read different parts of the file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "modin.pandas.dataframe.DataFrame"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(df_2005)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['resident_status', 'education_1989_revision', 'education_2003_revision',\n",
       "       'education_reporting_flag', 'month_of_death', 'sex', 'detail_age_type',\n",
       "       'detail_age', 'age_substitution_flag', 'age_recode_52', 'age_recode_27',\n",
       "       'age_recode_12', 'infant_age_recode_22',\n",
       "       'place_of_death_and_decedents_status', 'marital_status',\n",
       "       'day_of_week_of_death', 'current_data_year', 'injury_at_work',\n",
       "       'manner_of_death', 'method_of_disposition', 'autopsy', 'activity_code',\n",
       "       'place_of_injury_for_causes_w00_y34_except_y06_and_y07_',\n",
       "       'icd_code_10th_revision', '358_cause_recode', '113_cause_recode',\n",
       "       '130_infant_cause_recode', '39_cause_recode',\n",
       "       'number_of_entity_axis_conditions', 'entity_condition_1',\n",
       "       'entity_condition_2', 'entity_condition_3', 'entity_condition_4',\n",
       "       'entity_condition_5', 'entity_condition_6', 'entity_condition_7',\n",
       "       'entity_condition_8', 'entity_condition_9', 'entity_condition_10',\n",
       "       'entity_condition_11', 'entity_condition_12', 'entity_condition_13',\n",
       "       'entity_condition_14', 'entity_condition_15', 'entity_condition_16',\n",
       "       'entity_condition_17', 'entity_condition_18', 'entity_condition_19',\n",
       "       'entity_condition_20', 'number_of_record_axis_conditions',\n",
       "       'record_condition_1', 'record_condition_2', 'record_condition_3',\n",
       "       'record_condition_4', 'record_condition_5', 'record_condition_6',\n",
       "       'record_condition_7', 'record_condition_8', 'record_condition_9',\n",
       "       'record_condition_10', 'record_condition_11', 'record_condition_12',\n",
       "       'record_condition_13', 'record_condition_14', 'record_condition_15',\n",
       "       'record_condition_16', 'record_condition_17', 'record_condition_18',\n",
       "       'record_condition_19', 'record_condition_20', 'race',\n",
       "       'bridged_race_flag', 'race_imputation_flag', 'race_recode_3',\n",
       "       'race_recode_5', 'hispanic_origin', 'hispanic_originrace_recode'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_2005.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "count_by_marital_status = df_2005.groupby(by=\"marital_status\").count()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>resident_status</th>\n",
       "      <th>education_1989_revision</th>\n",
       "      <th>education_2003_revision</th>\n",
       "      <th>education_reporting_flag</th>\n",
       "      <th>month_of_death</th>\n",
       "      <th>sex</th>\n",
       "      <th>detail_age_type</th>\n",
       "      <th>detail_age</th>\n",
       "      <th>age_substitution_flag</th>\n",
       "      <th>age_recode_52</th>\n",
       "      <th>...</th>\n",
       "      <th>record_condition_18</th>\n",
       "      <th>record_condition_19</th>\n",
       "      <th>record_condition_20</th>\n",
       "      <th>race</th>\n",
       "      <th>bridged_race_flag</th>\n",
       "      <th>race_imputation_flag</th>\n",
       "      <th>race_recode_3</th>\n",
       "      <th>race_recode_5</th>\n",
       "      <th>hispanic_origin</th>\n",
       "      <th>hispanic_originrace_recode</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>D</th>\n",
       "      <td>300582</td>\n",
       "      <td>182218</td>\n",
       "      <td>118364</td>\n",
       "      <td>300582</td>\n",
       "      <td>300582</td>\n",
       "      <td>300582</td>\n",
       "      <td>300582</td>\n",
       "      <td>300582</td>\n",
       "      <td>10</td>\n",
       "      <td>300582</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>300582</td>\n",
       "      <td>676</td>\n",
       "      <td>937</td>\n",
       "      <td>300582</td>\n",
       "      <td>300582</td>\n",
       "      <td>300582</td>\n",
       "      <td>300582</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>M</th>\n",
       "      <td>931986</td>\n",
       "      <td>567362</td>\n",
       "      <td>364624</td>\n",
       "      <td>931986</td>\n",
       "      <td>931986</td>\n",
       "      <td>931986</td>\n",
       "      <td>931986</td>\n",
       "      <td>931986</td>\n",
       "      <td>24</td>\n",
       "      <td>931986</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>931986</td>\n",
       "      <td>1517</td>\n",
       "      <td>2876</td>\n",
       "      <td>931986</td>\n",
       "      <td>931986</td>\n",
       "      <td>931986</td>\n",
       "      <td>931986</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>S</th>\n",
       "      <td>298436</td>\n",
       "      <td>178236</td>\n",
       "      <td>120200</td>\n",
       "      <td>298436</td>\n",
       "      <td>298436</td>\n",
       "      <td>298436</td>\n",
       "      <td>298436</td>\n",
       "      <td>298436</td>\n",
       "      <td>12</td>\n",
       "      <td>298436</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>298436</td>\n",
       "      <td>1243</td>\n",
       "      <td>2214</td>\n",
       "      <td>298436</td>\n",
       "      <td>298436</td>\n",
       "      <td>298436</td>\n",
       "      <td>298436</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>U</th>\n",
       "      <td>12142</td>\n",
       "      <td>4583</td>\n",
       "      <td>7559</td>\n",
       "      <td>12142</td>\n",
       "      <td>12142</td>\n",
       "      <td>12142</td>\n",
       "      <td>12142</td>\n",
       "      <td>12142</td>\n",
       "      <td>6</td>\n",
       "      <td>12142</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>12142</td>\n",
       "      <td>12</td>\n",
       "      <td>691</td>\n",
       "      <td>12142</td>\n",
       "      <td>12142</td>\n",
       "      <td>12142</td>\n",
       "      <td>12142</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>W</th>\n",
       "      <td>909360</td>\n",
       "      <td>554771</td>\n",
       "      <td>354589</td>\n",
       "      <td>909360</td>\n",
       "      <td>909360</td>\n",
       "      <td>909360</td>\n",
       "      <td>909360</td>\n",
       "      <td>909360</td>\n",
       "      <td>25</td>\n",
       "      <td>909360</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>909360</td>\n",
       "      <td>969</td>\n",
       "      <td>1742</td>\n",
       "      <td>909360</td>\n",
       "      <td>909360</td>\n",
       "      <td>909360</td>\n",
       "      <td>909360</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows x 77 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   resident_status  education_1989_revision  education_2003_revision  \\\n",
       "D           300582                   182218                   118364   \n",
       "M           931986                   567362                   364624   \n",
       "S           298436                   178236                   120200   \n",
       "U            12142                     4583                     7559   \n",
       "W           909360                   554771                   354589   \n",
       "\n",
       "   education_reporting_flag  month_of_death     sex  detail_age_type  \\\n",
       "D                    300582          300582  300582           300582   \n",
       "M                    931986          931986  931986           931986   \n",
       "S                    298436          298436  298436           298436   \n",
       "U                     12142           12142   12142            12142   \n",
       "W                    909360          909360  909360           909360   \n",
       "\n",
       "   detail_age  age_substitution_flag  age_recode_52  \\\n",
       "D      300582                     10         300582   \n",
       "M      931986                     24         931986   \n",
       "S      298436                     12         298436   \n",
       "U       12142                      6          12142   \n",
       "W      909360                     25         909360   \n",
       "\n",
       "              ...             record_condition_18  record_condition_19  \\\n",
       "D             ...                               0                    0   \n",
       "M             ...                               0                    0   \n",
       "S             ...                               0                    0   \n",
       "U             ...                               0                    0   \n",
       "W             ...                               0                    0   \n",
       "\n",
       "   record_condition_20    race  bridged_race_flag  race_imputation_flag  \\\n",
       "D                    0  300582                676                   937   \n",
       "M                    0  931986               1517                  2876   \n",
       "S                    0  298436               1243                  2214   \n",
       "U                    0   12142                 12                   691   \n",
       "W                    0  909360                969                  1742   \n",
       "\n",
       "   race_recode_3  race_recode_5  hispanic_origin  hispanic_originrace_recode  \n",
       "D         300582         300582           300582                      300582  \n",
       "M         931986         931986           931986                      931986  \n",
       "S         298436         298436           298436                      298436  \n",
       "U          12142          12142            12142                       12142  \n",
       "W         909360         909360           909360                      909360  \n",
       "\n",
       "[5 rows x 77 columns]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "count_by_marital_status.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pandas on Ray leverages key parts of the Ray runtime. If you are not familiar with Ray, check out the **What is Ray?** section at the foot of this blog. However, **familiarity with Ray is not required to use Pandas on Ray.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using Pandas on Ray with scikit-learn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An important part of being able to use Pandas, is being able to use it with other libraries. We have put together an example of how to use Pandas on Ray with scikit-learn in the [examples found in the GitHub repo](https://github.com/modin-project/modin/blob/master/examples/modin-scikit-learn-example.ipynb)\n",
    "\n",
    "We demonstrate in that Jupyter Notebook that there is still no modification necessary outside of the `import` statement. You can pass a Pandas on Ray DataFrame to a numpy or scikit-learn function without the need to change your data at all. These functions will not run in parallel when they are passed in because we convert the Pandas on Ray DataFrames back to a single-threaded object."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='modin'></a>\n",
    "## What is Modin?\n",
    "\n",
    "### Modin - A Step Toward Unifying APIs\n",
    "Recently, we decided to separate Pandas on Ray from the Ray project to enable more rapid development and facilitate a long term vision of supporting a wide range of distributed compute environments in addition to Ray (e.g., Spark, Dask).  To embrace this more flexible vision of a common compute abstraction framework we moved Pandas on Ray into [Modin](https://github.com/modin-project/modin), a new RISELab project aimed at unifying the way the Data Scientists interact with data. At a high level, our plans for Modin are simple: To present a familiar set of APIs (e.g. Pandas, SQL) to Data Scientists with a pluggable back-end.\n",
    "\n",
    "Currently, Modin only supports the Ray execution engine with the Pandas API, though we have implemented a simple SQL example to give an idea of what we’re aiming for.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "import modin.sql as sql"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "conn = sql.connect(\"db_name\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "c = conn.cursor()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "c.execute(\"CREATE TABLE example (col1, col2, column 3, col4)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  col1  col2                 column 3  col4\n",
      "0    1   2.0  A String of information  True\n"
     ]
    }
   ],
   "source": [
    "c.execute(\"INSERT INTO example VALUES ('1', 2.0, 'A String of information', True)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  col1  col2                           column 3   col4\n",
      "0    1   2.0            A String of information   True\n",
      "1    6  17.0  A String of different information  False\n"
     ]
    }
   ],
   "source": [
    "c.execute(\"INSERT INTO example VALUES ('6', 17.0, 'A String of different information', False)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Over the coming months, we’ll be working to connect more back-end compute engines. We are also hoping to gauge the interest of the community in this general idea.\n",
    "\n",
    "For more information about Modin, visit the [GitHub repo](https://github.com/modin-project/modin) and [documentation](http://modin.readthedocs.io). We will talk more about Modin over the next several months as it develops, but in the meantime, you can also ask questions on the Modin [mailing list](https://groups.google.com/forum/#!forum/modin-dev)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What’s new, what works, what needs improvement?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Implementation Progress\n",
    "We have implemented a significant number of methods since our last blog post. For a full list of what we’re currently supporting, please visit our [docs](http://modin.readthedocs.io/en/latest/pandas_supported.html). Some of the operations can still be further optimized, and we’re excited about the potential that many of these operations have with respect to both memory and execution efficiency."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Distributed Series\n",
    "One of the more important realizations we’ve come to is that we can’t only parallelize the Pandas DataFrame; we also have to parallelize the Pandas Series. Currently, when you get a column with `df[‘column’]` or `df.column`, it will not operate in parallel on the resulting Series. A number of the unoptimized methods (e.g. `groupby`, `sort_values`, etc.) will be much faster after we implement a distributed Series object."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### MultiLevel Index\n",
    "Operations that are based on MultiLevel Index (e.g. `df.sum(level=...)`), are also not yet supported. This decision was made to ensure that we could implement basic functionality and have higher API completeness on the first pass. Over the course of the next few months we are planning to roll out MultiLevel Index operation support."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Optimization on Columnar Operations\n",
    "In our last blog, we mentioned that we had some performance issues with columnar operations. We have resolved these with a simple change to the partitioning scheme:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Columnar operations\n",
    "![columns_1](https://gist.github.com/devin-petersohn/d860eaf3a9524a89a54b2d93f1a2aac0/raw/ff73b19b380b717f2a49049a444ecbf6e8bb5b67/columnar_axis_0.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Row-wise operations\n",
    "![columns_2](https://gist.github.com/devin-petersohn/d860eaf3a9524a89a54b2d93f1a2aac0/raw/ff73b19b380b717f2a49049a444ecbf6e8bb5b67/columnar_axis_1.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We chose these plots because they clearly illustrate the point at which it starts becoming beneficial to use Pandas on Ray. There are overheads, some of which are fixed, associated with distributing a DataFrame.\n",
    "\n",
    "These improvements were accomplished by adjusting the way that we partition the data. Instead of partitioning into row groups (i.e. horizontally cutting the DataFrame) or column groups (vertically cutting the DataFrame), we partition into smaller blocks that are easy to ship around and group."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Groupby + Aggregate\n",
    "Another of the more commonly requested operations was `groupby` + `agg`. There are a number of tricky things about the `groupby` API, for example iterating over a grouped DataFrame:\n",
    "```python\n",
    "grouped = df.groupby(by=df.col)\n",
    "for g_id, g_df in grouped:\n",
    "    type(g_df)\n",
    "    g_df.mean()\n",
    "    # Other code here\n",
    "```\n",
    "We have implemented support for this style of interacting with the DataFrame object, and the resulting grouped DataFrames are distributed Pandas on Ray DataFrame objects. \n",
    "\n",
    "Additionally, we support a majority of the aggregation operations performed after a groupby. There is still some continued optimization needed in this area, but here is a plot of our current performance as compared with Pandas."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![groupby_plot](https://gist.github.com/devin-petersohn/d860eaf3a9524a89a54b2d93f1a2aac0/raw/ca1a9e4bfa4b89aaf952c7ac0ed8f092d8eeb6b1/groupby_sum.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we mentioned above, once we have a Distributed Series in place, it will be much easier to implement more efficient algorithms for `groupby`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Areas Where There is Room for Improvement\n",
    "### Memory Overhead and Garbage Collection\n",
    "Memory overhead is an issue with [Pandas](http://wesmckinney.com/blog/apache-arrow-pandas-internals/), and because we rely on Pandas, we are not immune to this memory cost. In this area, there is a lot of room for improvement. One area that we plan to improve on in the future is reducing the memory footprint and having smarter garbage collection.\n",
    "\n",
    "### Printing DataFrames and the Overheads Associated\n",
    "An important part of interacting with datasets is being able to print out part or all of a DataFrame. Currently, this operation is supported in Pandas on Ray, but it is not optimized yet. This is due in part to the cost of bringing objects out of the remote memory object store of Ray. This cost is similar to the cost of a `collect` or `compute` operation that other frameworks have, so one area of focus in the future will be looking at this.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Roadmap Moving Forward\n",
    "To give a sense of what we are planning moving forward, here is a list of plans for the future (in general priority order, subject to change):"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "<table style=\"width:100%; border:1px solid black;\">\n",
    "  <tr style=\"border-bottom:1px solid black;\">\n",
    "    <th style=\"text-align:left; border-right:1px solid black;\">Pandas on Ray</th>\n",
    "    <th style=\"text-align:left;\">Modin</th>\n",
    "  </tr>\n",
    "  <tr>\n",
    "    <td style=\"text-align:left; border-right:1px solid black;\">MultiLevel index support</td>\n",
    "    <td style=\"text-align:left;\">Gather feedback from the community</td> \n",
    "  </tr>\n",
    "  <tr>\n",
    "    <td style=\"text-align:left; border-right:1px solid black;\">Continue to work toward completing API coverage (about 100 methods left)</td>\n",
    "    <td style=\"text-align:left;\">Continue work on exposing a simple SQL interface</td>\n",
    "  </tr>\n",
    "  <tr>\n",
    "    <td style=\"text-align:left; border-right:1px solid black;\">Implement a Distributed Series object</td>\n",
    "    <td style=\"text-align:left;\">Start adding support for another compute engine (TBD)</td>\n",
    "  </tr>\n",
    "  <tr>\n",
    "    <td style=\"text-align:left; border-right:1px solid black;\">Optimize operations that can benefit from Distributed Series</td>\n",
    "    <td style=\"text-align:left;\"></td>\n",
    "  </tr>\n",
    "  <tr>\n",
    "    <td style=\"text-align:left; border-right:1px solid black;\">Work on being able to switch between eager and lazy environments</td>\n",
    "    <td style=\"text-align:left;\"></td>\n",
    "  </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "Since our last blog post, we have implemented a number of performance improvements and improved our API coverage of Pandas. We have moved the Pandas on Ray development into Modin, a project we started with the idea of allowing Data Scientists to use comfortable APIs without worrying about back-ends or distribution. From the start, this project has been community-focused. Ultimately, we want to empower Data Scientists, which means supporting a wide variety of use-cases and operations. We continue to be excited about the potential of these projects, and hope to provide the community with useful tools that enables and empowers Data Science."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Acknowledgements\n",
    "\n",
    "#### Contributors and  Co-authors\n",
    "Devin Petersohn, Kunal Gosar, Simon Mo, Peter Schafhalter, Robert Nishihara, Philipp Moritz, Patrick Yang, Omkar Salpekar, Rohan Singh, Adithya Girish, Harikaran Subbaraj, Peter Veerman, Helen Che, Jaemin Kim, Joseph Gonzalez, Ion Stoica, Anthony Joseph\n",
    "\n",
    "We would also like to thank the Ray-core development team. Without Ray, there can be no Pandas on Ray."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Footnotes\n",
    "<a id='pandasOnRay'></a>\n",
    "### What is Pandas on Ray?\n",
    "Pandas on Ray is an early stage DataFrame library that wraps Pandas and transparently distributes the data and computation. The user does not need to know how many cores their system has, nor do they need to specify how to distribute the data. In fact, users can continue using their previous Pandas notebooks while experiencing a considerable speedup from Pandas on Ray, even on a single machine. Only the import statement needs to be modified, as we demonstrate below. Once you’ve changed your import statement, you’re ready to use Pandas on Ray just like you would Pandas.\n",
    "<a id='ray'></a>\n",
    "### What is Ray?\n",
    "Pandas on Ray uses Ray as its underlying execution framework. Ray is a high-performance distributed execution framework targeted at large-scale machine learning and reinforcement learning applications. It achieves scalability and fault tolerance by abstracting the control state of the system in a global control store and keeping all other components stateless. It uses a distributed object store to efficiently handle large data through shared memory, and it uses a bottom-up hierarchical scheduling architecture to achieve low-latency and high-throughput scheduling. It uses a lightweight API based on dynamic task graphs and actors to express a wide range of applications in a flexible manner. You can find Ray on GitHub: github.com/ray-project/ray."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

## columnar_axis_0.png

      
    Raw
  

              columnar_axis_0.png
            
          
## columnar_axis_1.png

      
    Raw
  

              columnar_axis_1.png
            
          
## groupby.png

      
    Raw
  

              groupby.png
            
          
## groupby_sum.png

      
    Raw
  

              groupby_sum.png
            
          
## groupby_sum_bigger.png

      
    Raw
  

              groupby_sum_bigger.png