Skip to content

Instantly share code, notes, and snippets.

@devin-petersohn
Last active October 14, 2018 19:14
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save devin-petersohn/d860eaf3a9524a89a54b2d93f1a2aac0 to your computer and use it in GitHub Desktop.
Save devin-petersohn/d860eaf3a9524a89a54b2d93f1a2aac0 to your computer and use it in GitHub Desktop.
Pandas on Ray - Lessons learned Blog Post. Also introduces Modin, a project for unifying the APIs of computing engines.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pandas on Ray - Early Lessons from Parallelizing Pandas\n",
"\n",
"##### Devin Petersohn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In our last [blog post](https://rise.cs.berkeley.edu/blog/pandas-on-ray/), we introduced [Pandas on Ray](https://github.com/modin-project/modin) with some preliminary progress for making Pandas workflows faster by requiring only a single line of code change. Since then, we have received a lot of feedback from the community and in response we worked to significantly improve the functionality and performance. In this blog post, we will go over a few of the lessons we learned along the way and talk about performance and how we plan to continue improving the library moving forward."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Useful links\n",
"#### [[What is Ray?](https://github.com/ray-project/ray)] [[What is Pandas on Ray?](http://modin.readthedocs.io/en/latest/pandas_on_ray.html)] [[What is Modin?](https://github.com/modin-project/modin)]\n",
"### [[Download this notebook!](https://gist.github.com/devin-petersohn/d860eaf3a9524a89a54b2d93f1a2aac0#file-a_pandas_on_ray_blogpost_01-ipynb)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Lesson 1: Choosing What to Implement First\n",
"It is easy in a young project to start implementing the components of the API that offer easy performance wins or are low effort. In order to avoid bias in the order of implementation, we decided to take a data-driven approach to API completion. We went to [Kaggle](https://www.kaggle.com/) and grabbed the 1800 most upvoted Python notebooks and scripts and scraped them for the Pandas DataFrame methods they were invoking ([code](https://github.com/adgirish/kaggleScape)). \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Plot](https://docs.google.com/spreadsheets/d/e/2PACX-1vSJAqz2lmMe2yxUEV1BDYYJcb7F_javeq1mwW_uoiqOi8WuXQBnDIBAOkeF_WJ9iOtxuJxgvr_8PzFv/pubchart?oid=108581991&format=image)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We opted to complete the 90th percentile of operations first, such that the most popular methods would be completed. Currently, Pandas on Ray supports about 180 of the Pandas DataFrame methods. The list of currently supported APIs, details, and limitations of these implementations can be found in the [documentation](http://modin.readthedocs.io/en/latest/pandas_supported.html). We felt it was important that the implementation of the project be community-driven, rather than driven by our own bias."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Lesson 2: The Importance of Scaling the Pandas API\n",
"When we started this project, we focused on developing a full drop-in replacement for Pandas that would scale with hardware availability. Pandas is still used to teach introductory Data Science [online](https://www.kaggle.com/learn) and in many [college courses](http://www.ds100.org/). Many experienced Data Scientists have multiple years of experience working with Pandas, and what we really want to ensure with this project is that experience and knowledge is not wasted. Learning a new API is time consuming, as is learning a new framework. Such costs should not be overlooked when building an API or choosing the best tool for your workflow."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Lesson 3: Lazy Execution in Interactive Environments\n",
"Data Science is about interacting with datasets, visualizing them, and gathering some insight from the data. It is inherently interactive - meaning there’s a lot of thinking, typing, and digesting that happens between notebook blocks.\n",
"\n",
"Contrast this with the way that systems are being built: lazy execution. The systems community is fixated on minimizing the number of CPU cycles it takes to execute a given query or command. This type of execution works extremely well for batch data processing, however the real disconnect happens on interactive workflows. In an eager system, as a Data Scientist is typing a statement the previous one is executing. In contrast, with lazy evaluation nothing is executed until you need a result. In an interactive workflow, this means that the CPU is not working toward a result between statements or queries, leading to poor CPU utilization over the life of the workflow.\n",
"\n",
"Even in an interactive workflow, a lazy system will optimize for minimizing the number of CPU cycles it takes to complete the entire notebook. For example:\n",
"* lazy systems can have inefficient resource utilization over the course of an interactive workflow\n",
"* lazy systems are hard to debug as the error can happen in any of the statements that are lazily evaluated.\n",
"* lazy systems are harder to understand, particularly because most developers are used to eager-execution systems."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The Future: The Best of Both Worlds\n",
"The future of Pandas on Ray is a system that can switch between both eager and lazy execution. We want to achieve the scale and CPU benefits of lazy systems for large workflows and datasets while avoiding inefficient resource (CPU) utilization and poor debuggability they provide for small and medium interactive workflows."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Apache Arrow\n",
"Some of the performance improvements we can boast (shown below) are due to the work by Wes McKinney (the original author of Pandas) and the team at [Apache Arrow](https://arrow.apache.org/). Apache Arrow is a cross-language development platform for in-memory data. Pandas on Ray relies on the Plasma Store and the zero-copy serialization for Pandas DataFrames that Arrow provides. In most distributed systems, (de)serialization costs can dominate single operation tasks. Having zero copy, fast serialization is crucial for overall system performance. Check out their [GitHub](https://github.com/apache/arrow) and [docs](https://arrow.apache.org/docs/).\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using Pandas on Ray\n",
"<a id='usingPandasOnRay'></a>\n",
"Pandas on Ray has moved to the Modin project! See the **What is Modin?** section of this blog to learn more.\n",
"\n",
"To install simply use `pip`:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`pip install modin`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is still only a single line of code change to accelerate Pandas workflows."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Process STDOUT and STDERR is being redirected to /tmp/raylogs/.\n",
"Waiting for redis server at 127.0.0.1:33485 to respond...\n",
"Waiting for redis server at 127.0.0.1:43424 to respond...\n",
"Starting local scheduler with the following resources: {'CPU': 8, 'GPU': 0}.\n",
"\n",
"======================================================================\n",
"View the web UI at http://localhost:8889/notebooks/ray_ui67261.ipynb?token=2077a90717f57f0ab5b4f066c5d419f6ed4ef9153fd3b818\n",
"======================================================================\n",
"\n"
]
}
],
"source": [
"# import pandas as pd\n",
"import modin.pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, you are ready to use Pandas on Ray as you would use Pandas."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This file is 372MB decompressed and is running on a Laptop with 4 physical cores and 16GB of memory. [Data source](https://www.kaggle.com/cdc/mortality)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"df_2005 = pd.read_csv(\"2005_data.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you're running this notebook for yourself, you may have noticed that it completes almost instantly. In the background, the file is still being read, but we return the prompt to you as soon as we have deployed all of the tasks to read different parts of the file."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"modin.pandas.dataframe.DataFrame"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(df_2005)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['resident_status', 'education_1989_revision', 'education_2003_revision',\n",
" 'education_reporting_flag', 'month_of_death', 'sex', 'detail_age_type',\n",
" 'detail_age', 'age_substitution_flag', 'age_recode_52', 'age_recode_27',\n",
" 'age_recode_12', 'infant_age_recode_22',\n",
" 'place_of_death_and_decedents_status', 'marital_status',\n",
" 'day_of_week_of_death', 'current_data_year', 'injury_at_work',\n",
" 'manner_of_death', 'method_of_disposition', 'autopsy', 'activity_code',\n",
" 'place_of_injury_for_causes_w00_y34_except_y06_and_y07_',\n",
" 'icd_code_10th_revision', '358_cause_recode', '113_cause_recode',\n",
" '130_infant_cause_recode', '39_cause_recode',\n",
" 'number_of_entity_axis_conditions', 'entity_condition_1',\n",
" 'entity_condition_2', 'entity_condition_3', 'entity_condition_4',\n",
" 'entity_condition_5', 'entity_condition_6', 'entity_condition_7',\n",
" 'entity_condition_8', 'entity_condition_9', 'entity_condition_10',\n",
" 'entity_condition_11', 'entity_condition_12', 'entity_condition_13',\n",
" 'entity_condition_14', 'entity_condition_15', 'entity_condition_16',\n",
" 'entity_condition_17', 'entity_condition_18', 'entity_condition_19',\n",
" 'entity_condition_20', 'number_of_record_axis_conditions',\n",
" 'record_condition_1', 'record_condition_2', 'record_condition_3',\n",
" 'record_condition_4', 'record_condition_5', 'record_condition_6',\n",
" 'record_condition_7', 'record_condition_8', 'record_condition_9',\n",
" 'record_condition_10', 'record_condition_11', 'record_condition_12',\n",
" 'record_condition_13', 'record_condition_14', 'record_condition_15',\n",
" 'record_condition_16', 'record_condition_17', 'record_condition_18',\n",
" 'record_condition_19', 'record_condition_20', 'race',\n",
" 'bridged_race_flag', 'race_imputation_flag', 'race_recode_3',\n",
" 'race_recode_5', 'hispanic_origin', 'hispanic_originrace_recode'],\n",
" dtype='object')"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_2005.columns"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"count_by_marital_status = df_2005.groupby(by=\"marital_status\").count()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>resident_status</th>\n",
" <th>education_1989_revision</th>\n",
" <th>education_2003_revision</th>\n",
" <th>education_reporting_flag</th>\n",
" <th>month_of_death</th>\n",
" <th>sex</th>\n",
" <th>detail_age_type</th>\n",
" <th>detail_age</th>\n",
" <th>age_substitution_flag</th>\n",
" <th>age_recode_52</th>\n",
" <th>...</th>\n",
" <th>record_condition_18</th>\n",
" <th>record_condition_19</th>\n",
" <th>record_condition_20</th>\n",
" <th>race</th>\n",
" <th>bridged_race_flag</th>\n",
" <th>race_imputation_flag</th>\n",
" <th>race_recode_3</th>\n",
" <th>race_recode_5</th>\n",
" <th>hispanic_origin</th>\n",
" <th>hispanic_originrace_recode</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>D</th>\n",
" <td>300582</td>\n",
" <td>182218</td>\n",
" <td>118364</td>\n",
" <td>300582</td>\n",
" <td>300582</td>\n",
" <td>300582</td>\n",
" <td>300582</td>\n",
" <td>300582</td>\n",
" <td>10</td>\n",
" <td>300582</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>300582</td>\n",
" <td>676</td>\n",
" <td>937</td>\n",
" <td>300582</td>\n",
" <td>300582</td>\n",
" <td>300582</td>\n",
" <td>300582</td>\n",
" </tr>\n",
" <tr>\n",
" <th>M</th>\n",
" <td>931986</td>\n",
" <td>567362</td>\n",
" <td>364624</td>\n",
" <td>931986</td>\n",
" <td>931986</td>\n",
" <td>931986</td>\n",
" <td>931986</td>\n",
" <td>931986</td>\n",
" <td>24</td>\n",
" <td>931986</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>931986</td>\n",
" <td>1517</td>\n",
" <td>2876</td>\n",
" <td>931986</td>\n",
" <td>931986</td>\n",
" <td>931986</td>\n",
" <td>931986</td>\n",
" </tr>\n",
" <tr>\n",
" <th>S</th>\n",
" <td>298436</td>\n",
" <td>178236</td>\n",
" <td>120200</td>\n",
" <td>298436</td>\n",
" <td>298436</td>\n",
" <td>298436</td>\n",
" <td>298436</td>\n",
" <td>298436</td>\n",
" <td>12</td>\n",
" <td>298436</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>298436</td>\n",
" <td>1243</td>\n",
" <td>2214</td>\n",
" <td>298436</td>\n",
" <td>298436</td>\n",
" <td>298436</td>\n",
" <td>298436</td>\n",
" </tr>\n",
" <tr>\n",
" <th>U</th>\n",
" <td>12142</td>\n",
" <td>4583</td>\n",
" <td>7559</td>\n",
" <td>12142</td>\n",
" <td>12142</td>\n",
" <td>12142</td>\n",
" <td>12142</td>\n",
" <td>12142</td>\n",
" <td>6</td>\n",
" <td>12142</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>12142</td>\n",
" <td>12</td>\n",
" <td>691</td>\n",
" <td>12142</td>\n",
" <td>12142</td>\n",
" <td>12142</td>\n",
" <td>12142</td>\n",
" </tr>\n",
" <tr>\n",
" <th>W</th>\n",
" <td>909360</td>\n",
" <td>554771</td>\n",
" <td>354589</td>\n",
" <td>909360</td>\n",
" <td>909360</td>\n",
" <td>909360</td>\n",
" <td>909360</td>\n",
" <td>909360</td>\n",
" <td>25</td>\n",
" <td>909360</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>909360</td>\n",
" <td>969</td>\n",
" <td>1742</td>\n",
" <td>909360</td>\n",
" <td>909360</td>\n",
" <td>909360</td>\n",
" <td>909360</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows x 77 columns</p>\n",
"</div>"
],
"text/plain": [
" resident_status education_1989_revision education_2003_revision \\\n",
"D 300582 182218 118364 \n",
"M 931986 567362 364624 \n",
"S 298436 178236 120200 \n",
"U 12142 4583 7559 \n",
"W 909360 554771 354589 \n",
"\n",
" education_reporting_flag month_of_death sex detail_age_type \\\n",
"D 300582 300582 300582 300582 \n",
"M 931986 931986 931986 931986 \n",
"S 298436 298436 298436 298436 \n",
"U 12142 12142 12142 12142 \n",
"W 909360 909360 909360 909360 \n",
"\n",
" detail_age age_substitution_flag age_recode_52 \\\n",
"D 300582 10 300582 \n",
"M 931986 24 931986 \n",
"S 298436 12 298436 \n",
"U 12142 6 12142 \n",
"W 909360 25 909360 \n",
"\n",
" ... record_condition_18 record_condition_19 \\\n",
"D ... 0 0 \n",
"M ... 0 0 \n",
"S ... 0 0 \n",
"U ... 0 0 \n",
"W ... 0 0 \n",
"\n",
" record_condition_20 race bridged_race_flag race_imputation_flag \\\n",
"D 0 300582 676 937 \n",
"M 0 931986 1517 2876 \n",
"S 0 298436 1243 2214 \n",
"U 0 12142 12 691 \n",
"W 0 909360 969 1742 \n",
"\n",
" race_recode_3 race_recode_5 hispanic_origin hispanic_originrace_recode \n",
"D 300582 300582 300582 300582 \n",
"M 931986 931986 931986 931986 \n",
"S 298436 298436 298436 298436 \n",
"U 12142 12142 12142 12142 \n",
"W 909360 909360 909360 909360 \n",
"\n",
"[5 rows x 77 columns]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"count_by_marital_status.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pandas on Ray leverages key parts of the Ray runtime. If you are not familiar with Ray, check out the **What is Ray?** section at the foot of this blog. However, **familiarity with Ray is not required to use Pandas on Ray.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using Pandas on Ray with scikit-learn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"An important part of being able to use Pandas, is being able to use it with other libraries. We have put together an example of how to use Pandas on Ray with scikit-learn in the [examples found in the GitHub repo](https://github.com/modin-project/modin/blob/master/examples/modin-scikit-learn-example.ipynb)\n",
"\n",
"We demonstrate in that Jupyter Notebook that there is still no modification necessary outside of the `import` statement. You can pass a Pandas on Ray DataFrame to a numpy or scikit-learn function without the need to change your data at all. These functions will not run in parallel when they are passed in because we convert the Pandas on Ray DataFrames back to a single-threaded object."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='modin'></a>\n",
"## What is Modin?\n",
"\n",
"### Modin - A Step Toward Unifying APIs\n",
"Recently, we decided to separate Pandas on Ray from the Ray project to enable more rapid development and facilitate a long term vision of supporting a wide range of distributed compute environments in addition to Ray (e.g., Spark, Dask). To embrace this more flexible vision of a common compute abstraction framework we moved Pandas on Ray into [Modin](https://github.com/modin-project/modin), a new RISELab project aimed at unifying the way the Data Scientists interact with data. At a high level, our plans for Modin are simple: To present a familiar set of APIs (e.g. Pandas, SQL) to Data Scientists with a pluggable back-end.\n",
"\n",
"Currently, Modin only supports the Ray execution engine with the Pandas API, though we have implemented a simple SQL example to give an idea of what we’re aiming for.\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"import modin.sql as sql"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"conn = sql.connect(\"db_name\")"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"c = conn.cursor()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"c.execute(\"CREATE TABLE example (col1, col2, column 3, col4)\")"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" col1 col2 column 3 col4\n",
"0 1 2.0 A String of information True\n"
]
}
],
"source": [
"c.execute(\"INSERT INTO example VALUES ('1', 2.0, 'A String of information', True)\")"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" col1 col2 column 3 col4\n",
"0 1 2.0 A String of information True\n",
"1 6 17.0 A String of different information False\n"
]
}
],
"source": [
"c.execute(\"INSERT INTO example VALUES ('6', 17.0, 'A String of different information', False)\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Over the coming months, we’ll be working to connect more back-end compute engines. We are also hoping to gauge the interest of the community in this general idea.\n",
"\n",
"For more information about Modin, visit the [GitHub repo](https://github.com/modin-project/modin) and [documentation](http://modin.readthedocs.io). We will talk more about Modin over the next several months as it develops, but in the meantime, you can also ask questions on the Modin [mailing list](https://groups.google.com/forum/#!forum/modin-dev)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What’s new, what works, what needs improvement?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Implementation Progress\n",
"We have implemented a significant number of methods since our last blog post. For a full list of what we’re currently supporting, please visit our [docs](http://modin.readthedocs.io/en/latest/pandas_supported.html). Some of the operations can still be further optimized, and we’re excited about the potential that many of these operations have with respect to both memory and execution efficiency."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Distributed Series\n",
"One of the more important realizations we’ve come to is that we can’t only parallelize the Pandas DataFrame; we also have to parallelize the Pandas Series. Currently, when you get a column with `df[‘column’]` or `df.column`, it will not operate in parallel on the resulting Series. A number of the unoptimized methods (e.g. `groupby`, `sort_values`, etc.) will be much faster after we implement a distributed Series object."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### MultiLevel Index\n",
"Operations that are based on MultiLevel Index (e.g. `df.sum(level=...)`), are also not yet supported. This decision was made to ensure that we could implement basic functionality and have higher API completeness on the first pass. Over the course of the next few months we are planning to roll out MultiLevel Index operation support."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Optimization on Columnar Operations\n",
"In our last blog, we mentioned that we had some performance issues with columnar operations. We have resolved these with a simple change to the partitioning scheme:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Columnar operations\n",
"![columns_1](https://gist.github.com/devin-petersohn/d860eaf3a9524a89a54b2d93f1a2aac0/raw/ff73b19b380b717f2a49049a444ecbf6e8bb5b67/columnar_axis_0.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Row-wise operations\n",
"![columns_2](https://gist.github.com/devin-petersohn/d860eaf3a9524a89a54b2d93f1a2aac0/raw/ff73b19b380b717f2a49049a444ecbf6e8bb5b67/columnar_axis_1.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We chose these plots because they clearly illustrate the point at which it starts becoming beneficial to use Pandas on Ray. There are overheads, some of which are fixed, associated with distributing a DataFrame.\n",
"\n",
"These improvements were accomplished by adjusting the way that we partition the data. Instead of partitioning into row groups (i.e. horizontally cutting the DataFrame) or column groups (vertically cutting the DataFrame), we partition into smaller blocks that are easy to ship around and group."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Groupby + Aggregate\n",
"Another of the more commonly requested operations was `groupby` + `agg`. There are a number of tricky things about the `groupby` API, for example iterating over a grouped DataFrame:\n",
"```python\n",
"grouped = df.groupby(by=df.col)\n",
"for g_id, g_df in grouped:\n",
" type(g_df)\n",
" g_df.mean()\n",
" # Other code here\n",
"```\n",
"We have implemented support for this style of interacting with the DataFrame object, and the resulting grouped DataFrames are distributed Pandas on Ray DataFrame objects. \n",
"\n",
"Additionally, we support a majority of the aggregation operations performed after a groupby. There is still some continued optimization needed in this area, but here is a plot of our current performance as compared with Pandas."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![groupby_plot](https://gist.github.com/devin-petersohn/d860eaf3a9524a89a54b2d93f1a2aac0/raw/ca1a9e4bfa4b89aaf952c7ac0ed8f092d8eeb6b1/groupby_sum.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we mentioned above, once we have a Distributed Series in place, it will be much easier to implement more efficient algorithms for `groupby`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Areas Where There is Room for Improvement\n",
"### Memory Overhead and Garbage Collection\n",
"Memory overhead is an issue with [Pandas](http://wesmckinney.com/blog/apache-arrow-pandas-internals/), and because we rely on Pandas, we are not immune to this memory cost. In this area, there is a lot of room for improvement. One area that we plan to improve on in the future is reducing the memory footprint and having smarter garbage collection.\n",
"\n",
"### Printing DataFrames and the Overheads Associated\n",
"An important part of interacting with datasets is being able to print out part or all of a DataFrame. Currently, this operation is supported in Pandas on Ray, but it is not optimized yet. This is due in part to the cost of bringing objects out of the remote memory object store of Ray. This cost is similar to the cost of a `collect` or `compute` operation that other frameworks have, so one area of focus in the future will be looking at this.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Roadmap Moving Forward\n",
"To give a sense of what we are planning moving forward, here is a list of plans for the future (in general priority order, subject to change):"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"<table style=\"width:100%; border:1px solid black;\">\n",
" <tr style=\"border-bottom:1px solid black;\">\n",
" <th style=\"text-align:left; border-right:1px solid black;\">Pandas on Ray</th>\n",
" <th style=\"text-align:left;\">Modin</th>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align:left; border-right:1px solid black;\">MultiLevel index support</td>\n",
" <td style=\"text-align:left;\">Gather feedback from the community</td> \n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align:left; border-right:1px solid black;\">Continue to work toward completing API coverage (about 100 methods left)</td>\n",
" <td style=\"text-align:left;\">Continue work on exposing a simple SQL interface</td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align:left; border-right:1px solid black;\">Implement a Distributed Series object</td>\n",
" <td style=\"text-align:left;\">Start adding support for another compute engine (TBD)</td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align:left; border-right:1px solid black;\">Optimize operations that can benefit from Distributed Series</td>\n",
" <td style=\"text-align:left;\"></td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align:left; border-right:1px solid black;\">Work on being able to switch between eager and lazy environments</td>\n",
" <td style=\"text-align:left;\"></td>\n",
" </tr>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion\n",
"Since our last blog post, we have implemented a number of performance improvements and improved our API coverage of Pandas. We have moved the Pandas on Ray development into Modin, a project we started with the idea of allowing Data Scientists to use comfortable APIs without worrying about back-ends or distribution. From the start, this project has been community-focused. Ultimately, we want to empower Data Scientists, which means supporting a wide variety of use-cases and operations. We continue to be excited about the potential of these projects, and hope to provide the community with useful tools that enables and empowers Data Science."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Acknowledgements\n",
"\n",
"#### Contributors and Co-authors\n",
"Devin Petersohn, Kunal Gosar, Simon Mo, Peter Schafhalter, Robert Nishihara, Philipp Moritz, Patrick Yang, Omkar Salpekar, Rohan Singh, Adithya Girish, Harikaran Subbaraj, Peter Veerman, Helen Che, Jaemin Kim, Joseph Gonzalez, Ion Stoica, Anthony Joseph\n",
"\n",
"We would also like to thank the Ray-core development team. Without Ray, there can be no Pandas on Ray."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Footnotes\n",
"<a id='pandasOnRay'></a>\n",
"### What is Pandas on Ray?\n",
"Pandas on Ray is an early stage DataFrame library that wraps Pandas and transparently distributes the data and computation. The user does not need to know how many cores their system has, nor do they need to specify how to distribute the data. In fact, users can continue using their previous Pandas notebooks while experiencing a considerable speedup from Pandas on Ray, even on a single machine. Only the import statement needs to be modified, as we demonstrate below. Once you’ve changed your import statement, you’re ready to use Pandas on Ray just like you would Pandas.\n",
"<a id='ray'></a>\n",
"### What is Ray?\n",
"Pandas on Ray uses Ray as its underlying execution framework. Ray is a high-performance distributed execution framework targeted at large-scale machine learning and reinforcement learning applications. It achieves scalability and fault tolerance by abstracting the control state of the system in a global control store and keeping all other components stateless. It uses a distributed object store to efficiently handle large data through shared memory, and it uses a bottom-up hierarchical scheduling architecture to achieve low-latency and high-throughput scheduling. It uses a lightweight API based on dynamic task graphs and actors to express a wide range of applications in a flexible manner. You can find Ray on GitHub: github.com/ray-project/ray."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@StrategicVisionary
Copy link

Do you have any examples of join, merge, or concat e.g. pd.merge(df1, df2, how='outer')? Would make an excellent new blog post. Specifically with datetime indexes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment