devin-petersohn/blog_post_20181025_modin.ipynb

## blog_post_20181025_modin.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br><br><br><b><h1>Modin (Pandas on Ray) - October 2018</h1></b>\n",
    "\n",
    "<h3>Accelerate your pandas workflows by changing one line of code</h3>\n",
    "\n",
    "<h5>Devin Petersohn</h5>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Earlier this year we started a DataFrame project, **Modin** (formerly Pandas on Ray), with a proof of concept system and wrote a [blog post](https://rise.cs.berkeley.edu/blog/pandas-on-ray) about it. The reception to our project was overwhelmingly positive, and the Python community was extremely excited at the prospect of speeding up pandas workflows by changing one line of code. Even though we got great performance numbers at the time, it was still just a prototype system that was not yet ready for production workloads. Because it was a prototype, it had a high memory overhead and suboptimal performance in some cases. Over the past several months we have made several major improvements to the internals of the system in terms of efficiency and memory footprint.\n",
    "\n",
    "We spent about a month redesigning and reimplementing the backend of Modin for extensibility and memory efficiency. In the remainder of this blog we will discuss what we've done to improve the performance, efficiency, and scalability of the overall system.\n",
    "\n",
    "**What is Modin?**\n",
    "\n",
    "Modin is a DataFrame library that allows you to speed up your pandas workflows by changing one line of code. We were motivated to create this tool by an interesting trend we observed. We noticed that many Data Scientists are often forced to use different tools to perform the same tasks at different data scales. Not only are these tools completely different, they expose new APIs and sometimes require user input for system information (e.g. partitioning). When we started Modin, we took the approach that Modin should be a DataFrame for datasets ranging from 1KB to 1TB+, without needing to specify partitioning. We believe Data Scientists should be spending their time extracting value from their data in the DataFrame API that they use the most: pandas.\n",
    "\n",
    "If you want to learn more about Modin, Ray, or read the previous blog post, check out these useful links:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>Useful Links</h3>\n",
    "<p>\n",
    "    [ <b><a href=\"https://github.com/modin-project/modin\">Modin Repo</a></b> ]\n",
    "    [ <b><a href=\"http://modin.readthedocs.io\">Modin Documentation</a></b> ]\n",
    "    [ <b><a href=\"https://rise.cs.berkeley.edu/blog/pandas-on-ray-early-lessons/\">Previous blog post</a></b> ]\n",
    "    [ <b><a href=\"https://github.com/ray-project/ray\">Ray Repo</a></b> ]\n",
    "    [ <b><a href=\"http://ray.readthedocs.io/en/latest/index.html\">Ray Documentation</a></b> ]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>Blog Post Contents</h3>\n",
    "<p>\n",
    "    [ <b>New features</b> ]\n",
    "    [ <b>New additions to the API</b> ]\n",
    "    [ <b>Improvements in runtime & efficiency</b> ]\n",
    "    [ <b>Next Steps & Getting involved</b> ]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Modin can be installed with `pip`**:\n",
    "\n",
    "```\n",
    "pip install modin\n",
    "```\n",
    "\n",
    "**For more information on Modin, visit the [Modin documentation](http://modin.readthedocs.io)!**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<b>To use Modin to improve your pandas workloads, you only need to change your import statement:</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Process STDOUT and STDERR is being redirected to /tmp/raylogs/.\n",
      "Waiting for redis server at 127.0.0.1:35672 to respond...\n",
      "Waiting for redis server at 127.0.0.1:55334 to respond...\n",
      "Starting the Plasma object store with 27.00 GB memory.\n"
     ]
    }
   ],
   "source": [
    "# import pandas as pd\n",
    "import modin.pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2> New feature in Modin - Default to pandas implementation! </h2>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Modin now features a way to complete all operations, even those not yet implemented or optimized. This makes the system usable for notebooks that use operations not yet implemented in Modin.\n",
    "\n",
    "There is a performance penalty for going to/from a pandas DataFrame, but we felt it was important to allow **all** users to use Modin as we continue to improve the API coverage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "frame_data = np.random.randint(0, 100, size=(2**10, 2**6))\n",
    "df = pd.DataFrame(frame_data).add_prefix(\"col_\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "modin.pandas.dataframe.DataFrame"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col_0</th>\n",
       "      <th>col_1</th>\n",
       "      <th>col_2</th>\n",
       "      <th>col_3</th>\n",
       "      <th>col_4</th>\n",
       "      <th>col_5</th>\n",
       "      <th>col_6</th>\n",
       "      <th>col_7</th>\n",
       "      <th>col_8</th>\n",
       "      <th>col_9</th>\n",
       "      <th>...</th>\n",
       "      <th>col_54</th>\n",
       "      <th>col_55</th>\n",
       "      <th>col_56</th>\n",
       "      <th>col_57</th>\n",
       "      <th>col_58</th>\n",
       "      <th>col_59</th>\n",
       "      <th>col_60</th>\n",
       "      <th>col_61</th>\n",
       "      <th>col_62</th>\n",
       "      <th>col_63</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>18</td>\n",
       "      <td>28</td>\n",
       "      <td>84</td>\n",
       "      <td>79</td>\n",
       "      <td>51</td>\n",
       "      <td>76</td>\n",
       "      <td>66</td>\n",
       "      <td>61</td>\n",
       "      <td>33</td>\n",
       "      <td>57</td>\n",
       "      <td>...</td>\n",
       "      <td>53</td>\n",
       "      <td>46</td>\n",
       "      <td>19</td>\n",
       "      <td>37</td>\n",
       "      <td>47</td>\n",
       "      <td>14</td>\n",
       "      <td>22</td>\n",
       "      <td>91</td>\n",
       "      <td>84</td>\n",
       "      <td>90</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>29</td>\n",
       "      <td>32</td>\n",
       "      <td>23</td>\n",
       "      <td>62</td>\n",
       "      <td>51</td>\n",
       "      <td>95</td>\n",
       "      <td>39</td>\n",
       "      <td>85</td>\n",
       "      <td>97</td>\n",
       "      <td>...</td>\n",
       "      <td>90</td>\n",
       "      <td>72</td>\n",
       "      <td>0</td>\n",
       "      <td>89</td>\n",
       "      <td>41</td>\n",
       "      <td>58</td>\n",
       "      <td>97</td>\n",
       "      <td>45</td>\n",
       "      <td>78</td>\n",
       "      <td>33</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>20</td>\n",
       "      <td>80</td>\n",
       "      <td>46</td>\n",
       "      <td>68</td>\n",
       "      <td>30</td>\n",
       "      <td>10</td>\n",
       "      <td>80</td>\n",
       "      <td>36</td>\n",
       "      <td>75</td>\n",
       "      <td>46</td>\n",
       "      <td>...</td>\n",
       "      <td>98</td>\n",
       "      <td>22</td>\n",
       "      <td>99</td>\n",
       "      <td>17</td>\n",
       "      <td>42</td>\n",
       "      <td>47</td>\n",
       "      <td>76</td>\n",
       "      <td>45</td>\n",
       "      <td>32</td>\n",
       "      <td>27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>89</td>\n",
       "      <td>5</td>\n",
       "      <td>51</td>\n",
       "      <td>4</td>\n",
       "      <td>36</td>\n",
       "      <td>56</td>\n",
       "      <td>54</td>\n",
       "      <td>52</td>\n",
       "      <td>60</td>\n",
       "      <td>49</td>\n",
       "      <td>...</td>\n",
       "      <td>23</td>\n",
       "      <td>6</td>\n",
       "      <td>99</td>\n",
       "      <td>42</td>\n",
       "      <td>71</td>\n",
       "      <td>7</td>\n",
       "      <td>82</td>\n",
       "      <td>87</td>\n",
       "      <td>85</td>\n",
       "      <td>30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>17</td>\n",
       "      <td>87</td>\n",
       "      <td>24</td>\n",
       "      <td>58</td>\n",
       "      <td>65</td>\n",
       "      <td>84</td>\n",
       "      <td>6</td>\n",
       "      <td>19</td>\n",
       "      <td>46</td>\n",
       "      <td>4</td>\n",
       "      <td>...</td>\n",
       "      <td>45</td>\n",
       "      <td>98</td>\n",
       "      <td>12</td>\n",
       "      <td>92</td>\n",
       "      <td>85</td>\n",
       "      <td>8</td>\n",
       "      <td>57</td>\n",
       "      <td>96</td>\n",
       "      <td>92</td>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows x 64 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   col_0  col_1  col_2  col_3  col_4  col_5  col_6  col_7  col_8  col_9  \\\n",
       "0     18     28     84     79     51     76     66     61     33     57   \n",
       "1      2     29     32     23     62     51     95     39     85     97   \n",
       "2     20     80     46     68     30     10     80     36     75     46   \n",
       "3     89      5     51      4     36     56     54     52     60     49   \n",
       "4     17     87     24     58     65     84      6     19     46      4   \n",
       "\n",
       "    ...    col_54  col_55  col_56  col_57  col_58  col_59  col_60  col_61  \\\n",
       "0   ...        53      46      19      37      47      14      22      91   \n",
       "1   ...        90      72       0      89      41      58      97      45   \n",
       "2   ...        98      22      99      17      42      47      76      45   \n",
       "3   ...        23       6      99      42      71       7      82      87   \n",
       "4   ...        45      98      12      92      85       8      57      96   \n",
       "\n",
       "   col_62  col_63  \n",
       "0      84      90  \n",
       "1      78      33  \n",
       "2      32      27  \n",
       "3      85      30  \n",
       "4      92      60  \n",
       "\n",
       "[5 rows x 64 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>When defaulting to pandas, you will see a warning:</h3>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/DevinPetersohn/software_builds/modin/modin/pandas/dataframe.py:5064: UserWarning: Defaulting to Pandas implementation\n",
      "  warnings.warn(\"Defaulting to Pandas implementation\", UserWarning)\n"
     ]
    }
   ],
   "source": [
    "dot_df = df.dot(df.T)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p>\n",
    "    We return the DataFrame to a distributed Modin DataFrame once the computation is complete.\n",
    "</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "modin.pandas.dataframe.DataFrame"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(dot_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>...</th>\n",
       "      <th>1014</th>\n",
       "      <th>1015</th>\n",
       "      <th>1016</th>\n",
       "      <th>1017</th>\n",
       "      <th>1018</th>\n",
       "      <th>1019</th>\n",
       "      <th>1020</th>\n",
       "      <th>1021</th>\n",
       "      <th>1022</th>\n",
       "      <th>1023</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>168579</td>\n",
       "      <td>149654</td>\n",
       "      <td>130667</td>\n",
       "      <td>159250</td>\n",
       "      <td>142854</td>\n",
       "      <td>138115</td>\n",
       "      <td>160633</td>\n",
       "      <td>128078</td>\n",
       "      <td>141950</td>\n",
       "      <td>129142</td>\n",
       "      <td>...</td>\n",
       "      <td>144672</td>\n",
       "      <td>160224</td>\n",
       "      <td>147642</td>\n",
       "      <td>139209</td>\n",
       "      <td>133699</td>\n",
       "      <td>148443</td>\n",
       "      <td>149135</td>\n",
       "      <td>150800</td>\n",
       "      <td>157532</td>\n",
       "      <td>143403</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>149654</td>\n",
       "      <td>230984</td>\n",
       "      <td>149733</td>\n",
       "      <td>191966</td>\n",
       "      <td>164851</td>\n",
       "      <td>157974</td>\n",
       "      <td>178768</td>\n",
       "      <td>165793</td>\n",
       "      <td>153779</td>\n",
       "      <td>159321</td>\n",
       "      <td>...</td>\n",
       "      <td>173161</td>\n",
       "      <td>187568</td>\n",
       "      <td>163921</td>\n",
       "      <td>155004</td>\n",
       "      <td>144276</td>\n",
       "      <td>160195</td>\n",
       "      <td>177890</td>\n",
       "      <td>193852</td>\n",
       "      <td>179666</td>\n",
       "      <td>169774</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>130667</td>\n",
       "      <td>149733</td>\n",
       "      <td>176962</td>\n",
       "      <td>150915</td>\n",
       "      <td>130242</td>\n",
       "      <td>140749</td>\n",
       "      <td>135845</td>\n",
       "      <td>130759</td>\n",
       "      <td>137752</td>\n",
       "      <td>141496</td>\n",
       "      <td>...</td>\n",
       "      <td>148870</td>\n",
       "      <td>159224</td>\n",
       "      <td>138752</td>\n",
       "      <td>125384</td>\n",
       "      <td>112177</td>\n",
       "      <td>142116</td>\n",
       "      <td>140801</td>\n",
       "      <td>168335</td>\n",
       "      <td>155353</td>\n",
       "      <td>137603</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>159250</td>\n",
       "      <td>191966</td>\n",
       "      <td>150915</td>\n",
       "      <td>234321</td>\n",
       "      <td>159415</td>\n",
       "      <td>151796</td>\n",
       "      <td>176795</td>\n",
       "      <td>144677</td>\n",
       "      <td>161011</td>\n",
       "      <td>156245</td>\n",
       "      <td>...</td>\n",
       "      <td>184012</td>\n",
       "      <td>188178</td>\n",
       "      <td>159772</td>\n",
       "      <td>160776</td>\n",
       "      <td>147160</td>\n",
       "      <td>162679</td>\n",
       "      <td>188875</td>\n",
       "      <td>184331</td>\n",
       "      <td>175773</td>\n",
       "      <td>168834</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>142854</td>\n",
       "      <td>164851</td>\n",
       "      <td>130242</td>\n",
       "      <td>159415</td>\n",
       "      <td>189329</td>\n",
       "      <td>134701</td>\n",
       "      <td>159938</td>\n",
       "      <td>139161</td>\n",
       "      <td>137310</td>\n",
       "      <td>133352</td>\n",
       "      <td>...</td>\n",
       "      <td>134250</td>\n",
       "      <td>155996</td>\n",
       "      <td>146016</td>\n",
       "      <td>139073</td>\n",
       "      <td>129364</td>\n",
       "      <td>141505</td>\n",
       "      <td>149731</td>\n",
       "      <td>164819</td>\n",
       "      <td>158681</td>\n",
       "      <td>135631</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows x 1024 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     0       1       2       3       4       5       6       7       8     \\\n",
       "0  168579  149654  130667  159250  142854  138115  160633  128078  141950   \n",
       "1  149654  230984  149733  191966  164851  157974  178768  165793  153779   \n",
       "2  130667  149733  176962  150915  130242  140749  135845  130759  137752   \n",
       "3  159250  191966  150915  234321  159415  151796  176795  144677  161011   \n",
       "4  142854  164851  130242  159415  189329  134701  159938  139161  137310   \n",
       "\n",
       "     9      ...      1014    1015    1016    1017    1018    1019    1020  \\\n",
       "0  129142   ...    144672  160224  147642  139209  133699  148443  149135   \n",
       "1  159321   ...    173161  187568  163921  155004  144276  160195  177890   \n",
       "2  141496   ...    148870  159224  138752  125384  112177  142116  140801   \n",
       "3  156245   ...    184012  188178  159772  160776  147160  162679  188875   \n",
       "4  133352   ...    134250  155996  146016  139073  129364  141505  149731   \n",
       "\n",
       "     1021    1022    1023  \n",
       "0  150800  157532  143403  \n",
       "1  193852  179666  169774  \n",
       "2  168335  155353  137603  \n",
       "3  184331  175773  168834  \n",
       "4  164819  158681  135631  \n",
       "\n",
       "[5 rows x 1024 columns]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dot_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>New additions to the API</h2>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since our last blog post, we improved our overall API coverage by over 17%. Currently, this accounts for 93% of usage based on our recent study of how Data Scientists use pandas.\n",
    "\n",
    "Because we chose to expose an API identical to the pandas API, we designed the system around supporting even the more challenging and obscure API components. In certain architectures, it is inefficient or impossible to perform operations such as `iloc`. In Modin, we can also support the assignment of values based on a slice of the DataFrame from `iloc`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.iloc[:5, :5] = -1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col_0</th>\n",
       "      <th>col_1</th>\n",
       "      <th>col_2</th>\n",
       "      <th>col_3</th>\n",
       "      <th>col_4</th>\n",
       "      <th>col_5</th>\n",
       "      <th>col_6</th>\n",
       "      <th>col_7</th>\n",
       "      <th>col_8</th>\n",
       "      <th>col_9</th>\n",
       "      <th>...</th>\n",
       "      <th>col_54</th>\n",
       "      <th>col_55</th>\n",
       "      <th>col_56</th>\n",
       "      <th>col_57</th>\n",
       "      <th>col_58</th>\n",
       "      <th>col_59</th>\n",
       "      <th>col_60</th>\n",
       "      <th>col_61</th>\n",
       "      <th>col_62</th>\n",
       "      <th>col_63</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>76</td>\n",
       "      <td>66</td>\n",
       "      <td>61</td>\n",
       "      <td>33</td>\n",
       "      <td>57</td>\n",
       "      <td>...</td>\n",
       "      <td>53</td>\n",
       "      <td>46</td>\n",
       "      <td>19</td>\n",
       "      <td>37</td>\n",
       "      <td>47</td>\n",
       "      <td>14</td>\n",
       "      <td>22</td>\n",
       "      <td>91</td>\n",
       "      <td>84</td>\n",
       "      <td>90</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>51</td>\n",
       "      <td>95</td>\n",
       "      <td>39</td>\n",
       "      <td>85</td>\n",
       "      <td>97</td>\n",
       "      <td>...</td>\n",
       "      <td>90</td>\n",
       "      <td>72</td>\n",
       "      <td>0</td>\n",
       "      <td>89</td>\n",
       "      <td>41</td>\n",
       "      <td>58</td>\n",
       "      <td>97</td>\n",
       "      <td>45</td>\n",
       "      <td>78</td>\n",
       "      <td>33</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>10</td>\n",
       "      <td>80</td>\n",
       "      <td>36</td>\n",
       "      <td>75</td>\n",
       "      <td>46</td>\n",
       "      <td>...</td>\n",
       "      <td>98</td>\n",
       "      <td>22</td>\n",
       "      <td>99</td>\n",
       "      <td>17</td>\n",
       "      <td>42</td>\n",
       "      <td>47</td>\n",
       "      <td>76</td>\n",
       "      <td>45</td>\n",
       "      <td>32</td>\n",
       "      <td>27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>56</td>\n",
       "      <td>54</td>\n",
       "      <td>52</td>\n",
       "      <td>60</td>\n",
       "      <td>49</td>\n",
       "      <td>...</td>\n",
       "      <td>23</td>\n",
       "      <td>6</td>\n",
       "      <td>99</td>\n",
       "      <td>42</td>\n",
       "      <td>71</td>\n",
       "      <td>7</td>\n",
       "      <td>82</td>\n",
       "      <td>87</td>\n",
       "      <td>85</td>\n",
       "      <td>30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>84</td>\n",
       "      <td>6</td>\n",
       "      <td>19</td>\n",
       "      <td>46</td>\n",
       "      <td>4</td>\n",
       "      <td>...</td>\n",
       "      <td>45</td>\n",
       "      <td>98</td>\n",
       "      <td>12</td>\n",
       "      <td>92</td>\n",
       "      <td>85</td>\n",
       "      <td>8</td>\n",
       "      <td>57</td>\n",
       "      <td>96</td>\n",
       "      <td>92</td>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>25</td>\n",
       "      <td>76</td>\n",
       "      <td>30</td>\n",
       "      <td>62</td>\n",
       "      <td>3</td>\n",
       "      <td>22</td>\n",
       "      <td>55</td>\n",
       "      <td>48</td>\n",
       "      <td>22</td>\n",
       "      <td>69</td>\n",
       "      <td>...</td>\n",
       "      <td>94</td>\n",
       "      <td>78</td>\n",
       "      <td>94</td>\n",
       "      <td>70</td>\n",
       "      <td>17</td>\n",
       "      <td>4</td>\n",
       "      <td>27</td>\n",
       "      <td>11</td>\n",
       "      <td>21</td>\n",
       "      <td>79</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>58</td>\n",
       "      <td>3</td>\n",
       "      <td>35</td>\n",
       "      <td>73</td>\n",
       "      <td>5</td>\n",
       "      <td>27</td>\n",
       "      <td>1</td>\n",
       "      <td>99</td>\n",
       "      <td>80</td>\n",
       "      <td>34</td>\n",
       "      <td>...</td>\n",
       "      <td>57</td>\n",
       "      <td>98</td>\n",
       "      <td>43</td>\n",
       "      <td>99</td>\n",
       "      <td>92</td>\n",
       "      <td>32</td>\n",
       "      <td>83</td>\n",
       "      <td>60</td>\n",
       "      <td>69</td>\n",
       "      <td>84</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>28</td>\n",
       "      <td>89</td>\n",
       "      <td>26</td>\n",
       "      <td>21</td>\n",
       "      <td>95</td>\n",
       "      <td>33</td>\n",
       "      <td>77</td>\n",
       "      <td>83</td>\n",
       "      <td>46</td>\n",
       "      <td>76</td>\n",
       "      <td>...</td>\n",
       "      <td>19</td>\n",
       "      <td>38</td>\n",
       "      <td>9</td>\n",
       "      <td>59</td>\n",
       "      <td>7</td>\n",
       "      <td>86</td>\n",
       "      <td>63</td>\n",
       "      <td>17</td>\n",
       "      <td>34</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>71</td>\n",
       "      <td>14</td>\n",
       "      <td>67</td>\n",
       "      <td>84</td>\n",
       "      <td>17</td>\n",
       "      <td>38</td>\n",
       "      <td>33</td>\n",
       "      <td>99</td>\n",
       "      <td>73</td>\n",
       "      <td>8</td>\n",
       "      <td>...</td>\n",
       "      <td>84</td>\n",
       "      <td>41</td>\n",
       "      <td>81</td>\n",
       "      <td>96</td>\n",
       "      <td>30</td>\n",
       "      <td>27</td>\n",
       "      <td>79</td>\n",
       "      <td>58</td>\n",
       "      <td>17</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>96</td>\n",
       "      <td>66</td>\n",
       "      <td>14</td>\n",
       "      <td>11</td>\n",
       "      <td>28</td>\n",
       "      <td>13</td>\n",
       "      <td>52</td>\n",
       "      <td>1</td>\n",
       "      <td>48</td>\n",
       "      <td>96</td>\n",
       "      <td>...</td>\n",
       "      <td>97</td>\n",
       "      <td>24</td>\n",
       "      <td>34</td>\n",
       "      <td>83</td>\n",
       "      <td>40</td>\n",
       "      <td>29</td>\n",
       "      <td>28</td>\n",
       "      <td>0</td>\n",
       "      <td>24</td>\n",
       "      <td>46</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>10 rows x 64 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   col_0  col_1  col_2  col_3  col_4  col_5  col_6  col_7  col_8  col_9  \\\n",
       "0     -1     -1     -1     -1     -1     76     66     61     33     57   \n",
       "1     -1     -1     -1     -1     -1     51     95     39     85     97   \n",
       "2     -1     -1     -1     -1     -1     10     80     36     75     46   \n",
       "3     -1     -1     -1     -1     -1     56     54     52     60     49   \n",
       "4     -1     -1     -1     -1     -1     84      6     19     46      4   \n",
       "5     25     76     30     62      3     22     55     48     22     69   \n",
       "6     58      3     35     73      5     27      1     99     80     34   \n",
       "7     28     89     26     21     95     33     77     83     46     76   \n",
       "8     71     14     67     84     17     38     33     99     73      8   \n",
       "9     96     66     14     11     28     13     52      1     48     96   \n",
       "\n",
       "    ...    col_54  col_55  col_56  col_57  col_58  col_59  col_60  col_61  \\\n",
       "0   ...        53      46      19      37      47      14      22      91   \n",
       "1   ...        90      72       0      89      41      58      97      45   \n",
       "2   ...        98      22      99      17      42      47      76      45   \n",
       "3   ...        23       6      99      42      71       7      82      87   \n",
       "4   ...        45      98      12      92      85       8      57      96   \n",
       "5   ...        94      78      94      70      17       4      27      11   \n",
       "6   ...        57      98      43      99      92      32      83      60   \n",
       "7   ...        19      38       9      59       7      86      63      17   \n",
       "8   ...        84      41      81      96      30      27      79      58   \n",
       "9   ...        97      24      34      83      40      29      28       0   \n",
       "\n",
       "   col_62  col_63  \n",
       "0      84      90  \n",
       "1      78      33  \n",
       "2      32      27  \n",
       "3      85      30  \n",
       "4      92      60  \n",
       "5      21      79  \n",
       "6      69      84  \n",
       "7      34      10  \n",
       "8      17       1  \n",
       "9      24      46  \n",
       "\n",
       "[10 rows x 64 columns]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Notice the first 5 columns are -1 for the first 5 rows.\n",
    "df.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make the \n",
    "df.iloc[:6, -6:] = df.iloc[:6, :6]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col_0</th>\n",
       "      <th>col_1</th>\n",
       "      <th>col_2</th>\n",
       "      <th>col_3</th>\n",
       "      <th>col_4</th>\n",
       "      <th>col_5</th>\n",
       "      <th>col_6</th>\n",
       "      <th>col_7</th>\n",
       "      <th>col_8</th>\n",
       "      <th>col_9</th>\n",
       "      <th>...</th>\n",
       "      <th>col_54</th>\n",
       "      <th>col_55</th>\n",
       "      <th>col_56</th>\n",
       "      <th>col_57</th>\n",
       "      <th>col_58</th>\n",
       "      <th>col_59</th>\n",
       "      <th>col_60</th>\n",
       "      <th>col_61</th>\n",
       "      <th>col_62</th>\n",
       "      <th>col_63</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>76</td>\n",
       "      <td>66</td>\n",
       "      <td>61</td>\n",
       "      <td>33</td>\n",
       "      <td>57</td>\n",
       "      <td>...</td>\n",
       "      <td>53</td>\n",
       "      <td>46</td>\n",
       "      <td>19</td>\n",
       "      <td>37</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>76</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>51</td>\n",
       "      <td>95</td>\n",
       "      <td>39</td>\n",
       "      <td>85</td>\n",
       "      <td>97</td>\n",
       "      <td>...</td>\n",
       "      <td>90</td>\n",
       "      <td>72</td>\n",
       "      <td>0</td>\n",
       "      <td>89</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>51</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>10</td>\n",
       "      <td>80</td>\n",
       "      <td>36</td>\n",
       "      <td>75</td>\n",
       "      <td>46</td>\n",
       "      <td>...</td>\n",
       "      <td>98</td>\n",
       "      <td>22</td>\n",
       "      <td>99</td>\n",
       "      <td>17</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>56</td>\n",
       "      <td>54</td>\n",
       "      <td>52</td>\n",
       "      <td>60</td>\n",
       "      <td>49</td>\n",
       "      <td>...</td>\n",
       "      <td>23</td>\n",
       "      <td>6</td>\n",
       "      <td>99</td>\n",
       "      <td>42</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>56</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>84</td>\n",
       "      <td>6</td>\n",
       "      <td>19</td>\n",
       "      <td>46</td>\n",
       "      <td>4</td>\n",
       "      <td>...</td>\n",
       "      <td>45</td>\n",
       "      <td>98</td>\n",
       "      <td>12</td>\n",
       "      <td>92</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>84</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>25</td>\n",
       "      <td>76</td>\n",
       "      <td>30</td>\n",
       "      <td>62</td>\n",
       "      <td>3</td>\n",
       "      <td>22</td>\n",
       "      <td>55</td>\n",
       "      <td>48</td>\n",
       "      <td>22</td>\n",
       "      <td>69</td>\n",
       "      <td>...</td>\n",
       "      <td>94</td>\n",
       "      <td>78</td>\n",
       "      <td>94</td>\n",
       "      <td>70</td>\n",
       "      <td>25</td>\n",
       "      <td>76</td>\n",
       "      <td>30</td>\n",
       "      <td>62</td>\n",
       "      <td>3</td>\n",
       "      <td>22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>58</td>\n",
       "      <td>3</td>\n",
       "      <td>35</td>\n",
       "      <td>73</td>\n",
       "      <td>5</td>\n",
       "      <td>27</td>\n",
       "      <td>1</td>\n",
       "      <td>99</td>\n",
       "      <td>80</td>\n",
       "      <td>34</td>\n",
       "      <td>...</td>\n",
       "      <td>57</td>\n",
       "      <td>98</td>\n",
       "      <td>43</td>\n",
       "      <td>99</td>\n",
       "      <td>92</td>\n",
       "      <td>32</td>\n",
       "      <td>83</td>\n",
       "      <td>60</td>\n",
       "      <td>69</td>\n",
       "      <td>84</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>28</td>\n",
       "      <td>89</td>\n",
       "      <td>26</td>\n",
       "      <td>21</td>\n",
       "      <td>95</td>\n",
       "      <td>33</td>\n",
       "      <td>77</td>\n",
       "      <td>83</td>\n",
       "      <td>46</td>\n",
       "      <td>76</td>\n",
       "      <td>...</td>\n",
       "      <td>19</td>\n",
       "      <td>38</td>\n",
       "      <td>9</td>\n",
       "      <td>59</td>\n",
       "      <td>7</td>\n",
       "      <td>86</td>\n",
       "      <td>63</td>\n",
       "      <td>17</td>\n",
       "      <td>34</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>71</td>\n",
       "      <td>14</td>\n",
       "      <td>67</td>\n",
       "      <td>84</td>\n",
       "      <td>17</td>\n",
       "      <td>38</td>\n",
       "      <td>33</td>\n",
       "      <td>99</td>\n",
       "      <td>73</td>\n",
       "      <td>8</td>\n",
       "      <td>...</td>\n",
       "      <td>84</td>\n",
       "      <td>41</td>\n",
       "      <td>81</td>\n",
       "      <td>96</td>\n",
       "      <td>30</td>\n",
       "      <td>27</td>\n",
       "      <td>79</td>\n",
       "      <td>58</td>\n",
       "      <td>17</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>96</td>\n",
       "      <td>66</td>\n",
       "      <td>14</td>\n",
       "      <td>11</td>\n",
       "      <td>28</td>\n",
       "      <td>13</td>\n",
       "      <td>52</td>\n",
       "      <td>1</td>\n",
       "      <td>48</td>\n",
       "      <td>96</td>\n",
       "      <td>...</td>\n",
       "      <td>97</td>\n",
       "      <td>24</td>\n",
       "      <td>34</td>\n",
       "      <td>83</td>\n",
       "      <td>40</td>\n",
       "      <td>29</td>\n",
       "      <td>28</td>\n",
       "      <td>0</td>\n",
       "      <td>24</td>\n",
       "      <td>46</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>10 rows x 64 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   col_0  col_1  col_2  col_3  col_4  col_5  col_6  col_7  col_8  col_9  \\\n",
       "0     -1     -1     -1     -1     -1     76     66     61     33     57   \n",
       "1     -1     -1     -1     -1     -1     51     95     39     85     97   \n",
       "2     -1     -1     -1     -1     -1     10     80     36     75     46   \n",
       "3     -1     -1     -1     -1     -1     56     54     52     60     49   \n",
       "4     -1     -1     -1     -1     -1     84      6     19     46      4   \n",
       "5     25     76     30     62      3     22     55     48     22     69   \n",
       "6     58      3     35     73      5     27      1     99     80     34   \n",
       "7     28     89     26     21     95     33     77     83     46     76   \n",
       "8     71     14     67     84     17     38     33     99     73      8   \n",
       "9     96     66     14     11     28     13     52      1     48     96   \n",
       "\n",
       "    ...    col_54  col_55  col_56  col_57  col_58  col_59  col_60  col_61  \\\n",
       "0   ...        53      46      19      37      -1      -1      -1      -1   \n",
       "1   ...        90      72       0      89      -1      -1      -1      -1   \n",
       "2   ...        98      22      99      17      -1      -1      -1      -1   \n",
       "3   ...        23       6      99      42      -1      -1      -1      -1   \n",
       "4   ...        45      98      12      92      -1      -1      -1      -1   \n",
       "5   ...        94      78      94      70      25      76      30      62   \n",
       "6   ...        57      98      43      99      92      32      83      60   \n",
       "7   ...        19      38       9      59       7      86      63      17   \n",
       "8   ...        84      41      81      96      30      27      79      58   \n",
       "9   ...        97      24      34      83      40      29      28       0   \n",
       "\n",
       "   col_62  col_63  \n",
       "0      -1      76  \n",
       "1      -1      51  \n",
       "2      -1      10  \n",
       "3      -1      56  \n",
       "4      -1      84  \n",
       "5       3      22  \n",
       "6      69      84  \n",
       "7      34      10  \n",
       "8      17       1  \n",
       "9      24      46  \n",
       "\n",
       "[10 rows x 64 columns]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Notice that the last 6 columns now match the first 6 columns for the first 6 rows\n",
    "df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Visit the [Modin Documentation](http://modin.readthedocs.io) for more information about APIs supported**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Improvements in runtime and efficiency</h2>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We continue to focus our efforts toward optimizing the most-used and most-requested operations. We spent some time re-architecting the system to improve its memory efficiency. In the end, we were able to reduce the memory footprint of most operations by 2-3x. The architecture is discussed at length in the [documentation](https://modin.readthedocs.io/en/latest/architecture.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>Runtime and memory efficiency of `read_csv`</h3>\n",
    "\n",
    "<h5>...or how to get more than 1GB/second of read throughput on your CSV file</h5>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`read_csv` is by far the most used operation in pandas (see the study in our [previous blog post](https://rise.cs.berkeley.edu/blog/pandas-on-ray-early-lessons/)), so naturally it is the most important operation to optimize. We reduced the memory footprint by 2x and gained approximately 30% in speed.\n",
    "\n",
    "The current `read_csv` performance is shown in the plot below:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img align=\"left\" style=\"display:inline;\" src=\"https://gist.github.com/devin-petersohn/2384d06e536df1f14519e18b3ce46ecd/raw/af2c65350dc6e17a4b90e1d4fecaeb0867627c07/modin_plot.png\">\n",
    "<br>\n",
    "The plot on the left was run on a single node with 144 cores and 1TB RAM. <b>On this large machine, Modin is able to get more than 1GB/s of throughput from its `read_csv`.</b> Reading a 2GB file takes around 2 seconds and an 18GB file takes about 16 seconds. The pandas `read_csv` scales as you would expect, but Modin is able to use the full capacity of the machine, including all 144 cores, to read the CSV file.\n",
    "<br><br>\n",
    "These improvements are primarily due to effiency in partitioning as a result of the new architecture. Additional optimizations include reducing the communication overheads and reducing the size of the task reading the file. From these improvements in efficiency, we can get a massive performance improvement over pandas, even on a 4-core desktop."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<b>The following code was run on a 2013 4-core iMac with 32GB RAM.</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as old_pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### pandas `read_csv`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 27.5 s, sys: 3.37 s, total: 30.9 s\n",
      "Wall time: 30.9 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "pandas_csv_df = old_pd.read_csv(\"../750MB.csv\")\n",
    "pandas_tail = pandas_csv_df.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### `modin.pandas` `read_csv`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 90.1 ms, sys: 7.27 ms, total: 97.4 ms\n",
      "Wall time: 8.06 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "modin_csv_df = pd.read_csv(\"../750MB.csv\")\n",
    "modin_tail = modin_csv_df.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "modin.pandas.dataframe.DataFrame"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(modin_tail)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**With Modin, `read_csv` performs up to 4x faster on a 4-core machine just by changing the import statement**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>Memory efficiency</h3>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As discussed in a earlier [blog post](https://rise.cs.berkeley.edu/blog/pandas-on-ray-early-lessons/), we previously had some issues with the way that pandas copies data in memory and how that interacts with Ray and Arrow. Because of the immutable shared memory abstraction, we end up creating a full copy of partitions to modify a single cell. This is an area we're continuously improving, but also one in which we've made significant strides. \n",
    "\n",
    "We reworked the internals of Modin to reduce the copying of data during certain operations. These operations, e.g. `reindex`, are memory intensive and require shuffling. To solve this issue, we developed a no-copy shuffle technique for moving the data around in Modin. As we work to develop this method of shuffling it will continue to get more efficient, but already in the current iteration we see a reduction in memory overhead by 2-3x compared to the prototype implementation. We will discuss this approach in a future blog post."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also implemented some basic garbage collection for cleaning up the partitions to reduce the memory footprint. We will continue improving the memory efficiency of Modin as the system matures."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Next Steps and Getting Involved</h2>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we mentioned in our last [blog post](https://rise.cs.berkeley.edu/blog/pandas-on-ray-early-lessons/), we are continuously improving the API coverage of Modin. We want to continue to improve the way that Data Scientists extract value from their data. Much of the effort since the last release went into rebuilding the backend. We are still working on a distributed Series object, which would drastically improve the performance of many of the methods in `DataFrame`.\n",
    "\n",
    "Multi-level index support is preliminary, and we continue working toward supporting the full capabilities of Multi-level indexing natively in Modin.\n",
    "\n",
    "If you're interested in contributing to Modin, we have put together some [documentation](https://modin.readthedocs.io/en/latest/contributing.html) for getting started. Feel free to also contact the [developer mailing list](https://groups.google.com/forum/#!forum/modin-dev) with any questions!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

## modin_plot.png

      
    Raw
  

              modin_plot.png