Skip to content

Instantly share code, notes, and snippets.

@devin-petersohn
Last active November 13, 2018 13:22
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save devin-petersohn/2384d06e536df1f14519e18b3ce46ecd to your computer and use it in GitHub Desktop.
Save devin-petersohn/2384d06e536df1f14519e18b3ce46ecd to your computer and use it in GitHub Desktop.
Modin blog - October 25, 2018
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br><br><br><b><h1>Modin (Pandas on Ray) - October 2018</h1></b>\n",
"\n",
"<h3>Accelerate your pandas workflows by changing one line of code</h3>\n",
"\n",
"<h5>Devin Petersohn</h5>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Earlier this year we started a DataFrame project, **Modin** (formerly Pandas on Ray), with a proof of concept system and wrote a [blog post](https://rise.cs.berkeley.edu/blog/pandas-on-ray) about it. The reception to our project was overwhelmingly positive, and the Python community was extremely excited at the prospect of speeding up pandas workflows by changing one line of code. Even though we got great performance numbers at the time, it was still just a prototype system that was not yet ready for production workloads. Because it was a prototype, it had a high memory overhead and suboptimal performance in some cases. Over the past several months we have made several major improvements to the internals of the system in terms of efficiency and memory footprint.\n",
"\n",
"We spent about a month redesigning and reimplementing the backend of Modin for extensibility and memory efficiency. In the remainder of this blog we will discuss what we've done to improve the performance, efficiency, and scalability of the overall system.\n",
"\n",
"**What is Modin?**\n",
"\n",
"Modin is a DataFrame library that allows you to speed up your pandas workflows by changing one line of code. We were motivated to create this tool by an interesting trend we observed. We noticed that many Data Scientists are often forced to use different tools to perform the same tasks at different data scales. Not only are these tools completely different, they expose new APIs and sometimes require user input for system information (e.g. partitioning). When we started Modin, we took the approach that Modin should be a DataFrame for datasets ranging from 1KB to 1TB+, without needing to specify partitioning. We believe Data Scientists should be spending their time extracting value from their data in the DataFrame API that they use the most: pandas.\n",
"\n",
"If you want to learn more about Modin, Ray, or read the previous blog post, check out these useful links:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Useful Links</h3>\n",
"<p>\n",
" [ <b><a href=\"https://github.com/modin-project/modin\">Modin Repo</a></b> ]\n",
" [ <b><a href=\"http://modin.readthedocs.io\">Modin Documentation</a></b> ]\n",
" [ <b><a href=\"https://rise.cs.berkeley.edu/blog/pandas-on-ray-early-lessons/\">Previous blog post</a></b> ]\n",
" [ <b><a href=\"https://github.com/ray-project/ray\">Ray Repo</a></b> ]\n",
" [ <b><a href=\"http://ray.readthedocs.io/en/latest/index.html\">Ray Documentation</a></b> ]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Blog Post Contents</h3>\n",
"<p>\n",
" [ <b>New features</b> ]\n",
" [ <b>New additions to the API</b> ]\n",
" [ <b>Improvements in runtime & efficiency</b> ]\n",
" [ <b>Next Steps & Getting involved</b> ]\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Modin can be installed with `pip`**:\n",
"\n",
"```\n",
"pip install modin\n",
"```\n",
"\n",
"**For more information on Modin, visit the [Modin documentation](http://modin.readthedocs.io)!**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>To use Modin to improve your pandas workloads, you only need to change your import statement:</b>"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Process STDOUT and STDERR is being redirected to /tmp/raylogs/.\n",
"Waiting for redis server at 127.0.0.1:35672 to respond...\n",
"Waiting for redis server at 127.0.0.1:55334 to respond...\n",
"Starting the Plasma object store with 27.00 GB memory.\n"
]
}
],
"source": [
"# import pandas as pd\n",
"import modin.pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2> New feature in Modin - Default to pandas implementation! </h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Modin now features a way to complete all operations, even those not yet implemented or optimized. This makes the system usable for notebooks that use operations not yet implemented in Modin.\n",
"\n",
"There is a performance penalty for going to/from a pandas DataFrame, but we felt it was important to allow **all** users to use Modin as we continue to improve the API coverage."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"frame_data = np.random.randint(0, 100, size=(2**10, 2**6))\n",
"df = pd.DataFrame(frame_data).add_prefix(\"col_\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"modin.pandas.dataframe.DataFrame"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(df)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>col_0</th>\n",
" <th>col_1</th>\n",
" <th>col_2</th>\n",
" <th>col_3</th>\n",
" <th>col_4</th>\n",
" <th>col_5</th>\n",
" <th>col_6</th>\n",
" <th>col_7</th>\n",
" <th>col_8</th>\n",
" <th>col_9</th>\n",
" <th>...</th>\n",
" <th>col_54</th>\n",
" <th>col_55</th>\n",
" <th>col_56</th>\n",
" <th>col_57</th>\n",
" <th>col_58</th>\n",
" <th>col_59</th>\n",
" <th>col_60</th>\n",
" <th>col_61</th>\n",
" <th>col_62</th>\n",
" <th>col_63</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>18</td>\n",
" <td>28</td>\n",
" <td>84</td>\n",
" <td>79</td>\n",
" <td>51</td>\n",
" <td>76</td>\n",
" <td>66</td>\n",
" <td>61</td>\n",
" <td>33</td>\n",
" <td>57</td>\n",
" <td>...</td>\n",
" <td>53</td>\n",
" <td>46</td>\n",
" <td>19</td>\n",
" <td>37</td>\n",
" <td>47</td>\n",
" <td>14</td>\n",
" <td>22</td>\n",
" <td>91</td>\n",
" <td>84</td>\n",
" <td>90</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>29</td>\n",
" <td>32</td>\n",
" <td>23</td>\n",
" <td>62</td>\n",
" <td>51</td>\n",
" <td>95</td>\n",
" <td>39</td>\n",
" <td>85</td>\n",
" <td>97</td>\n",
" <td>...</td>\n",
" <td>90</td>\n",
" <td>72</td>\n",
" <td>0</td>\n",
" <td>89</td>\n",
" <td>41</td>\n",
" <td>58</td>\n",
" <td>97</td>\n",
" <td>45</td>\n",
" <td>78</td>\n",
" <td>33</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>20</td>\n",
" <td>80</td>\n",
" <td>46</td>\n",
" <td>68</td>\n",
" <td>30</td>\n",
" <td>10</td>\n",
" <td>80</td>\n",
" <td>36</td>\n",
" <td>75</td>\n",
" <td>46</td>\n",
" <td>...</td>\n",
" <td>98</td>\n",
" <td>22</td>\n",
" <td>99</td>\n",
" <td>17</td>\n",
" <td>42</td>\n",
" <td>47</td>\n",
" <td>76</td>\n",
" <td>45</td>\n",
" <td>32</td>\n",
" <td>27</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>89</td>\n",
" <td>5</td>\n",
" <td>51</td>\n",
" <td>4</td>\n",
" <td>36</td>\n",
" <td>56</td>\n",
" <td>54</td>\n",
" <td>52</td>\n",
" <td>60</td>\n",
" <td>49</td>\n",
" <td>...</td>\n",
" <td>23</td>\n",
" <td>6</td>\n",
" <td>99</td>\n",
" <td>42</td>\n",
" <td>71</td>\n",
" <td>7</td>\n",
" <td>82</td>\n",
" <td>87</td>\n",
" <td>85</td>\n",
" <td>30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>17</td>\n",
" <td>87</td>\n",
" <td>24</td>\n",
" <td>58</td>\n",
" <td>65</td>\n",
" <td>84</td>\n",
" <td>6</td>\n",
" <td>19</td>\n",
" <td>46</td>\n",
" <td>4</td>\n",
" <td>...</td>\n",
" <td>45</td>\n",
" <td>98</td>\n",
" <td>12</td>\n",
" <td>92</td>\n",
" <td>85</td>\n",
" <td>8</td>\n",
" <td>57</td>\n",
" <td>96</td>\n",
" <td>92</td>\n",
" <td>60</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows x 64 columns</p>\n",
"</div>"
],
"text/plain": [
" col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 \\\n",
"0 18 28 84 79 51 76 66 61 33 57 \n",
"1 2 29 32 23 62 51 95 39 85 97 \n",
"2 20 80 46 68 30 10 80 36 75 46 \n",
"3 89 5 51 4 36 56 54 52 60 49 \n",
"4 17 87 24 58 65 84 6 19 46 4 \n",
"\n",
" ... col_54 col_55 col_56 col_57 col_58 col_59 col_60 col_61 \\\n",
"0 ... 53 46 19 37 47 14 22 91 \n",
"1 ... 90 72 0 89 41 58 97 45 \n",
"2 ... 98 22 99 17 42 47 76 45 \n",
"3 ... 23 6 99 42 71 7 82 87 \n",
"4 ... 45 98 12 92 85 8 57 96 \n",
"\n",
" col_62 col_63 \n",
"0 84 90 \n",
"1 78 33 \n",
"2 32 27 \n",
"3 85 30 \n",
"4 92 60 \n",
"\n",
"[5 rows x 64 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>When defaulting to pandas, you will see a warning:</h3>"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/DevinPetersohn/software_builds/modin/modin/pandas/dataframe.py:5064: UserWarning: Defaulting to Pandas implementation\n",
" warnings.warn(\"Defaulting to Pandas implementation\", UserWarning)\n"
]
}
],
"source": [
"dot_df = df.dot(df.T)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>\n",
" We return the DataFrame to a distributed Modin DataFrame once the computation is complete.\n",
"</p>"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"modin.pandas.dataframe.DataFrame"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(dot_df)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" <th>5</th>\n",
" <th>6</th>\n",
" <th>7</th>\n",
" <th>8</th>\n",
" <th>9</th>\n",
" <th>...</th>\n",
" <th>1014</th>\n",
" <th>1015</th>\n",
" <th>1016</th>\n",
" <th>1017</th>\n",
" <th>1018</th>\n",
" <th>1019</th>\n",
" <th>1020</th>\n",
" <th>1021</th>\n",
" <th>1022</th>\n",
" <th>1023</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>168579</td>\n",
" <td>149654</td>\n",
" <td>130667</td>\n",
" <td>159250</td>\n",
" <td>142854</td>\n",
" <td>138115</td>\n",
" <td>160633</td>\n",
" <td>128078</td>\n",
" <td>141950</td>\n",
" <td>129142</td>\n",
" <td>...</td>\n",
" <td>144672</td>\n",
" <td>160224</td>\n",
" <td>147642</td>\n",
" <td>139209</td>\n",
" <td>133699</td>\n",
" <td>148443</td>\n",
" <td>149135</td>\n",
" <td>150800</td>\n",
" <td>157532</td>\n",
" <td>143403</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>149654</td>\n",
" <td>230984</td>\n",
" <td>149733</td>\n",
" <td>191966</td>\n",
" <td>164851</td>\n",
" <td>157974</td>\n",
" <td>178768</td>\n",
" <td>165793</td>\n",
" <td>153779</td>\n",
" <td>159321</td>\n",
" <td>...</td>\n",
" <td>173161</td>\n",
" <td>187568</td>\n",
" <td>163921</td>\n",
" <td>155004</td>\n",
" <td>144276</td>\n",
" <td>160195</td>\n",
" <td>177890</td>\n",
" <td>193852</td>\n",
" <td>179666</td>\n",
" <td>169774</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>130667</td>\n",
" <td>149733</td>\n",
" <td>176962</td>\n",
" <td>150915</td>\n",
" <td>130242</td>\n",
" <td>140749</td>\n",
" <td>135845</td>\n",
" <td>130759</td>\n",
" <td>137752</td>\n",
" <td>141496</td>\n",
" <td>...</td>\n",
" <td>148870</td>\n",
" <td>159224</td>\n",
" <td>138752</td>\n",
" <td>125384</td>\n",
" <td>112177</td>\n",
" <td>142116</td>\n",
" <td>140801</td>\n",
" <td>168335</td>\n",
" <td>155353</td>\n",
" <td>137603</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>159250</td>\n",
" <td>191966</td>\n",
" <td>150915</td>\n",
" <td>234321</td>\n",
" <td>159415</td>\n",
" <td>151796</td>\n",
" <td>176795</td>\n",
" <td>144677</td>\n",
" <td>161011</td>\n",
" <td>156245</td>\n",
" <td>...</td>\n",
" <td>184012</td>\n",
" <td>188178</td>\n",
" <td>159772</td>\n",
" <td>160776</td>\n",
" <td>147160</td>\n",
" <td>162679</td>\n",
" <td>188875</td>\n",
" <td>184331</td>\n",
" <td>175773</td>\n",
" <td>168834</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>142854</td>\n",
" <td>164851</td>\n",
" <td>130242</td>\n",
" <td>159415</td>\n",
" <td>189329</td>\n",
" <td>134701</td>\n",
" <td>159938</td>\n",
" <td>139161</td>\n",
" <td>137310</td>\n",
" <td>133352</td>\n",
" <td>...</td>\n",
" <td>134250</td>\n",
" <td>155996</td>\n",
" <td>146016</td>\n",
" <td>139073</td>\n",
" <td>129364</td>\n",
" <td>141505</td>\n",
" <td>149731</td>\n",
" <td>164819</td>\n",
" <td>158681</td>\n",
" <td>135631</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows x 1024 columns</p>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3 4 5 6 7 8 \\\n",
"0 168579 149654 130667 159250 142854 138115 160633 128078 141950 \n",
"1 149654 230984 149733 191966 164851 157974 178768 165793 153779 \n",
"2 130667 149733 176962 150915 130242 140749 135845 130759 137752 \n",
"3 159250 191966 150915 234321 159415 151796 176795 144677 161011 \n",
"4 142854 164851 130242 159415 189329 134701 159938 139161 137310 \n",
"\n",
" 9 ... 1014 1015 1016 1017 1018 1019 1020 \\\n",
"0 129142 ... 144672 160224 147642 139209 133699 148443 149135 \n",
"1 159321 ... 173161 187568 163921 155004 144276 160195 177890 \n",
"2 141496 ... 148870 159224 138752 125384 112177 142116 140801 \n",
"3 156245 ... 184012 188178 159772 160776 147160 162679 188875 \n",
"4 133352 ... 134250 155996 146016 139073 129364 141505 149731 \n",
"\n",
" 1021 1022 1023 \n",
"0 150800 157532 143403 \n",
"1 193852 179666 169774 \n",
"2 168335 155353 137603 \n",
"3 184331 175773 168834 \n",
"4 164819 158681 135631 \n",
"\n",
"[5 rows x 1024 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dot_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2>New additions to the API</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since our last blog post, we improved our overall API coverage by over 17%. Currently, this accounts for 93% of usage based on our recent study of how Data Scientists use pandas.\n",
"\n",
"Because we chose to expose an API identical to the pandas API, we designed the system around supporting even the more challenging and obscure API components. In certain architectures, it is inefficient or impossible to perform operations such as `iloc`. In Modin, we can also support the assignment of values based on a slice of the DataFrame from `iloc`:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"df.iloc[:5, :5] = -1"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>col_0</th>\n",
" <th>col_1</th>\n",
" <th>col_2</th>\n",
" <th>col_3</th>\n",
" <th>col_4</th>\n",
" <th>col_5</th>\n",
" <th>col_6</th>\n",
" <th>col_7</th>\n",
" <th>col_8</th>\n",
" <th>col_9</th>\n",
" <th>...</th>\n",
" <th>col_54</th>\n",
" <th>col_55</th>\n",
" <th>col_56</th>\n",
" <th>col_57</th>\n",
" <th>col_58</th>\n",
" <th>col_59</th>\n",
" <th>col_60</th>\n",
" <th>col_61</th>\n",
" <th>col_62</th>\n",
" <th>col_63</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>76</td>\n",
" <td>66</td>\n",
" <td>61</td>\n",
" <td>33</td>\n",
" <td>57</td>\n",
" <td>...</td>\n",
" <td>53</td>\n",
" <td>46</td>\n",
" <td>19</td>\n",
" <td>37</td>\n",
" <td>47</td>\n",
" <td>14</td>\n",
" <td>22</td>\n",
" <td>91</td>\n",
" <td>84</td>\n",
" <td>90</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>51</td>\n",
" <td>95</td>\n",
" <td>39</td>\n",
" <td>85</td>\n",
" <td>97</td>\n",
" <td>...</td>\n",
" <td>90</td>\n",
" <td>72</td>\n",
" <td>0</td>\n",
" <td>89</td>\n",
" <td>41</td>\n",
" <td>58</td>\n",
" <td>97</td>\n",
" <td>45</td>\n",
" <td>78</td>\n",
" <td>33</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>10</td>\n",
" <td>80</td>\n",
" <td>36</td>\n",
" <td>75</td>\n",
" <td>46</td>\n",
" <td>...</td>\n",
" <td>98</td>\n",
" <td>22</td>\n",
" <td>99</td>\n",
" <td>17</td>\n",
" <td>42</td>\n",
" <td>47</td>\n",
" <td>76</td>\n",
" <td>45</td>\n",
" <td>32</td>\n",
" <td>27</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>56</td>\n",
" <td>54</td>\n",
" <td>52</td>\n",
" <td>60</td>\n",
" <td>49</td>\n",
" <td>...</td>\n",
" <td>23</td>\n",
" <td>6</td>\n",
" <td>99</td>\n",
" <td>42</td>\n",
" <td>71</td>\n",
" <td>7</td>\n",
" <td>82</td>\n",
" <td>87</td>\n",
" <td>85</td>\n",
" <td>30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>84</td>\n",
" <td>6</td>\n",
" <td>19</td>\n",
" <td>46</td>\n",
" <td>4</td>\n",
" <td>...</td>\n",
" <td>45</td>\n",
" <td>98</td>\n",
" <td>12</td>\n",
" <td>92</td>\n",
" <td>85</td>\n",
" <td>8</td>\n",
" <td>57</td>\n",
" <td>96</td>\n",
" <td>92</td>\n",
" <td>60</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>25</td>\n",
" <td>76</td>\n",
" <td>30</td>\n",
" <td>62</td>\n",
" <td>3</td>\n",
" <td>22</td>\n",
" <td>55</td>\n",
" <td>48</td>\n",
" <td>22</td>\n",
" <td>69</td>\n",
" <td>...</td>\n",
" <td>94</td>\n",
" <td>78</td>\n",
" <td>94</td>\n",
" <td>70</td>\n",
" <td>17</td>\n",
" <td>4</td>\n",
" <td>27</td>\n",
" <td>11</td>\n",
" <td>21</td>\n",
" <td>79</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>58</td>\n",
" <td>3</td>\n",
" <td>35</td>\n",
" <td>73</td>\n",
" <td>5</td>\n",
" <td>27</td>\n",
" <td>1</td>\n",
" <td>99</td>\n",
" <td>80</td>\n",
" <td>34</td>\n",
" <td>...</td>\n",
" <td>57</td>\n",
" <td>98</td>\n",
" <td>43</td>\n",
" <td>99</td>\n",
" <td>92</td>\n",
" <td>32</td>\n",
" <td>83</td>\n",
" <td>60</td>\n",
" <td>69</td>\n",
" <td>84</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>28</td>\n",
" <td>89</td>\n",
" <td>26</td>\n",
" <td>21</td>\n",
" <td>95</td>\n",
" <td>33</td>\n",
" <td>77</td>\n",
" <td>83</td>\n",
" <td>46</td>\n",
" <td>76</td>\n",
" <td>...</td>\n",
" <td>19</td>\n",
" <td>38</td>\n",
" <td>9</td>\n",
" <td>59</td>\n",
" <td>7</td>\n",
" <td>86</td>\n",
" <td>63</td>\n",
" <td>17</td>\n",
" <td>34</td>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>71</td>\n",
" <td>14</td>\n",
" <td>67</td>\n",
" <td>84</td>\n",
" <td>17</td>\n",
" <td>38</td>\n",
" <td>33</td>\n",
" <td>99</td>\n",
" <td>73</td>\n",
" <td>8</td>\n",
" <td>...</td>\n",
" <td>84</td>\n",
" <td>41</td>\n",
" <td>81</td>\n",
" <td>96</td>\n",
" <td>30</td>\n",
" <td>27</td>\n",
" <td>79</td>\n",
" <td>58</td>\n",
" <td>17</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>96</td>\n",
" <td>66</td>\n",
" <td>14</td>\n",
" <td>11</td>\n",
" <td>28</td>\n",
" <td>13</td>\n",
" <td>52</td>\n",
" <td>1</td>\n",
" <td>48</td>\n",
" <td>96</td>\n",
" <td>...</td>\n",
" <td>97</td>\n",
" <td>24</td>\n",
" <td>34</td>\n",
" <td>83</td>\n",
" <td>40</td>\n",
" <td>29</td>\n",
" <td>28</td>\n",
" <td>0</td>\n",
" <td>24</td>\n",
" <td>46</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>10 rows x 64 columns</p>\n",
"</div>"
],
"text/plain": [
" col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 \\\n",
"0 -1 -1 -1 -1 -1 76 66 61 33 57 \n",
"1 -1 -1 -1 -1 -1 51 95 39 85 97 \n",
"2 -1 -1 -1 -1 -1 10 80 36 75 46 \n",
"3 -1 -1 -1 -1 -1 56 54 52 60 49 \n",
"4 -1 -1 -1 -1 -1 84 6 19 46 4 \n",
"5 25 76 30 62 3 22 55 48 22 69 \n",
"6 58 3 35 73 5 27 1 99 80 34 \n",
"7 28 89 26 21 95 33 77 83 46 76 \n",
"8 71 14 67 84 17 38 33 99 73 8 \n",
"9 96 66 14 11 28 13 52 1 48 96 \n",
"\n",
" ... col_54 col_55 col_56 col_57 col_58 col_59 col_60 col_61 \\\n",
"0 ... 53 46 19 37 47 14 22 91 \n",
"1 ... 90 72 0 89 41 58 97 45 \n",
"2 ... 98 22 99 17 42 47 76 45 \n",
"3 ... 23 6 99 42 71 7 82 87 \n",
"4 ... 45 98 12 92 85 8 57 96 \n",
"5 ... 94 78 94 70 17 4 27 11 \n",
"6 ... 57 98 43 99 92 32 83 60 \n",
"7 ... 19 38 9 59 7 86 63 17 \n",
"8 ... 84 41 81 96 30 27 79 58 \n",
"9 ... 97 24 34 83 40 29 28 0 \n",
"\n",
" col_62 col_63 \n",
"0 84 90 \n",
"1 78 33 \n",
"2 32 27 \n",
"3 85 30 \n",
"4 92 60 \n",
"5 21 79 \n",
"6 69 84 \n",
"7 34 10 \n",
"8 17 1 \n",
"9 24 46 \n",
"\n",
"[10 rows x 64 columns]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Notice the first 5 columns are -1 for the first 5 rows.\n",
"df.head(10)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"# Make the \n",
"df.iloc[:6, -6:] = df.iloc[:6, :6]"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>col_0</th>\n",
" <th>col_1</th>\n",
" <th>col_2</th>\n",
" <th>col_3</th>\n",
" <th>col_4</th>\n",
" <th>col_5</th>\n",
" <th>col_6</th>\n",
" <th>col_7</th>\n",
" <th>col_8</th>\n",
" <th>col_9</th>\n",
" <th>...</th>\n",
" <th>col_54</th>\n",
" <th>col_55</th>\n",
" <th>col_56</th>\n",
" <th>col_57</th>\n",
" <th>col_58</th>\n",
" <th>col_59</th>\n",
" <th>col_60</th>\n",
" <th>col_61</th>\n",
" <th>col_62</th>\n",
" <th>col_63</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>76</td>\n",
" <td>66</td>\n",
" <td>61</td>\n",
" <td>33</td>\n",
" <td>57</td>\n",
" <td>...</td>\n",
" <td>53</td>\n",
" <td>46</td>\n",
" <td>19</td>\n",
" <td>37</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>76</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>51</td>\n",
" <td>95</td>\n",
" <td>39</td>\n",
" <td>85</td>\n",
" <td>97</td>\n",
" <td>...</td>\n",
" <td>90</td>\n",
" <td>72</td>\n",
" <td>0</td>\n",
" <td>89</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>51</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>10</td>\n",
" <td>80</td>\n",
" <td>36</td>\n",
" <td>75</td>\n",
" <td>46</td>\n",
" <td>...</td>\n",
" <td>98</td>\n",
" <td>22</td>\n",
" <td>99</td>\n",
" <td>17</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>56</td>\n",
" <td>54</td>\n",
" <td>52</td>\n",
" <td>60</td>\n",
" <td>49</td>\n",
" <td>...</td>\n",
" <td>23</td>\n",
" <td>6</td>\n",
" <td>99</td>\n",
" <td>42</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>84</td>\n",
" <td>6</td>\n",
" <td>19</td>\n",
" <td>46</td>\n",
" <td>4</td>\n",
" <td>...</td>\n",
" <td>45</td>\n",
" <td>98</td>\n",
" <td>12</td>\n",
" <td>92</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>-1</td>\n",
" <td>84</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>25</td>\n",
" <td>76</td>\n",
" <td>30</td>\n",
" <td>62</td>\n",
" <td>3</td>\n",
" <td>22</td>\n",
" <td>55</td>\n",
" <td>48</td>\n",
" <td>22</td>\n",
" <td>69</td>\n",
" <td>...</td>\n",
" <td>94</td>\n",
" <td>78</td>\n",
" <td>94</td>\n",
" <td>70</td>\n",
" <td>25</td>\n",
" <td>76</td>\n",
" <td>30</td>\n",
" <td>62</td>\n",
" <td>3</td>\n",
" <td>22</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>58</td>\n",
" <td>3</td>\n",
" <td>35</td>\n",
" <td>73</td>\n",
" <td>5</td>\n",
" <td>27</td>\n",
" <td>1</td>\n",
" <td>99</td>\n",
" <td>80</td>\n",
" <td>34</td>\n",
" <td>...</td>\n",
" <td>57</td>\n",
" <td>98</td>\n",
" <td>43</td>\n",
" <td>99</td>\n",
" <td>92</td>\n",
" <td>32</td>\n",
" <td>83</td>\n",
" <td>60</td>\n",
" <td>69</td>\n",
" <td>84</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>28</td>\n",
" <td>89</td>\n",
" <td>26</td>\n",
" <td>21</td>\n",
" <td>95</td>\n",
" <td>33</td>\n",
" <td>77</td>\n",
" <td>83</td>\n",
" <td>46</td>\n",
" <td>76</td>\n",
" <td>...</td>\n",
" <td>19</td>\n",
" <td>38</td>\n",
" <td>9</td>\n",
" <td>59</td>\n",
" <td>7</td>\n",
" <td>86</td>\n",
" <td>63</td>\n",
" <td>17</td>\n",
" <td>34</td>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>71</td>\n",
" <td>14</td>\n",
" <td>67</td>\n",
" <td>84</td>\n",
" <td>17</td>\n",
" <td>38</td>\n",
" <td>33</td>\n",
" <td>99</td>\n",
" <td>73</td>\n",
" <td>8</td>\n",
" <td>...</td>\n",
" <td>84</td>\n",
" <td>41</td>\n",
" <td>81</td>\n",
" <td>96</td>\n",
" <td>30</td>\n",
" <td>27</td>\n",
" <td>79</td>\n",
" <td>58</td>\n",
" <td>17</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>96</td>\n",
" <td>66</td>\n",
" <td>14</td>\n",
" <td>11</td>\n",
" <td>28</td>\n",
" <td>13</td>\n",
" <td>52</td>\n",
" <td>1</td>\n",
" <td>48</td>\n",
" <td>96</td>\n",
" <td>...</td>\n",
" <td>97</td>\n",
" <td>24</td>\n",
" <td>34</td>\n",
" <td>83</td>\n",
" <td>40</td>\n",
" <td>29</td>\n",
" <td>28</td>\n",
" <td>0</td>\n",
" <td>24</td>\n",
" <td>46</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>10 rows x 64 columns</p>\n",
"</div>"
],
"text/plain": [
" col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 \\\n",
"0 -1 -1 -1 -1 -1 76 66 61 33 57 \n",
"1 -1 -1 -1 -1 -1 51 95 39 85 97 \n",
"2 -1 -1 -1 -1 -1 10 80 36 75 46 \n",
"3 -1 -1 -1 -1 -1 56 54 52 60 49 \n",
"4 -1 -1 -1 -1 -1 84 6 19 46 4 \n",
"5 25 76 30 62 3 22 55 48 22 69 \n",
"6 58 3 35 73 5 27 1 99 80 34 \n",
"7 28 89 26 21 95 33 77 83 46 76 \n",
"8 71 14 67 84 17 38 33 99 73 8 \n",
"9 96 66 14 11 28 13 52 1 48 96 \n",
"\n",
" ... col_54 col_55 col_56 col_57 col_58 col_59 col_60 col_61 \\\n",
"0 ... 53 46 19 37 -1 -1 -1 -1 \n",
"1 ... 90 72 0 89 -1 -1 -1 -1 \n",
"2 ... 98 22 99 17 -1 -1 -1 -1 \n",
"3 ... 23 6 99 42 -1 -1 -1 -1 \n",
"4 ... 45 98 12 92 -1 -1 -1 -1 \n",
"5 ... 94 78 94 70 25 76 30 62 \n",
"6 ... 57 98 43 99 92 32 83 60 \n",
"7 ... 19 38 9 59 7 86 63 17 \n",
"8 ... 84 41 81 96 30 27 79 58 \n",
"9 ... 97 24 34 83 40 29 28 0 \n",
"\n",
" col_62 col_63 \n",
"0 -1 76 \n",
"1 -1 51 \n",
"2 -1 10 \n",
"3 -1 56 \n",
"4 -1 84 \n",
"5 3 22 \n",
"6 69 84 \n",
"7 34 10 \n",
"8 17 1 \n",
"9 24 46 \n",
"\n",
"[10 rows x 64 columns]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Notice that the last 6 columns now match the first 6 columns for the first 6 rows\n",
"df.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Visit the [Modin Documentation](http://modin.readthedocs.io) for more information about APIs supported**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2>Improvements in runtime and efficiency</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We continue to focus our efforts toward optimizing the most-used and most-requested operations. We spent some time re-architecting the system to improve its memory efficiency. In the end, we were able to reduce the memory footprint of most operations by 2-3x. The architecture is discussed at length in the [documentation](https://modin.readthedocs.io/en/latest/architecture.html)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Runtime and memory efficiency of `read_csv`</h3>\n",
"\n",
"<h5>...or how to get more than 1GB/second of read throughput on your CSV file</h5>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`read_csv` is by far the most used operation in pandas (see the study in our [previous blog post](https://rise.cs.berkeley.edu/blog/pandas-on-ray-early-lessons/)), so naturally it is the most important operation to optimize. We reduced the memory footprint by 2x and gained approximately 30% in speed.\n",
"\n",
"The current `read_csv` performance is shown in the plot below:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img align=\"left\" style=\"display:inline;\" src=\"https://gist.github.com/devin-petersohn/2384d06e536df1f14519e18b3ce46ecd/raw/af2c65350dc6e17a4b90e1d4fecaeb0867627c07/modin_plot.png\">\n",
"<br>\n",
"The plot on the left was run on a single node with 144 cores and 1TB RAM. <b>On this large machine, Modin is able to get more than 1GB/s of throughput from its `read_csv`.</b> Reading a 2GB file takes around 2 seconds and an 18GB file takes about 16 seconds. The pandas `read_csv` scales as you would expect, but Modin is able to use the full capacity of the machine, including all 144 cores, to read the CSV file.\n",
"<br><br>\n",
"These improvements are primarily due to effiency in partitioning as a result of the new architecture. Additional optimizations include reducing the communication overheads and reducing the size of the task reading the file. From these improvements in efficiency, we can get a massive performance improvement over pandas, even on a 4-core desktop."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>The following code was run on a 2013 4-core iMac with 32GB RAM.</b>"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"import pandas as old_pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### pandas `read_csv`"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 27.5 s, sys: 3.37 s, total: 30.9 s\n",
"Wall time: 30.9 s\n"
]
}
],
"source": [
"%%time\n",
"pandas_csv_df = old_pd.read_csv(\"../750MB.csv\")\n",
"pandas_tail = pandas_csv_df.tail()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### `modin.pandas` `read_csv`"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 90.1 ms, sys: 7.27 ms, total: 97.4 ms\n",
"Wall time: 8.06 s\n"
]
}
],
"source": [
"%%time\n",
"modin_csv_df = pd.read_csv(\"../750MB.csv\")\n",
"modin_tail = modin_csv_df.tail()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"modin.pandas.dataframe.DataFrame"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(modin_tail)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**With Modin, `read_csv` performs up to 4x faster on a 4-core machine just by changing the import statement**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Memory efficiency</h3>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As discussed in a earlier [blog post](https://rise.cs.berkeley.edu/blog/pandas-on-ray-early-lessons/), we previously had some issues with the way that pandas copies data in memory and how that interacts with Ray and Arrow. Because of the immutable shared memory abstraction, we end up creating a full copy of partitions to modify a single cell. This is an area we're continuously improving, but also one in which we've made significant strides. \n",
"\n",
"We reworked the internals of Modin to reduce the copying of data during certain operations. These operations, e.g. `reindex`, are memory intensive and require shuffling. To solve this issue, we developed a no-copy shuffle technique for moving the data around in Modin. As we work to develop this method of shuffling it will continue to get more efficient, but already in the current iteration we see a reduction in memory overhead by 2-3x compared to the prototype implementation. We will discuss this approach in a future blog post."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We also implemented some basic garbage collection for cleaning up the partitions to reduce the memory footprint. We will continue improving the memory efficiency of Modin as the system matures."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2>Next Steps and Getting Involved</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we mentioned in our last [blog post](https://rise.cs.berkeley.edu/blog/pandas-on-ray-early-lessons/), we are continuously improving the API coverage of Modin. We want to continue to improve the way that Data Scientists extract value from their data. Much of the effort since the last release went into rebuilding the backend. We are still working on a distributed Series object, which would drastically improve the performance of many of the methods in `DataFrame`.\n",
"\n",
"Multi-level index support is preliminary, and we continue working toward supporting the full capabilities of Multi-level indexing natively in Modin.\n",
"\n",
"If you're interested in contributing to Modin, we have put together some [documentation](https://modin.readthedocs.io/en/latest/contributing.html) for getting started. Feel free to also contact the [developer mailing list](https://groups.google.com/forum/#!forum/modin-dev) with any questions!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment