Skip to content

Instantly share code, notes, and snippets.

@Nithanaroy
Last active December 27, 2019 12:36
Show Gist options
  • Save Nithanaroy/fc0c34dacae7bf3fb46b2e6b6595681b to your computer and use it in GitHub Desktop.
Save Nithanaroy/fc0c34dacae7bf3fb46b2e6b6595681b to your computer and use it in GitHub Desktop.
blog_percentile-buckets_long-tail-distribution.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "## Code for Percentile Buckets"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "import seaborn as sns\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport numpy as np",
"execution_count": 1,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Load the prices"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df = pd.read_csv(\"item_prices.csv\")\ndf.head()",
"execution_count": 2,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 2,
"data": {
"text/plain": " price\n0 1833\n1 296\n2 199\n3 4936\n4 1595",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>price</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1833</td>\n </tr>\n <tr>\n <th>1</th>\n <td>296</td>\n </tr>\n <tr>\n <th>2</th>\n <td>199</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4936</td>\n </tr>\n <tr>\n <th>4</th>\n <td>1595</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df.describe()",
"execution_count": 3,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 3,
"data": {
"text/plain": " price\ncount 989.000000\nmean 961.792720\nstd 1338.413805\nmin 5.000000\n25% 188.000000\n50% 483.000000\n75% 1140.000000\nmax 8708.000000",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>price</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>count</th>\n <td>989.000000</td>\n </tr>\n <tr>\n <th>mean</th>\n <td>961.792720</td>\n </tr>\n <tr>\n <th>std</th>\n <td>1338.413805</td>\n </tr>\n <tr>\n <th>min</th>\n <td>5.000000</td>\n </tr>\n <tr>\n <th>25%</th>\n <td>188.000000</td>\n </tr>\n <tr>\n <th>50%</th>\n <td>483.000000</td>\n </tr>\n <tr>\n <th>75%</th>\n <td>1140.000000</td>\n </tr>\n <tr>\n <th>max</th>\n <td>8708.000000</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Distribution of prices"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "plt.figure(figsize=(15, 4))\nax = sns.distplot( df.price, kde=True )\n# plt.ylabel(\"fraction of samples\")\nax.grid()",
"execution_count": 4,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": "<Figure size 1080x288 with 1 Axes>",
"image/png": "\n"
},
"metadata": {
"needs_background": "light"
}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "That's a pretty long tail!\n\nBucket into **10** equal sized buckets"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "num_bins = 10",
"execution_count": 5,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "bucketed_price, bins = pd.qcut(df.price, q=num_bins, labels=False, retbins=True)",
"execution_count": 6,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Boundaries of each bin identified by `qcut`"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "pd.DataFrame(np.round(bins), columns=[\"Upper Bound\"]).rename_axis(index='Bin')",
"execution_count": 7,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 7,
"data": {
"text/plain": " Upper Bound\nBin \n0 5.0\n1 50.0\n2 144.0\n3 233.0\n4 340.0\n5 483.0\n6 685.0\n7 952.0\n8 1373.0\n9 2490.0\n10 8708.0",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Upper Bound</th>\n </tr>\n <tr>\n <th>Bin</th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>5.0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>50.0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>144.0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>233.0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>340.0</td>\n </tr>\n <tr>\n <th>5</th>\n <td>483.0</td>\n </tr>\n <tr>\n <th>6</th>\n <td>685.0</td>\n </tr>\n <tr>\n <th>7</th>\n <td>952.0</td>\n </tr>\n <tr>\n <th>8</th>\n <td>1373.0</td>\n </tr>\n <tr>\n <th>9</th>\n <td>2490.0</td>\n </tr>\n <tr>\n <th>10</th>\n <td>8708.0</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df[\"price_bucketed\"] = bucketed_price\ndf.head()",
"execution_count": 8,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 8,
"data": {
"text/plain": " price price_bucketed\n0 1833 8\n1 296 3\n2 199 2\n3 4936 9\n4 1595 8",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>price</th>\n <th>price_bucketed</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1833</td>\n <td>8</td>\n </tr>\n <tr>\n <th>1</th>\n <td>296</td>\n <td>3</td>\n </tr>\n <tr>\n <th>2</th>\n <td>199</td>\n <td>2</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4936</td>\n <td>9</td>\n </tr>\n <tr>\n <th>4</th>\n <td>1595</td>\n <td>8</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Number of examples in each bucket are now $\\approx$ equal :)"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df.price_bucketed.value_counts()",
"execution_count": 9,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 9,
"data": {
"text/plain": "0 100\n9 99\n8 99\n7 99\n6 99\n4 99\n3 99\n2 99\n5 98\n1 98\nName: price_bucketed, dtype: int64"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Plot the distribution of `price_bucketed`"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "plt.figure(figsize=(15, 4))\nax = sns.distplot( df.price_bucketed, bins=num_bins, kde=False )\nplt.ylabel(\"count\")\nax.grid()",
"execution_count": 10,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": "<Figure size 1080x288 with 1 Axes>",
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA3sAAAEHCAYAAAAXsl9wAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAVmElEQVR4nO3de4xmB3ke8Oe1F3exuZksnRovzrrFxbVoItBiG2jRFKMWCMUOItxCcKiljVJiLkmVOPmjVJEqgZomcSCiWnEzCeUSY2oXR2BimEShqfGF+3otrAV8yRpjgg24cZylb/+YY3dYdpaZ3Zk535z9/aTRfud85/vO4/G7l2fO5avuDgAAANNy3NgBAAAAWHvKHgAAwAQpewAAABOk7AEAAEyQsgcAADBBW8YOcDS2bdvWO3bsGDvGj7j//vtz0kknjR0DlmVGmXVmlFlnRpl1ZvTYceONN97T3U841HObuuzt2LEjN9xww9gxfsTCwkLm5+fHjgHLMqPMOjPKrDOjzDozeuyoqm8s95zTOAEAACZI2QMAAJggZQ8AAGCClD0AAIAJUvYAAAAmSNkDAACYoHUre1X17qq6u6q+vGTd46vqk1X11eHXk4f1VVV/UFW3VtUXq+rp65ULAADgWLCeR/bem+T5B627JMm13X1GkmuH5SR5QZIzhq9dSd6xjrkAAAAmb93KXnf/RZK/OWj1+UkuGx5fluSCJevf14v+d5LHVdUp65UNAABg6rZs8P7munv/8PiuJHPD41OT3L5kuzuGdftzkKralcWjf5mbm8vCwsK6hT1S99733Vxx9TVjx9g0Hn/SCWNHOOZ8//vfn8nfO/AQM8qsM6PMOjNKsvFl72Hd3VXVR/C63Ul2J8nOnTt7fn5+raMdtSuuviYPbDtz7Bibxvw5p40d4ZizsLCQWfy9Aw8xo8w6M8qsM6MkG1/2vllVp3T3/uE0zbuH9XcmedKS7bYP64CD/Pfrbjvq99h6/4Nr8j6wXjbTjL7KD6xWZbP8f/1xNtOMcmwyo2tvM/55v9EfvXBVkguHxxcmuXLJ+tcMd+U8N8l9S073BAAAYJXW7cheVX0gyXySbVV1R5I3J3lLkg9X1UVJvpHkZcPmf5rkhUluTfJ/krx2vXIxe/zUCdjM/BkGwKxat7LX3a9c5qnzDrFtJ3ndemUBAAA41mz0aZwAAABsAGUPAABggpQ9AACACVL2AAAAJkjZAwAAmCBlDwAAYIKUPQAAgAlS9gAAACZI2QMAAJggZQ8AAGCClD0AAIAJUvYAAAAmSNkDAACYIGUPAABggpQ9AACACVL2AAAAJkjZAwAAmCBlDwAAYIKUPQAAgAlS9gAAACZI2QMAAJggZQ8AAGCClD0AAIAJUvYAAAAmSNkDAACYIGUPAABggpQ9AACACVL2AAAAJkjZAwAAmCBlDwAAYIKUPQAAgAlS9gAAACZolLJXVW+qqq9U1Zer6gNVtbWqTq+q66rq1qr6UFWdMEY2AACAKdjwsldVpyZ5fZKd3f3UJMcneUWStyb5ve5+cpLvJLloo7MBAABMxVincW5J8siq2pLkxCT7kzw3yeXD85cluWCkbAAAAJvelo3eYXffWVW/k+S2JH+b5JokNya5t7sPDJvdkeTUQ72+qnYl2ZUkc3NzWVhYWPfMq3XcgQey9Z69Y8eAZZlRZp0ZZdaZUWadGV17Cwv7xo6wahte9qrq5CTnJzk9yb1J/iTJ81f6+u7enWR3kuzcubPn5+fXIeXRueLqa/LAtjPHjgHL2nrPXjPKTDOjzDozyqwzo2tv/pzTxo6wamOcxvm8JF/r7m91998nuSLJs5M8bjitM0m2J7lzhGwAAACTMEbZuy3JuVV1YlVVkvOS7Eny6SQvHba5MMmVI2QDAACYhA0ve919XRZvxHJTki8NGXYn+Y0kv1pVtyb5iSTv2uhsAAAAU7Hh1+wlSXe/OcmbD1q9L8nZI8QBAACYnLE+egEAAIB1pOwBAABMkLIHAAAwQcoeAADABCl7AAAAE6TsAQAATJCyBwAAMEHKHgAAwAQpewAAABOk7AEAAEyQsgcAADBByh4AAMAEKXsAAAATpOwBAABMkLIHAAAwQcoeAADABCl7AAAAE6TsAQAATJCyBwAAMEHKHgAAwAQpewAAABOk7AEAAEyQsgcAADBByh4AAMAEKXsAAAATpOwBAABMkLIHAAAwQcoeAADABCl7AAAAE6TsAQAATJCyBwAAMEHKHgAAwASNUvaq6nFVdXlV7a2qm6vqmVX1+Kr6ZFV9dfj15DGyAQAATMFYR/YuTfLx7j4zyU8nuTnJJUmu7e4zklw7LAMAAHAENrzsVdVjkzwnybuSpLsf7O57k5yf5LJhs8uSXLDR2QAAAKZijCN7pyf5VpL3VNXnquqdVXVSkrnu3j9sc1eSuRGyAQAATMKWkfb59CQXd/d1VXVpDjpls7u7qvpQL66qXUl2Jcnc3FwWFhbWOe7qHXfggWy9Z+/YMWBZZpRZZ0aZdWaUWWdG197Cwr6xI6zaGGXvjiR3dPd1w/LlWSx736yqU7p7f1WdkuTuQ724u3cn2Z0kO3fu7Pn5+Q2IvDpXXH1NHth25tgxYFlb79lrRplpZpRZZ0aZdWZ07c2fc9rYEVZtw0/j7O67ktxeVU8ZVp2XZE+Sq5JcOKy7MMmVG50NAABgKsY4spckFyd5f1WdkGRfktdmsXh+uKouSvKNJC8bKRsAAMCmt6KyV1XXdvd5P27dSnX355PsPMRTR/R+AAAA/LDDlr2q2prkxCTbhg85r+GpxyQ5dZ2zAQAAcIR+3JG9X0ryxiRPTHJj/n/Z+26St69jLgAAAI7CYcted1+a5NKquri737ZBmQAAADhKK7pmr7vfVlXPSrJj6Wu6+33rlAsAAICjsNIbtPxRkn+S5PNJfjCs7iTKHgAAwAxa6Ucv7ExyVnf3eoYBAABgbaz0Q9W/nOQfrWcQAAAA1s5Kj+xtS7Knqj6b5O8eWtndL16XVAAAAByVlZa9/7SeIQAAAFhbK70b55+vdxAAAADWzkrvxvm9LN59M0lOSPKIJPd392PWKxgAAABHbqVH9h790OOqqiTnJzl3vUIBAABwdFZ6N86H9aL/keTfrEMeAAAA1sBKT+N8yZLF47L4uXsPrEsiAAAAjtpK78b5b5c8PpDk61k8lRMAAIAZtNJr9l673kEAAABYOyu6Zq+qtlfVR6vq7uHrI1W1fb3DAQAAcGRWeoOW9yS5KskTh6//OawDAABgBq207D2hu9/T3QeGr/cmecI65gIAAOAorLTsfbuqXl1Vxw9fr07y7fUMBgAAwJFbadn7d0leluSuJPuTvDTJL65TJgAAAI7SSj964beTXNjd30mSqnp8kt/JYgkEAABgxqz0yN5PPVT0kqS7/ybJ09YnEgAAAEdrpWXvuKo6+aGF4cjeSo8KAgAAsMFWWtj+a5K/qqo/GZZ/Lsl/Xp9IAAAAHK0Vlb3ufl9V3ZDkucOql3T3nvWLBQAAwNFY8amYQ7lT8AAAADaBlV6zBwAAwCai7AEAAEyQsgcAADBByh4AAMAEKXsAAAATpOwBAABM0Ghlr6qOr6rPVdXHhuXTq+q6qrq1qj5UVSeMlQ0AAGCzG/PI3huS3Lxk+a1Jfq+7n5zkO0kuGiUVAADABIxS9qpqe5KfSfLOYbmSPDfJ5cMmlyW5YIxsAAAAU7BlpP3+fpJfT/LoYfknktzb3QeG5TuSnHqoF1bVriS7kmRubi4LCwvrm/QIHHfggWy9Z+/YMWBZZpRZZ0aZdWaUWWdG197Cwr6xI6zahpe9qnpRkru7+8aqml/t67t7d5LdSbJz586en1/1W6y7K66+Jg9sO3PsGLCsrffsNaPMNDPKrDOjzDozuvbmzzlt7AirNsaRvWcneXFVvTDJ1iSPSXJpksdV1Zbh6N72JHeOkA0AAGASNvyave7+ze7e3t07krwiyae6++eTfDrJS4fNLkxy5UZnAwAAmIpZ+py930jyq1V1axav4XvXyHkAAAA2rbFu0JIk6e6FJAvD431Jzh4zDwAAwFTM0pE9AAAA1oiyBwAAMEHKHgAAwAQpewAAABOk7AEAAEyQsgcAADBByh4AAMAEKXsAAAATpOwBAABMkLIHAAAwQcoeAADABCl7AAAAE6TsAQAATJCyBwAAMEHKHgAAwAQpewAAABOk7AEAAEyQsgcAADBByh4AAMAEKXsAAAATpOwBAABMkLIHAAAwQcoeAADABCl7AAAAE6TsAQAATJCyBwAAMEHKHgAAwAQpewAAABOk7AEAAEyQsgcAADBByh4AAMAEKXsAAAATtOFlr6qeVFWfrqo9VfWVqnrDsP7xVfXJqvrq8OvJG50NAABgKsY4sncgya9191lJzk3yuqo6K8klSa7t7jOSXDssAwAAcAQ2vOx19/7uvml4/L0kNyc5Ncn5SS4bNrssyQUbnQ0AAGAqtoy586rakeRpSa5LMtfd+4en7koyt8xrdiXZlSRzc3NZWFhY95yrddyBB7L1nr1jx4BlmVFmnRll1plRZp0ZXXsLC/vGjrBqo5W9qnpUko8keWN3f7eqHn6uu7uq+lCv6+7dSXYnyc6dO3t+fn4D0q7OFVdfkwe2nTl2DFjW1nv2mlFmmhll1plRZp0ZXXvz55w2doRVG+VunFX1iCwWvfd39xXD6m9W1SnD86ckuXuMbAAAAFMwxt04K8m7ktzc3b+75Kmrklw4PL4wyZUbnQ0AAGAqxjiN89lJfiHJl6rq88O630ryliQfrqqLknwjyctGyAYAADAJG172uvsvk9QyT5+3kVkAAACmapRr9gAAAFhfyh4AAMAEKXsAAAATpOwBAABMkLIHAAAwQcoeAADABCl7AAAAE6TsAQAATJCyBwAAMEHKHgAAwAQpewAAABOk7AEAAEyQsgcAADBByh4AAMAEKXsAAAATpOwBAABMkLIHAAAwQcoeAADABCl7AAAAE6TsAQAATJCyBwAAMEHKHgAAwAQpewAAABOk7AEAAEyQsgcAADBByh4AAMAEKXsAAAATpOwBAABMkLIHAAAwQcoeAADABCl7AAAAE6TsAQAATJCyBwAAMEEzVfaq6vlVdUtV3VpVl4ydBwAAYLOambJXVccn+cMkL0hyVpJXVtVZ46YCAADYnGam7CU5O8mt3b2vux9M8sEk54+cCQAAYFPaMnaAJU5NcvuS5TuSnHPwRlW1K8muYfH7VXXLBmRbrW1J7hk7BByGGWXWmVFmnRll1pnRNfbzYwdY3k8u98Qslb0V6e7dSXaPneNwquqG7t45dg5Yjhll1plRZp0ZZdaZUZLZOo3zziRPWrK8fVgHAADAKs1S2bs+yRlVdXpVnZDkFUmuGjkTAADApjQzp3F294Gq+pUkn0hyfJJ3d/dXRo51pGb6NFOIGWX2mVFmnRll1plRUt09dgYAAADW2CydxgkAAMAaUfYAAAAmSNlbY1X1/Kq6papurapLxs4DS1XVk6rq01W1p6q+UlVvGDsTHEpVHV9Vn6uqj42dBQ5WVY+rqsuram9V3VxVzxw7EyxVVW8a/p7/clV9oKq2jp2JcSh7a6iqjk/yh0lekOSsJK+sqrPGTQU/5ECSX+vus5Kcm+R1ZpQZ9YYkN48dApZxaZKPd/eZSX46ZpUZUlWnJnl9kp3d/dQs3vjwFeOmYizK3to6O8mt3b2vux9M8sEk54+cCR7W3fu7+6bh8fey+A+UU8dNBT+sqrYn+Zkk7xw7Cxysqh6b5DlJ3pUk3f1gd987bir4EVuSPLKqtiQ5Mclfj5yHkSh7a+vUJLcvWb4j/iHNjKqqHUmeluS6cZPAj/j9JL+e5P+OHQQO4fQk30rynuFU43dW1Uljh4KHdPedSX4nyW1J9ie5r7uvGTcVY1H24BhUVY9K8pEkb+zu746dBx5SVS9Kcnd33zh2FljGliRPT/KO7n5akvuTuEafmVFVJ2fxzLLTkzwxyUlV9epxUzEWZW9t3ZnkSUuWtw/rYGZU1SOyWPTe391XjJ0HDvLsJC+uqq9n8VT451bVH48bCX7IHUnu6O6Hzoq4PIvlD2bF85J8rbu/1d1/n+SKJM8aORMjUfbW1vVJzqiq06vqhCxeDHvVyJngYVVVWbzO5Obu/t2x88DBuvs3u3t7d+/I4p+hn+puP5FmZnT3XUlur6qnDKvOS7JnxEhwsNuSnFtVJw5/758XNxE6Zm0ZO8CUdPeBqvqVJJ/I4p2P3t3dXxk5Fiz17CS/kORLVfX5Yd1vdfefjpgJYLO5OMn7hx/s7kvy2pHzwMO6+7qqujzJTVm8C/fnkuweNxVjqe4eOwMAAABrzGmcAAAAE6TsAQAATJCyBwAAMEHKHgAAwAQpewAAABOk7AEAAEyQsgfApFTVb1fV89bovear6mNr8D47qurLq9z+VUewn/dW1UtX+zoApknZA2Ayqur47v6P3f1nY2c5SjuSrLrsAcBSyh4Am8JwtGtvVb2/qm6uqsur6sSq+npVvbWqbkryc0uPblXVM6rqf1XVF6rqs1X16Ko6vqr+S1VdX1VfrKpf+jG7fkxVXV1Vt1TVf6uq44b3/v6SbC+tqvcOj+eq6qPDPr9QVc866L/jH1fV54Zsy2V5S5J/WVWfr6o3LbddLXr7kO3PkvzDtfheAzANW8YOAACr8JQkF3X3Z6rq3Un+/bD+29399CSpqucPv56Q5ENJXt7d11fVY5L8bZKLktzX3c+oqn+Q5DNVdU13f22ZfZ6d5Kwk30jy8SQvSXL5YTL+QZI/7+6frarjkzwqyclDpqck+WCSX+zuL1TVrkNlSXJJkv/Q3S8aXrfcdk8bvidnJZlLsifJu1f83QRg0pQ9ADaT27v7M8PjP07y+uHxhw6x7VOS7O/u65Oku7+bJFX1r5P81JJr2x6b5Iwky5W9z3b3vuG1H0jyL3L4svfcJK8Z9vmDJPdV1clJnpDkyiQv6e49w7bLZXnwoPdcbrvnJPnAsJ+/rqpPHSYXAMcYZQ+AzaSXWb5/Fe9RSS7u7k8c5T6Xrt+6gve5L8ltWSyLD5W9Q2apqvmDXrvcdi9cwX4BOEa5Zg+AzeS0qnrm8PhVSf7yMNvekuSUqnpGkgzX621J8okkv1xVjxjW/9OqOukw73N2VZ0+XKv38iX7/GZV/bNh/c8u2f7aJL88vPfxVfXYYf2Dw3avWXKnzeWyfC/Jo5e853Lb/UWSlw/7OSXJvzrMfwcAxxhH9gDYTG5J8rrher09Sd6R5OJDbdjdD1bVy5O8raoemcXr9Z6X5J1ZvNvlTVVVSb6V5ILD7PP6JG9P8uQkn07y0WH9JUk+Nrz+hixem5ckb0iyu6ouSvKDLBa//UOm+6vqRUk+OdzgZbksX0zyg6r6QpL3Jrl0me0+msXTRvdk8ajhXx3umwfAsaW6Dz47BQBmT1XtSPKx7n7qyFEAYFNwGicAAMAEObIHwDGvqv55kj86aPXfdfc5Y+QBgLWg7AEAAEyQ0zgBAAAmSNkDAACYIGUPAABggpQ9AACACfp/bYoAVlbXqoYAAAAASUVORK5CYII=\n"
},
"metadata": {
"needs_background": "light"
}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Similarly, [Quantile Discretizer](https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.feature.QuantileDiscretizer) is available in Apache Spark if the data doesn't to big and / or is distributed in a cluster of machines."
}
],
"metadata": {
"gist": {
"id": "fc0c34dacae7bf3fb46b2e6b6595681b",
"data": {
"description": "blog_percentile-buckets_long-tail-distribution.ipynb",
"public": true
}
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"language_info": {
"name": "python",
"version": "3.6.8",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"toc": {
"nav_menu": {},
"number_sections": false,
"sideBar": true,
"skip_h1_title": true,
"base_numbering": 1,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
},
"_draft": {
"nbviewer_url": "https://gist.github.com/fc0c34dacae7bf3fb46b2e6b6595681b"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment