Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Pandas - Groupby and Categorical Data.ipynb
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "## Groupby\n* DataFrame.groupby ( )\n* Grouped.agg ( )\n* Grouped.filter ( )\n\n## Rank\n* Pandas.DataFrame.rank ( )\n\n## Categorcial Data\n* Pandas.Series.cat ( )\n* Pandas.getdummies ( )"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "import pandas as pd ",
"execution_count": 1,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### create a test dataset to demo functions"
},
{
"metadata": {
"slideshow": {
"slide_type": "slide"
},
"trusted": true
},
"cell_type": "code",
"source": "df = pd.DataFrame(data = {'date':['2020-01-03',\n '2020-03-04',\n '2020-04-05',\n '2020-02-01',\n '2020-02-09',\n '2020-03-12',\n '2020-03-19'\n ],\n 'name':['Peter','Max','Ella','Maria','Tom','Emma','Lisa'],\n 'rate':[4.5,3.5,4,4.5,3,3.5,5],\n 'review':['It is a really nice restaurant! I would recommend it.',\n 'Good price. Not really satisfied with the service. Not recommend.',\n 'Like it. Staff there was quite nice. Recommend.',\n 'Looooove it!!!!! Really good! Highly recommend!',\n '',\n 'Not so bad. I will probably come back again',\n 'Such a nice restaurant!'\n ]\n })\ndf",
"execution_count": 2,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 2,
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>date</th>\n <th>name</th>\n <th>rate</th>\n <th>review</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>2020-01-03</td>\n <td>Peter</td>\n <td>4.5</td>\n <td>It is a really nice restaurant! I would recomm...</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2020-03-04</td>\n <td>Max</td>\n <td>3.5</td>\n <td>Good price. Not really satisfied with the serv...</td>\n </tr>\n <tr>\n <th>2</th>\n <td>2020-04-05</td>\n <td>Ella</td>\n <td>4.0</td>\n <td>Like it. Staff there was quite nice. Recommend.</td>\n </tr>\n <tr>\n <th>3</th>\n <td>2020-02-01</td>\n <td>Maria</td>\n <td>4.5</td>\n <td>Looooove it!!!!! Really good! Highly recommend!</td>\n </tr>\n <tr>\n <th>4</th>\n <td>2020-02-09</td>\n <td>Tom</td>\n <td>3.0</td>\n <td></td>\n </tr>\n <tr>\n <th>5</th>\n <td>2020-03-12</td>\n <td>Emma</td>\n <td>3.5</td>\n <td>Not so bad. I will probably come back again</td>\n </tr>\n <tr>\n <th>6</th>\n <td>2020-03-19</td>\n <td>Lisa</td>\n <td>5.0</td>\n <td>Such a nice restaurant!</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " date name rate review\n0 2020-01-03 Peter 4.5 It is a really nice restaurant! I would recomm...\n1 2020-03-04 Max 3.5 Good price. Not really satisfied with the serv...\n2 2020-04-05 Ella 4.0 Like it. Staff there was quite nice. Recommend.\n3 2020-02-01 Maria 4.5 Looooove it!!!!! Really good! Highly recommend!\n4 2020-02-09 Tom 3.0 \n5 2020-03-12 Emma 3.5 Not so bad. I will probably come back again\n6 2020-03-19 Lisa 5.0 Such a nice restaurant!"
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df.info()",
"execution_count": 3,
"outputs": [
{
"output_type": "stream",
"text": "<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 7 entries, 0 to 6\nData columns (total 4 columns):\ndate 7 non-null object\nname 7 non-null object\nrate 7 non-null float64\nreview 7 non-null object\ndtypes: float64(1), object(3)\nmemory usage: 304.0+ bytes\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Groupby\n\n* for single and multi index\n* applying multiple stats simultaneously\n* filter"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### DataFrame.groupby( by, level, dropna, etc ...)\n\nParameters:\n\n* by: label, or list of labels. Used to determine the groups for the groupby.\n* level: int, level name, or sequence of such, default None. If the axis is a MultiIndex (hierarchical), group by a particular level or levels.\n* dropna: bool, default True\n\n#### example : \nGroup by column names, check each group and get one of groups "
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "grouped = df.groupby(by = ['rate'])\ngrouped",
"execution_count": 5,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 5,
"data": {
"text/plain": "<pandas.core.groupby.DataFrameGroupBy object at 0x114d37160>"
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "for name, group in grouped:\n print (name)\n print (group)",
"execution_count": 6,
"outputs": [
{
"output_type": "stream",
"text": "3.0\n date name rate review\n4 2020-02-09 Tom 3.0 \n3.5\n date name rate review\n1 2020-03-04 Max 3.5 Good price. Not really satisfied with the serv...\n5 2020-03-12 Emma 3.5 Not so bad. I will probably come back again\n4.0\n date name rate review\n2 2020-04-05 Ella 4.0 Like it. Staff there was quite nice. Recommend.\n4.5\n date name rate review\n0 2020-01-03 Peter 4.5 It is a really nice restaurant! I would recomm...\n3 2020-02-01 Maria 4.5 Looooove it!!!!! Really good! Highly recommend!\n5.0\n date name rate review\n6 2020-03-19 Lisa 5.0 Such a nice restaurant!\n",
"name": "stdout"
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "grouped.get_group(3.5)",
"execution_count": 7,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 7,
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>date</th>\n <th>name</th>\n <th>rate</th>\n <th>review</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>1</th>\n <td>2020-03-04</td>\n <td>Max</td>\n <td>3.5</td>\n <td>Good price. Not really satisfied with the serv...</td>\n </tr>\n <tr>\n <th>5</th>\n <td>2020-03-12</td>\n <td>Emma</td>\n <td>3.5</td>\n <td>Not so bad. I will probably come back again</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " date name rate review\n1 2020-03-04 Max 3.5 Good price. Not really satisfied with the serv...\n5 2020-03-12 Emma 3.5 Not so bad. I will probably come back again"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### create data with single and multi index :"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_eg = df.copy()\ndf_eg['date'] = pd.to_datetime(df_eg['date'])\ndf_eg['month'] = df_eg['date'].dt.month\ndf_eg['year'] = df_eg['date'].dt.year\ndf_eg",
"execution_count": 16,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 16,
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>date</th>\n <th>name</th>\n <th>rate</th>\n <th>review</th>\n <th>month</th>\n <th>year</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>2020-01-03</td>\n <td>Peter</td>\n <td>4.5</td>\n <td>It is a really nice restaurant! I would recomm...</td>\n <td>1</td>\n <td>2020</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2020-03-04</td>\n <td>Max</td>\n <td>3.5</td>\n <td>Good price. Not really satisfied with the serv...</td>\n <td>3</td>\n <td>2020</td>\n </tr>\n <tr>\n <th>2</th>\n <td>2020-04-05</td>\n <td>Ella</td>\n <td>4.0</td>\n <td>Like it. Staff there was quite nice. Recommend.</td>\n <td>4</td>\n <td>2020</td>\n </tr>\n <tr>\n <th>3</th>\n <td>2020-02-01</td>\n <td>Maria</td>\n <td>4.5</td>\n <td>Looooove it!!!!! Really good! Highly recommend!</td>\n <td>2</td>\n <td>2020</td>\n </tr>\n <tr>\n <th>4</th>\n <td>2020-02-09</td>\n <td>Tom</td>\n <td>3.0</td>\n <td></td>\n <td>2</td>\n <td>2020</td>\n </tr>\n <tr>\n <th>5</th>\n <td>2020-03-12</td>\n <td>Emma</td>\n <td>3.5</td>\n <td>Not so bad. I will probably come back again</td>\n <td>3</td>\n <td>2020</td>\n </tr>\n <tr>\n <th>6</th>\n <td>2020-03-19</td>\n <td>Lisa</td>\n <td>5.0</td>\n <td>Such a nice restaurant!</td>\n <td>3</td>\n <td>2020</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " date name rate review \\\n0 2020-01-03 Peter 4.5 It is a really nice restaurant! I would recomm... \n1 2020-03-04 Max 3.5 Good price. Not really satisfied with the serv... \n2 2020-04-05 Ella 4.0 Like it. Staff there was quite nice. Recommend. \n3 2020-02-01 Maria 4.5 Looooove it!!!!! Really good! Highly recommend! \n4 2020-02-09 Tom 3.0 \n5 2020-03-12 Emma 3.5 Not so bad. I will probably come back again \n6 2020-03-19 Lisa 5.0 Such a nice restaurant! \n\n month year \n0 1 2020 \n1 3 2020 \n2 4 2020 \n3 2 2020 \n4 2 2020 \n5 3 2020 \n6 3 2020 "
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_single = df_eg.copy().set_index('month')\ndf_single",
"execution_count": 17,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 17,
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>date</th>\n <th>name</th>\n <th>rate</th>\n <th>review</th>\n <th>year</th>\n </tr>\n <tr>\n <th>month</th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>1</th>\n <td>2020-01-03</td>\n <td>Peter</td>\n <td>4.5</td>\n <td>It is a really nice restaurant! I would recomm...</td>\n <td>2020</td>\n </tr>\n <tr>\n <th>3</th>\n <td>2020-03-04</td>\n <td>Max</td>\n <td>3.5</td>\n <td>Good price. Not really satisfied with the serv...</td>\n <td>2020</td>\n </tr>\n <tr>\n <th>4</th>\n <td>2020-04-05</td>\n <td>Ella</td>\n <td>4.0</td>\n <td>Like it. Staff there was quite nice. Recommend.</td>\n <td>2020</td>\n </tr>\n <tr>\n <th>2</th>\n <td>2020-02-01</td>\n <td>Maria</td>\n <td>4.5</td>\n <td>Looooove it!!!!! Really good! Highly recommend!</td>\n <td>2020</td>\n </tr>\n <tr>\n <th>2</th>\n <td>2020-02-09</td>\n <td>Tom</td>\n <td>3.0</td>\n <td></td>\n <td>2020</td>\n </tr>\n <tr>\n <th>3</th>\n <td>2020-03-12</td>\n <td>Emma</td>\n <td>3.5</td>\n <td>Not so bad. I will probably come back again</td>\n <td>2020</td>\n </tr>\n <tr>\n <th>3</th>\n <td>2020-03-19</td>\n <td>Lisa</td>\n <td>5.0</td>\n <td>Such a nice restaurant!</td>\n <td>2020</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " date name rate \\\nmonth \n1 2020-01-03 Peter 4.5 \n3 2020-03-04 Max 3.5 \n4 2020-04-05 Ella 4.0 \n2 2020-02-01 Maria 4.5 \n2 2020-02-09 Tom 3.0 \n3 2020-03-12 Emma 3.5 \n3 2020-03-19 Lisa 5.0 \n\n review year \nmonth \n1 It is a really nice restaurant! I would recomm... 2020 \n3 Good price. Not really satisfied with the serv... 2020 \n4 Like it. Staff there was quite nice. Recommend. 2020 \n2 Looooove it!!!!! Really good! Highly recommend! 2020 \n2 2020 \n3 Not so bad. I will probably come back again 2020 \n3 Such a nice restaurant! 2020 "
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_single.index",
"execution_count": 18,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 18,
"data": {
"text/plain": "Int64Index([1, 3, 4, 2, 2, 3, 3], dtype='int64', name='month')"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### example for single index:"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_single.groupby(level = 0).count()",
"execution_count": 19,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 19,
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>date</th>\n <th>name</th>\n <th>rate</th>\n <th>review</th>\n <th>year</th>\n </tr>\n <tr>\n <th>month</th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>1</th>\n <td>1</td>\n <td>1</td>\n <td>1</td>\n <td>1</td>\n <td>1</td>\n </tr>\n <tr>\n <th>2</th>\n <td>2</td>\n <td>2</td>\n <td>2</td>\n <td>2</td>\n <td>2</td>\n </tr>\n <tr>\n <th>3</th>\n <td>3</td>\n <td>3</td>\n <td>3</td>\n <td>3</td>\n <td>3</td>\n </tr>\n <tr>\n <th>4</th>\n <td>1</td>\n <td>1</td>\n <td>1</td>\n <td>1</td>\n <td>1</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " date name rate review year\nmonth \n1 1 1 1 1 1\n2 2 2 2 2 2\n3 3 3 3 3 3\n4 1 1 1 1 1"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### example for multi-index:"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_multi = df_eg.copy().set_index(['month','year'])\ndf_multi",
"execution_count": 21,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 21,
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th></th>\n <th>date</th>\n <th>name</th>\n <th>rate</th>\n <th>review</th>\n </tr>\n <tr>\n <th>month</th>\n <th>year</th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>1</th>\n <th>2020</th>\n <td>2020-01-03</td>\n <td>Peter</td>\n <td>4.5</td>\n <td>It is a really nice restaurant! I would recomm...</td>\n </tr>\n <tr>\n <th>3</th>\n <th>2020</th>\n <td>2020-03-04</td>\n <td>Max</td>\n <td>3.5</td>\n <td>Good price. Not really satisfied with the serv...</td>\n </tr>\n <tr>\n <th>4</th>\n <th>2020</th>\n <td>2020-04-05</td>\n <td>Ella</td>\n <td>4.0</td>\n <td>Like it. Staff there was quite nice. Recommend.</td>\n </tr>\n <tr>\n <th rowspan=\"2\" valign=\"top\">2</th>\n <th>2020</th>\n <td>2020-02-01</td>\n <td>Maria</td>\n <td>4.5</td>\n <td>Looooove it!!!!! Really good! Highly recommend!</td>\n </tr>\n <tr>\n <th>2020</th>\n <td>2020-02-09</td>\n <td>Tom</td>\n <td>3.0</td>\n <td></td>\n </tr>\n <tr>\n <th rowspan=\"2\" valign=\"top\">3</th>\n <th>2020</th>\n <td>2020-03-12</td>\n <td>Emma</td>\n <td>3.5</td>\n <td>Not so bad. I will probably come back again</td>\n </tr>\n <tr>\n <th>2020</th>\n <td>2020-03-19</td>\n <td>Lisa</td>\n <td>5.0</td>\n <td>Such a nice restaurant!</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " date name rate \\\nmonth year \n1 2020 2020-01-03 Peter 4.5 \n3 2020 2020-03-04 Max 3.5 \n4 2020 2020-04-05 Ella 4.0 \n2 2020 2020-02-01 Maria 4.5 \n 2020 2020-02-09 Tom 3.0 \n3 2020 2020-03-12 Emma 3.5 \n 2020 2020-03-19 Lisa 5.0 \n\n review \nmonth year \n1 2020 It is a really nice restaurant! I would recomm... \n3 2020 Good price. Not really satisfied with the serv... \n4 2020 Like it. Staff there was quite nice. Recommend. \n2 2020 Looooove it!!!!! Really good! Highly recommend! \n 2020 \n3 2020 Not so bad. I will probably come back again \n 2020 Such a nice restaurant! "
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_multi.groupby(level = 1).count()",
"execution_count": 22,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 22,
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>date</th>\n <th>name</th>\n <th>rate</th>\n <th>review</th>\n </tr>\n <tr>\n <th>year</th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>2020</th>\n <td>7</td>\n <td>7</td>\n <td>7</td>\n <td>7</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " date name rate review\nyear \n2020 7 7 7 7"
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_multi.groupby(level = [0,1]).count()",
"execution_count": 23,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 23,
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th></th>\n <th>date</th>\n <th>name</th>\n <th>rate</th>\n <th>review</th>\n </tr>\n <tr>\n <th>month</th>\n <th>year</th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>1</th>\n <th>2020</th>\n <td>1</td>\n <td>1</td>\n <td>1</td>\n <td>1</td>\n </tr>\n <tr>\n <th>2</th>\n <th>2020</th>\n <td>2</td>\n <td>2</td>\n <td>2</td>\n <td>2</td>\n </tr>\n <tr>\n <th>3</th>\n <th>2020</th>\n <td>3</td>\n <td>3</td>\n <td>3</td>\n <td>3</td>\n </tr>\n <tr>\n <th>4</th>\n <th>2020</th>\n <td>1</td>\n <td>1</td>\n <td>1</td>\n <td>1</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " date name rate review\nmonth year \n1 2020 1 1 1 1\n2 2020 2 2 2 2\n3 2020 3 3 3 3\n4 2020 1 1 1 1"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### applying multiple calculation simultaneously by using aggregation\n\nOnce the GroupBy object has been created, several methods are available to perform a computation on the grouped data. \n\n#### example: "
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df",
"execution_count": 24,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 24,
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>date</th>\n <th>name</th>\n <th>rate</th>\n <th>review</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>2020-01-03</td>\n <td>Peter</td>\n <td>4.5</td>\n <td>It is a really nice restaurant! I would recomm...</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2020-03-04</td>\n <td>Max</td>\n <td>3.5</td>\n <td>Good price. Not really satisfied with the serv...</td>\n </tr>\n <tr>\n <th>2</th>\n <td>2020-04-05</td>\n <td>Ella</td>\n <td>4.0</td>\n <td>Like it. Staff there was quite nice. Recommend.</td>\n </tr>\n <tr>\n <th>3</th>\n <td>2020-02-01</td>\n <td>Maria</td>\n <td>4.5</td>\n <td>Looooove it!!!!! Really good! Highly recommend!</td>\n </tr>\n <tr>\n <th>4</th>\n <td>2020-02-09</td>\n <td>Tom</td>\n <td>3.0</td>\n <td></td>\n </tr>\n <tr>\n <th>5</th>\n <td>2020-03-12</td>\n <td>Emma</td>\n <td>3.5</td>\n <td>Not so bad. I will probably come back again</td>\n </tr>\n <tr>\n <th>6</th>\n <td>2020-03-19</td>\n <td>Lisa</td>\n <td>5.0</td>\n <td>Such a nice restaurant!</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " date name rate review\n0 2020-01-03 Peter 4.5 It is a really nice restaurant! I would recomm...\n1 2020-03-04 Max 3.5 Good price. Not really satisfied with the serv...\n2 2020-04-05 Ella 4.0 Like it. Staff there was quite nice. Recommend.\n3 2020-02-01 Maria 4.5 Looooove it!!!!! Really good! Highly recommend!\n4 2020-02-09 Tom 3.0 \n5 2020-03-12 Emma 3.5 Not so bad. I will probably come back again\n6 2020-03-19 Lisa 5.0 Such a nice restaurant!"
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "grouped.agg({'name':'count',\n 'date':['min','max'],\n 'review':'sum'\n })",
"execution_count": 25,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 25,
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead tr th {\n text-align: left;\n }\n\n .dataframe thead tr:last-of-type th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr>\n <th></th>\n <th colspan=\"2\" halign=\"left\">date</th>\n <th>name</th>\n <th>review</th>\n </tr>\n <tr>\n <th></th>\n <th>min</th>\n <th>max</th>\n <th>count</th>\n <th>sum</th>\n </tr>\n <tr>\n <th>rate</th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>3.0</th>\n <td>2020-02-09</td>\n <td>2020-02-09</td>\n <td>1</td>\n <td></td>\n </tr>\n <tr>\n <th>3.5</th>\n <td>2020-03-04</td>\n <td>2020-03-12</td>\n <td>2</td>\n <td>Good price. Not really satisfied with the serv...</td>\n </tr>\n <tr>\n <th>4.0</th>\n <td>2020-04-05</td>\n <td>2020-04-05</td>\n <td>1</td>\n <td>Like it. Staff there was quite nice. Recommend.</td>\n </tr>\n <tr>\n <th>4.5</th>\n <td>2020-01-03</td>\n <td>2020-02-01</td>\n <td>2</td>\n <td>It is a really nice restaurant! I would recomm...</td>\n </tr>\n <tr>\n <th>5.0</th>\n <td>2020-03-19</td>\n <td>2020-03-19</td>\n <td>1</td>\n <td>Such a nice restaurant!</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " date name \\\n min max count \nrate \n3.0 2020-02-09 2020-02-09 1 \n3.5 2020-03-04 2020-03-12 2 \n4.0 2020-04-05 2020-04-05 1 \n4.5 2020-01-03 2020-02-01 2 \n5.0 2020-03-19 2020-03-19 1 \n\n review \n sum \nrate \n3.0 \n3.5 Good price. Not really satisfied with the serv... \n4.0 Like it. Staff there was quite nice. Recommend. \n4.5 It is a really nice restaurant! I would recomm... \n5.0 Such a nice restaurant! "
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### filter\n\nElements from groups are filtered if they do not satisfy the boolean criterion specified by func. Return a copy of a DataFrame excluding filtered elements.\n\n#### example for filter out the review is shorter than 5 charaters"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "grouped.count()",
"execution_count": 26,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 26,
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>date</th>\n <th>name</th>\n <th>review</th>\n </tr>\n <tr>\n <th>rate</th>\n <th></th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>3.0</th>\n <td>1</td>\n <td>1</td>\n <td>1</td>\n </tr>\n <tr>\n <th>3.5</th>\n <td>2</td>\n <td>2</td>\n <td>2</td>\n </tr>\n <tr>\n <th>4.0</th>\n <td>1</td>\n <td>1</td>\n <td>1</td>\n </tr>\n <tr>\n <th>4.5</th>\n <td>2</td>\n <td>2</td>\n <td>2</td>\n </tr>\n <tr>\n <th>5.0</th>\n <td>1</td>\n <td>1</td>\n <td>1</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " date name review\nrate \n3.0 1 1 1\n3.5 2 2 2\n4.0 1 1 1\n4.5 2 2 2\n5.0 1 1 1"
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "grouped.filter(lambda x: x['review'].str.len().min() > 5)",
"execution_count": 27,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 27,
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>date</th>\n <th>name</th>\n <th>rate</th>\n <th>review</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>2020-01-03</td>\n <td>Peter</td>\n <td>4.5</td>\n <td>It is a really nice restaurant! I would recomm...</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2020-03-04</td>\n <td>Max</td>\n <td>3.5</td>\n <td>Good price. Not really satisfied with the serv...</td>\n </tr>\n <tr>\n <th>2</th>\n <td>2020-04-05</td>\n <td>Ella</td>\n <td>4.0</td>\n <td>Like it. Staff there was quite nice. Recommend.</td>\n </tr>\n <tr>\n <th>3</th>\n <td>2020-02-01</td>\n <td>Maria</td>\n <td>4.5</td>\n <td>Looooove it!!!!! Really good! Highly recommend!</td>\n </tr>\n <tr>\n <th>5</th>\n <td>2020-03-12</td>\n <td>Emma</td>\n <td>3.5</td>\n <td>Not so bad. I will probably come back again</td>\n </tr>\n <tr>\n <th>6</th>\n <td>2020-03-19</td>\n <td>Lisa</td>\n <td>5.0</td>\n <td>Such a nice restaurant!</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " date name rate review\n0 2020-01-03 Peter 4.5 It is a really nice restaurant! I would recomm...\n1 2020-03-04 Max 3.5 Good price. Not really satisfied with the serv...\n2 2020-04-05 Ella 4.0 Like it. Staff there was quite nice. Recommend.\n3 2020-02-01 Maria 4.5 Looooove it!!!!! Really good! Highly recommend!\n5 2020-03-12 Emma 3.5 Not so bad. I will probably come back again\n6 2020-03-19 Lisa 5.0 Such a nice restaurant!"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Rank"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### Pandas.DataFrame.rank(axis, method, numeric_only, ascending, ...)\n\nCompute numerical data ranks (1 through n) along axis.By default, equal values are assigned a rank that is the average of the ranks of those values.\n\nParameters:\n\n* axis: {0 or ‘index’, 1 or ‘columns’}, default 0. Index to direct ranking.\n* method: {‘average’, ‘min’, ‘max’, ‘first’, ‘dense’}, default ‘average’, indicating how to rank the group of records that have the same value\n* numeric_only: bool, optional, default False\n* ascending: bool, default True\n\nReturn:\nSame type as the caller\n\n#### example : "
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df.rank()",
"execution_count": 28,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 28,
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>date</th>\n <th>name</th>\n <th>rate</th>\n <th>review</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1.0</td>\n <td>6.0</td>\n <td>5.5</td>\n <td>3.0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>4.0</td>\n <td>5.0</td>\n <td>2.5</td>\n <td>2.0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>7.0</td>\n <td>1.0</td>\n <td>4.0</td>\n <td>4.0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>2.0</td>\n <td>4.0</td>\n <td>5.5</td>\n <td>5.0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>3.0</td>\n <td>7.0</td>\n <td>1.0</td>\n <td>1.0</td>\n </tr>\n <tr>\n <th>5</th>\n <td>5.0</td>\n <td>2.0</td>\n <td>2.5</td>\n <td>6.0</td>\n </tr>\n <tr>\n <th>6</th>\n <td>6.0</td>\n <td>3.0</td>\n <td>7.0</td>\n <td>7.0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " date name rate review\n0 1.0 6.0 5.5 3.0\n1 4.0 5.0 2.5 2.0\n2 7.0 1.0 4.0 4.0\n3 2.0 4.0 5.5 5.0\n4 3.0 7.0 1.0 1.0\n5 5.0 2.0 2.5 6.0\n6 6.0 3.0 7.0 7.0"
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df.rank(numeric_only = True,ascending= False,method = 'first')",
"execution_count": 29,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 29,
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>rate</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>2.0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>5.0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>4.0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>3.0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>7.0</td>\n </tr>\n <tr>\n <th>5</th>\n <td>6.0</td>\n </tr>\n <tr>\n <th>6</th>\n <td>1.0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " rate\n0 2.0\n1 5.0\n2 4.0\n3 3.0\n4 7.0\n5 6.0\n6 1.0"
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_example = df.copy()\ndf_example['rate_rank'] = df['rate'].rank(ascending= False,\n method = 'first'\n )\ndf_example",
"execution_count": 30,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 30,
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>date</th>\n <th>name</th>\n <th>rate</th>\n <th>review</th>\n <th>rate_rank</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>2020-01-03</td>\n <td>Peter</td>\n <td>4.5</td>\n <td>It is a really nice restaurant! I would recomm...</td>\n <td>2.0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2020-03-04</td>\n <td>Max</td>\n <td>3.5</td>\n <td>Good price. Not really satisfied with the serv...</td>\n <td>5.0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>2020-04-05</td>\n <td>Ella</td>\n <td>4.0</td>\n <td>Like it. Staff there was quite nice. Recommend.</td>\n <td>4.0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>2020-02-01</td>\n <td>Maria</td>\n <td>4.5</td>\n <td>Looooove it!!!!! Really good! Highly recommend!</td>\n <td>3.0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>2020-02-09</td>\n <td>Tom</td>\n <td>3.0</td>\n <td></td>\n <td>7.0</td>\n </tr>\n <tr>\n <th>5</th>\n <td>2020-03-12</td>\n <td>Emma</td>\n <td>3.5</td>\n <td>Not so bad. I will probably come back again</td>\n <td>6.0</td>\n </tr>\n <tr>\n <th>6</th>\n <td>2020-03-19</td>\n <td>Lisa</td>\n <td>5.0</td>\n <td>Such a nice restaurant!</td>\n <td>1.0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " date name rate review \\\n0 2020-01-03 Peter 4.5 It is a really nice restaurant! I would recomm... \n1 2020-03-04 Max 3.5 Good price. Not really satisfied with the serv... \n2 2020-04-05 Ella 4.0 Like it. Staff there was quite nice. Recommend. \n3 2020-02-01 Maria 4.5 Looooove it!!!!! Really good! Highly recommend! \n4 2020-02-09 Tom 3.0 \n5 2020-03-12 Emma 3.5 Not so bad. I will probably come back again \n6 2020-03-19 Lisa 5.0 Such a nice restaurant! \n\n rate_rank \n0 2.0 \n1 5.0 \n2 4.0 \n3 3.0 \n4 7.0 \n5 6.0 \n6 1.0 "
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Categorical Data"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### pandas.Series.cat()\n\nAccessor object for categorical properties of the Series values.\n\nParamters: \n* data: Series or CategoricalIndex\n\nCreate a category column for the current dataframe"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_example = df.copy()\ndf_example['level'] = pd.Series(list(['Good',\n 'Okay',\n 'Okay',\n 'Good',\n 'Soso',\n 'Soso',\n 'Good'\n ])).astype('category')\ndf_example",
"execution_count": 31,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 31,
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>date</th>\n <th>name</th>\n <th>rate</th>\n <th>review</th>\n <th>level</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>2020-01-03</td>\n <td>Peter</td>\n <td>4.5</td>\n <td>It is a really nice restaurant! I would recomm...</td>\n <td>Good</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2020-03-04</td>\n <td>Max</td>\n <td>3.5</td>\n <td>Good price. Not really satisfied with the serv...</td>\n <td>Okay</td>\n </tr>\n <tr>\n <th>2</th>\n <td>2020-04-05</td>\n <td>Ella</td>\n <td>4.0</td>\n <td>Like it. Staff there was quite nice. Recommend.</td>\n <td>Okay</td>\n </tr>\n <tr>\n <th>3</th>\n <td>2020-02-01</td>\n <td>Maria</td>\n <td>4.5</td>\n <td>Looooove it!!!!! Really good! Highly recommend!</td>\n <td>Good</td>\n </tr>\n <tr>\n <th>4</th>\n <td>2020-02-09</td>\n <td>Tom</td>\n <td>3.0</td>\n <td></td>\n <td>Soso</td>\n </tr>\n <tr>\n <th>5</th>\n <td>2020-03-12</td>\n <td>Emma</td>\n <td>3.5</td>\n <td>Not so bad. I will probably come back again</td>\n <td>Soso</td>\n </tr>\n <tr>\n <th>6</th>\n <td>2020-03-19</td>\n <td>Lisa</td>\n <td>5.0</td>\n <td>Such a nice restaurant!</td>\n <td>Good</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " date name rate review \\\n0 2020-01-03 Peter 4.5 It is a really nice restaurant! I would recomm... \n1 2020-03-04 Max 3.5 Good price. Not really satisfied with the serv... \n2 2020-04-05 Ella 4.0 Like it. Staff there was quite nice. Recommend. \n3 2020-02-01 Maria 4.5 Looooove it!!!!! Really good! Highly recommend! \n4 2020-02-09 Tom 3.0 \n5 2020-03-12 Emma 3.5 Not so bad. I will probably come back again \n6 2020-03-19 Lisa 5.0 Such a nice restaurant! \n\n level \n0 Good \n1 Okay \n2 Okay \n3 Good \n4 Soso \n5 Soso \n6 Good "
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_example['level'].cat.categories",
"execution_count": 32,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 32,
"data": {
"text/plain": "Index(['Good', 'Okay', 'Soso'], dtype='object')"
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_example['rename_cat'] = df_example.level.cat.rename_categories(['Cat.a','Cat.b','Cat.c'])\ndf_example",
"execution_count": 33,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 33,
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>date</th>\n <th>name</th>\n <th>rate</th>\n <th>review</th>\n <th>level</th>\n <th>rename_cat</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>2020-01-03</td>\n <td>Peter</td>\n <td>4.5</td>\n <td>It is a really nice restaurant! I would recomm...</td>\n <td>Good</td>\n <td>Cat.a</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2020-03-04</td>\n <td>Max</td>\n <td>3.5</td>\n <td>Good price. Not really satisfied with the serv...</td>\n <td>Okay</td>\n <td>Cat.b</td>\n </tr>\n <tr>\n <th>2</th>\n <td>2020-04-05</td>\n <td>Ella</td>\n <td>4.0</td>\n <td>Like it. Staff there was quite nice. Recommend.</td>\n <td>Okay</td>\n <td>Cat.b</td>\n </tr>\n <tr>\n <th>3</th>\n <td>2020-02-01</td>\n <td>Maria</td>\n <td>4.5</td>\n <td>Looooove it!!!!! Really good! Highly recommend!</td>\n <td>Good</td>\n <td>Cat.a</td>\n </tr>\n <tr>\n <th>4</th>\n <td>2020-02-09</td>\n <td>Tom</td>\n <td>3.0</td>\n <td></td>\n <td>Soso</td>\n <td>Cat.c</td>\n </tr>\n <tr>\n <th>5</th>\n <td>2020-03-12</td>\n <td>Emma</td>\n <td>3.5</td>\n <td>Not so bad. I will probably come back again</td>\n <td>Soso</td>\n <td>Cat.c</td>\n </tr>\n <tr>\n <th>6</th>\n <td>2020-03-19</td>\n <td>Lisa</td>\n <td>5.0</td>\n <td>Such a nice restaurant!</td>\n <td>Good</td>\n <td>Cat.a</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " date name rate review \\\n0 2020-01-03 Peter 4.5 It is a really nice restaurant! I would recomm... \n1 2020-03-04 Max 3.5 Good price. Not really satisfied with the serv... \n2 2020-04-05 Ella 4.0 Like it. Staff there was quite nice. Recommend. \n3 2020-02-01 Maria 4.5 Looooove it!!!!! Really good! Highly recommend! \n4 2020-02-09 Tom 3.0 \n5 2020-03-12 Emma 3.5 Not so bad. I will probably come back again \n6 2020-03-19 Lisa 5.0 Such a nice restaurant! \n\n level rename_cat \n0 Good Cat.a \n1 Okay Cat.b \n2 Okay Cat.b \n3 Good Cat.a \n4 Soso Cat.c \n5 Soso Cat.c \n6 Good Cat.a "
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_example['removed_cat'] = df_example.level.cat.remove_categories(['Soso'])\ndf_example",
"execution_count": 34,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 34,
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>date</th>\n <th>name</th>\n <th>rate</th>\n <th>review</th>\n <th>level</th>\n <th>rename_cat</th>\n <th>removed_cat</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>2020-01-03</td>\n <td>Peter</td>\n <td>4.5</td>\n <td>It is a really nice restaurant! I would recomm...</td>\n <td>Good</td>\n <td>Cat.a</td>\n <td>Good</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2020-03-04</td>\n <td>Max</td>\n <td>3.5</td>\n <td>Good price. Not really satisfied with the serv...</td>\n <td>Okay</td>\n <td>Cat.b</td>\n <td>Okay</td>\n </tr>\n <tr>\n <th>2</th>\n <td>2020-04-05</td>\n <td>Ella</td>\n <td>4.0</td>\n <td>Like it. Staff there was quite nice. Recommend.</td>\n <td>Okay</td>\n <td>Cat.b</td>\n <td>Okay</td>\n </tr>\n <tr>\n <th>3</th>\n <td>2020-02-01</td>\n <td>Maria</td>\n <td>4.5</td>\n <td>Looooove it!!!!! Really good! Highly recommend!</td>\n <td>Good</td>\n <td>Cat.a</td>\n <td>Good</td>\n </tr>\n <tr>\n <th>4</th>\n <td>2020-02-09</td>\n <td>Tom</td>\n <td>3.0</td>\n <td></td>\n <td>Soso</td>\n <td>Cat.c</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>5</th>\n <td>2020-03-12</td>\n <td>Emma</td>\n <td>3.5</td>\n <td>Not so bad. I will probably come back again</td>\n <td>Soso</td>\n <td>Cat.c</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>6</th>\n <td>2020-03-19</td>\n <td>Lisa</td>\n <td>5.0</td>\n <td>Such a nice restaurant!</td>\n <td>Good</td>\n <td>Cat.a</td>\n <td>Good</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " date name rate review \\\n0 2020-01-03 Peter 4.5 It is a really nice restaurant! I would recomm... \n1 2020-03-04 Max 3.5 Good price. Not really satisfied with the serv... \n2 2020-04-05 Ella 4.0 Like it. Staff there was quite nice. Recommend. \n3 2020-02-01 Maria 4.5 Looooove it!!!!! Really good! Highly recommend! \n4 2020-02-09 Tom 3.0 \n5 2020-03-12 Emma 3.5 Not so bad. I will probably come back again \n6 2020-03-19 Lisa 5.0 Such a nice restaurant! \n\n level rename_cat removed_cat \n0 Good Cat.a Good \n1 Okay Cat.b Okay \n2 Okay Cat.b Okay \n3 Good Cat.a Good \n4 Soso Cat.c NaN \n5 Soso Cat.c NaN \n6 Good Cat.a Good "
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_example['add_cat'] = df_example.level.cat.add_categories(['Bad'])\ndf_example",
"execution_count": 35,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 35,
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>date</th>\n <th>name</th>\n <th>rate</th>\n <th>review</th>\n <th>level</th>\n <th>rename_cat</th>\n <th>removed_cat</th>\n <th>add_cat</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>2020-01-03</td>\n <td>Peter</td>\n <td>4.5</td>\n <td>It is a really nice restaurant! I would recomm...</td>\n <td>Good</td>\n <td>Cat.a</td>\n <td>Good</td>\n <td>Good</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2020-03-04</td>\n <td>Max</td>\n <td>3.5</td>\n <td>Good price. Not really satisfied with the serv...</td>\n <td>Okay</td>\n <td>Cat.b</td>\n <td>Okay</td>\n <td>Okay</td>\n </tr>\n <tr>\n <th>2</th>\n <td>2020-04-05</td>\n <td>Ella</td>\n <td>4.0</td>\n <td>Like it. Staff there was quite nice. Recommend.</td>\n <td>Okay</td>\n <td>Cat.b</td>\n <td>Okay</td>\n <td>Okay</td>\n </tr>\n <tr>\n <th>3</th>\n <td>2020-02-01</td>\n <td>Maria</td>\n <td>4.5</td>\n <td>Looooove it!!!!! Really good! Highly recommend!</td>\n <td>Good</td>\n <td>Cat.a</td>\n <td>Good</td>\n <td>Good</td>\n </tr>\n <tr>\n <th>4</th>\n <td>2020-02-09</td>\n <td>Tom</td>\n <td>3.0</td>\n <td></td>\n <td>Soso</td>\n <td>Cat.c</td>\n <td>NaN</td>\n <td>Soso</td>\n </tr>\n <tr>\n <th>5</th>\n <td>2020-03-12</td>\n <td>Emma</td>\n <td>3.5</td>\n <td>Not so bad. I will probably come back again</td>\n <td>Soso</td>\n <td>Cat.c</td>\n <td>NaN</td>\n <td>Soso</td>\n </tr>\n <tr>\n <th>6</th>\n <td>2020-03-19</td>\n <td>Lisa</td>\n <td>5.0</td>\n <td>Such a nice restaurant!</td>\n <td>Good</td>\n <td>Cat.a</td>\n <td>Good</td>\n <td>Good</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " date name rate review \\\n0 2020-01-03 Peter 4.5 It is a really nice restaurant! I would recomm... \n1 2020-03-04 Max 3.5 Good price. Not really satisfied with the serv... \n2 2020-04-05 Ella 4.0 Like it. Staff there was quite nice. Recommend. \n3 2020-02-01 Maria 4.5 Looooove it!!!!! Really good! Highly recommend! \n4 2020-02-09 Tom 3.0 \n5 2020-03-12 Emma 3.5 Not so bad. I will probably come back again \n6 2020-03-19 Lisa 5.0 Such a nice restaurant! \n\n level rename_cat removed_cat add_cat \n0 Good Cat.a Good Good \n1 Okay Cat.b Okay Okay \n2 Okay Cat.b Okay Okay \n3 Good Cat.a Good Good \n4 Soso Cat.c NaN Soso \n5 Soso Cat.c NaN Soso \n6 Good Cat.a Good Good "
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_example['add_cat'].cat.categories",
"execution_count": 36,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 36,
"data": {
"text/plain": "Index(['Good', 'Okay', 'Soso', 'Bad'], dtype='object')"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### pandas.getdummies(data, prefix, etc...)\n\nConvert categorical variable into dummy/indicator variables.\n\nParameters: \n* data: array-like, Series, or DataFrame\n* prefix: str, list of str, or dict of str, default None\n\nReturns: \n* DataFrame, Dummy-coded data\n\nUse the DataFrame we created at 'cat' part:\n\n#### example:"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "pd.get_dummies(df_example['level'])",
"execution_count": 37,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 37,
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Good</th>\n <th>Okay</th>\n <th>Soso</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>5</th>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>6</th>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " Good Okay Soso\n0 1 0 0\n1 0 1 0\n2 0 1 0\n3 1 0 0\n4 0 0 1\n5 0 0 1\n6 1 0 0"
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "pd.get_dummies(df_example['level'], prefix = 'level_')",
"execution_count": 38,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 38,
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>level__Good</th>\n <th>level__Okay</th>\n <th>level__Soso</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>5</th>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>6</th>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " level__Good level__Okay level__Soso\n0 1 0 0\n1 0 1 0\n2 0 1 0\n3 1 0 0\n4 0 0 1\n5 0 0 1\n6 1 0 0"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## End\n\nThey are only a means to the end of learning how to do good data analysis "
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"gist": {
"id": "",
"data": {
"description": "Pandas - Groupby and Categorical Data.ipynb",
"public": false
}
},
"language_info": {
"file_extension": ".py",
"nbconvert_exporter": "python",
"version": "3.5.4",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"name": "python",
"mimetype": "text/x-python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment