Skip to content

Instantly share code, notes, and snippets.

@machinelearning-blog
Created January 16, 2018 15:27
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save machinelearning-blog/bf5cb00bf63061935caf87779c111b07 to your computer and use it in GitHub Desktop.
Save machinelearning-blog/bf5cb00bf63061935caf87779c111b07 to your computer and use it in GitHub Desktop.
Introduction to Pandas
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {
"_uuid": "79d8fa343e5b0b43bd342efa42b30025c2057741",
"_cell_guid": "0e95d0cb-3064-4dd6-b110-1590ec214caa"
},
"cell_type": "markdown",
"source": "## **Introduction**\nThis Kernel covers the basic pandas commands, every ML engineer or Data Scientist should know. Therefore it is intended for beginners. By using these commands we will transform and clean the Rossman Store Sales Dataset. Some transformation I will do in this Kernel doesn't really makes sense out of a machine learning persepective, I just do them for illustrational purposes of pandas. I will use the steps described in the Youtube Tutiorial: \"Introduction To Data Analytics With Pandas\" from Quentin Caudron, but with the Rossmann Data set. "
},
{
"metadata": {
"_uuid": "c4b505bc88cdb7521461b8334aa3cef69728f73b",
"_cell_guid": "ce3de614-aeff-4539-9948-9122e67efd90"
},
"cell_type": "markdown",
"source": "----------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n"
},
{
"metadata": {
"_uuid": "43b811718fa71157ee9c74f843f63fd4e8f04a3f",
"collapsed": true,
"_cell_guid": "17a056b4-98f0-4192-9a57-a4c072eb288e",
"trusted": true
},
"cell_type": "code",
"source": "import pandas as pd\nimport numpy as np\nimport matplotlib",
"execution_count": 1,
"outputs": []
},
{
"metadata": {
"_uuid": "062e2d8b46b0f945a4ab09e67dedb442526259e6",
"_cell_guid": "6c745296-c8b0-4488-a024-41a7c485037e",
"trusted": true
},
"cell_type": "code",
"source": "train_df = pd.read_csv(\"../input/train.csv\")",
"execution_count": 2,
"outputs": [
{
"output_type": "stream",
"text": "/opt/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2698: DtypeWarning: Columns (7) have mixed types. Specify dtype option on import or set low_memory=False.\n interactivity=interactivity, compiler=compiler, result=result)\n",
"name": "stderr"
}
]
},
{
"metadata": {
"_uuid": "5621f6fde374b4208d0deda3b0cb84df670259bd",
"_cell_guid": "8a682f5d-4c4c-4c89-bbd1-c63caad4c085"
},
"cell_type": "markdown",
"source": "----------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n"
},
{
"metadata": {
"_uuid": "335bf65960756bfcd040c8ed2ed231effd2f8b9f",
"_cell_guid": "8d0d65b9-8880-4d46-b336-a5a50872782f"
},
"cell_type": "markdown",
"source": ""
},
{
"metadata": {
"_uuid": "165ba81e1b1ef5b8186243867c3958c97e6828e3",
"_cell_guid": "97a5b0a5-7499-4377-89ae-8cb039792d3e"
},
"cell_type": "markdown",
"source": "# **Data Exploration**\n----------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n"
},
{
"metadata": {
"_uuid": "c6e70bf247c548c9fb4e9a99ece4468a462c8f16",
"_cell_guid": "472297ae-310a-4d29-8a36-7ab7dc9d3fa5",
"trusted": true
},
"cell_type": "code",
"source": "# .head() returns the first 5 rows of a dataset\ntrain_df.head()",
"execution_count": 3,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 3,
"data": {
"text/plain": " Store DayOfWeek Date Sales Customers Open Promo StateHoliday \\\n0 1 5 2015-07-31 5263 555 1 1 0 \n1 2 5 2015-07-31 6064 625 1 1 0 \n2 3 5 2015-07-31 8314 821 1 1 0 \n3 4 5 2015-07-31 13995 1498 1 1 0 \n4 5 5 2015-07-31 4822 559 1 1 0 \n\n SchoolHoliday \n0 1 \n1 1 \n2 1 \n3 1 \n4 1 ",
"text/html": "<div>\n<style>\n .dataframe thead tr:only-child th {\n text-align: right;\n }\n\n .dataframe thead th {\n text-align: left;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Store</th>\n <th>DayOfWeek</th>\n <th>Date</th>\n <th>Sales</th>\n <th>Customers</th>\n <th>Open</th>\n <th>Promo</th>\n <th>StateHoliday</th>\n <th>SchoolHoliday</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>5</td>\n <td>2015-07-31</td>\n <td>5263</td>\n <td>555</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>5</td>\n <td>2015-07-31</td>\n <td>6064</td>\n <td>625</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>5</td>\n <td>2015-07-31</td>\n <td>8314</td>\n <td>821</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>5</td>\n <td>2015-07-31</td>\n <td>13995</td>\n <td>1498</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>4</th>\n <td>5</td>\n <td>5</td>\n <td>2015-07-31</td>\n <td>4822</td>\n <td>559</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"_uuid": "e7e366326d7564df1d106caf24ade986497a7e6c",
"_cell_guid": "a97fed19-3399-45b1-b749-bff07dfed3c2",
"trusted": true
},
"cell_type": "code",
"source": "# .tail() returns the last 5 rows of a dataset\ntrain_df.tail()",
"execution_count": 4,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 4,
"data": {
"text/plain": " Store DayOfWeek Date Sales Customers Open Promo \\\n1017204 1111 2 2013-01-01 0 0 0 0 \n1017205 1112 2 2013-01-01 0 0 0 0 \n1017206 1113 2 2013-01-01 0 0 0 0 \n1017207 1114 2 2013-01-01 0 0 0 0 \n1017208 1115 2 2013-01-01 0 0 0 0 \n\n StateHoliday SchoolHoliday \n1017204 a 1 \n1017205 a 1 \n1017206 a 1 \n1017207 a 1 \n1017208 a 1 ",
"text/html": "<div>\n<style>\n .dataframe thead tr:only-child th {\n text-align: right;\n }\n\n .dataframe thead th {\n text-align: left;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Store</th>\n <th>DayOfWeek</th>\n <th>Date</th>\n <th>Sales</th>\n <th>Customers</th>\n <th>Open</th>\n <th>Promo</th>\n <th>StateHoliday</th>\n <th>SchoolHoliday</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>1017204</th>\n <td>1111</td>\n <td>2</td>\n <td>2013-01-01</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>a</td>\n <td>1</td>\n </tr>\n <tr>\n <th>1017205</th>\n <td>1112</td>\n <td>2</td>\n <td>2013-01-01</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>a</td>\n <td>1</td>\n </tr>\n <tr>\n <th>1017206</th>\n <td>1113</td>\n <td>2</td>\n <td>2013-01-01</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>a</td>\n <td>1</td>\n </tr>\n <tr>\n <th>1017207</th>\n <td>1114</td>\n <td>2</td>\n <td>2013-01-01</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>a</td>\n <td>1</td>\n </tr>\n <tr>\n <th>1017208</th>\n <td>1115</td>\n <td>2</td>\n <td>2013-01-01</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>a</td>\n <td>1</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"_uuid": "73be4188b29c533e5b0736f9286388f528d9ed5d",
"_cell_guid": "5ab84b5e-a381-4e0b-8fe7-6e20cc945531",
"trusted": true
},
"cell_type": "code",
"source": "# .info() shows generell information about the datafram like total entrie number, \n# total number of features, feature types etc.\ntrain_df.info()",
"execution_count": 5,
"outputs": [
{
"output_type": "stream",
"text": "<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 1017209 entries, 0 to 1017208\nData columns (total 9 columns):\nStore 1017209 non-null int64\nDayOfWeek 1017209 non-null int64\nDate 1017209 non-null object\nSales 1017209 non-null int64\nCustomers 1017209 non-null int64\nOpen 1017209 non-null int64\nPromo 1017209 non-null int64\nStateHoliday 1017209 non-null object\nSchoolHoliday 1017209 non-null int64\ndtypes: int64(7), object(2)\nmemory usage: 69.8+ MB\n",
"name": "stdout"
}
]
},
{
"metadata": {
"_uuid": "930b37135164e1ef5f02ff54ae76d2dc73a9852b",
"_cell_guid": "f0d6d85c-08cd-48f3-a64b-8635512bf4ff",
"trusted": true
},
"cell_type": "code",
"source": "# .iloc[] returns a specific row of the dataframe. Just put in the index you wish to see.\n# there is also .loc[] which is for selection by label, but also used with a boolean array\ntrain_df.iloc[2]",
"execution_count": 6,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 6,
"data": {
"text/plain": "Store 3\nDayOfWeek 5\nDate 2015-07-31\nSales 8314\nCustomers 821\nOpen 1\nPromo 1\nStateHoliday 0\nSchoolHoliday 1\nName: 2, dtype: object"
},
"metadata": {}
}
]
},
{
"metadata": {
"_uuid": "665eea0df4e4ef247e2be087f1167cce3c3bc420",
"_cell_guid": "76b90322-e691-4561-b415-4c1c6d513573",
"trusted": true
},
"cell_type": "code",
"source": "# .describe() shows you several statistical stats about your dataset.\ntrain_df.describe()",
"execution_count": 7,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 7,
"data": {
"text/plain": " Store DayOfWeek Sales Customers Open \\\ncount 1.017209e+06 1.017209e+06 1.017209e+06 1.017209e+06 1.017209e+06 \nmean 5.584297e+02 3.998341e+00 5.773819e+03 6.331459e+02 8.301067e-01 \nstd 3.219087e+02 1.997391e+00 3.849926e+03 4.644117e+02 3.755392e-01 \nmin 1.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 \n25% 2.800000e+02 2.000000e+00 3.727000e+03 4.050000e+02 1.000000e+00 \n50% 5.580000e+02 4.000000e+00 5.744000e+03 6.090000e+02 1.000000e+00 \n75% 8.380000e+02 6.000000e+00 7.856000e+03 8.370000e+02 1.000000e+00 \nmax 1.115000e+03 7.000000e+00 4.155100e+04 7.388000e+03 1.000000e+00 \n\n Promo SchoolHoliday \ncount 1.017209e+06 1.017209e+06 \nmean 3.815145e-01 1.786467e-01 \nstd 4.857586e-01 3.830564e-01 \nmin 0.000000e+00 0.000000e+00 \n25% 0.000000e+00 0.000000e+00 \n50% 0.000000e+00 0.000000e+00 \n75% 1.000000e+00 0.000000e+00 \nmax 1.000000e+00 1.000000e+00 ",
"text/html": "<div>\n<style>\n .dataframe thead tr:only-child th {\n text-align: right;\n }\n\n .dataframe thead th {\n text-align: left;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Store</th>\n <th>DayOfWeek</th>\n <th>Sales</th>\n <th>Customers</th>\n <th>Open</th>\n <th>Promo</th>\n <th>SchoolHoliday</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>count</th>\n <td>1.017209e+06</td>\n <td>1.017209e+06</td>\n <td>1.017209e+06</td>\n <td>1.017209e+06</td>\n <td>1.017209e+06</td>\n <td>1.017209e+06</td>\n <td>1.017209e+06</td>\n </tr>\n <tr>\n <th>mean</th>\n <td>5.584297e+02</td>\n <td>3.998341e+00</td>\n <td>5.773819e+03</td>\n <td>6.331459e+02</td>\n <td>8.301067e-01</td>\n <td>3.815145e-01</td>\n <td>1.786467e-01</td>\n </tr>\n <tr>\n <th>std</th>\n <td>3.219087e+02</td>\n <td>1.997391e+00</td>\n <td>3.849926e+03</td>\n <td>4.644117e+02</td>\n <td>3.755392e-01</td>\n <td>4.857586e-01</td>\n <td>3.830564e-01</td>\n </tr>\n <tr>\n <th>min</th>\n <td>1.000000e+00</td>\n <td>1.000000e+00</td>\n <td>0.000000e+00</td>\n <td>0.000000e+00</td>\n <td>0.000000e+00</td>\n <td>0.000000e+00</td>\n <td>0.000000e+00</td>\n </tr>\n <tr>\n <th>25%</th>\n <td>2.800000e+02</td>\n <td>2.000000e+00</td>\n <td>3.727000e+03</td>\n <td>4.050000e+02</td>\n <td>1.000000e+00</td>\n <td>0.000000e+00</td>\n <td>0.000000e+00</td>\n </tr>\n <tr>\n <th>50%</th>\n <td>5.580000e+02</td>\n <td>4.000000e+00</td>\n <td>5.744000e+03</td>\n <td>6.090000e+02</td>\n <td>1.000000e+00</td>\n <td>0.000000e+00</td>\n <td>0.000000e+00</td>\n </tr>\n <tr>\n <th>75%</th>\n <td>8.380000e+02</td>\n <td>6.000000e+00</td>\n <td>7.856000e+03</td>\n <td>8.370000e+02</td>\n <td>1.000000e+00</td>\n <td>1.000000e+00</td>\n <td>0.000000e+00</td>\n </tr>\n <tr>\n <th>max</th>\n <td>1.115000e+03</td>\n <td>7.000000e+00</td>\n <td>4.155100e+04</td>\n <td>7.388000e+03</td>\n <td>1.000000e+00</td>\n <td>1.000000e+00</td>\n <td>1.000000e+00</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"_uuid": "505a1662862b3bfbea05c607896c87d1999a15c2",
"_cell_guid": "a81dab5f-1732-49bb-870a-dcd2eaaf3bf1"
},
"cell_type": "markdown",
"source": ""
},
{
"metadata": {
"_uuid": "7d1310df6f735b44e5dd8592ba1aadad30bfd170",
"_cell_guid": "3be91fe6-7e4b-4e42-bfa9-a0fd8f75c9cb"
},
"cell_type": "markdown",
"source": "## **Missing Data ?**"
},
{
"metadata": {
"_uuid": "5c819fc6482575206d84d1b23d54d248cba31ac5",
"_cell_guid": "73877acc-4a31-4200-ab5d-70d1e850fd3d"
},
"cell_type": "markdown",
"source": "Let's look if you we have some missing data in our dataset. "
},
{
"metadata": {
"_uuid": "195846d78641dfbf4cbc7b0bd77e0261a9342905",
"_cell_guid": "8d6c3bb9-8269-445b-89f4-f09ad41431d1",
"trusted": true
},
"cell_type": "code",
"source": "# .isnull() detects missing values.\n# .sum() sums values up \n# .max() returns the maximum of the values in the object\n# By using these 3 methods together we can easily see how many missing values our data contains.\ntrain_df.isnull().sum().max() ",
"execution_count": 8,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 8,
"data": {
"text/plain": "0"
},
"metadata": {}
}
]
},
{
"metadata": {
"_uuid": "d5fae37032e0dbbb067fcbab6b85bc604a42b360",
"_cell_guid": "a3573993-f354-4618-9ba4-7a913ad9e4e0"
},
"cell_type": "markdown",
"source": ""
},
{
"metadata": {
"_uuid": "2a2c5b09f784023568e792d9ff41c9be94ae5343",
"_cell_guid": "d1065a54-9d69-40bb-babf-c2d3cc6404e8"
},
"cell_type": "markdown",
"source": "## **Changing pandas dtypes**"
},
{
"metadata": {
"_uuid": "4843479d5192799783d0bc2c07245f512e6dd264",
"_cell_guid": "c87ae40b-d51d-43f0-899a-412f7c567f07",
"trusted": true
},
"cell_type": "code",
"source": "# dtypes shows you all features of the dataset and at what type panadas stored them\ntrain_df.dtypes",
"execution_count": 9,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 9,
"data": {
"text/plain": "Store int64\nDayOfWeek int64\nDate object\nSales int64\nCustomers int64\nOpen int64\nPromo int64\nStateHoliday object\nSchoolHoliday int64\ndtype: object"
},
"metadata": {}
}
]
},
{
"metadata": {
"_uuid": "b13a71fa81c9b4891a3f20f36044cdf7918755c1",
"_cell_guid": "268fab0a-114e-42a5-8216-0fd749c430f8",
"trusted": true
},
"cell_type": "code",
"source": "print(train_df.Date[0])\n# type() returns the type of the input\nprint(type(train_df.Date[0]))",
"execution_count": 10,
"outputs": [
{
"output_type": "stream",
"text": "2015-07-31\n<class 'str'>\n",
"name": "stdout"
}
]
},
{
"metadata": {
"_uuid": "7698b65440c7a8209a19ddfb3bffc61018655a9a",
"_cell_guid": "ad24fb05-1e52-4b1e-956a-c649601991f3"
},
"cell_type": "markdown",
"source": "The Date feature is stored as a string. We will convert it into a pandas Datetime object, so that it is easier to work with."
},
{
"metadata": {
"_uuid": "56e1f45d5cb73018237539178c7110fe23829e07",
"_cell_guid": "37cc7853-c315-405f-8d43-06686f24c56c",
"trusted": true
},
"cell_type": "code",
"source": "# pd.to_datetime() transform it into a datetime object\ntrain_df.Date = pd.to_datetime(train_df.Date)\n# confirm the types\ntrain_df.dtypes",
"execution_count": 11,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 11,
"data": {
"text/plain": "Store int64\nDayOfWeek int64\nDate datetime64[ns]\nSales int64\nCustomers int64\nOpen int64\nPromo int64\nStateHoliday object\nSchoolHoliday int64\ndtype: object"
},
"metadata": {}
}
]
},
{
"metadata": {
"_uuid": "7176eaa66fcad49ef7ad42332fe2d177083362af",
"_cell_guid": "b566e2c7-c2ed-4715-96a2-3f618993b32d"
},
"cell_type": "markdown",
"source": ""
},
{
"metadata": {
"_uuid": "f587375f027eb81b61dcf2435a81bae14ee7a76b",
"_cell_guid": "7190b05e-87ae-4619-a6e4-c97600c7ac13"
},
"cell_type": "markdown",
"source": "## **Converting Values**"
},
{
"metadata": {
"_uuid": "d54b06dcf1b2ff822e9156153623098bf09a8b6f",
"_cell_guid": "9647bb11-2cda-41f2-ae7e-6fee215e6ec3",
"trusted": true
},
"cell_type": "code",
"source": "# train_df.StateHoliday selects the StateHoliday feature\n# value_counts() returns how many different values a feature has and counts how often they occur.\ntrain_df.StateHoliday.value_counts()",
"execution_count": 12,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 12,
"data": {
"text/plain": "0 855087\n0 131072\na 20260\nb 6690\nc 4100\nName: StateHoliday, dtype: int64"
},
"metadata": {}
}
]
},
{
"metadata": {
"_uuid": "062698727b56cf4de0df47829ea83eb6bee9720f",
"_cell_guid": "f4d6f06d-511b-4c7e-ac1d-f8d852e78a2b"
},
"cell_type": "markdown",
"source": "As you can see the StateHoliday feature contains not only numbers. Because of that we will convert the letters into numeric values."
},
{
"metadata": {
"_uuid": "e3ac55a2bd0d2b99395d1ca027b82e2f18bf650d",
"collapsed": true,
"_cell_guid": "67e641f3-2e22-4a60-b359-ef6f64d42304",
"trusted": true
},
"cell_type": "code",
"source": "# create a mapping dictionary\nmapping_dictionary = {\"StateHoliday\": {\"a\": 1, \"b\": 2, \"c\": 3}}",
"execution_count": 13,
"outputs": []
},
{
"metadata": {
"_uuid": "bdcb3e512e8389a92293008ac40da88d0d25c9b2",
"collapsed": true,
"_cell_guid": "fd01011d-6340-4658-80eb-09a41d121615",
"trusted": true
},
"cell_type": "code",
"source": "# .replace() replaces the values. \ntrain_df.replace(mapping_dictionary, inplace = True)",
"execution_count": 14,
"outputs": []
},
{
"metadata": {
"_uuid": "83e92c42a539a1e2386bd8bee805ee6d0e430c42",
"_cell_guid": "679e6513-1e84-41ed-856a-89865025d27c",
"trusted": true
},
"cell_type": "code",
"source": "# Let's see if it worked:\ntrain_df.StateHoliday.value_counts()",
"execution_count": 15,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 15,
"data": {
"text/plain": "0 855087\n0 131072\n1 20260\n2 6690\n3 4100\nName: StateHoliday, dtype: int64"
},
"metadata": {}
}
]
},
{
"metadata": {
"_uuid": "f66e46346487c0ac913fc269e4d213e57379fd62",
"_cell_guid": "f5cefb0d-5e7d-4b33-b8ab-0b8c346963c9",
"trusted": true
},
"cell_type": "code",
"source": "# Let's check the dtype again.\ntrain_df.dtypes",
"execution_count": 16,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 16,
"data": {
"text/plain": "Store int64\nDayOfWeek int64\nDate datetime64[ns]\nSales int64\nCustomers int64\nOpen int64\nPromo int64\nStateHoliday object\nSchoolHoliday int64\ndtype: object"
},
"metadata": {}
}
]
},
{
"metadata": {
"_uuid": "4f322230b69e9b1d6f4aac82647fad025a69404e",
"_cell_guid": "7f2d0600-02a3-4378-8c25-40fdb31e3e14"
},
"cell_type": "markdown",
"source": "We succesfully converted the features values but pandas has the StateHoliday feature still stored as an object dtype. We will convert it into an int64."
},
{
"metadata": {
"_uuid": "e9052522b4b182c00dd19500e6825dd6e81f2849",
"collapsed": true,
"_cell_guid": "4df968cb-e2aa-401d-9b0a-0e18874d4a3f",
"trusted": true
},
"cell_type": "code",
"source": "# astype() transform the dtype, at this example into an integer (int64)\ntrain_df.StateHoliday = train_df.StateHoliday.astype(int)",
"execution_count": 17,
"outputs": []
},
{
"metadata": {
"_uuid": "a8a46e317410e5a591480f5603ec586d6d07e0d7",
"_cell_guid": "d92ebb10-9b4f-44c9-b8d7-154e0f8143bd",
"trusted": true
},
"cell_type": "code",
"source": "train_df.dtypes",
"execution_count": 18,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 18,
"data": {
"text/plain": "Store int64\nDayOfWeek int64\nDate datetime64[ns]\nSales int64\nCustomers int64\nOpen int64\nPromo int64\nStateHoliday int64\nSchoolHoliday int64\ndtype: object"
},
"metadata": {}
}
]
},
{
"metadata": {
"_uuid": "4cf03b4c00ae86a26c0fc06e3e411ca2297f0bab",
"_cell_guid": "092672fa-2fcf-4bb2-9d3a-b89e11614476",
"trusted": true
},
"cell_type": "code",
"source": "train_df.head()",
"execution_count": 19,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 19,
"data": {
"text/plain": " Store DayOfWeek Date Sales Customers Open Promo StateHoliday \\\n0 1 5 2015-07-31 5263 555 1 1 0 \n1 2 5 2015-07-31 6064 625 1 1 0 \n2 3 5 2015-07-31 8314 821 1 1 0 \n3 4 5 2015-07-31 13995 1498 1 1 0 \n4 5 5 2015-07-31 4822 559 1 1 0 \n\n SchoolHoliday \n0 1 \n1 1 \n2 1 \n3 1 \n4 1 ",
"text/html": "<div>\n<style>\n .dataframe thead tr:only-child th {\n text-align: right;\n }\n\n .dataframe thead th {\n text-align: left;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Store</th>\n <th>DayOfWeek</th>\n <th>Date</th>\n <th>Sales</th>\n <th>Customers</th>\n <th>Open</th>\n <th>Promo</th>\n <th>StateHoliday</th>\n <th>SchoolHoliday</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>5</td>\n <td>2015-07-31</td>\n <td>5263</td>\n <td>555</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>5</td>\n <td>2015-07-31</td>\n <td>6064</td>\n <td>625</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>5</td>\n <td>2015-07-31</td>\n <td>8314</td>\n <td>821</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>5</td>\n <td>2015-07-31</td>\n <td>13995</td>\n <td>1498</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>4</th>\n <td>5</td>\n <td>5</td>\n <td>2015-07-31</td>\n <td>4822</td>\n <td>559</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"_uuid": "0ffe1c99a53a1fbfc5d770e61446d6e7e3469209",
"_cell_guid": "a214af2f-84c2-4e1d-898e-7edaf24f6fa3"
},
"cell_type": "markdown",
"source": ""
},
{
"metadata": {
"_uuid": "9cd7390090015d8328eb485178414096b9cf7121",
"_cell_guid": "84c0242f-1eb1-4d8b-b246-266d39e725af"
},
"cell_type": "markdown",
"source": "## **Creating new features and using the Datetime object**"
},
{
"metadata": {
"_uuid": "dfa0f50f2e24cf120dbc924edf36b6474c376622",
"_cell_guid": "01707143-5b99-40bf-bd95-3c5a71611955"
},
"cell_type": "markdown",
"source": "The DayOfWeek feature stores the day as a number, which is good to put it into an algorithm but out of illustrational purposes we will create a new feature that contains the actual weekday as a string. We will delete the DayofWeek feature because it didn't works inline with the Datetime object, which is way easier to work with."
},
{
"metadata": {
"_uuid": "27f31fc58c5d6cde4fe060db093a15b3f74ab0e3",
"collapsed": true,
"_cell_guid": "1e1c411b-755f-41f8-be42-52ffa99375d3",
"trusted": true
},
"cell_type": "code",
"source": "# .drop() to drop the DayOfWeek feature\ntrain_df = train_df.drop(\"DayOfWeek\", axis=1)",
"execution_count": 20,
"outputs": []
},
{
"metadata": {
"_uuid": "059555cb94dbf22e36989a2f456d3bf1863046fd",
"collapsed": true,
"_cell_guid": "aabc45b6-6a15-457d-a80c-3aab2ab46ffa",
"trusted": true
},
"cell_type": "code",
"source": "# Create a series for the weekdays for each entry using dt.weekday. \n# Pandas automatically finds the right day to a specific data because we previously \n# transformed the Date feature into a Datetime object.\nweekdays = train_df.Date.dt.weekday\n# assign() assigns the new weekdays feature to our dataframe.\ntrain_df = train_df.assign(weekdays = weekdays)",
"execution_count": 21,
"outputs": []
},
{
"metadata": {
"_uuid": "a1446a7aa982bd93a4989fcc362f319593dd6a71",
"_cell_guid": "edbdabf0-45bc-4d3e-a111-e584deb214bd",
"trusted": true
},
"cell_type": "code",
"source": "train_df.head()",
"execution_count": 22,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 22,
"data": {
"text/plain": " Store Date Sales Customers Open Promo StateHoliday \\\n0 1 2015-07-31 5263 555 1 1 0 \n1 2 2015-07-31 6064 625 1 1 0 \n2 3 2015-07-31 8314 821 1 1 0 \n3 4 2015-07-31 13995 1498 1 1 0 \n4 5 2015-07-31 4822 559 1 1 0 \n\n SchoolHoliday weekdays \n0 1 4 \n1 1 4 \n2 1 4 \n3 1 4 \n4 1 4 ",
"text/html": "<div>\n<style>\n .dataframe thead tr:only-child th {\n text-align: right;\n }\n\n .dataframe thead th {\n text-align: left;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Store</th>\n <th>Date</th>\n <th>Sales</th>\n <th>Customers</th>\n <th>Open</th>\n <th>Promo</th>\n <th>StateHoliday</th>\n <th>SchoolHoliday</th>\n <th>weekdays</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>2015-07-31</td>\n <td>5263</td>\n <td>555</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>4</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>2015-07-31</td>\n <td>6064</td>\n <td>625</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>4</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>2015-07-31</td>\n <td>8314</td>\n <td>821</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>4</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>2015-07-31</td>\n <td>13995</td>\n <td>1498</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>4</td>\n </tr>\n <tr>\n <th>4</th>\n <td>5</td>\n <td>2015-07-31</td>\n <td>4822</td>\n <td>559</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>4</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"_uuid": "e747b8949890a84648a679e34f0549ce5e2a5b14",
"_cell_guid": "ca2b3526-eda6-424b-9bf3-6ba9f6a0734b"
},
"cell_type": "markdown",
"source": "We now have a new feature that stores the day of week, called: weekdays. Now we will transform the numbers it contains into actual weekdays."
},
{
"metadata": {
"_uuid": "c0986e2ad63bb93d0d0922053e5176af067ad296",
"collapsed": true,
"_cell_guid": "a9c8a33d-0d8b-4cdd-adf6-1748a5763d1a",
"trusted": true
},
"cell_type": "code",
"source": "# creating a list of the days\nweekday_names = [\"Monday\", \"Tuesday\", \"Wednesday\", \"Thursday\", \"Friday\", \"Saturday\", \"Sunday\"]\n# for-loop to assign these days\nweekday_dict = {key: weekday_names[key] for key in range(7)}\n\n# fucntion to actually replace the numbers with the days\ndef day_of_week(idx):\n return weekday_dict[idx]\n# use apply() to apply our function to the weekdays column\ntrain_df.weekdays = train_df.weekdays.apply(day_of_week)",
"execution_count": 23,
"outputs": []
},
{
"metadata": {
"_uuid": "875c355d1f1beb922a105f2d220941d039228371",
"_cell_guid": "ce5ca3c8-afbc-4fa6-9feb-a2ea5cf0cb11",
"trusted": true
},
"cell_type": "code",
"source": "train_df.weekdays.value_counts()",
"execution_count": 24,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 24,
"data": {
"text/plain": "Thursday 145845\nFriday 145845\nWednesday 145665\nTuesday 145664\nSunday 144730\nMonday 144730\nSaturday 144730\nName: weekdays, dtype: int64"
},
"metadata": {}
}
]
},
{
"metadata": {
"_uuid": "add761342bb7939826bb964f041af021944753eb",
"_cell_guid": "2db62853-62d1-46dd-8db0-d00b44ac5671"
},
"cell_type": "markdown",
"source": ""
},
{
"metadata": {
"_uuid": "3bdb6fcf485bb01652eabd835eafafed008809e5",
"_cell_guid": "0b2111f9-6812-41cf-a502-600eea84cb21"
},
"cell_type": "markdown",
"source": "## **Grouping data by a Feature**\nLet's group these weekdays."
},
{
"metadata": {
"_uuid": "997f66ab0fc502035fb23be0c05a35ccc9b601dc",
"_cell_guid": "ffc7ba60-bba0-441a-adcd-da5c5b03d0da",
"scrolled": true,
"trusted": true
},
"cell_type": "code",
"source": "# groupby() groups our weekdays and count() counts the rows in each group\nweekday_counts = train_df.groupby(\"weekdays\").count()\n\n# We can reorder this dataframe by our weekday_names list\nweekday_counts = weekday_counts.loc[weekday_names]\n\nweekday_counts",
"execution_count": 25,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 25,
"data": {
"text/plain": " Store Date Sales Customers Open Promo StateHoliday \\\nweekdays \nMonday 144730 144730 144730 144730 144730 144730 144730 \nTuesday 145664 145664 145664 145664 145664 145664 145664 \nWednesday 145665 145665 145665 145665 145665 145665 145665 \nThursday 145845 145845 145845 145845 145845 145845 145845 \nFriday 145845 145845 145845 145845 145845 145845 145845 \nSaturday 144730 144730 144730 144730 144730 144730 144730 \nSunday 144730 144730 144730 144730 144730 144730 144730 \n\n SchoolHoliday \nweekdays \nMonday 144730 \nTuesday 145664 \nWednesday 145665 \nThursday 145845 \nFriday 145845 \nSaturday 144730 \nSunday 144730 ",
"text/html": "<div>\n<style>\n .dataframe thead tr:only-child th {\n text-align: right;\n }\n\n .dataframe thead th {\n text-align: left;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Store</th>\n <th>Date</th>\n <th>Sales</th>\n <th>Customers</th>\n <th>Open</th>\n <th>Promo</th>\n <th>StateHoliday</th>\n <th>SchoolHoliday</th>\n </tr>\n <tr>\n <th>weekdays</th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>Monday</th>\n <td>144730</td>\n <td>144730</td>\n <td>144730</td>\n <td>144730</td>\n <td>144730</td>\n <td>144730</td>\n <td>144730</td>\n <td>144730</td>\n </tr>\n <tr>\n <th>Tuesday</th>\n <td>145664</td>\n <td>145664</td>\n <td>145664</td>\n <td>145664</td>\n <td>145664</td>\n <td>145664</td>\n <td>145664</td>\n <td>145664</td>\n </tr>\n <tr>\n <th>Wednesday</th>\n <td>145665</td>\n <td>145665</td>\n <td>145665</td>\n <td>145665</td>\n <td>145665</td>\n <td>145665</td>\n <td>145665</td>\n <td>145665</td>\n </tr>\n <tr>\n <th>Thursday</th>\n <td>145845</td>\n <td>145845</td>\n <td>145845</td>\n <td>145845</td>\n <td>145845</td>\n <td>145845</td>\n <td>145845</td>\n <td>145845</td>\n </tr>\n <tr>\n <th>Friday</th>\n <td>145845</td>\n <td>145845</td>\n <td>145845</td>\n <td>145845</td>\n <td>145845</td>\n <td>145845</td>\n <td>145845</td>\n <td>145845</td>\n </tr>\n <tr>\n <th>Saturday</th>\n <td>144730</td>\n <td>144730</td>\n <td>144730</td>\n <td>144730</td>\n <td>144730</td>\n <td>144730</td>\n <td>144730</td>\n <td>144730</td>\n </tr>\n <tr>\n <th>Sunday</th>\n <td>144730</td>\n <td>144730</td>\n <td>144730</td>\n <td>144730</td>\n <td>144730</td>\n <td>144730</td>\n <td>144730</td>\n <td>144730</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"_uuid": "78ae9f36bb676a9af6d83e0f8218aeb689f3c4d0",
"_cell_guid": "32eb1869-4ba5-4a9b-8647-6e3d28937c4f",
"trusted": true
},
"cell_type": "code",
"source": "train_df.head()",
"execution_count": 26,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 26,
"data": {
"text/plain": " Store Date Sales Customers Open Promo StateHoliday \\\n0 1 2015-07-31 5263 555 1 1 0 \n1 2 2015-07-31 6064 625 1 1 0 \n2 3 2015-07-31 8314 821 1 1 0 \n3 4 2015-07-31 13995 1498 1 1 0 \n4 5 2015-07-31 4822 559 1 1 0 \n\n SchoolHoliday weekdays \n0 1 Friday \n1 1 Friday \n2 1 Friday \n3 1 Friday \n4 1 Friday ",
"text/html": "<div>\n<style>\n .dataframe thead tr:only-child th {\n text-align: right;\n }\n\n .dataframe thead th {\n text-align: left;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Store</th>\n <th>Date</th>\n <th>Sales</th>\n <th>Customers</th>\n <th>Open</th>\n <th>Promo</th>\n <th>StateHoliday</th>\n <th>SchoolHoliday</th>\n <th>weekdays</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>2015-07-31</td>\n <td>5263</td>\n <td>555</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>Friday</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>2015-07-31</td>\n <td>6064</td>\n <td>625</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>Friday</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>2015-07-31</td>\n <td>8314</td>\n <td>821</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>Friday</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>2015-07-31</td>\n <td>13995</td>\n <td>1498</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>Friday</td>\n </tr>\n <tr>\n <th>4</th>\n <td>5</td>\n <td>2015-07-31</td>\n <td>4822</td>\n <td>559</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>Friday</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"_uuid": "e726da9a1cdd83f4a9c71b64569a83f56f1c34dd",
"_cell_guid": "557fbe94-7087-4365-ab2d-01f23fc59411"
},
"cell_type": "markdown",
"source": ""
},
{
"metadata": {
"_uuid": "4d1a887f473329c8c6b1ef6499ccf3fdec74bb60",
"_cell_guid": "1ba8d587-4db9-4cda-ae77-4e6416a321bf"
},
"cell_type": "markdown",
"source": "## **Changing the index**"
},
{
"metadata": {
"_uuid": "e79cc81a8733b81d1cf8c77c3aecdaf9ede600ad",
"collapsed": true,
"_cell_guid": "c5b9945b-6fc7-4b4c-bac6-020aecd9fb30",
"trusted": true
},
"cell_type": "code",
"source": "# .index to set the index equal to the Date feature\ntrain_df.index = train_df.Date\n# Let's drop the \"old\" Date Feature because we no longer need it since it's values are \n# now the index.\n# .drop() to drop the feature\ntrain_df.drop([\"Date\"], axis = 1, inplace = True)",
"execution_count": 27,
"outputs": []
},
{
"metadata": {
"_uuid": "3226de66ab295ca4f81f1473722cd07e77acc668",
"_cell_guid": "f1cdc436-cbc0-4330-8aeb-687213b5e123",
"trusted": true
},
"cell_type": "code",
"source": "train_df.head()",
"execution_count": 28,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 28,
"data": {
"text/plain": " Store Sales Customers Open Promo StateHoliday SchoolHoliday \\\nDate \n2015-07-31 1 5263 555 1 1 0 1 \n2015-07-31 2 6064 625 1 1 0 1 \n2015-07-31 3 8314 821 1 1 0 1 \n2015-07-31 4 13995 1498 1 1 0 1 \n2015-07-31 5 4822 559 1 1 0 1 \n\n weekdays \nDate \n2015-07-31 Friday \n2015-07-31 Friday \n2015-07-31 Friday \n2015-07-31 Friday \n2015-07-31 Friday ",
"text/html": "<div>\n<style>\n .dataframe thead tr:only-child th {\n text-align: right;\n }\n\n .dataframe thead th {\n text-align: left;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Store</th>\n <th>Sales</th>\n <th>Customers</th>\n <th>Open</th>\n <th>Promo</th>\n <th>StateHoliday</th>\n <th>SchoolHoliday</th>\n <th>weekdays</th>\n </tr>\n <tr>\n <th>Date</th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>2015-07-31</th>\n <td>1</td>\n <td>5263</td>\n <td>555</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>Friday</td>\n </tr>\n <tr>\n <th>2015-07-31</th>\n <td>2</td>\n <td>6064</td>\n <td>625</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>Friday</td>\n </tr>\n <tr>\n <th>2015-07-31</th>\n <td>3</td>\n <td>8314</td>\n <td>821</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>Friday</td>\n </tr>\n <tr>\n <th>2015-07-31</th>\n <td>4</td>\n <td>13995</td>\n <td>1498</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>Friday</td>\n </tr>\n <tr>\n <th>2015-07-31</th>\n <td>5</td>\n <td>4822</td>\n <td>559</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>Friday</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"_uuid": "5250f2d9be5bb385ec8f1df33e7179e0a70dc40c",
"_cell_guid": "0b409143-9101-4b4c-b9d5-c25017099f47"
},
"cell_type": "markdown",
"source": "Instead of 0, 1, 2, 3, 4... we now have the actual Dates as index on the left of the dataframe. "
},
{
"metadata": {
"_uuid": "44f9a507e2a951c30caf0fcd0bb4cea0d66bbe63",
"_cell_guid": "dcc959f6-fee0-4c6f-9bd8-1104157b21d3"
},
"cell_type": "markdown",
"source": "----------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n"
},
{
"metadata": {
"_uuid": "b673325cc2cd810a06f28bdc3f0efffeeead5aca",
"_cell_guid": "9bee979a-585b-46c9-adc1-bc5cec7957fa"
},
"cell_type": "markdown",
"source": ""
}
],
"metadata": {
"language_info": {
"name": "python",
"version": "3.6.4",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment