Skip to content

Instantly share code, notes, and snippets.

@aristhera
Created April 18, 2020 15:51
Show Gist options
  • Save aristhera/7d2d127b1c192508fb168fca6cdf3791 to your computer and use it in GitHub Desktop.
Save aristhera/7d2d127b1c192508fb168fca6cdf3791 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": "<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n <a href=\"https://cocl.us/PY0101EN_edx_add_top\">\n <img src=\"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/PY0101EN/Ad/TopAd.png\" width=\"750\" align=\"center\">\n </a>\n</div>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<a href=\"https://cognitiveclass.ai/\">\n <img src=\"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/PY0101EN/Ad/CCLog.png\" width=\"200\" align=\"center\">\n</a>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<h1>Introduction to Pandas Python</h1>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<p><strong>Welcome!</strong> This notebook will teach you about using <code>Pandas</code> in the Python Programming Language. By the end of this lab, you'll know how to use <code>Pandas</code> package to view and access data.</p>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<h2>Table of Contents</h2>\n<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n <ul>\n <li><a href=\"dataset\">About the Dataset</a></li>\n <li><a href=\"pandas\">Introduction of <code>Pandas</code></a></li>\n <li><a href=\"data\">Viewing Data and Accessing Data</a></li>\n <li><a href=\"quiz\">Quiz on DataFrame</a></li>\n </ul>\n <p>\n Estimated time needed: <strong>15 min</strong>\n </p>\n</div>\n\n<hr>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<h2 id=\"dataset\">About the Dataset</h2>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The table has one row for each album and several columns\n\n<ul>\n <li><b>artist</b>: Name of the artist</li>\n <li><b>album</b>: Name of the album</li>\n <li><b>released_year</b>: Year the album was released</li>\n <li><b>length_min_sec</b>: Length of the album (hours,minutes,seconds)</li>\n <li><b>genre</b>: Genre of the album</li>\n <li><b>music_recording_sales_millions</b>: Music recording sales (millions in USD) on <a href=\"http://www.song-database.com/\">[SONG://DATABASE]</a></li>\n <li><b>claimed_sales_millions</b>: Album's claimed sales (millions in USD) on <a href=\"http://www.song-database.com/\">[SONG://DATABASE]</a></li>\n <li><b>date_released</b>: Date on which the album was released</li>\n <li><b>soundtrack</b>: Indicates if the album is the movie soundtrack (Y) or (N)</li>\n <li><b>rating_of_friends</b>: Indicates the rating from your friends from 1 to 10</li>\n</ul>\n\nYou can see the dataset here:\n\n<font size=\"1\">\n<table font-size:xx-small style=\"width:25%\">\n <tr>\n <th>Artist</th>\n <th>Album</th> \n <th>Released</th>\n <th>Length</th>\n <th>Genre</th> \n <th>Music recording sales (millions)</th>\n <th>Claimed sales (millions)</th>\n <th>Released</th>\n <th>Soundtrack</th>\n <th>Rating (friends)</th>\n </tr>\n <tr>\n <td>Michael Jackson</td>\n <td>Thriller</td> \n <td>1982</td>\n <td>00:42:19</td>\n <td>Pop, rock, R&B</td>\n <td>46</td>\n <td>65</td>\n <td>30-Nov-82</td>\n <td></td>\n <td>10.0</td>\n </tr>\n <tr>\n <td>AC/DC</td>\n <td>Back in Black</td> \n <td>1980</td>\n <td>00:42:11</td>\n <td>Hard rock</td>\n <td>26.1</td>\n <td>50</td>\n <td>25-Jul-80</td>\n <td></td>\n <td>8.5</td>\n </tr>\n <tr>\n <td>Pink Floyd</td>\n <td>The Dark Side of the Moon</td> \n <td>1973</td>\n <td>00:42:49</td>\n <td>Progressive rock</td>\n <td>24.2</td>\n <td>45</td>\n <td>01-Mar-73</td>\n <td></td>\n <td>9.5</td>\n </tr>\n <tr>\n <td>Whitney Houston</td>\n <td>The Bodyguard</td> \n <td>1992</td>\n <td>00:57:44</td>\n <td>Soundtrack/R&B, soul, pop</td>\n <td>26.1</td>\n <td>50</td>\n <td>25-Jul-80</td>\n <td>Y</td>\n <td>7.0</td>\n </tr>\n <tr>\n <td>Meat Loaf</td>\n <td>Bat Out of Hell</td> \n <td>1977</td>\n <td>00:46:33</td>\n <td>Hard rock, progressive rock</td>\n <td>20.6</td>\n <td>43</td>\n <td>21-Oct-77</td>\n <td></td>\n <td>7.0</td>\n </tr>\n <tr>\n <td>Eagles</td>\n <td>Their Greatest Hits (1971-1975)</td> \n <td>1976</td>\n <td>00:43:08</td>\n <td>Rock, soft rock, folk rock</td>\n <td>32.2</td>\n <td>42</td>\n <td>17-Feb-76</td>\n <td></td>\n <td>9.5</td>\n </tr>\n <tr>\n <td>Bee Gees</td>\n <td>Saturday Night Fever</td> \n <td>1977</td>\n <td>1:15:54</td>\n <td>Disco</td>\n <td>20.6</td>\n <td>40</td>\n <td>15-Nov-77</td>\n <td>Y</td>\n <td>9.0</td>\n </tr>\n <tr>\n <td>Fleetwood Mac</td>\n <td>Rumours</td> \n <td>1977</td>\n <td>00:40:01</td>\n <td>Soft rock</td>\n <td>27.9</td>\n <td>40</td>\n <td>04-Feb-77</td>\n <td></td>\n <td>9.5</td>\n </tr>\n</table></font>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<hr>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<h2 id=\"pandas\">Introduction of <code>Pandas</code></h2>"
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "Requirement already satisfied: xlrd in /opt/conda/envs/Python36/lib/python3.6/site-packages (1.2.0)\r\n"
}
],
"source": "# Dependency needed to install file \n\n!pip install xlrd"
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": "# Import required library\n\nimport pandas as pd"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "After the import command, we now have access to a large number of pre-built classes and functions. This assumes the library is installed; in our lab environment all the necessary libraries are installed. One way pandas allows you to work with data is a dataframe. Let's go through the process to go from a comma separated values (<b>.csv</b>) file to a dataframe. This variable <code>csv_path</code> stores the path of the <b>.csv</b>, that is used as an argument to the <code>read_csv</code> function. The result is stored in the object <code>df</code>, this is a common short form used for a variable referring to a Pandas dataframe. "
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": "# Read data from CSV file\n\ncsv_path = 'https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/PY0101EN/Chapter%204/Datasets/TopSellingAlbums.csv'\ndf = pd.read_csv(csv_path)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We can use the method <code>head()</code> to examine the first five rows of a dataframe: "
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Artist</th>\n <th>Album</th>\n <th>Released</th>\n <th>Length</th>\n <th>Genre</th>\n <th>Music Recording Sales (millions)</th>\n <th>Claimed Sales (millions)</th>\n <th>Released.1</th>\n <th>Soundtrack</th>\n <th>Rating</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>Michael Jackson</td>\n <td>Thriller</td>\n <td>1982</td>\n <td>0:42:19</td>\n <td>pop, rock, R&amp;B</td>\n <td>46.0</td>\n <td>65</td>\n <td>30-Nov-82</td>\n <td>NaN</td>\n <td>10.0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>AC/DC</td>\n <td>Back in Black</td>\n <td>1980</td>\n <td>0:42:11</td>\n <td>hard rock</td>\n <td>26.1</td>\n <td>50</td>\n <td>25-Jul-80</td>\n <td>NaN</td>\n <td>9.5</td>\n </tr>\n <tr>\n <th>2</th>\n <td>Pink Floyd</td>\n <td>The Dark Side of the Moon</td>\n <td>1973</td>\n <td>0:42:49</td>\n <td>progressive rock</td>\n <td>24.2</td>\n <td>45</td>\n <td>01-Mar-73</td>\n <td>NaN</td>\n <td>9.0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>Whitney Houston</td>\n <td>The Bodyguard</td>\n <td>1992</td>\n <td>0:57:44</td>\n <td>R&amp;B, soul, pop</td>\n <td>27.4</td>\n <td>44</td>\n <td>17-Nov-92</td>\n <td>Y</td>\n <td>8.5</td>\n </tr>\n <tr>\n <th>4</th>\n <td>Meat Loaf</td>\n <td>Bat Out of Hell</td>\n <td>1977</td>\n <td>0:46:33</td>\n <td>hard rock, progressive rock</td>\n <td>20.6</td>\n <td>43</td>\n <td>21-Oct-77</td>\n <td>NaN</td>\n <td>8.0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " Artist Album Released Length \\\n0 Michael Jackson Thriller 1982 0:42:19 \n1 AC/DC Back in Black 1980 0:42:11 \n2 Pink Floyd The Dark Side of the Moon 1973 0:42:49 \n3 Whitney Houston The Bodyguard 1992 0:57:44 \n4 Meat Loaf Bat Out of Hell 1977 0:46:33 \n\n Genre Music Recording Sales (millions) \\\n0 pop, rock, R&B 46.0 \n1 hard rock 26.1 \n2 progressive rock 24.2 \n3 R&B, soul, pop 27.4 \n4 hard rock, progressive rock 20.6 \n\n Claimed Sales (millions) Released.1 Soundtrack Rating \n0 65 30-Nov-82 NaN 10.0 \n1 50 25-Jul-80 NaN 9.5 \n2 45 01-Mar-73 NaN 9.0 \n3 44 17-Nov-92 Y 8.5 \n4 43 21-Oct-77 NaN 8.0 "
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "# Print first five rows of the dataframe\n\ndf.head()"
},
{
"cell_type": "markdown",
"metadata": {},
"source": " We use the path of the excel file and the function <code>read_excel</code>. The result is a data frame as before:"
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Artist</th>\n <th>Album</th>\n <th>Released</th>\n <th>Length</th>\n <th>Genre</th>\n <th>Music Recording Sales (millions)</th>\n <th>Claimed Sales (millions)</th>\n <th>Released.1</th>\n <th>Soundtrack</th>\n <th>Rating</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>Michael Jackson</td>\n <td>Thriller</td>\n <td>1982</td>\n <td>00:42:19</td>\n <td>pop, rock, R&amp;B</td>\n <td>46.0</td>\n <td>65</td>\n <td>1982-11-30</td>\n <td>NaN</td>\n <td>10.0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>AC/DC</td>\n <td>Back in Black</td>\n <td>1980</td>\n <td>00:42:11</td>\n <td>hard rock</td>\n <td>26.1</td>\n <td>50</td>\n <td>1980-07-25</td>\n <td>NaN</td>\n <td>9.5</td>\n </tr>\n <tr>\n <th>2</th>\n <td>Pink Floyd</td>\n <td>The Dark Side of the Moon</td>\n <td>1973</td>\n <td>00:42:49</td>\n <td>progressive rock</td>\n <td>24.2</td>\n <td>45</td>\n <td>1973-03-01</td>\n <td>NaN</td>\n <td>9.0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>Whitney Houston</td>\n <td>The Bodyguard</td>\n <td>1992</td>\n <td>00:57:44</td>\n <td>R&amp;B, soul, pop</td>\n <td>27.4</td>\n <td>44</td>\n <td>1992-11-17</td>\n <td>Y</td>\n <td>8.5</td>\n </tr>\n <tr>\n <th>4</th>\n <td>Meat Loaf</td>\n <td>Bat Out of Hell</td>\n <td>1977</td>\n <td>00:46:33</td>\n <td>hard rock, progressive rock</td>\n <td>20.6</td>\n <td>43</td>\n <td>1977-10-21</td>\n <td>NaN</td>\n <td>8.0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " Artist Album Released Length \\\n0 Michael Jackson Thriller 1982 00:42:19 \n1 AC/DC Back in Black 1980 00:42:11 \n2 Pink Floyd The Dark Side of the Moon 1973 00:42:49 \n3 Whitney Houston The Bodyguard 1992 00:57:44 \n4 Meat Loaf Bat Out of Hell 1977 00:46:33 \n\n Genre Music Recording Sales (millions) \\\n0 pop, rock, R&B 46.0 \n1 hard rock 26.1 \n2 progressive rock 24.2 \n3 R&B, soul, pop 27.4 \n4 hard rock, progressive rock 20.6 \n\n Claimed Sales (millions) Released.1 Soundtrack Rating \n0 65 1982-11-30 NaN 10.0 \n1 50 1980-07-25 NaN 9.5 \n2 45 1973-03-01 NaN 9.0 \n3 44 1992-11-17 Y 8.5 \n4 43 1977-10-21 NaN 8.0 "
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "# Read data from Excel File and print the first five rows\n\nxlsx_path = 'https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/PY0101EN/Chapter%204/Datasets/TopSellingAlbums.xlsx'\n\ndf = pd.read_excel(xlsx_path)\ndf.head()"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We can access the column <b>Length</b> and assign it a new dataframe <b>x</b>:"
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Length</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>00:42:19</td>\n </tr>\n <tr>\n <th>1</th>\n <td>00:42:11</td>\n </tr>\n <tr>\n <th>2</th>\n <td>00:42:49</td>\n </tr>\n <tr>\n <th>3</th>\n <td>00:57:44</td>\n </tr>\n <tr>\n <th>4</th>\n <td>00:46:33</td>\n </tr>\n <tr>\n <th>5</th>\n <td>00:43:08</td>\n </tr>\n <tr>\n <th>6</th>\n <td>01:15:54</td>\n </tr>\n <tr>\n <th>7</th>\n <td>00:40:01</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " Length\n0 00:42:19\n1 00:42:11\n2 00:42:49\n3 00:57:44\n4 00:46:33\n5 00:43:08\n6 01:15:54\n7 00:40:01"
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "# Access to the column Length\n\nx = df[['Length']]\nx"
},
{
"cell_type": "markdown",
"metadata": {},
"source": " The process is shown in the figure: "
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<img src=\"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/PY0101EN/Chapter%204/Images/DataEgOne.png\" width=\"750\" />"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<hr>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<h2 id=\"data\">Viewing Data and Accessing Data</h2>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "You can also get a column as a series. You can think of a Pandas series as a 1-D dataframe. Just use one bracket: "
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "0 00:42:19\n1 00:42:11\n2 00:42:49\n3 00:57:44\n4 00:46:33\n5 00:43:08\n6 01:15:54\n7 00:40:01\nName: Length, dtype: object"
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "# Get the column as a series\n\nx = df['Length']\nx"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "You can also get a column as a dataframe. For example, we can assign the column <b>Artist</b>:"
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Artist</th>\n <th>Album</th>\n <th>Released</th>\n <th>Length</th>\n <th>Genre</th>\n <th>Music Recording Sales (millions)</th>\n <th>Claimed Sales (millions)</th>\n <th>Released.1</th>\n <th>Soundtrack</th>\n <th>Rating</th>\n <th>artist</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>Michael Jackson</td>\n <td>Thriller</td>\n <td>1982</td>\n <td>00:42:19</td>\n <td>pop, rock, R&amp;B</td>\n <td>46.0</td>\n <td>65</td>\n <td>1982-11-30</td>\n <td>NaN</td>\n <td>10.0</td>\n <td>Michael Jackson</td>\n </tr>\n <tr>\n <th>1</th>\n <td>AC/DC</td>\n <td>Back in Black</td>\n <td>1980</td>\n <td>00:42:11</td>\n <td>hard rock</td>\n <td>26.1</td>\n <td>50</td>\n <td>1980-07-25</td>\n <td>NaN</td>\n <td>9.5</td>\n <td>AC/DC</td>\n </tr>\n <tr>\n <th>2</th>\n <td>Pink Floyd</td>\n <td>The Dark Side of the Moon</td>\n <td>1973</td>\n <td>00:42:49</td>\n <td>progressive rock</td>\n <td>24.2</td>\n <td>45</td>\n <td>1973-03-01</td>\n <td>NaN</td>\n <td>9.0</td>\n <td>Pink Floyd</td>\n </tr>\n <tr>\n <th>3</th>\n <td>Whitney Houston</td>\n <td>The Bodyguard</td>\n <td>1992</td>\n <td>00:57:44</td>\n <td>R&amp;B, soul, pop</td>\n <td>27.4</td>\n <td>44</td>\n <td>1992-11-17</td>\n <td>Y</td>\n <td>8.5</td>\n <td>Whitney Houston</td>\n </tr>\n <tr>\n <th>4</th>\n <td>Meat Loaf</td>\n <td>Bat Out of Hell</td>\n <td>1977</td>\n <td>00:46:33</td>\n <td>hard rock, progressive rock</td>\n <td>20.6</td>\n <td>43</td>\n <td>1977-10-21</td>\n <td>NaN</td>\n <td>8.0</td>\n <td>Meat Loaf</td>\n </tr>\n <tr>\n <th>5</th>\n <td>Eagles</td>\n <td>Their Greatest Hits (1971-1975)</td>\n <td>1976</td>\n <td>00:43:08</td>\n <td>rock, soft rock, folk rock</td>\n <td>32.2</td>\n <td>42</td>\n <td>1976-02-17</td>\n <td>NaN</td>\n <td>7.5</td>\n <td>Eagles</td>\n </tr>\n <tr>\n <th>6</th>\n <td>Bee Gees</td>\n <td>Saturday Night Fever</td>\n <td>1977</td>\n <td>01:15:54</td>\n <td>disco</td>\n <td>20.6</td>\n <td>40</td>\n <td>1977-11-15</td>\n <td>Y</td>\n <td>7.0</td>\n <td>Bee Gees</td>\n </tr>\n <tr>\n <th>7</th>\n <td>Fleetwood Mac</td>\n <td>Rumours</td>\n <td>1977</td>\n <td>00:40:01</td>\n <td>soft rock</td>\n <td>27.9</td>\n <td>40</td>\n <td>1977-02-04</td>\n <td>NaN</td>\n <td>6.5</td>\n <td>Fleetwood Mac</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " Artist Album Released Length \\\n0 Michael Jackson Thriller 1982 00:42:19 \n1 AC/DC Back in Black 1980 00:42:11 \n2 Pink Floyd The Dark Side of the Moon 1973 00:42:49 \n3 Whitney Houston The Bodyguard 1992 00:57:44 \n4 Meat Loaf Bat Out of Hell 1977 00:46:33 \n5 Eagles Their Greatest Hits (1971-1975) 1976 00:43:08 \n6 Bee Gees Saturday Night Fever 1977 01:15:54 \n7 Fleetwood Mac Rumours 1977 00:40:01 \n\n Genre Music Recording Sales (millions) \\\n0 pop, rock, R&B 46.0 \n1 hard rock 26.1 \n2 progressive rock 24.2 \n3 R&B, soul, pop 27.4 \n4 hard rock, progressive rock 20.6 \n5 rock, soft rock, folk rock 32.2 \n6 disco 20.6 \n7 soft rock 27.9 \n\n Claimed Sales (millions) Released.1 Soundtrack Rating artist \n0 65 1982-11-30 NaN 10.0 Michael Jackson \n1 50 1980-07-25 NaN 9.5 AC/DC \n2 45 1973-03-01 NaN 9.0 Pink Floyd \n3 44 1992-11-17 Y 8.5 Whitney Houston \n4 43 1977-10-21 NaN 8.0 Meat Loaf \n5 42 1976-02-17 NaN 7.5 Eagles \n6 40 1977-11-15 Y 7.0 Bee Gees \n7 40 1977-02-04 NaN 6.5 Fleetwood Mac "
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "# Get the column as a dataframe\n\nx = df.assign(artist=df['Artist']) #df.assign(temp_f=df['temp_c'] * 9 / 5 + 32)\nx"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "You can do the same thing for multiple columns; we just put the dataframe name, in this case, <code>df</code>, and the name of the multiple column headers enclosed in double brackets. The result is a new dataframe comprised of the specified columns:"
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Artist</th>\n <th>Length</th>\n <th>Genre</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>Michael Jackson</td>\n <td>00:42:19</td>\n <td>pop, rock, R&amp;B</td>\n </tr>\n <tr>\n <th>1</th>\n <td>AC/DC</td>\n <td>00:42:11</td>\n <td>hard rock</td>\n </tr>\n <tr>\n <th>2</th>\n <td>Pink Floyd</td>\n <td>00:42:49</td>\n <td>progressive rock</td>\n </tr>\n <tr>\n <th>3</th>\n <td>Whitney Houston</td>\n <td>00:57:44</td>\n <td>R&amp;B, soul, pop</td>\n </tr>\n <tr>\n <th>4</th>\n <td>Meat Loaf</td>\n <td>00:46:33</td>\n <td>hard rock, progressive rock</td>\n </tr>\n <tr>\n <th>5</th>\n <td>Eagles</td>\n <td>00:43:08</td>\n <td>rock, soft rock, folk rock</td>\n </tr>\n <tr>\n <th>6</th>\n <td>Bee Gees</td>\n <td>01:15:54</td>\n <td>disco</td>\n </tr>\n <tr>\n <th>7</th>\n <td>Fleetwood Mac</td>\n <td>00:40:01</td>\n <td>soft rock</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " Artist Length Genre\n0 Michael Jackson 00:42:19 pop, rock, R&B\n1 AC/DC 00:42:11 hard rock\n2 Pink Floyd 00:42:49 progressive rock\n3 Whitney Houston 00:57:44 R&B, soul, pop\n4 Meat Loaf 00:46:33 hard rock, progressive rock\n5 Eagles 00:43:08 rock, soft rock, folk rock\n6 Bee Gees 01:15:54 disco\n7 Fleetwood Mac 00:40:01 soft rock"
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "# Access to multiple columns\n\ny = df[['Artist','Length','Genre']]\ny"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The process is shown in the figure:"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<img src = \"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/PY0101EN/Chapter%204/Images/DataEgTwo.png\" width=\"1100\" />"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "One way to access unique elements is the <code>iloc</code> method, where you can access the 1st row and the 1st column as follows:"
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "'Michael Jackson'"
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "# Access the value on the first row and the first column\n\ndf.iloc[0, 0]"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "You can access the 2nd row and the 1st column as follows:"
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "'AC/DC'"
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "# Access the value on the second row and the first column\n\ndf.iloc[1,0]"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "You can access the 1st row and the 3rd column as follows: "
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "1982"
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "# Access the value on the first row and the third column\n\ndf.iloc[0,2]"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "You can access the column using the name as well, the following are the same as above: "
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "'Michael Jackson'"
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "# Access the column using the name\n\ndf.loc[0, 'Artist']"
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "'AC/DC'"
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "# Access the column using the name\n\ndf.loc[1, 'Artist']"
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "1982"
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "# Access the column using the name\n\ndf.loc[0, 'Released']"
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "1980"
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "# Access the column using the name\n\ndf.loc[1, 'Released']"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "You can perform slicing using both the index and the name of the column:"
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Artist</th>\n <th>Album</th>\n <th>Released</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>Michael Jackson</td>\n <td>Thriller</td>\n <td>1982</td>\n </tr>\n <tr>\n <th>1</th>\n <td>AC/DC</td>\n <td>Back in Black</td>\n <td>1980</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " Artist Album Released\n0 Michael Jackson Thriller 1982\n1 AC/DC Back in Black 1980"
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "# Slicing the dataframe\n\ndf.iloc[0:2, 0:3]"
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Artist</th>\n <th>Album</th>\n <th>Released</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>Michael Jackson</td>\n <td>Thriller</td>\n <td>1982</td>\n </tr>\n <tr>\n <th>1</th>\n <td>AC/DC</td>\n <td>Back in Black</td>\n <td>1980</td>\n </tr>\n <tr>\n <th>2</th>\n <td>Pink Floyd</td>\n <td>The Dark Side of the Moon</td>\n <td>1973</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " Artist Album Released\n0 Michael Jackson Thriller 1982\n1 AC/DC Back in Black 1980\n2 Pink Floyd The Dark Side of the Moon 1973"
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "# Slicing the dataframe using name\n\ndf.loc[0:2, 'Artist':'Released']"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<hr>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<h2 id=\"quiz\">Quiz on DataFrame</h2>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Use a variable <code>q</code> to store the column <b>Rating</b> as a dataframe"
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Rating</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>10.0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>9.5</td>\n </tr>\n <tr>\n <th>2</th>\n <td>9.0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>8.5</td>\n </tr>\n <tr>\n <th>4</th>\n <td>8.0</td>\n </tr>\n <tr>\n <th>5</th>\n <td>7.5</td>\n </tr>\n <tr>\n <th>6</th>\n <td>7.0</td>\n </tr>\n <tr>\n <th>7</th>\n <td>6.5</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " Rating\n0 10.0\n1 9.5\n2 9.0\n3 8.5\n4 8.0\n5 7.5\n6 7.0\n7 6.5"
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "# Write your code below and press Shift+Enter to execute\nq=df[['Rating']]\nq"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Double-click __here__ for the solution.\n\n<!-- Your answer is below:\nq = df[['Rating']]\nq\n-->"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Assign the variable <code>q</code> to the dataframe that is made up of the column <b>Released</b> and <b>Artist</b>:"
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Artist</th>\n <th>Released</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>Michael Jackson</td>\n <td>1982</td>\n </tr>\n <tr>\n <th>1</th>\n <td>AC/DC</td>\n <td>1980</td>\n </tr>\n <tr>\n <th>2</th>\n <td>Pink Floyd</td>\n <td>1973</td>\n </tr>\n <tr>\n <th>3</th>\n <td>Whitney Houston</td>\n <td>1992</td>\n </tr>\n <tr>\n <th>4</th>\n <td>Meat Loaf</td>\n <td>1977</td>\n </tr>\n <tr>\n <th>5</th>\n <td>Eagles</td>\n <td>1976</td>\n </tr>\n <tr>\n <th>6</th>\n <td>Bee Gees</td>\n <td>1977</td>\n </tr>\n <tr>\n <th>7</th>\n <td>Fleetwood Mac</td>\n <td>1977</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " Artist Released\n0 Michael Jackson 1982\n1 AC/DC 1980\n2 Pink Floyd 1973\n3 Whitney Houston 1992\n4 Meat Loaf 1977\n5 Eagles 1976\n6 Bee Gees 1977\n7 Fleetwood Mac 1977"
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "# Write your code below and press Shift+Enter to execute - df.assign(temp_f=df['temp_c'] * 9 / 5 + 32) y = df[['Artist','Length','Genre']]\nq = df[['Artist','Released']]\nq\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Double-click __here__ for the solution.\n\n<!-- Your answer is below:\nq = df[['Released', 'Artist']]\nq\n-->"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Access the 2nd row and the 3rd column of <code>df</code>:"
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "1980"
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "# Write your code below and press Shift+Enter to execute\ndf.iloc[1,2]"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Double-click __here__ for the solution.\n\n<!-- Your answer is below:\ndf.iloc[1, 2]\n-->"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<hr>\n<h2>The last exercise!</h2>\n<p>Congratulations, you have completed your first lesson and hands-on lab in Python. However, there is one more thing you need to do. The Data Science community encourages sharing work. The best way to share and showcase your work is to share it on GitHub. By sharing your notebook on GitHub you are not only building your reputation with fellow data scientists, but you can also show it off when applying for a job. Even though this was your first piece of work, it is never too early to start building good habits. So, please read and follow <a href=\"https://cognitiveclass.ai/blog/data-scientists-stand-out-by-sharing-your-notebooks/\" target=\"_blank\">this article</a> to learn how to share your work.\n<hr>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n<h2>Get IBM Watson Studio free of charge!</h2>\n <p><a href=\"https://cocl.us/PY0101EN_edx_add_bbottom\"><img src=\"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/PY0101EN/Ad/BottomAd.png\" width=\"750\" align=\"center\"></a></p>\n</div>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<h3>About the Authors:</h3> \n<p><a href=\"https://www.linkedin.com/in/joseph-s-50398b136/\" target=\"_blank\">Joseph Santarcangelo</a> is a Data Scientist at IBM, and holds a PhD in Electrical Engineering. His research focused on using Machine Learning, Signal Processing, and Computer Vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.</p>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Other contributors: <a href=\"www.linkedin.com/in/jiahui-mavis-zhou-a4537814a\">Mavis Zhou</a>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<hr>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<p>Copyright &copy; 2018 IBM Developer Skills Network. This notebook and its source code are released under the terms of the <a href=\"https://cognitiveclass.ai/mit-license/\">MIT License</a>.</p>"
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3.6",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment