Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save saxenaiway/ffb9e108f157a8a8c34801094c9daf86 to your computer and use it in GitHub Desktop.
Save saxenaiway/ffb9e108f157a8a8c34801094c9daf86 to your computer and use it in GitHub Desktop.
Created on Skills Network Labs
{"cells":[{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["<center>\n"," <img src=\"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/labs/Module%201/images/IDSNlogo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\" />\n","</center>\n","\n","# Data Visualization\n","\n","Estimated time needed: **30** minutes\n","\n","## Objectives\n","\n","After completing this lab you will be able to:\n","\n","- Create Data Visualization with Python\n","- Use various Python libraries for visualization\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["## Introduction\n","\n","The aim of these labs is to introduce you to data visualization with Python as concrete and as consistent as possible. \n","Speaking of consistency, because there is no _best_ data visualization library avaiblable for Python - up to creating these labs - we have to introduce different libraries and show their benefits when we are discussing new visualization concepts. Doing so, we hope to make students well-rounded with visualization libraries and concepts so that they are able to judge and decide on the best visualitzation technique and tool for a given problem _and_ audience.\n","\n","Please make sure that you have completed the prerequisites for this course, namely [**Python Basics for Data Science**](https://www.edx.org/course/python-basics-for-data-science-2?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ) and [**Analyzing Data with Python**](https://www.edx.org/course/data-analysis-with-python?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ).\n","\n","**Note**: The majority of the plots and visualizations will be generated using data stored in _pandas_ dataframes. Therefore, in this lab, we provide a brief crash course on _pandas_. However, if you are interested in learning more about the _pandas_ library, detailed description and explanation of how to use it and how to clean, munge, and process data stored in a _pandas_ dataframe are provided in our course [**Analyzing Data with Python**](https://www.edx.org/course/data-analysis-with-python?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ).\n","\n","* * *\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["## Table of Contents\n","\n","<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n","\n","1. [Exploring Datasets with _pandas_](#0)<br>\n"," 1.1 [The Dataset: Immigration to Canada from 1980 to 2013](#2)<br>\n"," 1.2 [_pandas_ Basics](#4) <br>\n"," 1.3 [_pandas_ Intermediate: Indexing and Selection](#6) <br>\n","2. [Visualizing Data using Matplotlib](#8) <br>\n"," 2.1 [Matplotlib: Standard Python Visualization Library](#10) <br>\n","3. [Line Plots](#12)\n"," </div>\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["# Exploring Datasets with _pandas_ <a id=\"0\"></a>\n","\n","_pandas_ is an essential data analysis toolkit for Python. From their [website](http://pandas.pydata.org?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ):\n","\n","> _pandas_ is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, **real world** data analysis in Python.\n","\n","The course heavily relies on _pandas_ for data wrangling, analysis, and visualization. We encourage you to spend some time and familizare yourself with the _pandas_ API Reference: [http://pandas.pydata.org/pandas-docs/stable/api.html](http://pandas.pydata.org/pandas-docs/stable/api.html?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ).\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["## The Dataset: Immigration to Canada from 1980 to 2013 <a id=\"2\"></a>\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Dataset Source: [International migration flows to and from selected countries - The 2015 revision](http://www.un.org/en/development/desa/population/migration/data/empirical2/migrationflows.shtml?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ).\n","\n","The dataset contains annual data on the flows of international immigrants as recorded by the countries of destination. The data presents both inflows and outflows according to the place of birth, citizenship or place of previous / next residence both for foreigners and nationals. The current version presents data pertaining to 45 countries.\n","\n","In this lab, we will focus on the Canadian immigration data.\n","\n","<img src = \"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/labs/Module%201/images/DataSnapshot.png\" align=\"center\" width=900>\n","\n"," The Canada Immigration dataset can be fetched from <a href=\"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/Data%20Files/Canada.xlsx\">here</a>.\n","\n","* * *\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["## _pandas_ Basics<a id=\"4\"></a>\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["The first thing we'll do is import two key data analysis modules: _pandas_ and **Numpy**.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":false},"outputs":[],"source":["import numpy as np # useful for many scientific computing in Python\n","import pandas as pd # primary data structure library"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Let's download and import our primary Canadian Immigration dataset using _pandas_ `read_excel()` method. Normally, before we can do that, we would need to download a module which _pandas_ requires to read in excel files. This module is **xlrd**. For your convenience, we have pre-installed this module, so you would not have to worry about that. Otherwise, you would need to run the following line of code to install the **xlrd** module:\n","\n","```\n","!conda install -c anaconda xlrd --yes\n","```\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Now we are ready to read in our data.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":false},"outputs":[],"source":["df_can = pd.read_excel('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/Data%20Files/Canada.xlsx',\n"," sheet_name='Canada by Citizenship',\n"," skiprows=range(20),\n"," skipfooter=2)\n","\n","print ('Data read into a pandas dataframe!')"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Let's view the top 5 rows of the dataset using the `head()` function.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["df_can.head()\n","# tip: You can specify the number of rows you'd like to see as follows: df_can.head(10) "]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["We can also veiw the bottom 5 rows of the dataset using the `tail()` function.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["df_can.tail()"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["When analyzing a dataset, it's always a good idea to start by getting basic information about your dataframe. We can do this by using the `info()` method.\n","\n","This method can be used to get a short summary of the dataframe.\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["df_can.info(verbose=False)"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["To get the list of column headers we can call upon the dataframe's `.columns` parameter.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":false},"outputs":[],"source":["df_can.columns.values "]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Similarly, to get the list of indicies we use the `.index` parameter.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":true},"outputs":[],"source":["df_can.index.values"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Note: The default type of index and columns is NOT list.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":false},"outputs":[],"source":["print(type(df_can.columns))\n","print(type(df_can.index))"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["To get the index and columns as lists, we can use the `tolist()` method.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":false},"outputs":[],"source":["df_can.columns.tolist()\n","df_can.index.tolist()\n","\n","print (type(df_can.columns.tolist()))\n","print (type(df_can.index.tolist()))"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["To view the dimensions of the dataframe, we use the `.shape` parameter.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":false},"outputs":[],"source":["# size of dataframe (rows, columns)\n","df_can.shape "]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Note: The main types stored in _pandas_ objects are _float_, _int_, _bool_, _datetime64[ns]_ and _datetime64[ns, tz] (in >= 0.17.0)_, _timedelta[ns]_, _category (in >= 0.15.0)_, and _object_ (string). In addition these dtypes have item sizes, e.g. int64 and int32. \n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Let's clean the data set to remove a few unnecessary columns. We can use _pandas_ `drop()` method as follows:\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":false},"outputs":[],"source":["# in pandas axis=0 represents rows (default) and axis=1 represents columns.\n","df_can.drop(['AREA','REG','DEV','Type','Coverage'], axis=1, inplace=True)\n","df_can.head(2)"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Let's rename the columns so that they make sense. We can use `rename()` method by passing in a dictionary of old and new names as follows:\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":false},"outputs":[],"source":["df_can.rename(columns={'OdName':'Country', 'AreaName':'Continent', 'RegName':'Region'}, inplace=True)\n","df_can.columns"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["We will also add a 'Total' column that sums up the total immigrants by country over the entire period 1980 - 2013, as follows:\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":false},"outputs":[],"source":["df_can['Total'] = df_can.sum(axis=1)"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["We can check to see how many null objects we have in the dataset as follows:\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":true},"outputs":[],"source":["df_can.isnull().sum()"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Finally, let's view a quick summary of each column in our dataframe using the `describe()` method.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":false},"outputs":[],"source":["df_can.describe()"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["* * *\n","\n","## _pandas_ Intermediate: Indexing and Selection (slicing)<a id=\"6\"></a>\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["### Select Column\n","\n","**There are two ways to filter on a column name:**\n","\n","Method 1: Quick and easy, but only works if the column name does NOT have spaces or special characters.\n","\n","```python\n"," df.column_name \n"," (returns series)\n","```\n","\n","Method 2: More robust, and can filter on multiple columns.\n","\n","```python\n"," df['column'] \n"," (returns series)\n","```\n","\n","```python\n"," df[['column 1', 'column 2']] \n"," (returns dataframe)\n","```\n","\n","* * *\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Example: Let's try filtering on the list of countries ('Country').\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":true},"outputs":[],"source":["df_can.Country # returns a series"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Let's try filtering on the list of countries ('OdName') and the data for years: 1980 - 1985.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["df_can[['Country', 1980, 1981, 1982, 1983, 1984, 1985]] # returns a dataframe\n","# notice that 'Country' is string, and the years are integers. \n","# for the sake of consistency, we will convert all column names to string later on."]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["### Select Row\n","\n","There are main 3 ways to select rows:\n","\n","```python\n"," df.loc[label] \n"," #filters by the labels of the index/column\n"," df.iloc[index] \n"," #filters by the positions of the index/column\n","```\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Before we proceed, notice that the defaul index of the dataset is a numeric range from 0 to 194. This makes it very difficult to do a query by a specific country. For example to search for data on Japan, we need to know the corressponding index value.\n","\n","This can be fixed very easily by setting the 'Country' column as the index using `set_index()` method.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":true},"outputs":[],"source":["df_can.set_index('Country', inplace=True)\n","# tip: The opposite of set is reset. So to reset the index, we can use df_can.reset_index()"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":false},"outputs":[],"source":["df_can.head(3)"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":false},"outputs":[],"source":["# optional: to remove the name of the index\n","df_can.index.name = None"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Example: Let's view the number of immigrants from Japan (row 87) for the following scenarios:\n","\n","```\n","1. The full row data (all columns)\n","2. For year 2013\n","3. For years 1980 to 1985\n","```\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":true},"outputs":[],"source":["# 1. the full row data (all columns)\n","print(df_can.loc['Japan'])\n","\n","# alternate methods\n","print(df_can.iloc[87])\n","print(df_can[df_can.index == 'Japan'].T.squeeze())"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":true},"outputs":[],"source":["# 2. for year 2013\n","print(df_can.loc['Japan', 2013])\n","\n","# alternate method\n","print(df_can.iloc[87, 36]) # year 2013 is the last column, with a positional index of 36"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["# 3. for years 1980 to 1985\n","print(df_can.loc['Japan', [1980, 1981, 1982, 1983, 1984, 1984]])\n","print(df_can.iloc[87, [3, 4, 5, 6, 7, 8]])"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Column names that are integers (such as the years) might introduce some confusion. For example, when we are referencing the year 2013, one might confuse that when the 2013th positional index. \n","\n","To avoid this ambuigity, let's convert the column names into strings: '1980' to '2013'.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":false},"outputs":[],"source":["df_can.columns = list(map(str, df_can.columns))\n","# [print (type(x)) for x in df_can.columns.values] #<-- uncomment to check type of column headers"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Since we converted the years to string, let's declare a variable that will allow us to easily call upon the full range of years:\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":false},"outputs":[],"source":["# useful for plotting later on\n","years = list(map(str, range(1980, 2014)))\n","years"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["### Filtering based on a criteria\n","\n","To filter the dataframe based on a condition, we simply pass the condition as a boolean vector. \n","\n","For example, Let's filter the dataframe to show the data on Asian countries (AreaName = Asia).\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":true},"outputs":[],"source":["# 1. create the condition boolean series\n","condition = df_can['Continent'] == 'Asia'\n","print(condition)"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["# 2. pass this condition into the dataFrame\n","df_can[condition]"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["# we can pass mutliple criteria in the same line. \n","# let's filter for AreaNAme = Asia and RegName = Southern Asia\n","\n","df_can[(df_can['Continent']=='Asia') & (df_can['Region']=='Southern Asia')]\n","\n","# note: When using 'and' and 'or' operators, pandas requires we use '&' and '|' instead of 'and' and 'or'\n","# don't forget to enclose the two conditions in parentheses"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Before we proceed: let's review the changes we have made to our dataframe.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":true},"outputs":[],"source":["print('data dimensions:', df_can.shape)\n","print(df_can.columns)\n","df_can.head(2)"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["* * *\n","\n","# Visualizing Data using Matplotlib<a id=\"8\"></a>\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["## Matplotlib: Standard Python Visualization Library<a id=\"10\"></a>\n","\n","The primary plotting library we will explore in the course is [Matplotlib](http://matplotlib.org?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ). As mentioned on their website: \n","\n","> Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the jupyter notebook, web application servers, and four graphical user interface toolkits.\n","\n","If you are aspiring to create impactful visualization with python, Matplotlib is an essential tool to have at your disposal.\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["### Matplotlib.Pyplot\n","\n","One of the core aspects of Matplotlib is `matplotlib.pyplot`. It is Matplotlib's scripting layer which we studied in details in the videos about Matplotlib. Recall that it is a collection of command style functions that make Matplotlib work like MATLAB. Each `pyplot` function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc. In this lab, we will work with the scripting layer to learn how to generate line plots. In future labs, we will get to work with the Artist layer as well to experiment first hand how it differs from the scripting layer. \n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Let's start by importing `Matplotlib` and `Matplotlib.pyplot` as follows:\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":false},"outputs":[],"source":["# we are using the inline backend\n","%matplotlib inline \n","\n","import matplotlib as mpl\n","import matplotlib.pyplot as plt"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["*optional: check if Matplotlib is loaded.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":false},"outputs":[],"source":["print ('Matplotlib version: ', mpl.__version__) # >= 2.0.0"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["*optional: apply a style to Matplotlib.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":false},"outputs":[],"source":["print(plt.style.available)\n","mpl.style.use(['ggplot']) # optional: for ggplot-like style"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["### Plotting in _pandas_\n","\n","Fortunately, pandas has a built-in implementation of Matplotlib that we can use. Plotting in _pandas_ is as simple as appending a `.plot()` method to a series or dataframe.\n","\n","Documentation:\n","\n","- [Plotting with Series](http://pandas.pydata.org/pandas-docs/stable/api.html#plotting?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ)<br>\n","- [Plotting with Dataframes](http://pandas.pydata.org/pandas-docs/stable/api.html#api-dataframe-plotting?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ)\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["# Line Pots (Series/Dataframe) <a id=\"12\"></a>\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["**What is a line plot and why use it?**\n","\n","A line chart or line plot is a type of plot which displays information as a series of data points called 'markers' connected by straight line segments. It is a basic type of chart common in many fields.\n","Use line plot when you have a continuous data set. These are best suited for trend-based visualizations of data over a period of time.\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["**Let's start with a case study:**\n","\n","In 2010, Haiti suffered a catastrophic magnitude 7.0 earthquake. The quake caused widespread devastation and loss of life and aout three million people were affected by this natural disaster. As part of Canada's humanitarian effort, the Government of Canada stepped up its effort in accepting refugees from Haiti. We can quickly visualize this effort using a `Line` plot:\n","\n","**Question:** Plot a line graph of immigration from Haiti using `df.plot()`.\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["First, we will extract the data series for Haiti.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":false},"outputs":[],"source":["haiti = df_can.loc['Haiti', years] # passing in years 1980 - 2013 to exclude the 'total' column\n","haiti.head()"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Next, we will plot a line plot by appending `.plot()` to the `haiti` dataframe.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":false},"outputs":[],"source":["haiti.plot()"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["_pandas_ automatically populated the x-axis with the index values (years), and the y-axis with the column values (population). However, notice how the years were not displayed because they are of type _string_. Therefore, let's change the type of the index values to _integer_ for plotting.\n","\n","Also, let's label the x and y axis using `plt.title()`, `plt.ylabel()`, and `plt.xlabel()` as follows:\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":true},"outputs":[],"source":["haiti.index = haiti.index.map(int) # let's change the index values of Haiti to type integer for plotting\n","haiti.plot(kind='line')\n","\n","plt.title('Immigration from Haiti')\n","plt.ylabel('Number of immigrants')\n","plt.xlabel('Years')\n","\n","plt.show() # need this line to show the updates made to the figure"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["We can clearly notice how number of immigrants from Haiti spiked up from 2010 as Canada stepped up its efforts to accept refugees from Haiti. Let's annotate this spike in the plot by using the `plt.text()` method.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["haiti.plot(kind='line')\n","\n","plt.title('Immigration from Haiti')\n","plt.ylabel('Number of Immigrants')\n","plt.xlabel('Years')\n","\n","# annotate the 2010 Earthquake. \n","# syntax: plt.text(x, y, label)\n","plt.text(2000, 6000, '2010 Earthquake') # see note below\n","\n","plt.show() "]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["With just a few lines of code, you were able to quickly identify and visualize the spike in immigration!\n","\n","Quick note on x and y values in `plt.text(x, y, label)`:\n","\n","```\n"," Since the x-axis (years) is type 'integer', we specified x as a year. The y axis (number of immigrants) is type 'integer', so we can just specify the value y = 6000.\n","```\n","\n","```python\n"," plt.text(2000, 6000, '2010 Earthquake') # years stored as type int\n","```\n","\n","```\n","If the years were stored as type 'string', we would need to specify x as the index position of the year. Eg 20th index is year 2000 since it is the 20th year with a base year of 1980.\n","```\n","\n","```python\n"," plt.text(20, 6000, '2010 Earthquake') # years stored as type int\n","```\n","\n","```\n","We will cover advanced annotation methods in later modules.\n","```\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["We can easily add more countries to line plot to make meaningful comparisons immigration from different countries. \n","\n","**Question:** Let's compare the number of immigrants from India and China from 1980 to 2013.\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Step 1: Get the data set for China and India, and display dataframe.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":true},"outputs":[],"source":["### type your answer here\n","\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["<details><summary>Click here for a sample python solution</summary>\n","\n","```python\n"," #The correct answer is:\n"," df_CI = df_can.loc[['India', 'China'], years]\n"," df_CI.head()\n","```\n","\n","</details>\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Step 2: Plot graph. We will explicitly specify line plot by passing in `kind` parameter to `plot()`.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false},"scrolled":true},"outputs":[],"source":["### type your answer here\n","\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["<details><summary>Click here for a sample python solution</summary>\n","\n","```python\n"," #The correct answer is:\n"," df_CI.plot(kind='line')\n","```\n","\n","</details>\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["That doesn't look right...\n","\n","Recall that _pandas_ plots the indices on the x-axis and the columns as individual lines on the y-axis. Since `df_CI` is a dataframe with the `country` as the index and `years` as the columns, we must first transpose the dataframe using `transpose()` method to swap the row and columns.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["df_CI = df_CI.transpose()\n","df_CI.head()"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["_pandas_ will auomatically graph the two countries on the same graph. Go ahead and plot the new transposed dataframe. Make sure to add a title to the plot and label the axes.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["### type your answer here\n","\n","\n","\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["<details><summary>Click here for a sample python solution</summary>\n","\n","```python\n"," #The correct answer is:\n"," df_CI.index = df_CI.index.map(int) # let's change the index values of df_CI to type integer for plotting\n"," df_CI.plot(kind='line')\n","\n"," plt.title('Immigrants from China and India')\n"," plt.ylabel('Number of Immigrants')\n"," plt.xlabel('Years')\n","\n"," plt.show()\n","```\n","\n","</details>\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["<br>From the above plot, we can observe that the China and India have very similar immigration trends through the years. \n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["_Note_: How come we didn't need to transpose Haiti's dataframe before plotting (like we did for df_CI)?\n","\n","That's because `haiti` is a series as opposed to a dataframe, and has the years as its indices as shown below. \n","\n","```python\n","print(type(haiti))\n","print(haiti.head(5))\n","```\n","\n","> class 'pandas.core.series.Series' <br>\n","> 1980 1666 <br>\n","> 1981 3692 <br>\n","> 1982 3498 <br>\n","> 1983 2860 <br>\n","> 1984 1418 <br>\n","> Name: Haiti, dtype: int64 <br>\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["Line plot is a handy tool to display several dependent variables against one independent variable. However, it is recommended that no more than 5-10 lines on a single graph; any more than that and it becomes difficult to interpret.\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["**Question:** Compare the trend of top 5 countries that contributed the most to immigration to Canada.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"outputs":[],"source":["### type your answer here\n","\n","\n","\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["<details><summary>Click here for a sample python solution</summary>\n","\n","```python\n"," #The correct answer is: \n"," #Step 1: Get the dataset. Recall that we created a Total column that calculates cumulative immigration by country. \n"," #We will sort on this column to get our top 5 countries using pandas sort_values() method.\n"," \n"," inplace = True paramemter saves the changes to the original df_can dataframe\n"," df_can.sort_values(by='Total', ascending=False, axis=0, inplace=True)\n","\n"," # get the top 5 entries\n"," df_top5 = df_can.head(5)\n","\n"," # transpose the dataframe\n"," df_top5 = df_top5[years].transpose() \n","\n"," print(df_top5)\n","\n","\n"," #Step 2: Plot the dataframe. To make the plot more readeable, we will change the size using the `figsize` parameter.\n"," df_top5.index = df_top5.index.map(int) # let's change the index values of df_top5 to type integer for plotting\n"," df_top5.plot(kind='line', figsize=(14, 8)) # pass a tuple (x, y) size\n","\n","\n","\n"," plt.title('Immigration Trend of Top 5 Countries')\n"," plt.ylabel('Number of Immigrants')\n"," plt.xlabel('Years')\n","\n","\n"," plt.show()\n","\n","```\n","\n","</details>\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["### Other Plots\n","\n","Congratulations! you have learned how to wrangle data with python and create a line plot with Matplotlib. There are many other plotting styles available other than the default Line plot, all of which can be accessed by passing `kind` keyword to `plot()`. The full list of available plots are as follows:\n","\n","- `bar` for vertical bar plots\n","- `barh` for horizontal bar plots\n","- `hist` for histogram\n","- `box` for boxplot\n","- `kde` or `density` for density plots\n","- `area` for area plots\n","- `pie` for pie plots\n","- `scatter` for scatter plots\n","- `hexbin` for hexbin plot\n"]},{"cell_type":"markdown","metadata":{"button":false,"new_sheet":false,"run_control":{"read_only":false}},"source":["### Thank you for completing this lab!\n","\n","## Author\n","\n","<a href=\"https://www.linkedin.com/in/aklson/\" target=\"_blank\">Alex Aklson</a>\n","\n","### Other Contributors\n","\n","[Jay Rajasekharan](https://www.linkedin.com/in/jayrajasekharan?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ)\n","[Ehsan M. Kermani](https://www.linkedin.com/in/ehsanmkermani?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ)\n","[Slobodan Markovic](https://www.linkedin.com/in/slobodan-markovic?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ).\n","\n","## Change Log\n","\n","| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n","| ----------------- | ------- | ------------- | ---------------------------------- |\n","| 2021-01-20 | 2.3 | Lakshmi Holla | Changed TOC cell markdown |\n","| 2020-11-20 | 2.2 | Lakshmi Holla | Changed IBM box URL |\n","| 2020-11-03 | 2.1 | Lakshmi Holla | Changed URL and info method |\n","| 2020-08-27 | 2.0 | Lavanya | Moved Lab to course repo in GitLab |\n","| | | | |\n","| | | | |\n","\n","## <h3 align=\"center\"> © IBM Corporation 2020. All rights reserved. <h3/>\n"]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.7.6"},"widgets":{"state":{},"version":"1.1.2"}},"nbformat":4,"nbformat_minor":2}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment