Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save darryllamoureux/6f57a5e0720df2bc3a99135e0fb50e9e to your computer and use it in GitHub Desktop.
Save darryllamoureux/6f57a5e0720df2bc3a99135e0fb50e9e to your computer and use it in GitHub Desktop.
IBM Data Analyst Capstone Project > Week 1 > Lab 4: Collecting Data Using Web Scraping
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "<center>\n <img src=\"https://gitlab.com/ibm/skills-network/courses/placeholder101/-/raw/master/labs/module%201/images/IDSNlogo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\" />\n</center>\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "# **Hands-on Lab : Web Scraping**\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Estimated time needed: **30 to 45** minutes\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Objectives\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "In this lab you will perform the following:\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "- Extract information from a given web site \n- Write the scraped data into a csv file.\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Extract information from the given web site\n\nYou will extract the data from the below web site: <br> \n"
},
{
"metadata": {},
"cell_type": "code",
"source": "#this url contains the data you need to scrape\nurl = \"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/Programming_Languages.html\"",
"execution_count": 26,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "The data you need to scrape is the **name of the programming language** and **average annual salary**.<br> It is a good idea to open the url in your web broswer and study the contents of the web page before you start to scrape.\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Import the required libraries\n"
},
{
"metadata": {},
"cell_type": "code",
"source": "# Your code here\nfrom bs4 import BeautifulSoup # this module helps in web scrapping.\nimport requests # this module helps us to download a web page\nimport pandas as pd\nimport numpy as np",
"execution_count": 27,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Download the webpage at the url\n"
},
{
"metadata": {},
"cell_type": "code",
"source": "#your code goes here\n# get the contents of the webpage in text format and store in a variable called data\ndata = requests.get(url).text",
"execution_count": 28,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Create a soup object\n"
},
{
"metadata": {},
"cell_type": "code",
"source": "#your code goes here\nsoup = BeautifulSoup(data,\"html5lib\") # create a soup object using the variable 'data'",
"execution_count": 29,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Scrape the `Language name` and `annual average salary`.\n"
},
{
"metadata": {},
"cell_type": "code",
"source": "#your code goes here\n#find a html table in the web page\ntable = soup.find('table') # in html table is represented by the tag <table>",
"execution_count": 30,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Save the scrapped data into a file named _popular-languages.csv_\n"
},
{
"metadata": {},
"cell_type": "code",
"source": "# Get all rows from the table\nlanguage_list = []\nfor row in table.find_all('tr'): # in html table row is represented by the tag <tr>\n # Get all columns in each row.\n cols = row.find_all('td') # in html a column is represented by the tag <td>\n language = cols[1].getText() # store the value in column 3 as language_name\n salary = cols[3].getText() # store the value in column 4 as annual_average_salary\n language_list.append([language,salary])\n\n# convert to dataframe:\ndf_lang = pd.DataFrame(language_list, columns=['Language','Average Annual Salary']) \n\n# save as csv:\nfilename = \"popular-languages.csv\"\ndf_lang.to_csv(filename)\n\n# now print out the file:\ndf = pd.read_csv(filename, header=1) \ndf",
"execution_count": 61,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 61,
"data": {
"text/plain": " 0 Language Average Annual Salary\n0 1 Python $114,383\n1 2 Java $101,013\n2 3 R $92,037\n3 4 Javascript $110,981\n4 5 Swift $130,801\n5 6 C++ $113,865\n6 7 C# $88,726\n7 8 PHP $84,727\n8 9 SQL $84,793\n9 10 Go $94,082",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>0</th>\n <th>Language</th>\n <th>Average Annual Salary</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>Python</td>\n <td>$114,383</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>Java</td>\n <td>$101,013</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>R</td>\n <td>$92,037</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>Javascript</td>\n <td>$110,981</td>\n </tr>\n <tr>\n <th>4</th>\n <td>5</td>\n <td>Swift</td>\n <td>$130,801</td>\n </tr>\n <tr>\n <th>5</th>\n <td>6</td>\n <td>C++</td>\n <td>$113,865</td>\n </tr>\n <tr>\n <th>6</th>\n <td>7</td>\n <td>C#</td>\n <td>$88,726</td>\n </tr>\n <tr>\n <th>7</th>\n <td>8</td>\n <td>PHP</td>\n <td>$84,727</td>\n </tr>\n <tr>\n <th>8</th>\n <td>9</td>\n <td>SQL</td>\n <td>$84,793</td>\n </tr>\n <tr>\n <th>9</th>\n <td>10</td>\n <td>Go</td>\n <td>$94,082</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "code",
"source": "# use the inline backend to generate the plots within the browser\n%matplotlib inline \n\nimport matplotlib as mpl\nimport matplotlib.pyplot as plt\n\nmpl.style.use('ggplot') # optional: for ggplot-like style\n\n# check for latest version of Matplotlib\nprint ('Matplotlib version: ', mpl.__version__) # >= 2.0.0",
"execution_count": 74,
"outputs": [
{
"output_type": "stream",
"text": "Matplotlib version: 3.2.2\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "code",
"source": "#step 1 - clean data\ndf['Average Annual Salary'].replace('[\\$,]', '', regex=True, inplace=True)\ndf['Average Annual Salary'] = df['Average Annual Salary'].astype(float)\ndf.sort_values('Average Annual Salary', ascending=True, inplace=True)\n#del df['0']\ndf = df.set_index('Language')\ndf\n",
"execution_count": 79,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 79,
"data": {
"text/plain": " Average Annual Salary\nLanguage \nPHP 84727.0\nSQL 84793.0\nC# 88726.0\nR 92037.0\nGo 94082.0\nJava 101013.0\nJavascript 110981.0\nC++ 113865.0\nPython 114383.0\nSwift 130801.0",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Average Annual Salary</th>\n </tr>\n <tr>\n <th>Language</th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>PHP</th>\n <td>84727.0</td>\n </tr>\n <tr>\n <th>SQL</th>\n <td>84793.0</td>\n </tr>\n <tr>\n <th>C#</th>\n <td>88726.0</td>\n </tr>\n <tr>\n <th>R</th>\n <td>92037.0</td>\n </tr>\n <tr>\n <th>Go</th>\n <td>94082.0</td>\n </tr>\n <tr>\n <th>Java</th>\n <td>101013.0</td>\n </tr>\n <tr>\n <th>Javascript</th>\n <td>110981.0</td>\n </tr>\n <tr>\n <th>C++</th>\n <td>113865.0</td>\n </tr>\n <tr>\n <th>Python</th>\n <td>114383.0</td>\n </tr>\n <tr>\n <th>Swift</th>\n <td>130801.0</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "code",
"source": "# step 2: plot data\ndf.plot(kind='barh', figsize=(10, 6))\n\nplt.xlabel('Average Annual Salary') # add to x-label to the plot\nplt.ylabel('Programming Language') # add y-label to the plot\nplt.title('Average Annual Salary by Progamming Language') # add title to the plot\n\nplt.show()",
"execution_count": 80,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": "<Figure size 720x432 with 1 Axes>",
"image/png": "\n"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Authors\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Ramesh Sannareddy\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### Other Contributors\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Rav Ahuja\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Change Log\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n| ----------------- | ------- | ----------------- | ---------------------------------- |\n| 2020-10-17 | 0.1 | Ramesh Sannareddy | Created initial version of the lab |\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": " Copyright \u00a9 2020 IBM Corporation. This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ).\n"
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3.7",
"language": "python"
},
"language_info": {
"name": "python",
"version": "3.7.9",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
@priyansh8994
Copy link

wow. loved this

@darryllamoureux
Copy link
Author

@priyansh8994 - Thanks! This was part of the capstone project for the IBM Data Analyst specialization, available on Coursera.org. Check it out - it was very worthwhile, and improved my understanding of not only the languages and technology involved in data analysis, but gave context on how to present data for it to be more meaningful to your audience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment