Skip to content

Instantly share code, notes, and snippets.

@maryamalqasmi
Created August 26, 2021 11:59
Show Gist options
  • Save maryamalqasmi/697971776ca267d563651ca1eb0f9c3d to your computer and use it in GitHub Desktop.
Save maryamalqasmi/697971776ca267d563651ca1eb0f9c3d to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": "<center>\n <img src=\"https://gitlab.com/ibm/skills-network/courses/placeholder101/-/raw/master/labs/module%201/images/IDSNlogo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\" />\n</center>\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "# **Space X Falcon 9 First Stage Landing Prediction**\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Web scraping Falcon 9 and Falcon Heavy Launches Records from Wikipedia\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Estimated time needed: **40** minutes\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "In this lab, you will be performing web scraping to collect Falcon 9 historical launch records from a Wikipedia page titled `List of Falcon 9 and Falcon Heavy launches`\n\n[https://en.wikipedia.org/wiki/List_of_Falcon\\_9\\_and_Falcon_Heavy_launches](https://en.wikipedia.org/wiki/List_of_Falcon\\_9\\_and_Falcon_Heavy_launches?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01)\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module\\_1\\_L2/images/Falcon9\\_rocket_family.svg)\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Falcon 9 first stage will land successfully\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/landing\\_1.gif)\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Several examples of an unsuccessful landing are shown here:\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/crash.gif)\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "More specifically, the launch records are stored in a HTML table shown below:\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module\\_1\\_L2/images/falcon9-launches-wiki.png)\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Objectives\n\nWeb scrap Falcon 9 launch records with `BeautifulSoup`:\n\n* Extract a Falcon 9 launch records HTML table from Wikipedia\n* Parse the table and convert it into a Pandas data frame\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "First let's import required packages for this lab\n"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!pip3 install beautifulsoup4\n!pip3 install requests"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "import sys\n\nimport requests\nfrom bs4 import BeautifulSoup\nimport re\nimport unicodedata\nimport pandas as pd"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "and we will provide some helper functions for you to process web scraped HTML table\n"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "def date_time(table_cells):\n \"\"\"\n This function returns the data and time from the HTML table cell\n Input: the element of a table data cell extracts extra row\n \"\"\"\n return [data_time.strip() for data_time in list(table_cells.strings)][0:2]\n\ndef booster_version(table_cells):\n \"\"\"\n This function returns the booster version from the HTML table cell \n Input: the element of a table data cell extracts extra row\n \"\"\"\n out=''.join([booster_version for i,booster_version in enumerate( table_cells.strings) if i%2==0][0:-1])\n return out\n\ndef landing_status(table_cells):\n \"\"\"\n This function returns the landing status from the HTML table cell \n Input: the element of a table data cell extracts extra row\n \"\"\"\n out=[i for i in table_cells.strings][0]\n return out\n\n\ndef get_mass(table_cells):\n mass=unicodedata.normalize(\"NFKD\", table_cells.text).strip()\n if mass:\n mass.find(\"kg\")\n new_mass=mass[0:mass.find(\"kg\")+2]\n else:\n new_mass=0\n return new_mass\n\n\ndef extract_column_from_header(row):\n \"\"\"\n This function returns the landing status from the HTML table cell \n Input: the element of a table data cell extracts extra row\n \"\"\"\n if (row.br):\n row.br.extract()\n if row.a:\n row.a.extract()\n if row.sup:\n row.sup.extract()\n \n colunm_name = ' '.join(row.contents)\n \n # Filter the digit and empty names\n if not(colunm_name.strip().isdigit()):\n colunm_name = colunm_name.strip()\n return colunm_name \n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "To keep the lab tasks consistent, you will be asked to scrape the data from a snapshot of the `List of Falcon 9 and Falcon Heavy launches` Wikipage updated on\n`9th June 2021`\n"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "static_url = \"https://en.wikipedia.org/w/index.php?title=List_of_Falcon_9_and_Falcon_Heavy_launches&oldid=1027686922\""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Next, request the HTML page from the above URL and get a `response` object\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### TASK 1: Request the Falcon9 Launch Wiki page from its URL\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "First, let's perform an HTTP GET method to request the Falcon9 Launch HTML page, as an HTTP response.\n"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "# use requests.get() method with the provided static_url\n# assign the response to a object"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Create a `BeautifulSoup` object from the HTML `response`\n"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "# Use BeautifulSoup() to create a BeautifulSoup object from a response text content"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Print the page title to verify if the `BeautifulSoup` object was created properly\n"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "# Use soup.title attribute"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### TASK 2: Extract all column/variable names from the HTML table header\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Next, we want to collect all relevant column names from the HTML table header\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Let's try to find all tables on the wiki page first. If you need to refresh your memory about `BeautifulSoup`, please check the external reference link towards the end of this lab\n"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "# Use the find_all function in the BeautifulSoup object, with element type `table`\n# Assign the result to a list called `html_tables`\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Starting from the third table is our target table contains the actual launch records.\n"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "# Let's print the third table and check its content\nfirst_launch_table = html_tables[2]\nprint(first_launch_table)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "You should able to see the columns names embedded in the table header elements `<th>` as follows:\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "```\n<tr>\n<th scope=\"col\">Flight No.\n</th>\n<th scope=\"col\">Date and<br/>time (<a href=\"/wiki/Coordinated_Universal_Time\" title=\"Coordinated Universal Time\">UTC</a>)\n</th>\n<th scope=\"col\"><a href=\"/wiki/List_of_Falcon_9_first-stage_boosters\" title=\"List of Falcon 9 first-stage boosters\">Version,<br/>Booster</a> <sup class=\"reference\" id=\"cite_ref-booster_11-0\"><a href=\"#cite_note-booster-11\">[b]</a></sup>\n</th>\n<th scope=\"col\">Launch site\n</th>\n<th scope=\"col\">Payload<sup class=\"reference\" id=\"cite_ref-Dragon_12-0\"><a href=\"#cite_note-Dragon-12\">[c]</a></sup>\n</th>\n<th scope=\"col\">Payload mass\n</th>\n<th scope=\"col\">Orbit\n</th>\n<th scope=\"col\">Customer\n</th>\n<th scope=\"col\">Launch<br/>outcome\n</th>\n<th scope=\"col\"><a href=\"/wiki/Falcon_9_first-stage_landing_tests\" title=\"Falcon 9 first-stage landing tests\">Booster<br/>landing</a>\n</th></tr>\n```\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Next, we just need to iterate through the `<th>` elements and apply the provided `extract_column_from_header()` to extract column name one by one\n"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "column_names = []\n\n# Apply find_all() function with `th` element on first_launch_table\n# Iterate each th element and apply the provided extract_column_from_header() to get a column name\n# Append the Non-empty column name (`if name is not None and len(name) > 0`) into a list called column_names\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Check the extracted column names\n"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "print(column_names)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## TASK 3: Create a data frame by parsing the launch HTML tables\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We will create an empty dictionary with keys from the extracted column names in the previous task. Later, this dictionary will be converted into a Pandas dataframe\n"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "launch_dict= dict.fromkeys(column_names)\n\n# Remove an irrelvant column\ndel launch_dict['Date and time ( )']\n\n# Let's initial the launch_dict with each value to be an empty list\nlaunch_dict['Flight No.'] = []\nlaunch_dict['Launch site'] = []\nlaunch_dict['Payload'] = []\nlaunch_dict['Payload mass'] = []\nlaunch_dict['Orbit'] = []\nlaunch_dict['Customer'] = []\nlaunch_dict['Launch outcome'] = []\n# Added some new columns\nlaunch_dict['Version Booster']=[]\nlaunch_dict['Booster landing']=[]\nlaunch_dict['Date']=[]\nlaunch_dict['Time']=[]"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Next, we just need to fill up the `launch_dict` with launch records extracted from table rows.\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Usually, HTML tables in Wiki pages are likely to contain unexpected annotations and other types of noises, such as reference links `B0004.1[8]`, missing values `N/A [e]`, inconsistent formatting, etc.\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "To simplify the parsing process, we have provided an incomplete code snippet below to help you to fill up the `launch_dict`. Please complete the following code snippet with TODOs or you can choose to write your own logic to parse all launch tables:\n"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "extracted_row = 0\n#Extract each table \nfor table_number,table in enumerate(soup.find_all('table',\"wikitable plainrowheaders collapsible\")):\n # get table row \n for rows in table.find_all(\"tr\"):\n #check to see if first table heading is as number corresponding to launch a number \n if rows.th:\n if rows.th.string:\n flight_number=rows.th.string.strip()\n flag=flight_number.isdigit()\n else:\n flag=False\n #get table element \n row=rows.find_all('td')\n #if it is number save cells in a dictonary \n if flag:\n extracted_row += 1\n # Flight Number value\n # TODO: Append the flight_number into launch_dict with key `Flight No.`\n #print(flight_number)\n datatimelist=date_time(row[0])\n \n # Date value\n # TODO: Append the date into launch_dict with key `Date`\n date = datatimelist[0].strip(',')\n #print(date)\n \n # Time value\n # TODO: Append the time into launch_dict with key `Time`\n time = datatimelist[1]\n #print(time)\n \n # Booster version\n # TODO: Append the bv into launch_dict with key `Version Booster`\n bv=booster_version(row[1])\n if not(bv):\n bv=row[1].a.string\n print(bv)\n \n # Launch Site\n # TODO: Append the bv into launch_dict with key `Launch Site`\n launch_site = row[2].a.string\n #print(launch_site)\n \n # Payload\n # TODO: Append the payload into launch_dict with key `Payload`\n payload = row[3].a.string\n #print(payload)\n \n # Payload Mass\n # TODO: Append the payload_mass into launch_dict with key `Payload mass`\n payload_mass = get_mass(row[4])\n #print(payload)\n \n # Orbit\n # TODO: Append the orbit into launch_dict with key `Orbit`\n orbit = row[5].a.string\n #print(orbit)\n \n # Customer\n # TODO: Append the customer into launch_dict with key `Customer`\n customer = row[6].a.string\n #print(customer)\n \n # Launch outcome\n # TODO: Append the launch_outcome into launch_dict with key `Launch outcome`\n launch_outcome = list(row[7].strings)[0]\n #print(launch_outcome)\n \n # Booster landing\n # TODO: Append the launch_outcome into launch_dict with key `Booster landing`\n booster_landing = landing_status(row[8])\n #print(booster_landing)\n "
},
{
"cell_type": "markdown",
"metadata": {},
"source": "After you have fill in the parsed launch record values into `launch_dict`, you can create a dataframe from it.\n"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "df=pd.DataFrame(launch_dict)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We can now export it to a <b>CSV</b> for the next section, but to make the answers consistent and in case you have difficulties finishing this lab.\n\nFollowing labs will be using a provided dataset to make each lab independent.\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<code>df.to_csv('spacex_web_scraped.csv', index=False)</code>\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Authors\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<a href=\"https://www.linkedin.com/in/yan-luo-96288783/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01\">Yan Luo</a>\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<a href=\"https://www.linkedin.com/in/nayefaboutayoun/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01\">Nayef Abou Tayoun</a>\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Change Log\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n| ----------------- | ------- | ---------- | --------------------------- |\n| 2021-06-09 | 1.0 | Yan Luo | Tasks updates |\n| 2020-11-10 | 1.0 | Nayef | Created the initial version |\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Copyright \u00a9 2021 IBM Corporation. All rights reserved.\n"
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment