maryamalqasmi/scrap.ipynb

## scrap.ipynb
{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "<center>\n    <img src=\"https://gitlab.com/ibm/skills-network/courses/placeholder101/-/raw/master/labs/module%201/images/IDSNlogo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\"  />\n</center>\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "# **Space X  Falcon 9 First Stage Landing Prediction**\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "## Web scraping Falcon 9 and Falcon Heavy Launches Records from Wikipedia\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "Estimated time needed: **40** minutes\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "In this lab, you will be performing web scraping to collect Falcon 9 historical launch records from a Wikipedia page titled `List of Falcon 9 and Falcon Heavy launches`\n\n[https://en.wikipedia.org/wiki/List_of_Falcon\\_9\\_and_Falcon_Heavy_launches](https://en.wikipedia.org/wiki/List_of_Falcon\\_9\\_and_Falcon_Heavy_launches?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01)\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module\\_1\\_L2/images/Falcon9\\_rocket_family.svg)\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "Falcon 9 first stage will land successfully\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/landing\\_1.gif)\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "Several examples of an unsuccessful landing are shown here:\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/crash.gif)\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "More specifically, the launch records are stored in a HTML table shown below:\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module\\_1\\_L2/images/falcon9-launches-wiki.png)\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "## Objectives\n\nWeb scrap Falcon 9 launch records with `BeautifulSoup`:\n\n*   Extract a Falcon 9 launch records HTML table from Wikipedia\n*   Parse the table and convert it into a Pandas data frame\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "First let's import required packages for this lab\n"
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": "!pip3 install beautifulsoup4\n!pip3 install requests"
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": "import sys\n\nimport requests\nfrom bs4 import BeautifulSoup\nimport re\nimport unicodedata\nimport pandas as pd"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "and we will provide some helper functions for you to process web scraped HTML table\n"
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": "def date_time(table_cells):\n    \"\"\"\n    This function returns the data and time from the HTML  table cell\n    Input: the  element of a table data cell extracts extra row\n    \"\"\"\n    return [data_time.strip() for data_time in list(table_cells.strings)][0:2]\n\ndef booster_version(table_cells):\n    \"\"\"\n    This function returns the booster version from the HTML  table cell \n    Input: the  element of a table data cell extracts extra row\n    \"\"\"\n    out=''.join([booster_version for i,booster_version in enumerate( table_cells.strings) if i%2==0][0:-1])\n    return out\n\ndef landing_status(table_cells):\n    \"\"\"\n    This function returns the landing status from the HTML table cell \n    Input: the  element of a table data cell extracts extra row\n    \"\"\"\n    out=[i for i in table_cells.strings][0]\n    return out\n\n\ndef get_mass(table_cells):\n    mass=unicodedata.normalize(\"NFKD\", table_cells.text).strip()\n    if mass:\n        mass.find(\"kg\")\n        new_mass=mass[0:mass.find(\"kg\")+2]\n    else:\n        new_mass=0\n    return new_mass\n\n\ndef extract_column_from_header(row):\n    \"\"\"\n    This function returns the landing status from the HTML table cell \n    Input: the  element of a table data cell extracts extra row\n    \"\"\"\n    if (row.br):\n        row.br.extract()\n    if row.a:\n        row.a.extract()\n    if row.sup:\n        row.sup.extract()\n        \n    colunm_name = ' '.join(row.contents)\n    \n    # Filter the digit and empty names\n    if not(colunm_name.strip().isdigit()):\n        colunm_name = colunm_name.strip()\n        return colunm_name    \n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "To keep the lab tasks consistent, you will be asked to scrape the data from a snapshot of the  `List of Falcon 9 and Falcon Heavy launches` Wikipage updated on\n`9th June 2021`\n"
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": "static_url = \"https://en.wikipedia.org/w/index.php?title=List_of_Falcon_9_and_Falcon_Heavy_launches&oldid=1027686922\""
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "Next, request the HTML page from the above URL and get a `response` object\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "### TASK 1: Request the Falcon9 Launch Wiki page from its URL\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "First, let's perform an HTTP GET method to request the Falcon9 Launch HTML page, as an HTTP response.\n"
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": "# use requests.get() method with the provided static_url\n# assign the response to a object"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "Create a `BeautifulSoup` object from the HTML `response`\n"
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": "# Use BeautifulSoup() to create a BeautifulSoup object from a response text content"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "Print the page title to verify if the `BeautifulSoup` object was created properly\n"
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": "# Use soup.title attribute"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "### TASK 2: Extract all column/variable names from the HTML table header\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "Next, we want to collect all relevant column names from the HTML table header\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "Let's try to find all tables on the wiki page first. If you need to refresh your memory about `BeautifulSoup`, please check the external reference link towards the end of this lab\n"
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": "# Use the find_all function in the BeautifulSoup object, with element type `table`\n# Assign the result to a list called `html_tables`\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "Starting from the third table is our target table contains the actual launch records.\n"
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": "# Let's print the third table and check its content\nfirst_launch_table = html_tables[2]\nprint(first_launch_table)"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "You should able to see the columns names embedded in the table header elements `<th>` as follows:\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "```\n<tr>\n<th scope=\"col\">Flight No.\n</th>\n<th scope=\"col\">Date and<br/>time (<a href=\"/wiki/Coordinated_Universal_Time\" title=\"Coordinated Universal Time\">UTC</a>)\n</th>\n<th scope=\"col\"><a href=\"/wiki/List_of_Falcon_9_first-stage_boosters\" title=\"List of Falcon 9 first-stage boosters\">Version,<br/>Booster</a> <sup class=\"reference\" id=\"cite_ref-booster_11-0\"><a href=\"#cite_note-booster-11\">[b]</a></sup>\n</th>\n<th scope=\"col\">Launch site\n</th>\n<th scope=\"col\">Payload<sup class=\"reference\" id=\"cite_ref-Dragon_12-0\"><a href=\"#cite_note-Dragon-12\">[c]</a></sup>\n</th>\n<th scope=\"col\">Payload mass\n</th>\n<th scope=\"col\">Orbit\n</th>\n<th scope=\"col\">Customer\n</th>\n<th scope=\"col\">Launch<br/>outcome\n</th>\n<th scope=\"col\"><a href=\"/wiki/Falcon_9_first-stage_landing_tests\" title=\"Falcon 9 first-stage landing tests\">Booster<br/>landing</a>\n</th></tr>\n```\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "Next, we just need to iterate through the `<th>` elements and apply the provided `extract_column_from_header()` to extract column name one by one\n"
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": "column_names = []\n\n# Apply find_all() function with `th` element on first_launch_table\n# Iterate each th element and apply the provided extract_column_from_header() to get a column name\n# Append the Non-empty column name (`if name is not None and len(name) > 0`) into a list called column_names\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "Check the extracted column names\n"
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": "print(column_names)"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "## TASK 3: Create a data frame by parsing the launch HTML tables\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "We will create an empty dictionary with keys from the extracted column names in the previous task. Later, this dictionary will be converted into a Pandas dataframe\n"
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": "launch_dict= dict.fromkeys(column_names)\n\n# Remove an irrelvant column\ndel launch_dict['Date and time ( )']\n\n# Let's initial the launch_dict with each value to be an empty list\nlaunch_dict['Flight No.'] = []\nlaunch_dict['Launch site'] = []\nlaunch_dict['Payload'] = []\nlaunch_dict['Payload mass'] = []\nlaunch_dict['Orbit'] = []\nlaunch_dict['Customer'] = []\nlaunch_dict['Launch outcome'] = []\n# Added some new columns\nlaunch_dict['Version Booster']=[]\nlaunch_dict['Booster landing']=[]\nlaunch_dict['Date']=[]\nlaunch_dict['Time']=[]"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "Next, we just need to fill up the `launch_dict` with launch records extracted from table rows.\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "Usually, HTML tables in Wiki pages are likely to contain unexpected annotations and other types of noises, such as reference links `B0004.1[8]`, missing values `N/A [e]`, inconsistent formatting, etc.\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "To simplify the parsing process, we have provided an incomplete code snippet below to help you to fill up the `launch_dict`. Please complete the following code snippet with TODOs or you can choose to write your own logic to parse all launch tables:\n"
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": "extracted_row = 0\n#Extract each table \nfor table_number,table in enumerate(soup.find_all('table',\"wikitable plainrowheaders collapsible\")):\n   # get table row \n    for rows in table.find_all(\"tr\"):\n        #check to see if first table heading is as number corresponding to launch a number \n        if rows.th:\n            if rows.th.string:\n                flight_number=rows.th.string.strip()\n                flag=flight_number.isdigit()\n        else:\n            flag=False\n        #get table element \n        row=rows.find_all('td')\n        #if it is number save cells in a dictonary \n        if flag:\n            extracted_row += 1\n            # Flight Number value\n            # TODO: Append the flight_number into launch_dict with key `Flight No.`\n            #print(flight_number)\n            datatimelist=date_time(row[0])\n            \n            # Date value\n            # TODO: Append the date into launch_dict with key `Date`\n            date = datatimelist[0].strip(',')\n            #print(date)\n            \n            # Time value\n            # TODO: Append the time into launch_dict with key `Time`\n            time = datatimelist[1]\n            #print(time)\n              \n            # Booster version\n            # TODO: Append the bv into launch_dict with key `Version Booster`\n            bv=booster_version(row[1])\n            if not(bv):\n                bv=row[1].a.string\n            print(bv)\n            \n            # Launch Site\n            # TODO: Append the bv into launch_dict with key `Launch Site`\n            launch_site = row[2].a.string\n            #print(launch_site)\n            \n            # Payload\n            # TODO: Append the payload into launch_dict with key `Payload`\n            payload = row[3].a.string\n            #print(payload)\n            \n            # Payload Mass\n            # TODO: Append the payload_mass into launch_dict with key `Payload mass`\n            payload_mass = get_mass(row[4])\n            #print(payload)\n            \n            # Orbit\n            # TODO: Append the orbit into launch_dict with key `Orbit`\n            orbit = row[5].a.string\n            #print(orbit)\n            \n            # Customer\n            # TODO: Append the customer into launch_dict with key `Customer`\n            customer = row[6].a.string\n            #print(customer)\n            \n            # Launch outcome\n            # TODO: Append the launch_outcome into launch_dict with key `Launch outcome`\n            launch_outcome = list(row[7].strings)[0]\n            #print(launch_outcome)\n            \n            # Booster landing\n            # TODO: Append the launch_outcome into launch_dict with key `Booster landing`\n            booster_landing = landing_status(row[8])\n            #print(booster_landing)\n            "
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "After you have fill in the parsed launch record values into `launch_dict`, you can create a dataframe from it.\n"
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": "df=pd.DataFrame(launch_dict)"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "We can now export it to a <b>CSV</b> for the next section, but to make the answers consistent and in case you have difficulties finishing this lab.\n\nFollowing labs will be using a provided dataset to make each lab independent.\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "<code>df.to_csv('spacex_web_scraped.csv', index=False)</code>\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "## Authors\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "<a href=\"https://www.linkedin.com/in/yan-luo-96288783/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01\">Yan Luo</a>\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "<a href=\"https://www.linkedin.com/in/nayefaboutayoun/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01\">Nayef Abou Tayoun</a>\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "## Change Log\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "| Date (YYYY-MM-DD) | Version | Changed By | Change Description          |\n| ----------------- | ------- | ---------- | --------------------------- |\n| 2021-06-09        | 1.0     | Yan Luo    | Tasks updates               |\n| 2020-11-10        | 1.0     | Nayef      | Created the initial version |\n"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "Copyright \u00a9 2021 IBM Corporation. All rights reserved.\n"
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "Python 3",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.8.8"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 4
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "<center>\n <img src=\"https://gitlab.com/ibm/skills-network/courses/placeholder101/-/raw/master/labs/module%201/images/IDSNlogo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\" />\n</center>\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "# Space X Falcon 9 First Stage Landing Prediction\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "## Web scraping Falcon 9 and Falcon Heavy Launches Records from Wikipedia\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Estimated time needed: 40 minutes\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "In this lab, you will be performing web scraping to collect Falcon 9 historical launch records from a Wikipedia page titled `List of Falcon 9 and Falcon Heavy launches`\n\n[https://en.wikipedia.org/wiki/List_of_Falcon\\_9\\_and_Falcon_Heavy_launches](https://en.wikipedia.org/wiki/List_of_Falcon\\_9\\_and_Falcon_Heavy_launches?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01)\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module\\_1\\_L2/images/Falcon9\\_rocket_family.svg)\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Falcon 9 first stage will land successfully\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/landing\\_1.gif)\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Several examples of an unsuccessful landing are shown here:\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/crash.gif)\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "More specifically, the launch records are stored in a HTML table shown below:\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module\\_1\\_L2/images/falcon9-launches-wiki.png)\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "## Objectives\n\nWeb scrap Falcon 9 launch records with `BeautifulSoup`:\n\n* Extract a Falcon 9 launch records HTML table from Wikipedia\n* Parse the table and convert it into a Pandas data frame\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "First let's import required packages for this lab\n"
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": "!pip3 install beautifulsoup4\n!pip3 install requests"
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": "import sys\n\nimport requests\nfrom bs4 import BeautifulSoup\nimport re\nimport unicodedata\nimport pandas as pd"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "and we will provide some helper functions for you to process web scraped HTML table\n"
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": "def date_time(table_cells):\n \"\"\"\n This function returns the data and time from the HTML table cell\n Input: the element of a table data cell extracts extra row\n \"\"\"\n return [data_time.strip() for data_time in list(table_cells.strings)][0:2]\n\ndef booster_version(table_cells):\n \"\"\"\n This function returns the booster version from the HTML table cell \n Input: the element of a table data cell extracts extra row\n \"\"\"\n out=''.join([booster_version for i,booster_version in enumerate( table_cells.strings) if i%2==0][0:-1])\n return out\n\ndef landing_status(table_cells):\n \"\"\"\n This function returns the landing status from the HTML table cell \n Input: the element of a table data cell extracts extra row\n \"\"\"\n out=[i for i in table_cells.strings][0]\n return out\n\n\ndef get_mass(table_cells):\n mass=unicodedata.normalize(\"NFKD\", table_cells.text).strip()\n if mass:\n mass.find(\"kg\")\n new_mass=mass[0:mass.find(\"kg\")+2]\n else:\n new_mass=0\n return new_mass\n\n\ndef extract_column_from_header(row):\n \"\"\"\n This function returns the landing status from the HTML table cell \n Input: the element of a table data cell extracts extra row\n \"\"\"\n if (row.br):\n row.br.extract()\n if row.a:\n row.a.extract()\n if row.sup:\n row.sup.extract()\n \n colunm_name = ' '.join(row.contents)\n \n # Filter the digit and empty names\n if not(colunm_name.strip().isdigit()):\n colunm_name = colunm_name.strip()\n return colunm_name \n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "To keep the lab tasks consistent, you will be asked to scrape the data from a snapshot of the `List of Falcon 9 and Falcon Heavy launches` Wikipage updated on\n`9th June 2021`\n"
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": "static_url = \"https://en.wikipedia.org/w/index.php?title=List_of_Falcon_9_and_Falcon_Heavy_launches&oldid=1027686922\""
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Next, request the HTML page from the above URL and get a `response` object\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "### TASK 1: Request the Falcon9 Launch Wiki page from its URL\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "First, let's perform an HTTP GET method to request the Falcon9 Launch HTML page, as an HTTP response.\n"
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": "# use requests.get() method with the provided static_url\n# assign the response to a object"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Create a `BeautifulSoup` object from the HTML `response`\n"
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": "# Use BeautifulSoup() to create a BeautifulSoup object from a response text content"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Print the page title to verify if the `BeautifulSoup` object was created properly\n"
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": "# Use soup.title attribute"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "### TASK 2: Extract all column/variable names from the HTML table header\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Next, we want to collect all relevant column names from the HTML table header\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Let's try to find all tables on the wiki page first. If you need to refresh your memory about `BeautifulSoup`, please check the external reference link towards the end of this lab\n"
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": "# Use the find_all function in the BeautifulSoup object, with element type `table`\n# Assign the result to a list called `html_tables`\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Starting from the third table is our target table contains the actual launch records.\n"
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": "# Let's print the third table and check its content\nfirst_launch_table = html_tables[2]\nprint(first_launch_table)"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "You should able to see the columns names embedded in the table header elements `<th>` as follows:\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "```\n<tr>\n<th scope=\"col\">Flight No.\n</th>\n<th scope=\"col\">Date and<br/>time (<a href=\"/wiki/Coordinated_Universal_Time\" title=\"Coordinated Universal Time\">UTC</a>)\n</th>\n<th scope=\"col\"><a href=\"/wiki/List_of_Falcon_9_first-stage_boosters\" title=\"List of Falcon 9 first-stage boosters\">Version,<br/>Booster</a> <sup class=\"reference\" id=\"cite_ref-booster_11-0\"><a href=\"#cite_note-booster-11\">[b]</a></sup>\n</th>\n<th scope=\"col\">Launch site\n</th>\n<th scope=\"col\">Payload<sup class=\"reference\" id=\"cite_ref-Dragon_12-0\"><a href=\"#cite_note-Dragon-12\">[c]</a></sup>\n</th>\n<th scope=\"col\">Payload mass\n</th>\n<th scope=\"col\">Orbit\n</th>\n<th scope=\"col\">Customer\n</th>\n<th scope=\"col\">Launch<br/>outcome\n</th>\n<th scope=\"col\"><a href=\"/wiki/Falcon_9_first-stage_landing_tests\" title=\"Falcon 9 first-stage landing tests\">Booster<br/>landing</a>\n</th></tr>\n```\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Next, we just need to iterate through the `<th>` elements and apply the provided `extract_column_from_header()` to extract column name one by one\n"
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": "column_names = []\n\n# Apply find_all() function with `th` element on first_launch_table\n# Iterate each th element and apply the provided extract_column_from_header() to get a column name\n# Append the Non-empty column name (`if name is not None and len(name) > 0`) into a list called column_names\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Check the extracted column names\n"
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": "print(column_names)"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "## TASK 3: Create a data frame by parsing the launch HTML tables\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "We will create an empty dictionary with keys from the extracted column names in the previous task. Later, this dictionary will be converted into a Pandas dataframe\n"
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": "launch_dict= dict.fromkeys(column_names)\n\n# Remove an irrelvant column\ndel launch_dict['Date and time ( )']\n\n# Let's initial the launch_dict with each value to be an empty list\nlaunch_dict['Flight No.'] = []\nlaunch_dict['Launch site'] = []\nlaunch_dict['Payload'] = []\nlaunch_dict['Payload mass'] = []\nlaunch_dict['Orbit'] = []\nlaunch_dict['Customer'] = []\nlaunch_dict['Launch outcome'] = []\n# Added some new columns\nlaunch_dict['Version Booster']=[]\nlaunch_dict['Booster landing']=[]\nlaunch_dict['Date']=[]\nlaunch_dict['Time']=[]"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Next, we just need to fill up the `launch_dict` with launch records extracted from table rows.\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Usually, HTML tables in Wiki pages are likely to contain unexpected annotations and other types of noises, such as reference links `B0004.1[8]`, missing values `N/A [e]`, inconsistent formatting, etc.\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "To simplify the parsing process, we have provided an incomplete code snippet below to help you to fill up the `launch_dict`. Please complete the following code snippet with TODOs or you can choose to write your own logic to parse all launch tables:\n"
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": "extracted_row = 0\n#Extract each table \nfor table_number,table in enumerate(soup.find_all('table',\"wikitable plainrowheaders collapsible\")):\n # get table row \n for rows in table.find_all(\"tr\"):\n #check to see if first table heading is as number corresponding to launch a number \n if rows.th:\n if rows.th.string:\n flight_number=rows.th.string.strip()\n flag=flight_number.isdigit()\n else:\n flag=False\n #get table element \n row=rows.find_all('td')\n #if it is number save cells in a dictonary \n if flag:\n extracted_row += 1\n # Flight Number value\n # TODO: Append the flight_number into launch_dict with key `Flight No.`\n #print(flight_number)\n datatimelist=date_time(row[0])\n \n # Date value\n # TODO: Append the date into launch_dict with key `Date`\n date = datatimelist[0].strip(',')\n #print(date)\n \n # Time value\n # TODO: Append the time into launch_dict with key `Time`\n time = datatimelist[1]\n #print(time)\n \n # Booster version\n # TODO: Append the bv into launch_dict with key `Version Booster`\n bv=booster_version(row[1])\n if not(bv):\n bv=row[1].a.string\n print(bv)\n \n # Launch Site\n # TODO: Append the bv into launch_dict with key `Launch Site`\n launch_site = row[2].a.string\n #print(launch_site)\n \n # Payload\n # TODO: Append the payload into launch_dict with key `Payload`\n payload = row[3].a.string\n #print(payload)\n \n # Payload Mass\n # TODO: Append the payload_mass into launch_dict with key `Payload mass`\n payload_mass = get_mass(row[4])\n #print(payload)\n \n # Orbit\n # TODO: Append the orbit into launch_dict with key `Orbit`\n orbit = row[5].a.string\n #print(orbit)\n \n # Customer\n # TODO: Append the customer into launch_dict with key `Customer`\n customer = row[6].a.string\n #print(customer)\n \n # Launch outcome\n # TODO: Append the launch_outcome into launch_dict with key `Launch outcome`\n launch_outcome = list(row[7].strings)[0]\n #print(launch_outcome)\n \n # Booster landing\n # TODO: Append the launch_outcome into launch_dict with key `Booster landing`\n booster_landing = landing_status(row[8])\n #print(booster_landing)\n "
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "After you have fill in the parsed launch record values into `launch_dict`, you can create a dataframe from it.\n"
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": "df=pd.DataFrame(launch_dict)"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "We can now export it to a <b>CSV</b> for the next section, but to make the answers consistent and in case you have difficulties finishing this lab.\n\nFollowing labs will be using a provided dataset to make each lab independent.\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "<code>df.to_csv('spacex_web_scraped.csv', index=False)</code>\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "## Authors\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "<a href=\"https://www.linkedin.com/in/yan-luo-96288783/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01\">Yan Luo</a>\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "<a href=\"https://www.linkedin.com/in/nayefaboutayoun/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01\">Nayef Abou Tayoun</a>\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "## Change Log\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "\| Date (YYYY-MM-DD) \| Version \| Changed By \| Change Description \|\n\| ----------------- \| ------- \| ---------- \| --------------------------- \|\n\| 2021-06-09 \| 1.0 \| Yan Luo \| Tasks updates \|\n\| 2020-11-10 \| 1.0 \| Nayef \| Created the initial version \|\n"
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Copyright \u00a9 2021 IBM Corporation. All rights reserved.\n"
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.8.8"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 4
	}