Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save mahooi/d505ec45dc847bc77334529b7b36acb9 to your computer and use it in GitHub Desktop.
Save mahooi/d505ec45dc847bc77334529b7b36acb9 to your computer and use it in GitHub Desktop.
Created on Skills Network Labs
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a href=\"https://cognitiveclass.ai\"><img src = \"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0103EN-SkillsNetwork/labs/Module%202/images/IDSNlogo.png\" width = 400> </a>\n",
"\n",
"<h1 align=center><font size = 5>From Requirements to Collection</font></h1>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction\n",
"\n",
"In this lab, we will continue learning about the data science methodology, and focus on the **Data Requirements** and the **Data Collection** stages.\n",
"\n",
"Estaimted time needed: **15** minutes\n",
"\n",
"## Objectives\n",
"\n",
"After complting this lab you will be able to:\n",
"\n",
"- Understand Data Requirements\n",
"- Explore the stages in Data Collection\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of Contents\n",
"\n",
"<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n",
" \n",
"1. [Data Requirements](#0)<br>\n",
"2. [Data Collection](#2)<br>\n",
"</div>\n",
"<hr>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Requirements <a id=\"0\"></a>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0103EN-SkillsNetwork/labs/Module%202/images/lab2_fig1_flowchart_data_requirements.png\" width=500>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the videos, we learned that the chosen analytic approach determines the data requirements. Specifically, the analytic methods to be used require certain data content, formats and representations, guided by domain knowledge.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the **From Problem to Approach Lab**, we determined that automating the process of determining the cuisine of a given recipe or dish is potentially possible using the ingredients of the recipe or the dish. In order to build a model, we need extensive data of different cuisines and recipes.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Identifying the required data fulfills the data requirements stage of the data science methodology.\n",
"\n",
"* * *\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Collection <a id=\"2\"></a>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0103EN-SkillsNetwork/labs/Module%202/images/lab2_fig2_flowchart_data_collection.png\" width=500>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the initial data collection stage, data scientists identify and gather the available data resources. These can be in the form of structured, unstructured, and even semi-structured data relevant to the problem domain.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Web Scraping of Online Food Recipes\n",
"\n",
"A researcher named Yong-Yeol Ahn scraped tens of thousands of food recipes (cuisines and ingredients) from three different websites, namely:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0103EN-SkillsNetwork/labs/Module%202/images/lab2_fig3_allrecipes.png\" width=500>\n",
"<div align=\"center\">\n",
"www.allrecipes.com\n",
"</div>\n",
"<br/><br/>\n",
"<img src=\"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0103EN-SkillsNetwork/labs/Module%202/images/lab2_fig4_epicurious.png\" width=500>\n",
"<div align=\"center\">\n",
"www.epicurious.com\n",
"</div>\n",
"<br/><br/>\n",
"<img src=\"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0103EN-SkillsNetwork/labs/Module%202/images/lab2_fig5_menupan.png\" width=500>\n",
"<div align=\"center\">\n",
"www.menupan.com\n",
"</div>\n",
"<br/><br/>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For more information on Yong-Yeol Ahn and his research, you can read his paper on [Flavor Network and the Principles of Food Pairing](http://yongyeol.com/papers/ahn-flavornet-2011.pdf?utm_email=Email&utm_source=Nurture&utm_content=000026UJ&utm_term=10006555&utm_campaign=PLACEHOLDER&utm_id=SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DS0103EN-SkillsNetwork-20083987).\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Luckily, we will not need to carry out any data collection as the data that we need to meet the goal defined in the business understanding stage is readily available.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### We have already acquired the data and placed it on an IBM server. Let's download the data and take a look at it.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h4 style=\"color: red\";> Important note: Please note that you are not expected to know how to program in R. The following code is meant to illustrate the stage of data collection, so it is totally fine if you do not understand the individual lines of code. We have a full course on programming in R, <a href=\"http://cocl.us/RP0101EN_DS0103EN_LAB2_R\">R101</a>, so please feel free to complete the course if you are interested in learning how to program in R.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using this notebook:\n",
"\n",
"To run any of the following cells of code, you can type **Shift + Enter** to excute the code in a cell.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get the version of R installed.\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import requests as rq"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Download the data from the IBM server.\n"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"recipes='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0103EN-SkillsNetwork/labs/Module%202/recipes.csv'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Read the data into an R dataframe and name it **recipes**.\n"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv(recipes)\n"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"df.to_csv(r'C:\\Users\\Maha Al-Khater\\Documents\\GitHub\\resipes.csv', index = False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show the first few rows.\n"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>country</th>\n",
" <th>almond</th>\n",
" <th>angelica</th>\n",
" <th>anise</th>\n",
" <th>anise_seed</th>\n",
" <th>apple</th>\n",
" <th>apple_brandy</th>\n",
" <th>apricot</th>\n",
" <th>armagnac</th>\n",
" <th>artemisia</th>\n",
" <th>...</th>\n",
" <th>whiskey</th>\n",
" <th>white_bread</th>\n",
" <th>white_wine</th>\n",
" <th>whole_grain_wheat_flour</th>\n",
" <th>wine</th>\n",
" <th>wood</th>\n",
" <th>yam</th>\n",
" <th>yeast</th>\n",
" <th>yogurt</th>\n",
" <th>zucchini</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Vietnamese</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>...</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Vietnamese</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>...</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Vietnamese</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>...</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Vietnamese</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>...</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Vietnamese</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>...</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Vietnamese</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>...</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Vietnamese</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>...</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Vietnamese</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>...</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>Vietnamese</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>...</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>Vietnamese</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>...</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>10 rows × 384 columns</p>\n",
"</div>"
],
"text/plain": [
" country almond angelica anise anise_seed apple apple_brandy apricot \\\n",
"0 Vietnamese No No No No No No No \n",
"1 Vietnamese No No No No No No No \n",
"2 Vietnamese No No No No No No No \n",
"3 Vietnamese No No No No No No No \n",
"4 Vietnamese No No No No No No No \n",
"5 Vietnamese No No No No No No No \n",
"6 Vietnamese No No No No No No No \n",
"7 Vietnamese No No No No No No No \n",
"8 Vietnamese No No No No No No No \n",
"9 Vietnamese No No No No No No No \n",
"\n",
" armagnac artemisia ... whiskey white_bread white_wine \\\n",
"0 No No ... No No No \n",
"1 No No ... No No No \n",
"2 No No ... No No No \n",
"3 No No ... No No No \n",
"4 No No ... No No No \n",
"5 No No ... No No No \n",
"6 No No ... No No No \n",
"7 No No ... No No No \n",
"8 No No ... No No No \n",
"9 No No ... No No No \n",
"\n",
" whole_grain_wheat_flour wine wood yam yeast yogurt zucchini \n",
"0 No No No No No No No \n",
"1 No No No No No No No \n",
"2 No No No No No No No \n",
"3 No No No No No No No \n",
"4 No No No No No No No \n",
"5 No No No No No No No \n",
"6 No No No No No No No \n",
"7 No No No No No No No \n",
"8 No No No No No No No \n",
"9 No No No No No No No \n",
"\n",
"[10 rows x 384 columns]"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get the dimensions of the dataframe.\n"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"(57691, 384)"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So our dataset consists of 57,691 recipes. Each row represents a recipe, and for each recipe, the corresponding cuisine is documented as well as whether 384 ingredients exist in the recipe or not, beginning with almond and ending with zucchini.\n",
"\n",
"* * *\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that the data collection stage is complete, data scientists typically use descriptive statistics and visualization techniques to better understand the data and get acquainted with it. Data scientists, essentially, explore the data to:\n",
"\n",
"- understand its content,\n",
"- assess its quality,\n",
"- discover any interesting preliminary insights, and,\n",
"- determine whether additional data is necessary to fill any gaps in the data.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Thank you for completing this lab!\n",
"\n",
"This notebook is part of the free course on **Cognitive Class** called _Data Science Methodology_. If you accessed this notebook outside the course, you can take this free self-paced course, online by clicking [here](http://cocl.us/DS0103EN_LAB2_R).\n",
"\n",
"## Author\n",
"\n",
"<a href=\"https://www.linkedin.com/in/aklson/\" target=\"_blank\">Alex Aklson</a>\n",
"\n",
"## Change Log\n",
"\n",
"| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n",
"| ----------------- | ------- | ---------- | ---------------------------------- |\n",
"| 2021-04-06 | 2.1 | Malika | Updated lab link |\n",
"| 2020-09-16 | 2.0 | Simran | Moved lab to course repo in GitLab |\n",
"\n",
"<hr>\n",
"\n",
"## <h3 align=\"center\"> © IBM Corporation 2020. All rights reserved. <h3/>\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python",
"language": "python",
"name": "conda-env-python-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.13"
},
"widgets": {
"state": {},
"version": "1.1.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment