Skip to content

Instantly share code, notes, and snippets.

@727021
Last active September 28, 2020 16:59
Show Gist options
  • Save 727021/73d9c4f54081d59a4cf3ac1cb02082cb to your computer and use it in GitHub Desktop.
Save 727021/73d9c4f54081d59a4cf3ac1cb02082cb to your computer and use it in GitHub Desktop.
Assessment.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "Assessment.ipynb",
"provenance": [],
"collapsed_sections": [],
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/727021/73d9c4f54081d59a4cf3ac1cb02082cb/assessment.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "E5dpj74Hya2q"
},
"source": [
"# Introduction\n",
"This assignment will test how well you're able to perform various data science-related tasks using Python, pandas, and seaborn, as well as how well you can interpret statistical and probabilistic results.\n",
"\n",
"Each Problem Group below will center around a particular dataset that you have worked with before.\n",
"\n",
"To ensure you receive full credit for a question, make sure you demonstrate the appropriate pandas, seaborn, or other commands as requested in the provided code blocks. \n",
"\n",
"You may find that some questions require multiple steps to fully answer. Others require some mental arithmetic in addition to pandas commands. Use your best judgment.\n",
"\n",
"## Submission\n",
"Each problem group asks a series of questions. This assignment consists of two submissions:\n",
"\n",
"1. After completing the questions below, open the Module 01 Assessment Quiz in Canvas and enter your answers to these questions there.\n",
"\n",
"2. After completing and submitting the quiz, save this Colab notebook as a GitHub Gist (You'll need to create a GitHub account for this), by selecting `Save a copy as a GitHub Gist` from the `File` menu above.\n",
"\n",
" In Canvas, open the Module 01 Assessment GitHub Gist assignment and paste the GitHub Gist URL for this notebook. Then submit that assignment."
]
},
{
"cell_type": "code",
"metadata": {
"id": "sth8revcidyV"
},
"source": [
"# https://stackoverflow.com/a/17303428/2031203\n",
"# Make answers a little easier to see\n",
"class color:\n",
" PURPLE = '\\033[95m'\n",
" CYAN = '\\033[96m'\n",
" DARKCYAN = '\\033[36m'\n",
" BLUE = '\\033[94m'\n",
" GREEN = '\\033[92m'\n",
" YELLOW = '\\033[93m'\n",
" RED = '\\033[91m'\n",
" BOLD = '\\033[1m'\n",
" UNDERLINE = '\\033[4m'\n",
" END = '\\033[0m'"
],
"execution_count": 28,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "WgYhh-W3xk4e"
},
"source": [
"## Problem Group 1\n",
"\n",
"For the questions in this group, you'll work with the Netflix Movies Dataset found at this url: [https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/netflix_titles.csv](https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/netflix_titles.csv)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kulUs9K7xxB3"
},
"source": [
"### Question 1\n",
"Load the dataset into a Pandas data frame and determine what data type is used to store the `release_year` feature."
]
},
{
"cell_type": "code",
"metadata": {
"id": "ecSfSVrsxjx5"
},
"source": [
"import pandas as pd\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"\n",
"sns.set_style('darkgrid')"
],
"execution_count": 2,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "IJwMI4R-daM1",
"outputId": "8a0d27dc-4baa-48c1-991a-68b8e9c62039",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 652
}
},
"source": [
"netflix = pd.read_csv('https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/netflix_titles.csv')\n",
"netflix.head()"
],
"execution_count": 3,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>show_id</th>\n",
" <th>type</th>\n",
" <th>title</th>\n",
" <th>director</th>\n",
" <th>cast</th>\n",
" <th>country</th>\n",
" <th>date_added</th>\n",
" <th>release_year</th>\n",
" <th>rating</th>\n",
" <th>duration</th>\n",
" <th>listed_in</th>\n",
" <th>description</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>81145628</td>\n",
" <td>Movie</td>\n",
" <td>Norm of the North: King Sized Adventure</td>\n",
" <td>Richard Finn, Tim Maltby</td>\n",
" <td>Alan Marriott, Andrew Toth, Brian Dobson, Cole...</td>\n",
" <td>United States, India, South Korea, China</td>\n",
" <td>September 9, 2019</td>\n",
" <td>2019</td>\n",
" <td>TV-PG</td>\n",
" <td>90 min</td>\n",
" <td>Children &amp; Family Movies, Comedies</td>\n",
" <td>Before planning an awesome wedding for his gra...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>80117401</td>\n",
" <td>Movie</td>\n",
" <td>Jandino: Whatever it Takes</td>\n",
" <td>NaN</td>\n",
" <td>Jandino Asporaat</td>\n",
" <td>United Kingdom</td>\n",
" <td>September 9, 2016</td>\n",
" <td>2016</td>\n",
" <td>TV-MA</td>\n",
" <td>94 min</td>\n",
" <td>Stand-Up Comedy</td>\n",
" <td>Jandino Asporaat riffs on the challenges of ra...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>70234439</td>\n",
" <td>TV Show</td>\n",
" <td>Transformers Prime</td>\n",
" <td>NaN</td>\n",
" <td>Peter Cullen, Sumalee Montano, Frank Welker, J...</td>\n",
" <td>United States</td>\n",
" <td>September 8, 2018</td>\n",
" <td>2013</td>\n",
" <td>TV-Y7-FV</td>\n",
" <td>1 Season</td>\n",
" <td>Kids' TV</td>\n",
" <td>With the help of three human allies, the Autob...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>80058654</td>\n",
" <td>TV Show</td>\n",
" <td>Transformers: Robots in Disguise</td>\n",
" <td>NaN</td>\n",
" <td>Will Friedle, Darren Criss, Constance Zimmer, ...</td>\n",
" <td>United States</td>\n",
" <td>September 8, 2018</td>\n",
" <td>2016</td>\n",
" <td>TV-Y7</td>\n",
" <td>1 Season</td>\n",
" <td>Kids' TV</td>\n",
" <td>When a prison ship crash unleashes hundreds of...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>80125979</td>\n",
" <td>Movie</td>\n",
" <td>#realityhigh</td>\n",
" <td>Fernando Lebrija</td>\n",
" <td>Nesta Cooper, Kate Walsh, John Michael Higgins...</td>\n",
" <td>United States</td>\n",
" <td>September 8, 2017</td>\n",
" <td>2017</td>\n",
" <td>TV-14</td>\n",
" <td>99 min</td>\n",
" <td>Comedies</td>\n",
" <td>When nerdy high schooler Dani finally attracts...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" show_id ... description\n",
"0 81145628 ... Before planning an awesome wedding for his gra...\n",
"1 80117401 ... Jandino Asporaat riffs on the challenges of ra...\n",
"2 70234439 ... With the help of three human allies, the Autob...\n",
"3 80058654 ... When a prison ship crash unleashes hundreds of...\n",
"4 80125979 ... When nerdy high schooler Dani finally attracts...\n",
"\n",
"[5 rows x 12 columns]"
]
},
"metadata": {
"tags": []
},
"execution_count": 3
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "3CfKMtvppESb",
"outputId": "5d17afa5-f133-4fd7-86af-099ddbc4009d",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"print('Release Year dtype: {}{}'.format(color.BOLD+color.RED,netflix.dtypes['release_year']))"
],
"execution_count": 71,
"outputs": [
{
"output_type": "stream",
"text": [
"Release Year dtype: \u001b[1m\u001b[91mint64\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ivpHTGpczpyM"
},
"source": [
"### Question 2\n",
"Filter your dataset so it contains only `TV Shows`. How many of those TV Shows were rated `TV-Y7`?"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Zf6QABfXx5Xh",
"outputId": "7c08d1c6-bf4b-405f-e3aa-a7ce6b65adbe",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"tv = netflix[netflix['type'] == 'TV Show']\n",
"print('TV Shows rated TV-Y7: ', len(tv[tv['rating'] == 'TV-Y7']), sep=color.BOLD + color.RED)"
],
"execution_count": 30,
"outputs": [
{
"output_type": "stream",
"text": [
"TV Shows rated TV-Y7: \u001b[1m\u001b[91m100\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-esock-41eGo"
},
"source": [
"### Question 3\n",
"Further filter your dataset so it only contains TV Shows released between the years 2000 and 2009 inclusive. How many of *those* shows were rated `TV-Y7`?"
]
},
{
"cell_type": "code",
"metadata": {
"id": "cBNHqCgz0WDp",
"outputId": "87370712-4cf9-4a89-abe2-b4f4d5a903b2",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"filtered_tv = tv[(tv['rating'] == 'TV-Y7') & (2000 <= tv['release_year']) & (tv['release_year'] <= 2009)]\n",
"print('TV Shows rated TV-Y7 between 2000 and 2009: ', len(filtered_tv), sep=color.BOLD + color.RED)"
],
"execution_count": 31,
"outputs": [
{
"output_type": "stream",
"text": [
"TV Shows rated TV-Y7 between 2000 and 2009: \u001b[1m\u001b[91m4\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "s9pLfYQx2W8s"
},
"source": [
"### Question 4\n",
"Use seaborn to create a count plot showing the relative counts of the ratings of TV-Shows released between the years 2000 and 2009, inclusive.\n",
"\n",
"What is the top-most number on the y-axis scale?"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Akz49cnq1qUI",
"outputId": "ffec16dc-6be4-424c-e2fa-57a8befe2810",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 295
}
},
"source": [
"ratings = sns.countplot(x='rating',data=tv[(tv['release_year'] >= 2000) & (tv['release_year'] <= 2009)])\n",
"ratings.set_xlabel('Rating')\n",
"ratings.set_ylabel('# TV Shows')\n",
"ratings.set_title('Netflix TV Show Rating Counts (2000-2009)')\n",
"plt.show()"
],
"execution_count": 33,
"outputs": [
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"tags": []
}
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cB4zmoJm3XDj"
},
"source": [
"## Problem Group 2\n",
"\n",
"For the questions in this group, you'll work with the Cereal Dataset found at this url: [https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/cereal.csv](https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/cereal.csv)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BE_TdsHa3sMj"
},
"source": [
"### Question 5\n",
"After importing the dataset into a pandas data frame, determine the median amount of `protein` in cereal brands manufactured by Kelloggs. (`mfr` code \"K\")"
]
},
{
"cell_type": "code",
"metadata": {
"id": "fBXFGnfP2tfV",
"outputId": "f4ad365f-ca50-4f64-c292-a44fb6ebb64c",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 366
}
},
"source": [
"cereal = pd.read_csv('https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/cereal.csv')\n",
"cereal.head()"
],
"execution_count": 22,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>mfr</th>\n",
" <th>type</th>\n",
" <th>calories</th>\n",
" <th>protein</th>\n",
" <th>fat</th>\n",
" <th>sodium</th>\n",
" <th>fiber</th>\n",
" <th>carbo</th>\n",
" <th>sugars</th>\n",
" <th>potass</th>\n",
" <th>vitamins</th>\n",
" <th>shelf</th>\n",
" <th>weight</th>\n",
" <th>cups</th>\n",
" <th>rating</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>100% Bran</td>\n",
" <td>N</td>\n",
" <td>C</td>\n",
" <td>70</td>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>130</td>\n",
" <td>10.0</td>\n",
" <td>5.0</td>\n",
" <td>6</td>\n",
" <td>280</td>\n",
" <td>25</td>\n",
" <td>3</td>\n",
" <td>1.0</td>\n",
" <td>0.33</td>\n",
" <td>68.402973</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>100% Natural Bran</td>\n",
" <td>Q</td>\n",
" <td>C</td>\n",
" <td>120</td>\n",
" <td>3</td>\n",
" <td>5</td>\n",
" <td>15</td>\n",
" <td>2.0</td>\n",
" <td>8.0</td>\n",
" <td>8</td>\n",
" <td>135</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>1.0</td>\n",
" <td>1.00</td>\n",
" <td>33.983679</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>All-Bran</td>\n",
" <td>K</td>\n",
" <td>C</td>\n",
" <td>70</td>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>260</td>\n",
" <td>9.0</td>\n",
" <td>7.0</td>\n",
" <td>5</td>\n",
" <td>320</td>\n",
" <td>25</td>\n",
" <td>3</td>\n",
" <td>1.0</td>\n",
" <td>0.33</td>\n",
" <td>59.425505</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>All-Bran with Extra Fiber</td>\n",
" <td>K</td>\n",
" <td>C</td>\n",
" <td>50</td>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>140</td>\n",
" <td>14.0</td>\n",
" <td>8.0</td>\n",
" <td>0</td>\n",
" <td>330</td>\n",
" <td>25</td>\n",
" <td>3</td>\n",
" <td>1.0</td>\n",
" <td>0.50</td>\n",
" <td>93.704912</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Almond Delight</td>\n",
" <td>R</td>\n",
" <td>C</td>\n",
" <td>110</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>200</td>\n",
" <td>1.0</td>\n",
" <td>14.0</td>\n",
" <td>8</td>\n",
" <td>-1</td>\n",
" <td>25</td>\n",
" <td>3</td>\n",
" <td>1.0</td>\n",
" <td>0.75</td>\n",
" <td>34.384843</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name mfr type calories ... shelf weight cups rating\n",
"0 100% Bran N C 70 ... 3 1.0 0.33 68.402973\n",
"1 100% Natural Bran Q C 120 ... 3 1.0 1.00 33.983679\n",
"2 All-Bran K C 70 ... 3 1.0 0.33 59.425505\n",
"3 All-Bran with Extra Fiber K C 50 ... 3 1.0 0.50 93.704912\n",
"4 Almond Delight R C 110 ... 3 1.0 0.75 34.384843\n",
"\n",
"[5 rows x 16 columns]"
]
},
"metadata": {
"tags": []
},
"execution_count": 22
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "Vn6nmEjXh8s2",
"outputId": "d81fa50c-f694-4cd0-c521-5943ca085b68",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"kelloggs = cereal[cereal['mfr'] == 'K']\n",
"print('Median Kelloggs protein: ', kelloggs['protein'].median(), sep=color.BOLD + color.RED)"
],
"execution_count": 34,
"outputs": [
{
"output_type": "stream",
"text": [
"Median Kelloggs protein: \u001b[1m\u001b[91m3.0\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_tAtBuST4vFq"
},
"source": [
"### Question 6\n",
"If you were to choose a brand of cereal made by Kelloggs at random (counting only the brands in this dataset), what is:\n",
"\n",
" P( protein = 3 )\n",
"\n",
"For the brand you chose? Round your answer to two decimal places."
]
},
{
"cell_type": "code",
"metadata": {
"id": "PLIl-nwZ4PT9",
"outputId": "f851942c-49df-4ee6-8a8e-c2ea48d78ec6",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"P_protein_3 = kelloggs['protein'].value_counts(normalize=True)[3]\n",
"print('P(protein=3) = {}{:.2f}'.format(color.BOLD + color.RED, P_protein_3))"
],
"execution_count": 72,
"outputs": [
{
"output_type": "stream",
"text": [
"P(protein=3) = \u001b[1m\u001b[91m0.39\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "W6nZYMTM6pbi"
},
"source": [
"### Question 7\n",
"If you were to choose a brand of cereal made by Kelloggs at random (counting only the brands in this dataset), what is:\n",
"\n",
" P( calories > 100 | protein = 3)\n",
"\n",
"for the brand you chose? Round your answer to two decimal places."
]
},
{
"cell_type": "code",
"metadata": {
"id": "TJITPcq25KwS",
"outputId": "9b9a47f9-a454-4a86-cbc2-1e4873f0665d",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"kelloggs_protein_3 = kelloggs[kelloggs['protein'] == 3]\n",
"P_cal_gt_100_given_protein_3 = len(kelloggs_protein_3[kelloggs_protein_3['calories'] > 100]) / len(kelloggs_protein_3)\n",
"print('P(calories > 100 | protein = 3) = {}{:.2f}'.format(color.BOLD + color.RED, P_cal_gt_100_given_protein_3))"
],
"execution_count": 73,
"outputs": [
{
"output_type": "stream",
"text": [
"P(calories > 100 | protein = 3) = \u001b[1m\u001b[91m0.67\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "W3del4PC9NJ-"
},
"source": [
"### Question 8\n",
"In order to comply with new government regulations, all cereals must now come with a \"Healthiness\" rating. This rating is calculated based on this formula:\n",
"\n",
" healthiness = (protein + fiber) / sugar\n",
"\n",
"Create a new `healthiness` column populated with values based on the above formula.\n",
"\n",
"Then, determine the median healthiness value for only General Mills cereals (`mfr` = \"G\"), rounded to two decimal places."
]
},
{
"cell_type": "code",
"metadata": {
"id": "TqFx9yvV6LDX",
"outputId": "b7434c1b-dee1-48b6-e1e1-6f8936c57b72",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"cereal['healthiness'] = (cereal['protein'] + cereal['fiber']) / cereal['sugars']\n",
"print('Median General Mills Healthiness: {}{:.2f}'.format(color.BOLD + color.RED, cereal[cereal['mfr'] == 'G']['healthiness'].median()))"
],
"execution_count": 74,
"outputs": [
{
"output_type": "stream",
"text": [
"Median General Mills Healthiness: \u001b[1m\u001b[91m0.47\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "AcuUNAxC-g7c"
},
"source": [
"## Problem Group 3\n",
"\n",
"For the questions in this group, you'll work with the Titanic Dataset found at this url: [https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/titanic.csv](https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/titanic.csv)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "POOuuXfYAJQK"
},
"source": [
"### Question 9\n",
"\n",
"After loading the dataset into a pandas DataFrame, create a new column called `NameGroup` that contains the first letter of the passenger's surname in lower case.\n",
"\n",
"Note that in the dataset, passenger's names are provided in the `Name` column and are listed as:\n",
"\n",
" Surname, Given names\n",
"\n",
"For example, if a passenger's `Name` is `Braund, Mr. Owen Harris`, the `NameGroup` column should contain the value `b`.\n",
"\n",
"Then count how many passengers have a `NameGroup` value of `k`."
]
},
{
"cell_type": "code",
"metadata": {
"id": "T80gU65-ASFa",
"outputId": "9ea1660f-dbe9-4ac1-b299-a98df0f591db",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 484
}
},
"source": [
"titanic = pd.read_csv('https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/titanic.csv')\n",
"titanic.head()"
],
"execution_count": 59,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.0500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass ... Fare Cabin Embarked\n",
"0 1 0 3 ... 7.2500 NaN S\n",
"1 2 1 1 ... 71.2833 C85 C\n",
"2 3 1 3 ... 7.9250 NaN S\n",
"3 4 1 1 ... 53.1000 C123 S\n",
"4 5 0 3 ... 8.0500 NaN S\n",
"\n",
"[5 rows x 12 columns]"
]
},
"metadata": {
"tags": []
},
"execution_count": 59
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "E81qACTfAWl2",
"outputId": "977ff587-5c37-48f8-8207-f24f852478d3",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"titanic['NameGroup'] = titanic['Name'].str.slice(0,1).str.lower()\n",
"print('Titanic Passengers in NameGroup {}k{}: {}{}'.format(color.BOLD, color.END, color.BOLD + color.RED, titanic['NameGroup'].value_counts()['k']))"
],
"execution_count": 64,
"outputs": [
{
"output_type": "stream",
"text": [
"Titanic Passengers in NameGroup \u001b[1mk\u001b[0m: \u001b[1m\u001b[91m28\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6iZtc5-zBZQG"
},
"source": [
"### Question 10\n",
"Using seaborn, create a boxplot showing the distribution of passenger `Age` grouped by the port they `Embarked` from.\n",
"\n",
"How many outliers were there in the group embarked from Queenstown (Embarked value `Q`)?"
]
},
{
"cell_type": "code",
"metadata": {
"id": "9D6TzGQjAZIw",
"outputId": "c6966d55-14af-4906-99d3-a4365435da2c",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 295
}
},
"source": [
"age_vs_embarked = sns.boxplot(y='Age',x='Embarked',data=titanic)\n",
"# (C = Cherbourg; Q = Queenstown; S = Southampton)\n",
"age_vs_embarked.set_xticklabels(['Southampton','Cherbourg','Queenstown'])\n",
"age_vs_embarked.set_title('Passenger Age Distribution by Port')\n",
"plt.show()"
],
"execution_count": 69,
"outputs": [
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"tags": []
}
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0lylG1LUDaHB"
},
"source": [
"## 🌟 Bonus Question 🌟\n",
"\n",
"*Note that you have to enter the answer to this question in the Module 01 Bonus Assessment.*\n",
"\n",
"Use a Bayesian Classifier from sklearn to predict the probability that a passenger would surive given they were a female embarking from Queenstown.\n",
"\n",
"Encode any non-numeric values as necessary, and be sure to drop any rows with missing data for features that would affect the classification.\n",
"\n",
"Enter your answer as a decimal value rounded to 2 decimal places."
]
},
{
"cell_type": "code",
"metadata": {
"id": "t7xEDOStBwbJ"
},
"source": [
""
],
"execution_count": null,
"outputs": []
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment