Skip to content

Instantly share code, notes, and snippets.

@qbeer
Created September 17, 2021 06:38
Show Gist options
  • Save qbeer/9d73f973aa9e0fb6e3631ba82a3c3e84 to your computer and use it in GitHub Desktop.
Save qbeer/9d73f973aa9e0fb6e3631ba82a3c3e84 to your computer and use it in GitHub Desktop.
01_SOLVED_EDA.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 5,
"metadata": {
"language_info": {
"name": "plaintext"
},
"colab": {
"name": "01_SOLVED_EDA.ipynb",
"provenance": [],
"include_colab_link": true
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/qbeer/9d73f973aa9e0fb6e3631ba82a3c3e84/01_solved_eda.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "mVrfcB0reopX"
},
"source": [
""
],
"id": "mVrfcB0reopX"
},
{
"cell_type": "markdown",
"metadata": {
"id": "oKcFgCLleopZ"
},
"source": [
"Exploratory data analysis<a href=\"http://patbaa.web.elte.hu/physdm/code_examples/01_SOLVED_EDA.html#Exploratory-data-analysis\" class=\"anchor-link\">¶</a>\n",
"========================================================================================================================================================\n",
"\n",
"<http://patbaa.web.elte.hu/physdm/data/titanic.csv>\n",
"\n",
"On the link above you will find a dataset about the Titanic passengers.\n",
"Your task is to explore the dataset.\n",
"\n",
"Help for the columns:\n",
"\n",
"- SibSp - number of sibling/spouses on the ship\n",
"- Parch - number of parent/children on the ship\n",
"- Cabin - the cabin they slept in (if they had a cabin)\n",
"- Embarked - harbour of entering the ship\n",
"- Pclass - passenger class (like on trains)\n",
"\n",
"#### 1. Load the above-linked csv file as a pandas dataframe. Check & plot if any of the columns has missing values. If they have, investigate if the missingness is random or not.<a href=\"http://patbaa.web.elte.hu/physdm/code_examples/01_SOLVED_EDA.html#1.-Load-the-above-linked-csv-file-as-a-pandas-dataframe.-Check-&amp;-plot-if-any-of-the-columns-has-missing-values.-If-they-have,-investigate-if-the-missingness-is-random-or-not.\" class=\"anchor-link\">¶</a>\n",
"\n",
"Impute the missing values in a sensible way:\n",
"\n",
"- if only a very small percentage is missing, imputing with the\n",
" column-wise mean makes sense, or also removing the missing rows\n",
" makes sense\n",
"- if in a row almost all the entries is missing, it worth to remove\n",
" that given row\n",
"- if a larger portion is missing from a column, usually it worth to\n",
" encode that with a value that does not appear in the dataset (eg:\n",
" -1).\n",
"\n",
"The imputing method affects different machine learning models different\n",
"way, but now we are interested only in EDA, so try to keep as much\n",
"information as possible!\n",
"\n",
"#### 2. Create a heatmap which shows how many people survived and dies with the different Pclass variables. You need to create a table where the columns indicates if a person survived or not, the rows indicates the different Pclass and the cell values contains the number of people belonging the that given category. The table should be colored based on the value of the cells in the table.<a href=\"http://patbaa.web.elte.hu/physdm/code_examples/01_SOLVED_EDA.html#2.-Create-a-heatmap-which-shows-how-many-people-survived-and-dies-with-the-different-Pclass-variables.-You-need-to-create-a-table-where-the-columns-indicates-if-a-person-survived-or-not,-the-rows-indicates-the-different-Pclass-and-the-cell-values-contains-the-number-of-people-belonging-the-that-given-category.-The-table-should-be-colored-based-on-the-value-of-the-cells-in-the-table.\" class=\"anchor-link\">¶</a>\n",
"\n",
"#### 3. Create boxplots for each different Pclass. The boxplot should show the age distribution for the given Pclass. Plot all of these next to each other in a row to make it easier to compare!<a href=\"http://patbaa.web.elte.hu/physdm/code_examples/01_SOLVED_EDA.html#3.-Create-boxplots-for-each-different-Pclass.-The-boxplot-should-show-the-age-distribution-for-the-given-Pclass.-Plot-all-of-these-next-to-each-other-in-a-row-to-make-it-easier-to-compare!\" class=\"anchor-link\">¶</a>\n",
"\n",
"#### 4. Calculate the correlation matrix for the numerical columns. Show it also as a heatmap described at the 1st task.<a href=\"http://patbaa.web.elte.hu/physdm/code_examples/01_SOLVED_EDA.html#4.-Calculate-the-correlation-matrix-for-the-numerical-columns.-Show-it-also-as-a-heatmap-described-at-the-1st-task.\" class=\"anchor-link\">¶</a>\n",
"\n",
"Which feature seems to play the most important role in surviving/not\n",
"surviving? Explain how and why could that feature be important!\n",
"\n",
"#### 5. Create two plots which you think are meaningful. Interpret both of them. (Eg.: older people buy more expensive ticket? people buying more expensive ticket survive more? etc.)<a href=\"http://patbaa.web.elte.hu/physdm/code_examples/01_SOLVED_EDA.html#5.-Create-two-plots-which-you-think-are-meaningful.-Interpret-both-of-them.-(Eg.:-older-people-buy-more-expensive-ticket?-people-buying-more-expensive-ticket-survive-more?-etc.)\" class=\"anchor-link\">¶</a>\n",
"\n",
"### Hints:<a href=\"http://patbaa.web.elte.hu/physdm/code_examples/01_SOLVED_EDA.html#Hints:\" class=\"anchor-link\">¶</a>\n",
"\n",
"- On total you can get 10 points for fully completing all tasks.\n",
"- Decorate your notebook with, questions, explanation etc, make it\n",
" self contained and understandable!\n",
"- Comments you code when necessary\n",
"- Write functions for repetitive tasks!\n",
"- Use the pandas package for data loading and handling\n",
"- Use matplotlib and seaborn for plotting or bokeh and plotly for\n",
" interactive investigation\n",
"- Use the scikit learn package for almost everything\n",
"- Use for loops only if it is really necessary!\n",
"- Code sharing is not allowed between student! Sharing code will\n",
" result in zero points.\n",
"- If you use code found on web, it is OK, but, make its source clear!\n",
"\n"
],
"id": "oKcFgCLleopZ"
},
{
"cell_type": "markdown",
"metadata": {
"id": "gRJuXgo2eopd"
},
"source": [
"In \\[15\\]:\n",
"\n",
" sns.factorplot('Sex', data=data[data.Survived == 0], kind='count')\n",
" plt.title('Not survived')\n",
" sns.factorplot('Sex', data=data[data.Survived == 1], kind='count')\n",
" plt.title('Survived')\n",
" plt.show()\n",
"\n",
" /home/pataki/.conda/envs/fastai/lib/python3.6/site-packages/seaborn/categorical.py:3666: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`.\n",
" warnings.warn(msg)\n",
"\n",
"![]()\n",
"\n",
"![]()"
],
"id": "gRJuXgo2eopd"
},
{
"cell_type": "markdown",
"metadata": {
"id": "9jW6Mtfmeopk"
},
"source": [
"In \\[16\\]:\n",
"\n",
" data.head()\n",
"\n",
"Out\\[16\\]:\n",
"\n",
"| | Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Cabin | Embarked | has\\_no\\_cabin | has\\_no\\_age |\n",
"|-----|----------|--------|--------|------|-------|-------|---------|-------|----------|----------------|--------------|\n",
"| 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | NaN | S | 1 | 0 |\n",
"| 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C85 | C | 0 | 0 |\n",
"| 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | NaN | S | 1 | 0 |\n",
"| 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | C123 | S | 0 | 0 |\n",
"| 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | NaN | S | 1 | 0 |"
],
"id": "9jW6Mtfmeopk"
},
{
"cell_type": "markdown",
"metadata": {
"id": "lZKppttHeopm"
},
"source": [
"In \\[17\\]:\n",
"\n",
" sns.heatmap(data.groupby(['Sex'])[['Parch', 'SibSp']].mean(), annot=True)\n",
"\n",
"Out\\[17\\]:\n",
"\n",
" <matplotlib.axes._subplots.AxesSubplot at 0x7f3f4e307f28>\n",
"\n",
"![]()"
],
"id": "lZKppttHeopm"
},
{
"cell_type": "markdown",
"metadata": {
"id": "RKBoiodMeopn"
},
"source": [
"In \\[18\\]:\n",
"\n",
" Counter(data.Sex)\n",
"\n",
"Out\\[18\\]:\n",
"\n",
" Counter({'male': 577, 'female': 314})"
],
"id": "RKBoiodMeopn"
},
{
"cell_type": "markdown",
"metadata": {
"id": "l348KbyIeopn"
},
"source": [
"It seems that males often traveled alone!"
],
"id": "l348KbyIeopn"
},
{
"cell_type": "markdown",
"metadata": {
"id": "2d_GmkDLeopn"
},
"source": [
"In \\[19\\]:\n",
"\n",
" Counter(data[(data.SibSp == 0) & (data.Parch == 0)].Sex)\n",
"\n",
"Out\\[19\\]:\n",
"\n",
" Counter({'female': 126, 'male': 411})"
],
"id": "2d_GmkDLeopn"
},
{
"cell_type": "markdown",
"metadata": {
"id": "w2nYWTEreopo"
},
"source": [
""
],
"id": "w2nYWTEreopo"
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment