Skip to content

Instantly share code, notes, and snippets.

@girlvsdata
Last active December 5, 2017 23:20
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save girlvsdata/cddfa75664bcd36ec1d301eb8980f7bf to your computer and use it in GitHub Desktop.
Save girlvsdata/cddfa75664bcd36ec1d301eb8980f7bf to your computer and use it in GitHub Desktop.
Kaggle 5 Day Data Challenge - Day 1 - Pokemon
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Kaggle 5 Day Data Challenge - Day 1 - Pokemon\n",
"## Importing and summarizing a .csv with Python and pandas\n",
"This is the first day of the Kaggle 5 day data challenge, and its a simple but very important one!\n",
"\n",
"Today's challenge is to read in data from a .csv and summarize it. I'm going to use Python, pandas and this Pokemon dataset from Kaggle: ('https://www.kaggle.com/sekarmg/pokemon')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Step 1 - Import the libraries"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Step 2 - Read your data into a dataframe"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"pokemon = pd.read_csv('pokemon_alopez247.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Step 3 - Summarize your data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First we'll use pandas' describe() function to generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Number</th>\n",
" <th>Total</th>\n",
" <th>HP</th>\n",
" <th>Attack</th>\n",
" <th>Defense</th>\n",
" <th>Sp_Atk</th>\n",
" <th>Sp_Def</th>\n",
" <th>Speed</th>\n",
" <th>Generation</th>\n",
" <th>Pr_Male</th>\n",
" <th>Height_m</th>\n",
" <th>Weight_kg</th>\n",
" <th>Catch_Rate</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>721.00000</td>\n",
" <td>721.000000</td>\n",
" <td>721.000000</td>\n",
" <td>721.000000</td>\n",
" <td>721.000000</td>\n",
" <td>721.000000</td>\n",
" <td>721.000000</td>\n",
" <td>721.000000</td>\n",
" <td>721.000000</td>\n",
" <td>644.000000</td>\n",
" <td>721.000000</td>\n",
" <td>721.000000</td>\n",
" <td>721.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>361.00000</td>\n",
" <td>417.945908</td>\n",
" <td>68.380028</td>\n",
" <td>75.013870</td>\n",
" <td>70.808599</td>\n",
" <td>68.737864</td>\n",
" <td>69.291262</td>\n",
" <td>65.714286</td>\n",
" <td>3.323162</td>\n",
" <td>0.553377</td>\n",
" <td>1.144979</td>\n",
" <td>56.773370</td>\n",
" <td>100.246879</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>208.27906</td>\n",
" <td>109.663671</td>\n",
" <td>25.848272</td>\n",
" <td>28.984475</td>\n",
" <td>29.296558</td>\n",
" <td>28.788005</td>\n",
" <td>27.015860</td>\n",
" <td>27.277920</td>\n",
" <td>1.669873</td>\n",
" <td>0.199969</td>\n",
" <td>1.044369</td>\n",
" <td>89.095667</td>\n",
" <td>76.573513</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>1.00000</td>\n",
" <td>180.000000</td>\n",
" <td>1.000000</td>\n",
" <td>5.000000</td>\n",
" <td>5.000000</td>\n",
" <td>10.000000</td>\n",
" <td>20.000000</td>\n",
" <td>5.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.100000</td>\n",
" <td>0.100000</td>\n",
" <td>3.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>181.00000</td>\n",
" <td>320.000000</td>\n",
" <td>50.000000</td>\n",
" <td>53.000000</td>\n",
" <td>50.000000</td>\n",
" <td>45.000000</td>\n",
" <td>50.000000</td>\n",
" <td>45.000000</td>\n",
" <td>2.000000</td>\n",
" <td>0.500000</td>\n",
" <td>0.610000</td>\n",
" <td>9.400000</td>\n",
" <td>45.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>361.00000</td>\n",
" <td>424.000000</td>\n",
" <td>65.000000</td>\n",
" <td>74.000000</td>\n",
" <td>65.000000</td>\n",
" <td>65.000000</td>\n",
" <td>65.000000</td>\n",
" <td>65.000000</td>\n",
" <td>3.000000</td>\n",
" <td>0.500000</td>\n",
" <td>0.990000</td>\n",
" <td>28.000000</td>\n",
" <td>65.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>541.00000</td>\n",
" <td>499.000000</td>\n",
" <td>80.000000</td>\n",
" <td>95.000000</td>\n",
" <td>85.000000</td>\n",
" <td>90.000000</td>\n",
" <td>85.000000</td>\n",
" <td>85.000000</td>\n",
" <td>5.000000</td>\n",
" <td>0.500000</td>\n",
" <td>1.400000</td>\n",
" <td>61.000000</td>\n",
" <td>180.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>721.00000</td>\n",
" <td>720.000000</td>\n",
" <td>255.000000</td>\n",
" <td>165.000000</td>\n",
" <td>230.000000</td>\n",
" <td>154.000000</td>\n",
" <td>230.000000</td>\n",
" <td>160.000000</td>\n",
" <td>6.000000</td>\n",
" <td>1.000000</td>\n",
" <td>14.500000</td>\n",
" <td>950.000000</td>\n",
" <td>255.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Number Total HP Attack Defense Sp_Atk \\\n",
"count 721.00000 721.000000 721.000000 721.000000 721.000000 721.000000 \n",
"mean 361.00000 417.945908 68.380028 75.013870 70.808599 68.737864 \n",
"std 208.27906 109.663671 25.848272 28.984475 29.296558 28.788005 \n",
"min 1.00000 180.000000 1.000000 5.000000 5.000000 10.000000 \n",
"25% 181.00000 320.000000 50.000000 53.000000 50.000000 45.000000 \n",
"50% 361.00000 424.000000 65.000000 74.000000 65.000000 65.000000 \n",
"75% 541.00000 499.000000 80.000000 95.000000 85.000000 90.000000 \n",
"max 721.00000 720.000000 255.000000 165.000000 230.000000 154.000000 \n",
"\n",
" Sp_Def Speed Generation Pr_Male Height_m Weight_kg \\\n",
"count 721.000000 721.000000 721.000000 644.000000 721.000000 721.000000 \n",
"mean 69.291262 65.714286 3.323162 0.553377 1.144979 56.773370 \n",
"std 27.015860 27.277920 1.669873 0.199969 1.044369 89.095667 \n",
"min 20.000000 5.000000 1.000000 0.000000 0.100000 0.100000 \n",
"25% 50.000000 45.000000 2.000000 0.500000 0.610000 9.400000 \n",
"50% 65.000000 65.000000 3.000000 0.500000 0.990000 28.000000 \n",
"75% 85.000000 85.000000 5.000000 0.500000 1.400000 61.000000 \n",
"max 230.000000 160.000000 6.000000 1.000000 14.500000 950.000000 \n",
"\n",
" Catch_Rate \n",
"count 721.000000 \n",
"mean 100.246879 \n",
"std 76.573513 \n",
"min 3.000000 \n",
"25% 45.000000 \n",
"50% 65.000000 \n",
"75% 180.000000 \n",
"max 255.000000 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pokemon.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also use pandas to see the first few rows of our data, for instance rows 1-5 as below:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Number Name Type_1 Type_2 Total HP Attack Defense Sp_Atk \\\n",
"0 1 Bulbasaur Grass Poison 318 45 49 49 65 \n",
"1 2 Ivysaur Grass Poison 405 60 62 63 80 \n",
"2 3 Venusaur Grass Poison 525 80 82 83 100 \n",
"3 4 Charmander Fire NaN 309 39 52 43 60 \n",
"4 5 Charmeleon Fire NaN 405 58 64 58 80 \n",
"\n",
" Sp_Def ... Color hasGender Pr_Male Egg_Group_1 Egg_Group_2 \\\n",
"0 65 ... Green True 0.875 Monster Grass \n",
"1 80 ... Green True 0.875 Monster Grass \n",
"2 100 ... Green True 0.875 Monster Grass \n",
"3 50 ... Red True 0.875 Monster Dragon \n",
"4 65 ... Red True 0.875 Monster Dragon \n",
"\n",
" hasMegaEvolution Height_m Weight_kg Catch_Rate Body_Style \n",
"0 False 0.71 6.9 45 quadruped \n",
"1 False 0.99 13.0 45 quadruped \n",
"2 True 2.01 100.0 45 quadruped \n",
"3 False 0.61 8.5 45 bipedal_tailed \n",
"4 False 1.09 19.0 45 bipedal_tailed \n",
"\n",
"[5 rows x 23 columns]\n"
]
}
],
"source": [
"print(pokemon[0:5])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's helpful to see the full size of our data and its shape:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"16583 (721, 23)\n"
]
}
],
"source": [
"print(pokemon.size,pokemon.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Its also good to see how many nulls we are dealing with"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Number 0\n",
"Name 0\n",
"Type_1 0\n",
"Type_2 371\n",
"Total 0\n",
"HP 0\n",
"Attack 0\n",
"Defense 0\n",
"Sp_Atk 0\n",
"Sp_Def 0\n",
"Speed 0\n",
"Generation 0\n",
"isLegendary 0\n",
"Color 0\n",
"hasGender 0\n",
"Pr_Male 77\n",
"Egg_Group_1 0\n",
"Egg_Group_2 530\n",
"hasMegaEvolution 0\n",
"Height_m 0\n",
"Weight_kg 0\n",
"Catch_Rate 0\n",
"Body_Style 0\n",
"dtype: int64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pokemon.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### And that's it!\n",
"So from all the above data, we can see that the dataset has 23 columns and 721 rows, the Type 2 column has 371 null values, the Pr_Male column has 77 null values and the Egg_Group_2 column has 530 null values.\n",
"\n",
"We've now successfully read in and summarized a .csv, stay tuned for Day 2 of the 5 Day Data Challenge!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment