Created
March 15, 2018 01:53
-
-
Save datapluspeople/12f4632094628db13d3168c059101bdf to your computer and use it in GitHub Desktop.
Exploring IBM HR Data with Python | Part 0 - Statistics
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h1>Exploring IBM HR data using Python</h1>\n", | |
"<h2>Intro</h2>\n", | |
"<p>This tutorial series explores the IBM HR data set. This data is typically used to demonstrate the ability of various machine learning algorithms applied to HR data.</p>\n", | |
"<p>In this series, I'll use it to demonstrate the awesome power Python can bring to HR data</p>\n", | |
"<p><b>Sections</b></p>\n", | |
"<ul>\n", | |
" <li>Statistics</li>\n", | |
" <li>Matplotlib</li>\n", | |
" <li>Pandas</li>\n", | |
" <li>Seaborn</li>\n", | |
" <li>Plotly</li>\n", | |
" <li>Findings</li>\n", | |
" </ul>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"__author__ = \"adam\"\n", | |
"__version__ = \"1.0.0\"\n", | |
"__maintainer__ = \"adam\"\n", | |
"__email__ = \"adam@datapluspeople.com\"" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# imports \n", | |
"import numpy as np\n", | |
"import pandas as pd\n", | |
"\n", | |
"import matplotlib.pyplot as plt\n", | |
"import seaborn as sns\n", | |
"\n", | |
"%matplotlib inline" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# read the data directly from IBM Watson Analytics\n", | |
"# using pandas read excel file into dataframe\n", | |
"url = \"https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-HR-Employee-Attrition.xlsx\"\n", | |
"empl_data = pd.read_excel(url)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# alternatively, save the file for repeated use\n", | |
"# we'll reference the saved file in the future portions of this analysis\n", | |
"\n", | |
"# empl_data.to_excel(\"WA_Fn-UseC_-HR-Employee-Attrition.xlsx\")\n", | |
"# empl_data = pd.read_excel(\"WA_Fn-UseC_-HR-Employee-Attrition.xlsx\")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h2>pandas</h2>\n", | |
"<p>pandas provides a number of built-in methods allowing us to quickly and easily explore our data. When reading our data, we stored it in a DataFrame. If that it is a new term, for now think of it sort like an Excel sheet - but way better.</p>\n", | |
"<p>Let's see what we've got...</p>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Age</th>\n", | |
" <th>Attrition</th>\n", | |
" <th>BusinessTravel</th>\n", | |
" <th>DailyRate</th>\n", | |
" <th>Department</th>\n", | |
" <th>DistanceFromHome</th>\n", | |
" <th>Education</th>\n", | |
" <th>EducationField</th>\n", | |
" <th>EmployeeCount</th>\n", | |
" <th>EmployeeNumber</th>\n", | |
" <th>...</th>\n", | |
" <th>RelationshipSatisfaction</th>\n", | |
" <th>StandardHours</th>\n", | |
" <th>StockOptionLevel</th>\n", | |
" <th>TotalWorkingYears</th>\n", | |
" <th>TrainingTimesLastYear</th>\n", | |
" <th>WorkLifeBalance</th>\n", | |
" <th>YearsAtCompany</th>\n", | |
" <th>YearsInCurrentRole</th>\n", | |
" <th>YearsSinceLastPromotion</th>\n", | |
" <th>YearsWithCurrManager</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>41</td>\n", | |
" <td>Yes</td>\n", | |
" <td>Travel_Rarely</td>\n", | |
" <td>1102</td>\n", | |
" <td>Sales</td>\n", | |
" <td>1</td>\n", | |
" <td>2</td>\n", | |
" <td>Life Sciences</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>...</td>\n", | |
" <td>1</td>\n", | |
" <td>80</td>\n", | |
" <td>0</td>\n", | |
" <td>8</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>6</td>\n", | |
" <td>4</td>\n", | |
" <td>0</td>\n", | |
" <td>5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>49</td>\n", | |
" <td>No</td>\n", | |
" <td>Travel_Frequently</td>\n", | |
" <td>279</td>\n", | |
" <td>Research & Development</td>\n", | |
" <td>8</td>\n", | |
" <td>1</td>\n", | |
" <td>Life Sciences</td>\n", | |
" <td>1</td>\n", | |
" <td>2</td>\n", | |
" <td>...</td>\n", | |
" <td>4</td>\n", | |
" <td>80</td>\n", | |
" <td>1</td>\n", | |
" <td>10</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>10</td>\n", | |
" <td>7</td>\n", | |
" <td>1</td>\n", | |
" <td>7</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>37</td>\n", | |
" <td>Yes</td>\n", | |
" <td>Travel_Rarely</td>\n", | |
" <td>1373</td>\n", | |
" <td>Research & Development</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>Other</td>\n", | |
" <td>1</td>\n", | |
" <td>4</td>\n", | |
" <td>...</td>\n", | |
" <td>2</td>\n", | |
" <td>80</td>\n", | |
" <td>0</td>\n", | |
" <td>7</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>33</td>\n", | |
" <td>No</td>\n", | |
" <td>Travel_Frequently</td>\n", | |
" <td>1392</td>\n", | |
" <td>Research & Development</td>\n", | |
" <td>3</td>\n", | |
" <td>4</td>\n", | |
" <td>Life Sciences</td>\n", | |
" <td>1</td>\n", | |
" <td>5</td>\n", | |
" <td>...</td>\n", | |
" <td>3</td>\n", | |
" <td>80</td>\n", | |
" <td>0</td>\n", | |
" <td>8</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>8</td>\n", | |
" <td>7</td>\n", | |
" <td>3</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>27</td>\n", | |
" <td>No</td>\n", | |
" <td>Travel_Rarely</td>\n", | |
" <td>591</td>\n", | |
" <td>Research & Development</td>\n", | |
" <td>2</td>\n", | |
" <td>1</td>\n", | |
" <td>Medical</td>\n", | |
" <td>1</td>\n", | |
" <td>7</td>\n", | |
" <td>...</td>\n", | |
" <td>4</td>\n", | |
" <td>80</td>\n", | |
" <td>1</td>\n", | |
" <td>6</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"<p>5 rows × 35 columns</p>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Age Attrition BusinessTravel DailyRate Department \\\n", | |
"0 41 Yes Travel_Rarely 1102 Sales \n", | |
"1 49 No Travel_Frequently 279 Research & Development \n", | |
"2 37 Yes Travel_Rarely 1373 Research & Development \n", | |
"3 33 No Travel_Frequently 1392 Research & Development \n", | |
"4 27 No Travel_Rarely 591 Research & Development \n", | |
"\n", | |
" DistanceFromHome Education EducationField EmployeeCount EmployeeNumber \\\n", | |
"0 1 2 Life Sciences 1 1 \n", | |
"1 8 1 Life Sciences 1 2 \n", | |
"2 2 2 Other 1 4 \n", | |
"3 3 4 Life Sciences 1 5 \n", | |
"4 2 1 Medical 1 7 \n", | |
"\n", | |
" ... RelationshipSatisfaction StandardHours \\\n", | |
"0 ... 1 80 \n", | |
"1 ... 4 80 \n", | |
"2 ... 2 80 \n", | |
"3 ... 3 80 \n", | |
"4 ... 4 80 \n", | |
"\n", | |
" StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance \\\n", | |
"0 0 8 0 1 \n", | |
"1 1 10 3 3 \n", | |
"2 0 7 3 3 \n", | |
"3 0 8 3 3 \n", | |
"4 1 6 3 3 \n", | |
"\n", | |
" YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion \\\n", | |
"0 6 4 0 \n", | |
"1 10 7 1 \n", | |
"2 0 0 0 \n", | |
"3 8 7 3 \n", | |
"4 2 2 2 \n", | |
"\n", | |
" YearsWithCurrManager \n", | |
"0 5 \n", | |
"1 7 \n", | |
"2 0 \n", | |
"3 0 \n", | |
"4 2 \n", | |
"\n", | |
"[5 rows x 35 columns]" | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# view the first 5 rows\n", | |
"empl_data.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(1470, 35)" | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# how many rows, columns are in the DataFrame?\n", | |
"empl_data.shape" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"51450" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# how many different data points do we have?\n", | |
"empl_data.size" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',\n", | |
" 'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',\n", | |
" 'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',\n", | |
" 'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',\n", | |
" 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',\n", | |
" 'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',\n", | |
" 'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',\n", | |
" 'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',\n", | |
" 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',\n", | |
" 'YearsWithCurrManager'],\n", | |
" dtype='object')" | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# what are the names of all the columns?\n", | |
"empl_data.columns" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"<class 'pandas.core.frame.DataFrame'>\n", | |
"RangeIndex: 1470 entries, 0 to 1469\n", | |
"Data columns (total 35 columns):\n", | |
"Age 1470 non-null int64\n", | |
"Attrition 1470 non-null object\n", | |
"BusinessTravel 1470 non-null object\n", | |
"DailyRate 1470 non-null int64\n", | |
"Department 1470 non-null object\n", | |
"DistanceFromHome 1470 non-null int64\n", | |
"Education 1470 non-null int64\n", | |
"EducationField 1470 non-null object\n", | |
"EmployeeCount 1470 non-null int64\n", | |
"EmployeeNumber 1470 non-null int64\n", | |
"EnvironmentSatisfaction 1470 non-null int64\n", | |
"Gender 1470 non-null object\n", | |
"HourlyRate 1470 non-null int64\n", | |
"JobInvolvement 1470 non-null int64\n", | |
"JobLevel 1470 non-null int64\n", | |
"JobRole 1470 non-null object\n", | |
"JobSatisfaction 1470 non-null int64\n", | |
"MaritalStatus 1470 non-null object\n", | |
"MonthlyIncome 1470 non-null int64\n", | |
"MonthlyRate 1470 non-null int64\n", | |
"NumCompaniesWorked 1470 non-null int64\n", | |
"Over18 1470 non-null object\n", | |
"OverTime 1470 non-null object\n", | |
"PercentSalaryHike 1470 non-null int64\n", | |
"PerformanceRating 1470 non-null int64\n", | |
"RelationshipSatisfaction 1470 non-null int64\n", | |
"StandardHours 1470 non-null int64\n", | |
"StockOptionLevel 1470 non-null int64\n", | |
"TotalWorkingYears 1470 non-null int64\n", | |
"TrainingTimesLastYear 1470 non-null int64\n", | |
"WorkLifeBalance 1470 non-null int64\n", | |
"YearsAtCompany 1470 non-null int64\n", | |
"YearsInCurrentRole 1470 non-null int64\n", | |
"YearsSinceLastPromotion 1470 non-null int64\n", | |
"YearsWithCurrManager 1470 non-null int64\n", | |
"dtypes: int64(26), object(9)\n", | |
"memory usage: 402.0+ KB\n" | |
] | |
} | |
], | |
"source": [ | |
"# info about the dataframe\n", | |
"empl_data.info()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Python performs it's work 'in-memory' meaning your computer is holding all of this data in memory. HR data isn't typically 'big data', but it's import to have an understanind of the resource demands of your data analysis." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"Index 80\n", | |
"Age 11760\n", | |
"Attrition 11760\n", | |
"BusinessTravel 11760\n", | |
"DailyRate 11760\n", | |
"Department 11760\n", | |
"DistanceFromHome 11760\n", | |
"Education 11760\n", | |
"EducationField 11760\n", | |
"EmployeeCount 11760\n", | |
"EmployeeNumber 11760\n", | |
"EnvironmentSatisfaction 11760\n", | |
"Gender 11760\n", | |
"HourlyRate 11760\n", | |
"JobInvolvement 11760\n", | |
"JobLevel 11760\n", | |
"JobRole 11760\n", | |
"JobSatisfaction 11760\n", | |
"MaritalStatus 11760\n", | |
"MonthlyIncome 11760\n", | |
"MonthlyRate 11760\n", | |
"NumCompaniesWorked 11760\n", | |
"Over18 11760\n", | |
"OverTime 11760\n", | |
"PercentSalaryHike 11760\n", | |
"PerformanceRating 11760\n", | |
"RelationshipSatisfaction 11760\n", | |
"StandardHours 11760\n", | |
"StockOptionLevel 11760\n", | |
"TotalWorkingYears 11760\n", | |
"TrainingTimesLastYear 11760\n", | |
"WorkLifeBalance 11760\n", | |
"YearsAtCompany 11760\n", | |
"YearsInCurrentRole 11760\n", | |
"YearsSinceLastPromotion 11760\n", | |
"YearsWithCurrManager 11760\n", | |
"dtype: int64" | |
] | |
}, | |
"execution_count": 10, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"empl_data.memory_usage()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<p>the memory_usage method provides us with the memory used by each column in our new DataFrame. To get the entire amount of consumed memory, we can just add each item together. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"The employee DataFrame is using 411680 bytes in memory.\n" | |
] | |
} | |
], | |
"source": [ | |
"memory_used = empl_data.memory_usage().sum()\n", | |
"print(f'The employee DataFrame is using {memory_used} bytes in memory.')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We are using approximately 411kb of system memory to store the DataFrame. Scroll back up to the output of the .info() method. This actually contained the memory usage as well, though the value is slightly less. Why? The memory_usage method includes the memory used by the index of the DataFrame.\n", | |
"\n", | |
"But there's still more, and to enable a more accurate summary, we'll pass an optional item that accounts for the full usage of contained objects." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"The employee DataFrame is using 1177169 bytes in memory.\n" | |
] | |
} | |
], | |
"source": [ | |
"memory_used = empl_data.memory_usage(deep=True).sum()\n", | |
"print(f'The employee DataFrame is using {memory_used} bytes in memory.')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Nearly 3x the original result. You should have no problem holding this amount, but with much larger data sets and complex analysis, you may want to be aware of memory usage when processing on a local computer." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h2>Summary Statistics</h2>\n", | |
"<p>Enough with the background info, let's see the data!</p>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Age</th>\n", | |
" <th>DailyRate</th>\n", | |
" <th>DistanceFromHome</th>\n", | |
" <th>Education</th>\n", | |
" <th>EmployeeCount</th>\n", | |
" <th>EmployeeNumber</th>\n", | |
" <th>EnvironmentSatisfaction</th>\n", | |
" <th>HourlyRate</th>\n", | |
" <th>JobInvolvement</th>\n", | |
" <th>JobLevel</th>\n", | |
" <th>...</th>\n", | |
" <th>RelationshipSatisfaction</th>\n", | |
" <th>StandardHours</th>\n", | |
" <th>StockOptionLevel</th>\n", | |
" <th>TotalWorkingYears</th>\n", | |
" <th>TrainingTimesLastYear</th>\n", | |
" <th>WorkLifeBalance</th>\n", | |
" <th>YearsAtCompany</th>\n", | |
" <th>YearsInCurrentRole</th>\n", | |
" <th>YearsSinceLastPromotion</th>\n", | |
" <th>YearsWithCurrManager</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>count</th>\n", | |
" <td>1470.000000</td>\n", | |
" <td>1470.000000</td>\n", | |
" <td>1470.000000</td>\n", | |
" <td>1470.000000</td>\n", | |
" <td>1470.0</td>\n", | |
" <td>1470.000000</td>\n", | |
" <td>1470.000000</td>\n", | |
" <td>1470.000000</td>\n", | |
" <td>1470.000000</td>\n", | |
" <td>1470.000000</td>\n", | |
" <td>...</td>\n", | |
" <td>1470.000000</td>\n", | |
" <td>1470.0</td>\n", | |
" <td>1470.000000</td>\n", | |
" <td>1470.000000</td>\n", | |
" <td>1470.000000</td>\n", | |
" <td>1470.000000</td>\n", | |
" <td>1470.000000</td>\n", | |
" <td>1470.000000</td>\n", | |
" <td>1470.000000</td>\n", | |
" <td>1470.000000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>mean</th>\n", | |
" <td>36.923810</td>\n", | |
" <td>802.485714</td>\n", | |
" <td>9.192517</td>\n", | |
" <td>2.912925</td>\n", | |
" <td>1.0</td>\n", | |
" <td>1024.865306</td>\n", | |
" <td>2.721769</td>\n", | |
" <td>65.891156</td>\n", | |
" <td>2.729932</td>\n", | |
" <td>2.063946</td>\n", | |
" <td>...</td>\n", | |
" <td>2.712245</td>\n", | |
" <td>80.0</td>\n", | |
" <td>0.793878</td>\n", | |
" <td>11.279592</td>\n", | |
" <td>2.799320</td>\n", | |
" <td>2.761224</td>\n", | |
" <td>7.008163</td>\n", | |
" <td>4.229252</td>\n", | |
" <td>2.187755</td>\n", | |
" <td>4.123129</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>std</th>\n", | |
" <td>9.135373</td>\n", | |
" <td>403.509100</td>\n", | |
" <td>8.106864</td>\n", | |
" <td>1.024165</td>\n", | |
" <td>0.0</td>\n", | |
" <td>602.024335</td>\n", | |
" <td>1.093082</td>\n", | |
" <td>20.329428</td>\n", | |
" <td>0.711561</td>\n", | |
" <td>1.106940</td>\n", | |
" <td>...</td>\n", | |
" <td>1.081209</td>\n", | |
" <td>0.0</td>\n", | |
" <td>0.852077</td>\n", | |
" <td>7.780782</td>\n", | |
" <td>1.289271</td>\n", | |
" <td>0.706476</td>\n", | |
" <td>6.126525</td>\n", | |
" <td>3.623137</td>\n", | |
" <td>3.222430</td>\n", | |
" <td>3.568136</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>min</th>\n", | |
" <td>18.000000</td>\n", | |
" <td>102.000000</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>1.0</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>30.000000</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>...</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>80.0</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>25%</th>\n", | |
" <td>30.000000</td>\n", | |
" <td>465.000000</td>\n", | |
" <td>2.000000</td>\n", | |
" <td>2.000000</td>\n", | |
" <td>1.0</td>\n", | |
" <td>491.250000</td>\n", | |
" <td>2.000000</td>\n", | |
" <td>48.000000</td>\n", | |
" <td>2.000000</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>...</td>\n", | |
" <td>2.000000</td>\n", | |
" <td>80.0</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>6.000000</td>\n", | |
" <td>2.000000</td>\n", | |
" <td>2.000000</td>\n", | |
" <td>3.000000</td>\n", | |
" <td>2.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>2.000000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>50%</th>\n", | |
" <td>36.000000</td>\n", | |
" <td>802.000000</td>\n", | |
" <td>7.000000</td>\n", | |
" <td>3.000000</td>\n", | |
" <td>1.0</td>\n", | |
" <td>1020.500000</td>\n", | |
" <td>3.000000</td>\n", | |
" <td>66.000000</td>\n", | |
" <td>3.000000</td>\n", | |
" <td>2.000000</td>\n", | |
" <td>...</td>\n", | |
" <td>3.000000</td>\n", | |
" <td>80.0</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>10.000000</td>\n", | |
" <td>3.000000</td>\n", | |
" <td>3.000000</td>\n", | |
" <td>5.000000</td>\n", | |
" <td>3.000000</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>3.000000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>75%</th>\n", | |
" <td>43.000000</td>\n", | |
" <td>1157.000000</td>\n", | |
" <td>14.000000</td>\n", | |
" <td>4.000000</td>\n", | |
" <td>1.0</td>\n", | |
" <td>1555.750000</td>\n", | |
" <td>4.000000</td>\n", | |
" <td>83.750000</td>\n", | |
" <td>3.000000</td>\n", | |
" <td>3.000000</td>\n", | |
" <td>...</td>\n", | |
" <td>4.000000</td>\n", | |
" <td>80.0</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>15.000000</td>\n", | |
" <td>3.000000</td>\n", | |
" <td>3.000000</td>\n", | |
" <td>9.000000</td>\n", | |
" <td>7.000000</td>\n", | |
" <td>3.000000</td>\n", | |
" <td>7.000000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>max</th>\n", | |
" <td>60.000000</td>\n", | |
" <td>1499.000000</td>\n", | |
" <td>29.000000</td>\n", | |
" <td>5.000000</td>\n", | |
" <td>1.0</td>\n", | |
" <td>2068.000000</td>\n", | |
" <td>4.000000</td>\n", | |
" <td>100.000000</td>\n", | |
" <td>4.000000</td>\n", | |
" <td>5.000000</td>\n", | |
" <td>...</td>\n", | |
" <td>4.000000</td>\n", | |
" <td>80.0</td>\n", | |
" <td>3.000000</td>\n", | |
" <td>40.000000</td>\n", | |
" <td>6.000000</td>\n", | |
" <td>4.000000</td>\n", | |
" <td>40.000000</td>\n", | |
" <td>18.000000</td>\n", | |
" <td>15.000000</td>\n", | |
" <td>17.000000</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"<p>8 rows × 26 columns</p>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Age DailyRate DistanceFromHome Education EmployeeCount \\\n", | |
"count 1470.000000 1470.000000 1470.000000 1470.000000 1470.0 \n", | |
"mean 36.923810 802.485714 9.192517 2.912925 1.0 \n", | |
"std 9.135373 403.509100 8.106864 1.024165 0.0 \n", | |
"min 18.000000 102.000000 1.000000 1.000000 1.0 \n", | |
"25% 30.000000 465.000000 2.000000 2.000000 1.0 \n", | |
"50% 36.000000 802.000000 7.000000 3.000000 1.0 \n", | |
"75% 43.000000 1157.000000 14.000000 4.000000 1.0 \n", | |
"max 60.000000 1499.000000 29.000000 5.000000 1.0 \n", | |
"\n", | |
" EmployeeNumber EnvironmentSatisfaction HourlyRate JobInvolvement \\\n", | |
"count 1470.000000 1470.000000 1470.000000 1470.000000 \n", | |
"mean 1024.865306 2.721769 65.891156 2.729932 \n", | |
"std 602.024335 1.093082 20.329428 0.711561 \n", | |
"min 1.000000 1.000000 30.000000 1.000000 \n", | |
"25% 491.250000 2.000000 48.000000 2.000000 \n", | |
"50% 1020.500000 3.000000 66.000000 3.000000 \n", | |
"75% 1555.750000 4.000000 83.750000 3.000000 \n", | |
"max 2068.000000 4.000000 100.000000 4.000000 \n", | |
"\n", | |
" JobLevel ... RelationshipSatisfaction \\\n", | |
"count 1470.000000 ... 1470.000000 \n", | |
"mean 2.063946 ... 2.712245 \n", | |
"std 1.106940 ... 1.081209 \n", | |
"min 1.000000 ... 1.000000 \n", | |
"25% 1.000000 ... 2.000000 \n", | |
"50% 2.000000 ... 3.000000 \n", | |
"75% 3.000000 ... 4.000000 \n", | |
"max 5.000000 ... 4.000000 \n", | |
"\n", | |
" StandardHours StockOptionLevel TotalWorkingYears \\\n", | |
"count 1470.0 1470.000000 1470.000000 \n", | |
"mean 80.0 0.793878 11.279592 \n", | |
"std 0.0 0.852077 7.780782 \n", | |
"min 80.0 0.000000 0.000000 \n", | |
"25% 80.0 0.000000 6.000000 \n", | |
"50% 80.0 1.000000 10.000000 \n", | |
"75% 80.0 1.000000 15.000000 \n", | |
"max 80.0 3.000000 40.000000 \n", | |
"\n", | |
" TrainingTimesLastYear WorkLifeBalance YearsAtCompany \\\n", | |
"count 1470.000000 1470.000000 1470.000000 \n", | |
"mean 2.799320 2.761224 7.008163 \n", | |
"std 1.289271 0.706476 6.126525 \n", | |
"min 0.000000 1.000000 0.000000 \n", | |
"25% 2.000000 2.000000 3.000000 \n", | |
"50% 3.000000 3.000000 5.000000 \n", | |
"75% 3.000000 3.000000 9.000000 \n", | |
"max 6.000000 4.000000 40.000000 \n", | |
"\n", | |
" YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager \n", | |
"count 1470.000000 1470.000000 1470.000000 \n", | |
"mean 4.229252 2.187755 4.123129 \n", | |
"std 3.623137 3.222430 3.568136 \n", | |
"min 0.000000 0.000000 0.000000 \n", | |
"25% 2.000000 0.000000 2.000000 \n", | |
"50% 3.000000 1.000000 3.000000 \n", | |
"75% 7.000000 3.000000 7.000000 \n", | |
"max 18.000000 15.000000 17.000000 \n", | |
"\n", | |
"[8 rows x 26 columns]" | |
] | |
}, | |
"execution_count": 13, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# statistics about the DataFrame's numerical fields\n", | |
"empl_data.describe()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<p>Describe allows us to see descriptive statistics for the numerical Series in the DataFrame\n", | |
"Note that non-numeric fields, such as 'Department' are not included in the results. </p>\n", | |
"<p>Very helpful, but I don't want to scroll horizontally. Let's fix that.</p>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>count</th>\n", | |
" <th>mean</th>\n", | |
" <th>std</th>\n", | |
" <th>min</th>\n", | |
" <th>25%</th>\n", | |
" <th>50%</th>\n", | |
" <th>75%</th>\n", | |
" <th>max</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>Age</th>\n", | |
" <td>1470.0</td>\n", | |
" <td>36.923810</td>\n", | |
" <td>9.135373</td>\n", | |
" <td>18.0</td>\n", | |
" <td>30.00</td>\n", | |
" <td>36.0</td>\n", | |
" <td>43.00</td>\n", | |
" <td>60.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>DailyRate</th>\n", | |
" <td>1470.0</td>\n", | |
" <td>802.485714</td>\n", | |
" <td>403.509100</td>\n", | |
" <td>102.0</td>\n", | |
" <td>465.00</td>\n", | |
" <td>802.0</td>\n", | |
" <td>1157.00</td>\n", | |
" <td>1499.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>DistanceFromHome</th>\n", | |
" <td>1470.0</td>\n", | |
" <td>9.192517</td>\n", | |
" <td>8.106864</td>\n", | |
" <td>1.0</td>\n", | |
" <td>2.00</td>\n", | |
" <td>7.0</td>\n", | |
" <td>14.00</td>\n", | |
" <td>29.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Education</th>\n", | |
" <td>1470.0</td>\n", | |
" <td>2.912925</td>\n", | |
" <td>1.024165</td>\n", | |
" <td>1.0</td>\n", | |
" <td>2.00</td>\n", | |
" <td>3.0</td>\n", | |
" <td>4.00</td>\n", | |
" <td>5.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>EmployeeCount</th>\n", | |
" <td>1470.0</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>1.0</td>\n", | |
" <td>1.00</td>\n", | |
" <td>1.0</td>\n", | |
" <td>1.00</td>\n", | |
" <td>1.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>EmployeeNumber</th>\n", | |
" <td>1470.0</td>\n", | |
" <td>1024.865306</td>\n", | |
" <td>602.024335</td>\n", | |
" <td>1.0</td>\n", | |
" <td>491.25</td>\n", | |
" <td>1020.5</td>\n", | |
" <td>1555.75</td>\n", | |
" <td>2068.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>EnvironmentSatisfaction</th>\n", | |
" <td>1470.0</td>\n", | |
" <td>2.721769</td>\n", | |
" <td>1.093082</td>\n", | |
" <td>1.0</td>\n", | |
" <td>2.00</td>\n", | |
" <td>3.0</td>\n", | |
" <td>4.00</td>\n", | |
" <td>4.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>HourlyRate</th>\n", | |
" <td>1470.0</td>\n", | |
" <td>65.891156</td>\n", | |
" <td>20.329428</td>\n", | |
" <td>30.0</td>\n", | |
" <td>48.00</td>\n", | |
" <td>66.0</td>\n", | |
" <td>83.75</td>\n", | |
" <td>100.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>JobInvolvement</th>\n", | |
" <td>1470.0</td>\n", | |
" <td>2.729932</td>\n", | |
" <td>0.711561</td>\n", | |
" <td>1.0</td>\n", | |
" <td>2.00</td>\n", | |
" <td>3.0</td>\n", | |
" <td>3.00</td>\n", | |
" <td>4.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>JobLevel</th>\n", | |
" <td>1470.0</td>\n", | |
" <td>2.063946</td>\n", | |
" <td>1.106940</td>\n", | |
" <td>1.0</td>\n", | |
" <td>1.00</td>\n", | |
" <td>2.0</td>\n", | |
" <td>3.00</td>\n", | |
" <td>5.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>JobSatisfaction</th>\n", | |
" <td>1470.0</td>\n", | |
" <td>2.728571</td>\n", | |
" <td>1.102846</td>\n", | |
" <td>1.0</td>\n", | |
" <td>2.00</td>\n", | |
" <td>3.0</td>\n", | |
" <td>4.00</td>\n", | |
" <td>4.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>MonthlyIncome</th>\n", | |
" <td>1470.0</td>\n", | |
" <td>6502.931293</td>\n", | |
" <td>4707.956783</td>\n", | |
" <td>1009.0</td>\n", | |
" <td>2911.00</td>\n", | |
" <td>4919.0</td>\n", | |
" <td>8379.00</td>\n", | |
" <td>19999.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>MonthlyRate</th>\n", | |
" <td>1470.0</td>\n", | |
" <td>14313.103401</td>\n", | |
" <td>7117.786044</td>\n", | |
" <td>2094.0</td>\n", | |
" <td>8047.00</td>\n", | |
" <td>14235.5</td>\n", | |
" <td>20461.50</td>\n", | |
" <td>26999.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>NumCompaniesWorked</th>\n", | |
" <td>1470.0</td>\n", | |
" <td>2.693197</td>\n", | |
" <td>2.498009</td>\n", | |
" <td>0.0</td>\n", | |
" <td>1.00</td>\n", | |
" <td>2.0</td>\n", | |
" <td>4.00</td>\n", | |
" <td>9.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>PercentSalaryHike</th>\n", | |
" <td>1470.0</td>\n", | |
" <td>15.209524</td>\n", | |
" <td>3.659938</td>\n", | |
" <td>11.0</td>\n", | |
" <td>12.00</td>\n", | |
" <td>14.0</td>\n", | |
" <td>18.00</td>\n", | |
" <td>25.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>PerformanceRating</th>\n", | |
" <td>1470.0</td>\n", | |
" <td>3.153741</td>\n", | |
" <td>0.360824</td>\n", | |
" <td>3.0</td>\n", | |
" <td>3.00</td>\n", | |
" <td>3.0</td>\n", | |
" <td>3.00</td>\n", | |
" <td>4.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>RelationshipSatisfaction</th>\n", | |
" <td>1470.0</td>\n", | |
" <td>2.712245</td>\n", | |
" <td>1.081209</td>\n", | |
" <td>1.0</td>\n", | |
" <td>2.00</td>\n", | |
" <td>3.0</td>\n", | |
" <td>4.00</td>\n", | |
" <td>4.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>StandardHours</th>\n", | |
" <td>1470.0</td>\n", | |
" <td>80.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>80.0</td>\n", | |
" <td>80.00</td>\n", | |
" <td>80.0</td>\n", | |
" <td>80.00</td>\n", | |
" <td>80.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>StockOptionLevel</th>\n", | |
" <td>1470.0</td>\n", | |
" <td>0.793878</td>\n", | |
" <td>0.852077</td>\n", | |
" <td>0.0</td>\n", | |
" <td>0.00</td>\n", | |
" <td>1.0</td>\n", | |
" <td>1.00</td>\n", | |
" <td>3.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>TotalWorkingYears</th>\n", | |
" <td>1470.0</td>\n", | |
" <td>11.279592</td>\n", | |
" <td>7.780782</td>\n", | |
" <td>0.0</td>\n", | |
" <td>6.00</td>\n", | |
" <td>10.0</td>\n", | |
" <td>15.00</td>\n", | |
" <td>40.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>TrainingTimesLastYear</th>\n", | |
" <td>1470.0</td>\n", | |
" <td>2.799320</td>\n", | |
" <td>1.289271</td>\n", | |
" <td>0.0</td>\n", | |
" <td>2.00</td>\n", | |
" <td>3.0</td>\n", | |
" <td>3.00</td>\n", | |
" <td>6.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>WorkLifeBalance</th>\n", | |
" <td>1470.0</td>\n", | |
" <td>2.761224</td>\n", | |
" <td>0.706476</td>\n", | |
" <td>1.0</td>\n", | |
" <td>2.00</td>\n", | |
" <td>3.0</td>\n", | |
" <td>3.00</td>\n", | |
" <td>4.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>YearsAtCompany</th>\n", | |
" <td>1470.0</td>\n", | |
" <td>7.008163</td>\n", | |
" <td>6.126525</td>\n", | |
" <td>0.0</td>\n", | |
" <td>3.00</td>\n", | |
" <td>5.0</td>\n", | |
" <td>9.00</td>\n", | |
" <td>40.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>YearsInCurrentRole</th>\n", | |
" <td>1470.0</td>\n", | |
" <td>4.229252</td>\n", | |
" <td>3.623137</td>\n", | |
" <td>0.0</td>\n", | |
" <td>2.00</td>\n", | |
" <td>3.0</td>\n", | |
" <td>7.00</td>\n", | |
" <td>18.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>YearsSinceLastPromotion</th>\n", | |
" <td>1470.0</td>\n", | |
" <td>2.187755</td>\n", | |
" <td>3.222430</td>\n", | |
" <td>0.0</td>\n", | |
" <td>0.00</td>\n", | |
" <td>1.0</td>\n", | |
" <td>3.00</td>\n", | |
" <td>15.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>YearsWithCurrManager</th>\n", | |
" <td>1470.0</td>\n", | |
" <td>4.123129</td>\n", | |
" <td>3.568136</td>\n", | |
" <td>0.0</td>\n", | |
" <td>2.00</td>\n", | |
" <td>3.0</td>\n", | |
" <td>7.00</td>\n", | |
" <td>17.0</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" count mean std min 25% \\\n", | |
"Age 1470.0 36.923810 9.135373 18.0 30.00 \n", | |
"DailyRate 1470.0 802.485714 403.509100 102.0 465.00 \n", | |
"DistanceFromHome 1470.0 9.192517 8.106864 1.0 2.00 \n", | |
"Education 1470.0 2.912925 1.024165 1.0 2.00 \n", | |
"EmployeeCount 1470.0 1.000000 0.000000 1.0 1.00 \n", | |
"EmployeeNumber 1470.0 1024.865306 602.024335 1.0 491.25 \n", | |
"EnvironmentSatisfaction 1470.0 2.721769 1.093082 1.0 2.00 \n", | |
"HourlyRate 1470.0 65.891156 20.329428 30.0 48.00 \n", | |
"JobInvolvement 1470.0 2.729932 0.711561 1.0 2.00 \n", | |
"JobLevel 1470.0 2.063946 1.106940 1.0 1.00 \n", | |
"JobSatisfaction 1470.0 2.728571 1.102846 1.0 2.00 \n", | |
"MonthlyIncome 1470.0 6502.931293 4707.956783 1009.0 2911.00 \n", | |
"MonthlyRate 1470.0 14313.103401 7117.786044 2094.0 8047.00 \n", | |
"NumCompaniesWorked 1470.0 2.693197 2.498009 0.0 1.00 \n", | |
"PercentSalaryHike 1470.0 15.209524 3.659938 11.0 12.00 \n", | |
"PerformanceRating 1470.0 3.153741 0.360824 3.0 3.00 \n", | |
"RelationshipSatisfaction 1470.0 2.712245 1.081209 1.0 2.00 \n", | |
"StandardHours 1470.0 80.000000 0.000000 80.0 80.00 \n", | |
"StockOptionLevel 1470.0 0.793878 0.852077 0.0 0.00 \n", | |
"TotalWorkingYears 1470.0 11.279592 7.780782 0.0 6.00 \n", | |
"TrainingTimesLastYear 1470.0 2.799320 1.289271 0.0 2.00 \n", | |
"WorkLifeBalance 1470.0 2.761224 0.706476 1.0 2.00 \n", | |
"YearsAtCompany 1470.0 7.008163 6.126525 0.0 3.00 \n", | |
"YearsInCurrentRole 1470.0 4.229252 3.623137 0.0 2.00 \n", | |
"YearsSinceLastPromotion 1470.0 2.187755 3.222430 0.0 0.00 \n", | |
"YearsWithCurrManager 1470.0 4.123129 3.568136 0.0 2.00 \n", | |
"\n", | |
" 50% 75% max \n", | |
"Age 36.0 43.00 60.0 \n", | |
"DailyRate 802.0 1157.00 1499.0 \n", | |
"DistanceFromHome 7.0 14.00 29.0 \n", | |
"Education 3.0 4.00 5.0 \n", | |
"EmployeeCount 1.0 1.00 1.0 \n", | |
"EmployeeNumber 1020.5 1555.75 2068.0 \n", | |
"EnvironmentSatisfaction 3.0 4.00 4.0 \n", | |
"HourlyRate 66.0 83.75 100.0 \n", | |
"JobInvolvement 3.0 3.00 4.0 \n", | |
"JobLevel 2.0 3.00 5.0 \n", | |
"JobSatisfaction 3.0 4.00 4.0 \n", | |
"MonthlyIncome 4919.0 8379.00 19999.0 \n", | |
"MonthlyRate 14235.5 20461.50 26999.0 \n", | |
"NumCompaniesWorked 2.0 4.00 9.0 \n", | |
"PercentSalaryHike 14.0 18.00 25.0 \n", | |
"PerformanceRating 3.0 3.00 4.0 \n", | |
"RelationshipSatisfaction 3.0 4.00 4.0 \n", | |
"StandardHours 80.0 80.00 80.0 \n", | |
"StockOptionLevel 1.0 1.00 3.0 \n", | |
"TotalWorkingYears 10.0 15.00 40.0 \n", | |
"TrainingTimesLastYear 3.0 3.00 6.0 \n", | |
"WorkLifeBalance 3.0 3.00 4.0 \n", | |
"YearsAtCompany 5.0 9.00 40.0 \n", | |
"YearsInCurrentRole 3.0 7.00 18.0 \n", | |
"YearsSinceLastPromotion 1.0 3.00 15.0 \n", | |
"YearsWithCurrManager 3.0 7.00 17.0 " | |
] | |
}, | |
"execution_count": 14, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# transpose the DataFrame\n", | |
"empl_data.describe().transpose()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Transposing the output of describe allows us to view without scrolling. This works well for this DataFrames with many columns." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h3>Working with DataFrames</h3>\n", | |
"<p>We've managed to get the data <i>in</i> to the DataFrame, but how do we now get data <b><i>out</i></b> of the DataFrame?</p>\n", | |
"<p>We don't always want the full DataFrame, we may want to select items that only meet certain criteria.</p>\n", | |
"<p>Let's have a look at how to do just that.</p>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0 0\n", | |
"1 3\n", | |
"2 3\n", | |
"3 3\n", | |
"4 3\n", | |
"5 2\n", | |
"6 3\n", | |
"7 2\n", | |
"8 2\n", | |
"9 3\n", | |
"10 5\n", | |
"11 3\n", | |
"12 1\n", | |
"13 2\n", | |
"14 4\n", | |
"15 1\n", | |
"16 5\n", | |
"17 2\n", | |
"18 3\n", | |
"19 3\n", | |
"20 5\n", | |
"21 4\n", | |
"22 4\n", | |
"23 6\n", | |
"24 2\n", | |
"25 3\n", | |
"26 5\n", | |
"27 2\n", | |
"28 4\n", | |
"29 2\n", | |
" ..\n", | |
"1440 3\n", | |
"1441 2\n", | |
"1442 3\n", | |
"1443 2\n", | |
"1444 4\n", | |
"1445 3\n", | |
"1446 2\n", | |
"1447 4\n", | |
"1448 5\n", | |
"1449 4\n", | |
"1450 2\n", | |
"1451 1\n", | |
"1452 3\n", | |
"1453 2\n", | |
"1454 3\n", | |
"1455 2\n", | |
"1456 2\n", | |
"1457 2\n", | |
"1458 5\n", | |
"1459 2\n", | |
"1460 3\n", | |
"1461 3\n", | |
"1462 2\n", | |
"1463 2\n", | |
"1464 2\n", | |
"1465 3\n", | |
"1466 5\n", | |
"1467 0\n", | |
"1468 3\n", | |
"1469 3\n", | |
"Name: TrainingTimesLastYear, Length: 1470, dtype: int64" | |
] | |
}, | |
"execution_count": 15, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# select all the Training counts from last year\n", | |
"empl_data['TrainingTimesLastYear']" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<p>The selection returned to us a Series containing all of the Training data. The '..' applied by the Jupyter Notebook condensed the output so we didn't have to see all 1,469 rows. </p>\n", | |
"<p>We'll now store this Series in a new 'training' variable and generate some statistics to help us understand the Training in the organization.</p>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# store the result\n", | |
"trainings = empl_data['TrainingTimesLastYear']" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<p>As the result is now stored in memory, it is not returned to the screen. To view it, we'll have to explicitly ask for it.</p>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0 0\n", | |
"1 3\n", | |
"2 3\n", | |
"3 3\n", | |
"4 3\n", | |
"5 2\n", | |
"6 3\n", | |
"7 2\n", | |
"8 2\n", | |
"9 3\n", | |
"10 5\n", | |
"11 3\n", | |
"12 1\n", | |
"13 2\n", | |
"14 4\n", | |
"15 1\n", | |
"16 5\n", | |
"17 2\n", | |
"18 3\n", | |
"19 3\n", | |
"20 5\n", | |
"21 4\n", | |
"22 4\n", | |
"23 6\n", | |
"24 2\n", | |
"25 3\n", | |
"26 5\n", | |
"27 2\n", | |
"28 4\n", | |
"29 2\n", | |
" ..\n", | |
"1440 3\n", | |
"1441 2\n", | |
"1442 3\n", | |
"1443 2\n", | |
"1444 4\n", | |
"1445 3\n", | |
"1446 2\n", | |
"1447 4\n", | |
"1448 5\n", | |
"1449 4\n", | |
"1450 2\n", | |
"1451 1\n", | |
"1452 3\n", | |
"1453 2\n", | |
"1454 3\n", | |
"1455 2\n", | |
"1456 2\n", | |
"1457 2\n", | |
"1458 5\n", | |
"1459 2\n", | |
"1460 3\n", | |
"1461 3\n", | |
"1462 2\n", | |
"1463 2\n", | |
"1464 2\n", | |
"1465 3\n", | |
"1466 5\n", | |
"1467 0\n", | |
"1468 3\n", | |
"1469 3\n", | |
"Name: TrainingTimesLastYear, Length: 1470, dtype: int64" | |
] | |
}, | |
"execution_count": 17, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# view the training variable contents\n", | |
"trainings" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<p>Let's now use this to answer the following questions:</p>\n", | |
"<ul>\n", | |
" <li>How many total trainings took place last year?</li>\n", | |
" <li>What is the average number of trainings employees received?</li>\n", | |
" <li>What is the maximum number of trainings received?</li>\n", | |
" <li>What is the minimum number of trainings received?</li>\n", | |
" <li>For each number of trainings, how many employees are in each group?</li>\n", | |
" <li>How many employees received the minimum number of trainings?</li>\n", | |
" </ul>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"4115" | |
] | |
}, | |
"execution_count": 18, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# total number of trainings\n", | |
"trainings.sum()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"1470" | |
] | |
}, | |
"execution_count": 19, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# total number of employees\n", | |
"trainings.count()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"2.7993197278911564" | |
] | |
}, | |
"execution_count": 20, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# average number of trainings\n", | |
"trainings.mean()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"6" | |
] | |
}, | |
"execution_count": 21, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# maximum number of trainings\n", | |
"trainings.max()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0" | |
] | |
}, | |
"execution_count": 22, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# minimum number of trainings\n", | |
"trainings.min()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"2 547\n", | |
"3 491\n", | |
"4 123\n", | |
"5 119\n", | |
"1 71\n", | |
"6 65\n", | |
"0 54\n", | |
"Name: TrainingTimesLastYear, dtype: int64" | |
] | |
}, | |
"execution_count": 23, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# number of employees in each group\n", | |
"trainings.value_counts()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 24, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"54" | |
] | |
}, | |
"execution_count": 24, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# number of employees not receiving training last year\n", | |
"trainings.value_counts()[0]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 25, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"4% of employees did not receive training last year.\n" | |
] | |
} | |
], | |
"source": [ | |
"# percentage of employees not receiving training\n", | |
"print('{:.0f}% of employees did not receive training last year.'.format(((trainings.value_counts()[0])/trainings.count())*100))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h2>Section Recap</h2>\n", | |
"<p>In this section we covered a lot of ground after just a few simple steps:</p>\n", | |
"<ul>\n", | |
" <li>Read in a dataset directly from a website</li>\n", | |
" <li>Explored the dataset for understanding of it's contents</li>\n", | |
" <li>Analyzed the amount of memory consumed by this dataset</li>\n", | |
" <li>Generated Summary Statistics of all the numerical fields</li>\n", | |
" <li>Extracted the <var>training</var> column, returning a Series</li>\n", | |
" <li>Evaluated training activity within the organization</li>\n", | |
" <li>Discovered our first insight - 4% haven't received any training!</li>\n", | |
" </ul>" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.4" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment