Skip to content

Instantly share code, notes, and snippets.

@codebrain001
Created April 12, 2020 20:12
Show Gist options
  • Save codebrain001/907d3aa2692c702abbb547fb63ca7744 to your computer and use it in GitHub Desktop.
Save codebrain001/907d3aa2692c702abbb547fb63ca7744 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Exploratory Data Analysis"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>state</th>\n",
" <th>state_po</th>\n",
" <th>county</th>\n",
" <th>FIPS</th>\n",
" <th>office</th>\n",
" <th>candidate</th>\n",
" <th>party</th>\n",
" <th>candidatevotes</th>\n",
" <th>totalvotes</th>\n",
" <th>version</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2016.0</td>\n",
" <td>Alabama</td>\n",
" <td>AL</td>\n",
" <td>Autauga</td>\n",
" <td>1001.0</td>\n",
" <td>President</td>\n",
" <td>Hillary Clinton</td>\n",
" <td>democrat</td>\n",
" <td>5936.0</td>\n",
" <td>24973.0</td>\n",
" <td>20190722.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2016.0</td>\n",
" <td>Alabama</td>\n",
" <td>AL</td>\n",
" <td>Autauga</td>\n",
" <td>1001.0</td>\n",
" <td>President</td>\n",
" <td>Donald Trump</td>\n",
" <td>republican</td>\n",
" <td>18172.0</td>\n",
" <td>24973.0</td>\n",
" <td>20190722.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2016.0</td>\n",
" <td>Alabama</td>\n",
" <td>AL</td>\n",
" <td>Autauga</td>\n",
" <td>1001.0</td>\n",
" <td>President</td>\n",
" <td>Other</td>\n",
" <td>NaN</td>\n",
" <td>865.0</td>\n",
" <td>24973.0</td>\n",
" <td>20190722.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2016.0</td>\n",
" <td>Alabama</td>\n",
" <td>AL</td>\n",
" <td>Baldwin</td>\n",
" <td>1003.0</td>\n",
" <td>President</td>\n",
" <td>Hillary Clinton</td>\n",
" <td>democrat</td>\n",
" <td>18458.0</td>\n",
" <td>95215.0</td>\n",
" <td>20190722.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2016.0</td>\n",
" <td>Alabama</td>\n",
" <td>AL</td>\n",
" <td>Baldwin</td>\n",
" <td>1003.0</td>\n",
" <td>President</td>\n",
" <td>Donald Trump</td>\n",
" <td>republican</td>\n",
" <td>72883.0</td>\n",
" <td>95215.0</td>\n",
" <td>20190722.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" year state state_po ... candidatevotes totalvotes version\n",
"0 2016.0 Alabama AL ... 5936.0 24973.0 20190722.0\n",
"1 2016.0 Alabama AL ... 18172.0 24973.0 20190722.0\n",
"2 2016.0 Alabama AL ... 865.0 24973.0 20190722.0\n",
"3 2016.0 Alabama AL ... 18458.0 95215.0 20190722.0\n",
"4 2016.0 Alabama AL ... 72883.0 95215.0 20190722.0\n",
"\n",
"[5 rows x 11 columns]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"### Getting an overview of the data\n",
"dask_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 9474 entries, 0 to 9473\n",
"Data columns (total 11 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 year 9474 non-null float64\n",
" 1 state 9474 non-null object \n",
" 2 state_po 9462 non-null object \n",
" 3 county 9474 non-null object \n",
" 4 FIPS 9462 non-null float64\n",
" 5 office 9474 non-null object \n",
" 6 candidate 9474 non-null object \n",
" 7 party 6316 non-null object \n",
" 8 candidatevotes 9468 non-null float64\n",
" 9 totalvotes 9474 non-null float64\n",
" 10 version 9474 non-null float64\n",
"dtypes: float64(5), object(6)\n",
"memory usage: 814.3+ KB\n"
]
}
],
"source": [
"# Getting overview of the the data type (dtype) of all the features and get an overview of features with missing values via the 'Non-Null count'\n",
"dask_df.compute().info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In an attempt to manage memory, Dask takes all the numeric values as float and non-numeric values as objects"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment