Skip to content

Instantly share code, notes, and snippets.

@gangulymadhura
Created July 28, 2019 11:31
Show Gist options
  • Save gangulymadhura/d45528c23cf16b68d3ecfaf04a3b6d35 to your computer and use it in GitHub Desktop.
Save gangulymadhura/d45528c23cf16b68d3ecfaf04a3b6d35 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Import Packages\n",
"First, we import pandas package and dataset module from sklearn package to get the dataset for this tutorial. "
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from sklearn import datasets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Load Dataset\n",
"Load iris dataset from datasets module that was imported. Iris is a famous dataset in the world of statistics and has been used for numerous tutorials.\n",
"\n",
"If you run ```print(iris)``` you will see that it returns a dictionary where the first key, value pair is \"data\" and a 150x 4 numpy array. The second key, value pair is \"target\" and list of integer values. The last key, value pair is \"feature_names\" and a list of names. We shall be using these 3 components to build our pandas dataframe that will be used in the rest of this tutorial."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Iris data shape : (150, 4)\n",
"\n",
"\n"
]
}
],
"source": [
"# import some data to play with\n",
"iris = datasets.load_iris()\n",
"#print(iris)\n",
"print(\"Iris data shape :\", iris['data'].shape)\n",
"print('\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Create DataFrame\n",
"Dataframes are most useful data stuctures in python, they are 2 -dimensional tabular data structures with column names, row names or index and data.\n",
"We use function ```DataFrame()``` to convert the numpy array into a pandas dataframe. This is the only function to create a dataframe in pandas. We can check the columns created with the ```.columns()``` function. "
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',\n",
" 'petal width (cm)'],\n",
" dtype='object')\n"
]
}
],
"source": [
"# Create data frame\n",
"df=pd.DataFrame(iris.data,columns=iris.feature_names)\n",
"print(df.columns)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Add new column\n",
"Next we see how a new column \"target\" can be added to an existing dataframe. A new column can be created by assigning a list. This is one of many other ways, but this is the most common way of doing this. Learn about other ways in this excellent article - https://www.geeksforgeeks.org/adding-new-column-to-existing-dataframe-in-pandas/"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Checking top records: \n",
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n",
"0 5.1 3.5 1.4 0.2 \n",
"1 4.9 3.0 1.4 0.2 \n",
"2 4.7 3.2 1.3 0.2 \n",
"3 4.6 3.1 1.5 0.2 \n",
"4 5.0 3.6 1.4 0.2 \n",
"\n",
" target \n",
"0 0 \n",
"1 0 \n",
"2 0 \n",
"3 0 \n",
"4 0 \n",
"\n",
"\n"
]
}
],
"source": [
"# Lets add another column to the dataset\n",
"df['target']=iris['target']\n",
"print(\"Checking top records: \")\n",
"print(df.head())\n",
"print('\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Recode column\n",
"Then we create yet another new column \"Species\" by mapping values from column \"target\" to respective species names using dictionary \"mapping_species\" with ```map()``` function. This is also known as recoding a column. Another way to this would be by using ```.replace()``` which gives the same result. We create another version of \"Species\", \"Species_2\" by using ```.replace()```.\n"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Checking species mapping: \n",
"{0: 'setosa', 1: 'versicolor', 2: 'virginica'}\n",
"\n",
"\n",
"Checking if Species is created : \n",
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n",
"0 5.1 3.5 1.4 0.2 \n",
"1 4.9 3.0 1.4 0.2 \n",
"2 4.7 3.2 1.3 0.2 \n",
"3 4.6 3.1 1.5 0.2 \n",
"4 5.0 3.6 1.4 0.2 \n",
"\n",
" target Species \n",
"0 0 setosa \n",
"1 0 setosa \n",
"2 0 setosa \n",
"3 0 setosa \n",
"4 0 setosa \n",
"\n",
"\n",
"Checking if Species_2 is created : \n",
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n",
"0 5.1 3.5 1.4 0.2 \n",
"1 4.9 3.0 1.4 0.2 \n",
"2 4.7 3.2 1.3 0.2 \n",
"3 4.6 3.1 1.5 0.2 \n",
"4 5.0 3.6 1.4 0.2 \n",
"\n",
" target Species Species_2 \n",
"0 0 setosa setosa \n",
"1 0 setosa setosa \n",
"2 0 setosa setosa \n",
"3 0 setosa setosa \n",
"4 0 setosa setosa \n",
"\n",
"\n"
]
}
],
"source": [
"\n",
"# Use map to recode column\n",
"species_mapping={0:iris['target_names'][0],1:iris['target_names'][1],2:iris['target_names'][2]}\n",
"print('Checking species mapping: ')\n",
"print(species_mapping)\n",
"print('\\n')\n",
"\n",
"df['Species'] = df['target'].map(species_mapping)\n",
"print(\"Checking if Species is created : \")\n",
"print(df.head())\n",
"print('\\n')\n",
"\n",
"# Use .replace to recode column\n",
"df['Species_2'] = df['target'].replace(species_mapping)\n",
"print(\"Checking if Species_2 is created : \")\n",
"print(df.head())\n",
"print('\\n')\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Drop columns\n",
"Columns can be dropped by using the ```.drop()``` function with option ```axis=1``` specifying that we want to drop a column. Checking the column names again shows that \"Species_2\" has been dropped. We can use ```tolist()``` function to convert columnn names, which is a series to list, this is not necessary but just nicer to look at."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Checking if Species_2 is dropped : \n",
"['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'target', 'Species']\n",
"\n",
"\n"
]
}
],
"source": [
"df.drop(['Species_2'],axis=1,inplace=True)\n",
"print(\"Checking if Species_2 is dropped : \")\n",
"print(df.columns.tolist())\n",
"print('\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Read Dataset\n",
"\n",
"Here we have loaded the iris dataset from datasets module of sklearn package so that anyone can run the script. But in actual work scenario you would need to read in the dataset either from a server or from your local folder. Keeping that in mind lets see how it would work if you have the same file stored in a local folder. For this we shall first write out the dataframe created above and then read it back. This way its reproducible for everyone.\n",
"\n",
"Notice when we write out the dataframe with ```.to_csv()``` we specify option ```index=False```, if this is not done an unnamed column with row indexes will be created in the written out file. You can try to write out wihtout this option.\n",
"\n",
"The most common way to read files with Pandas is with ```read_csv()``` function. This function has many parameters that can be used to specify how data needs to be read. For example, by default the first row of the data will be considered to be the header and used to create column names, if the file does not have any headers then specify ``` headers=None```, unless we specify column names with option ```names``` the dataframe will have unnamed columns.Take a look at pandas documentation to learn more about the available options when reading in data -\n",
"https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html\n"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n",
"0 5.1 3.5 1.4 0.2 \n",
"1 4.9 3.0 1.4 0.2 \n",
"2 4.7 3.2 1.3 0.2 \n",
"3 4.6 3.1 1.5 0.2 \n",
"4 5.0 3.6 1.4 0.2 \n",
"\n",
" target Species \n",
"0 0 setosa \n",
"1 0 setosa \n",
"2 0 setosa \n",
"3 0 setosa \n",
"4 0 setosa \n"
]
}
],
"source": [
"df.to_csv('Iris.csv',index=False)\n",
"iris_df = pd.read_csv('Iris.csv')\n",
"print(iris_df.head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Create DataFrame from scratch\n",
"\n",
"Though not needed that often it is useful to know how to create a pandas dataframe from scratch. We can create a dataframe from a dictionary with values as lists, where dictionary keys are the dataframe column names and the values are column values. There are options to specify datatypes and index values. Take a look at pandas documentation to learn more - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Name Gender\n",
"0 El Girl\n",
"1 Mike Boy\n",
"2 Dustin Boy\n",
"3 Lucas Boy\n",
"4 Max Girl\n"
]
}
],
"source": [
"# create dictionary\n",
"dat_dict = {'Name':['El','Mike','Dustin','Lucas','Max','Will'],'Gender':['Girl','Boy','Boy','Boy','Girl','Boy']}\n",
"# convert to dataframe\n",
"dat_df = pd.DataFrame(dat_dict)\n",
"# check output\n",
"print(dat_df.head())\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Inspecting a DataFrame\n",
"\n",
"Now that we have learnt how to create a pandas dataframe be it from an existing loaded dataset or reading in an external file or from scratch, lets inspect some properties of a dataframe. ```type()``` function returns the class type of \"dat_df\" as pandas dataframe and that of column \"Name\" as pandas series. We have not introduced series so far, series are another type of data stucture in pandas, they are one dimensional arrays with labels. Each column of a dataframe is a series as you can see from running ```print(type(dat_df['Name']))```. You can create a series from a numply array in same way as a dataframe from a dictionary just by replacing ```.DataFrame()``` function with ```.series()```. Finally we can insert a series as a new column into a dataframe."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataframe class : <class 'pandas.core.frame.DataFrame'>\n",
"Dataframe column class : <class 'pandas.core.series.Series'>\n",
"Series : 0 None\n",
"1 Nancy\n",
"2 None\n",
"3 Erika\n",
"4 Billy\n",
"5 Jonathan\n",
"dtype: object\n",
"New dataframe :\n",
" Name Gender Sibling\n",
"0 El Girl None\n",
"1 Mike Boy Nancy\n",
"2 Dustin Boy None\n",
"3 Lucas Boy Erika\n",
"4 Max Girl Billy\n"
]
}
],
"source": [
"# Check class type\n",
"print('Dataframe class :', type(dat_df))\n",
"print('Dataframe column class :', type(dat_df['Name']))\n",
"\n",
"# import numpy package\n",
"import numpy as np\n",
"# create numpy array\n",
"Sibling = np.array([None,'Nancy',None,'Erika','Billy','Jonathan'])\n",
"# convert to pandas series\n",
"Sibling = pd.Series(Sibling)\n",
"print('Series :',Sibling)\n",
"\n",
"# insert series as new column into dataframe\n",
"dat_df['Sibling']=Sibling\n",
"print(\"New dataframe :\")\n",
"print(dat_df.head())"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment