Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save gangulymadhura/d98d33ca16beb219cfa1b13a26eda162 to your computer and use it in GitHub Desktop.
Save gangulymadhura/d98d33ca16beb219cfa1b13a26eda162 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Practical Guide to Pandas Essentials for Beginners\n",
"\n",
"\n",
"This tutorial covers\n",
"1. Selecting rows and columns from a dataframe\n",
"2. Change values of a dataframe\n",
"3. Appending dataframes\n",
"4. Handling duplicates\n",
"5. Handling missing values\n",
"6. Changing columns types"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Quick recap on last tutorial\n",
"\n",
"In last tutorial we have covered basic data operations like read, write, adding new column, recoding existing columns and dropping column. Here is the github link to the jupyter notebook. Also posted below.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Create the DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 100,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"Iris data shape : (150, 4)\n",
"\n",
"\n",
"Top records : \n",
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n",
"0 5.1 3.5 1.4 0.2 \n",
"1 4.9 3.0 1.4 0.2 \n",
"2 4.7 3.2 1.3 0.2 \n",
"3 4.6 3.1 1.5 0.2 \n",
"4 5.0 3.6 1.4 0.2 \n",
"\n",
" target Species \n",
"0 0 setosa \n",
"1 0 setosa \n",
"2 0 setosa \n",
"3 0 setosa \n",
"4 0 setosa \n"
]
}
],
"source": [
"#import libraries\n",
"import pandas as pd\n",
"from sklearn import datasets\n",
"\n",
"#creating the dataset for this tutorial\n",
"iris = datasets.load_iris()\n",
"df=pd.DataFrame(iris.data,columns=iris.feature_names)\n",
"df['target']=iris['target']\n",
"\n",
"# Use map to recode column\n",
"species_mapping={0:iris['target_names'][0],1:iris['target_names'][1],2:iris['target_names'][2]}\n",
"#print('Checking species mapping: ')\n",
"#print(species_mapping)\n",
"#print('\\n')\n",
"\n",
"df['Species'] = df['target'].map(species_mapping)\n",
"#print(\"Checking if Species is created : \")\n",
"#print(df.head())\n",
"#print('\\n')\n",
"\n",
"#show few records\n",
"print('\\n')\n",
"print(\"Iris data shape :\", iris['data'].shape)\n",
"print('\\n')\n",
"print(\"Top records : \")\n",
"print(df.head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Selecting rows and columns from dataframe\n",
"\n",
"One of the most common data tasks we perform are selecting subsets of data either by selecting set of rows or columns or combination of both. This can be a bit tricky (and frustrating) in pandas unless we know what's happening behind the scenes. Let's see how we can accomplish these tasks in pandas. \n",
"We shall be using the same ```iris``` dataset as used in last turorial. The first part of the script is just recreating the dataset from last tutorial."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## .iloc and .loc\n",
"\n",
".iloc and .loc are indexers that are used for selecting rows and columns in pandas, .iloc can be used to select rows or columns by number and .loc can used to select rows with row labels or boolean masks. Note that the dataframe created above has row numbers as row labels, this is the default row index when no column is assigned to be the row label of a pandas dataframe. \n",
"\n",
"When selecting single row or column subsets from a pandas dataframe using double square brackets will return a dataframe and using single square brackets with return a series. Let's see how ```iloc``` can be used to select rows and columns."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# iloc"
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"Select row 1 using iloc (returns series) :\n"
]
},
{
"data": {
"text/plain": [
"sepal length (cm) 5.1\n",
"sepal width (cm) 3.5\n",
"petal length (cm) 1.4\n",
"petal width (cm) 0.2\n",
"target 0\n",
"Species setosa\n",
"Name: 0, dtype: object"
]
},
"execution_count": 101,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(\"\\n\")\n",
"print(\"Select row 1 using iloc (returns series) :\")\n",
"# select 1st row using iloc\n",
"df.iloc[0] # returns a series, not specifying colunms returns all columns\n",
"df.iloc[0,:] # returns a series, not specifying colunms returns all columns\n"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"Select row 1 using iloc (returns dataframe) :\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sepal length (cm)</th>\n",
" <th>sepal width (cm)</th>\n",
" <th>petal length (cm)</th>\n",
" <th>petal width (cm)</th>\n",
" <th>target</th>\n",
" <th>Species</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>5.1</td>\n",
" <td>3.5</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" <td>0</td>\n",
" <td>setosa</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n",
"0 5.1 3.5 1.4 0.2 \n",
"\n",
" target Species \n",
"0 0 setosa "
]
},
"execution_count": 102,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(\"\\n\")\n",
"print(\"Select row 1 using iloc (returns dataframe) :\")\n",
"df.iloc[[0],:] # returns a dataframe, not specifying colunms returns all columns"
]
},
{
"cell_type": "code",
"execution_count": 103,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"Select multiple rows using using iloc :\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sepal length (cm)</th>\n",
" <th>sepal width (cm)</th>\n",
" <th>petal length (cm)</th>\n",
" <th>petal width (cm)</th>\n",
" <th>target</th>\n",
" <th>Species</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>5.1</td>\n",
" <td>3.5</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" <td>0</td>\n",
" <td>setosa</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>4.9</td>\n",
" <td>3.0</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" <td>0</td>\n",
" <td>setosa</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>4.7</td>\n",
" <td>3.2</td>\n",
" <td>1.3</td>\n",
" <td>0.2</td>\n",
" <td>0</td>\n",
" <td>setosa</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>4.6</td>\n",
" <td>3.1</td>\n",
" <td>1.5</td>\n",
" <td>0.2</td>\n",
" <td>0</td>\n",
" <td>setosa</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>5.0</td>\n",
" <td>3.6</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" <td>0</td>\n",
" <td>setosa</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n",
"0 5.1 3.5 1.4 0.2 \n",
"1 4.9 3.0 1.4 0.2 \n",
"2 4.7 3.2 1.3 0.2 \n",
"3 4.6 3.1 1.5 0.2 \n",
"4 5.0 3.6 1.4 0.2 \n",
"\n",
" target Species \n",
"0 0 setosa \n",
"1 0 setosa \n",
"2 0 setosa \n",
"3 0 setosa \n",
"4 0 setosa "
]
},
"execution_count": 103,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(\"\\n\")\n",
"print(\"Select multiple rows using using iloc :\")\n",
"# select multiple rows using iloc\n",
"df.iloc[0:5] # returns a dataframe\n",
"df.iloc[0:5,:] # returns a dataframe\n"
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"Select column 1 using using iloc (returns series) :\n"
]
},
{
"data": {
"text/plain": [
"0 5.1\n",
"1 4.9\n",
"2 4.7\n",
"3 4.6\n",
"4 5.0\n",
" ... \n",
"145 6.7\n",
"146 6.3\n",
"147 6.5\n",
"148 6.2\n",
"149 5.9\n",
"Name: sepal length (cm), Length: 150, dtype: float64"
]
},
"execution_count": 104,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(\"\\n\")\n",
"print(\"Select column 1 using using iloc (returns series) :\")\n",
"# select 1st column using iloc\n",
"df.iloc[:,0] # returns a series, not specifying rows returns all rows\n",
"df.iloc[:,0] # returns a series, not specifying rows returns all rows"
]
},
{
"cell_type": "code",
"execution_count": 105,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"Select column 1 using using iloc (returns dataframe) :\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sepal length (cm)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>5.1</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>4.9</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>4.7</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>4.6</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <td>145</td>\n",
" <td>6.7</td>\n",
" </tr>\n",
" <tr>\n",
" <td>146</td>\n",
" <td>6.3</td>\n",
" </tr>\n",
" <tr>\n",
" <td>147</td>\n",
" <td>6.5</td>\n",
" </tr>\n",
" <tr>\n",
" <td>148</td>\n",
" <td>6.2</td>\n",
" </tr>\n",
" <tr>\n",
" <td>149</td>\n",
" <td>5.9</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>150 rows × 1 columns</p>\n",
"</div>"
],
"text/plain": [
" sepal length (cm)\n",
"0 5.1\n",
"1 4.9\n",
"2 4.7\n",
"3 4.6\n",
"4 5.0\n",
".. ...\n",
"145 6.7\n",
"146 6.3\n",
"147 6.5\n",
"148 6.2\n",
"149 5.9\n",
"\n",
"[150 rows x 1 columns]"
]
},
"execution_count": 105,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(\"\\n\")\n",
"print(\"Select column 1 using using iloc (returns dataframe) :\")\n",
"# select 1st column using iloc\n",
"df.iloc[:,[0]] # returns a dataframe, not specifying rows returns all rows"
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"Select multiple columns using using iloc :\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sepal length (cm)</th>\n",
" <th>sepal width (cm)</th>\n",
" <th>petal length (cm)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>5.1</td>\n",
" <td>3.5</td>\n",
" <td>1.4</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>4.9</td>\n",
" <td>3.0</td>\n",
" <td>1.4</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>4.7</td>\n",
" <td>3.2</td>\n",
" <td>1.3</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>4.6</td>\n",
" <td>3.1</td>\n",
" <td>1.5</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>5.0</td>\n",
" <td>3.6</td>\n",
" <td>1.4</td>\n",
" </tr>\n",
" <tr>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <td>145</td>\n",
" <td>6.7</td>\n",
" <td>3.0</td>\n",
" <td>5.2</td>\n",
" </tr>\n",
" <tr>\n",
" <td>146</td>\n",
" <td>6.3</td>\n",
" <td>2.5</td>\n",
" <td>5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>147</td>\n",
" <td>6.5</td>\n",
" <td>3.0</td>\n",
" <td>5.2</td>\n",
" </tr>\n",
" <tr>\n",
" <td>148</td>\n",
" <td>6.2</td>\n",
" <td>3.4</td>\n",
" <td>5.4</td>\n",
" </tr>\n",
" <tr>\n",
" <td>149</td>\n",
" <td>5.9</td>\n",
" <td>3.0</td>\n",
" <td>5.1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>150 rows × 3 columns</p>\n",
"</div>"
],
"text/plain": [
" sepal length (cm) sepal width (cm) petal length (cm)\n",
"0 5.1 3.5 1.4\n",
"1 4.9 3.0 1.4\n",
"2 4.7 3.2 1.3\n",
"3 4.6 3.1 1.5\n",
"4 5.0 3.6 1.4\n",
".. ... ... ...\n",
"145 6.7 3.0 5.2\n",
"146 6.3 2.5 5.0\n",
"147 6.5 3.0 5.2\n",
"148 6.2 3.4 5.4\n",
"149 5.9 3.0 5.1\n",
"\n",
"[150 rows x 3 columns]"
]
},
"execution_count": 106,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(\"\\n\")\n",
"print(\"Select multiple columns using using iloc :\")\n",
"# select multiple columns using iloc\n",
"df.iloc[:,0:3] # returns a dataframe,not specifying rows returns all rows "
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"Select multiple rows and columns using using iloc :\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sepal length (cm)</th>\n",
" <th>sepal width (cm)</th>\n",
" <th>petal length (cm)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>5.1</td>\n",
" <td>3.5</td>\n",
" <td>1.4</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>4.9</td>\n",
" <td>3.0</td>\n",
" <td>1.4</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>4.7</td>\n",
" <td>3.2</td>\n",
" <td>1.3</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>4.6</td>\n",
" <td>3.1</td>\n",
" <td>1.5</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>5.0</td>\n",
" <td>3.6</td>\n",
" <td>1.4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" sepal length (cm) sepal width (cm) petal length (cm)\n",
"0 5.1 3.5 1.4\n",
"1 4.9 3.0 1.4\n",
"2 4.7 3.2 1.3\n",
"3 4.6 3.1 1.5\n",
"4 5.0 3.6 1.4"
]
},
"execution_count": 107,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(\"\\n\")\n",
"print(\"Select multiple rows and columns using using iloc :\")\n",
"# select rows and columns with iloc\n",
"df.iloc[0:5,0:3] # returns a dataframe"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# loc"
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"Select 1 row using loc (returns series) :\n"
]
},
{
"data": {
"text/plain": [
"sepal length (cm) 5.1\n",
"sepal width (cm) 3.5\n",
"petal length (cm) 1.4\n",
"petal width (cm) 0.2\n",
"target 0\n",
"Species setosa\n",
"Name: 0, dtype: object"
]
},
"execution_count": 108,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# select 1st row using loc\n",
"print(\"\\n\")\n",
"print(\"Select 1 row using loc (returns series) :\")\n",
"df.loc[0] # returns a series, not specifying colunms returns all columns\n",
"df.loc[0,:] # returns a series, not specifying colunms returns all columns"
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"Select 1 row using loc (returns dataframe) :\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sepal length (cm)</th>\n",
" <th>sepal width (cm)</th>\n",
" <th>petal length (cm)</th>\n",
" <th>petal width (cm)</th>\n",
" <th>target</th>\n",
" <th>Species</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>5.1</td>\n",
" <td>3.5</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" <td>0</td>\n",
" <td>setosa</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n",
"0 5.1 3.5 1.4 0.2 \n",
"\n",
" target Species \n",
"0 0 setosa "
]
},
"execution_count": 109,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# select 1st row using loc\n",
"print(\"\\n\")\n",
"print(\"Select 1 row using loc (returns dataframe) :\")\n",
"df.loc[[0],:] # returns a dataframe, not specifying colunms returns all columns\n"
]
},
{
"cell_type": "code",
"execution_count": 110,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"Select multiple rows using loc (returns dataframe) :\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sepal length (cm)</th>\n",
" <th>sepal width (cm)</th>\n",
" <th>petal length (cm)</th>\n",
" <th>petal width (cm)</th>\n",
" <th>target</th>\n",
" <th>Species</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>5.1</td>\n",
" <td>3.5</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" <td>0</td>\n",
" <td>setosa</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>4.9</td>\n",
" <td>3.0</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" <td>0</td>\n",
" <td>setosa</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>4.7</td>\n",
" <td>3.2</td>\n",
" <td>1.3</td>\n",
" <td>0.2</td>\n",
" <td>0</td>\n",
" <td>setosa</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>4.6</td>\n",
" <td>3.1</td>\n",
" <td>1.5</td>\n",
" <td>0.2</td>\n",
" <td>0</td>\n",
" <td>setosa</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>5.0</td>\n",
" <td>3.6</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" <td>0</td>\n",
" <td>setosa</td>\n",
" </tr>\n",
" <tr>\n",
" <td>5</td>\n",
" <td>5.4</td>\n",
" <td>3.9</td>\n",
" <td>1.7</td>\n",
" <td>0.4</td>\n",
" <td>0</td>\n",
" <td>setosa</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n",
"0 5.1 3.5 1.4 0.2 \n",
"1 4.9 3.0 1.4 0.2 \n",
"2 4.7 3.2 1.3 0.2 \n",
"3 4.6 3.1 1.5 0.2 \n",
"4 5.0 3.6 1.4 0.2 \n",
"5 5.4 3.9 1.7 0.4 \n",
"\n",
" target Species \n",
"0 0 setosa \n",
"1 0 setosa \n",
"2 0 setosa \n",
"3 0 setosa \n",
"4 0 setosa \n",
"5 0 setosa "
]
},
"execution_count": 110,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(\"\\n\")\n",
"print(\"Select multiple rows using loc (returns dataframe) :\")\n",
"# select multiple rows using loc\n",
"df.loc[0:5] # returns a dataframe\n",
"df.loc[0:5,:] # returns a dataframe"
]
},
{
"cell_type": "code",
"execution_count": 111,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"Select 1 column using loc (returns series) :\n"
]
},
{
"data": {
"text/plain": [
"0 5.1\n",
"1 4.9\n",
"2 4.7\n",
"3 4.6\n",
"4 5.0\n",
" ... \n",
"145 6.7\n",
"146 6.3\n",
"147 6.5\n",
"148 6.2\n",
"149 5.9\n",
"Name: sepal length (cm), Length: 150, dtype: float64"
]
},
"execution_count": 111,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# select 1st column using loc\n",
"# df.loc[:,0] # does not work\n",
"print(\"\\n\")\n",
"print(\"Select 1 column using loc (returns series) :\")\n",
"df.loc[:,'sepal length (cm)'] # returns a series, not specifying rows returns all rows"
]
},
{
"cell_type": "code",
"execution_count": 112,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"Select 1 column using loc (returns dataframe) :\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sepal length (cm)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>5.1</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>4.9</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>4.7</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>4.6</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <td>145</td>\n",
" <td>6.7</td>\n",
" </tr>\n",
" <tr>\n",
" <td>146</td>\n",
" <td>6.3</td>\n",
" </tr>\n",
" <tr>\n",
" <td>147</td>\n",
" <td>6.5</td>\n",
" </tr>\n",
" <tr>\n",
" <td>148</td>\n",
" <td>6.2</td>\n",
" </tr>\n",
" <tr>\n",
" <td>149</td>\n",
" <td>5.9</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>150 rows × 1 columns</p>\n",
"</div>"
],
"text/plain": [
" sepal length (cm)\n",
"0 5.1\n",
"1 4.9\n",
"2 4.7\n",
"3 4.6\n",
"4 5.0\n",
".. ...\n",
"145 6.7\n",
"146 6.3\n",
"147 6.5\n",
"148 6.2\n",
"149 5.9\n",
"\n",
"[150 rows x 1 columns]"
]
},
"execution_count": 112,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# select 1st column using loc\n",
"# df.loc[:,0] # does not work\n",
"print(\"\\n\")\n",
"print(\"Select 1 column using loc (returns dataframe) :\")\n",
"df.loc[:,['sepal length (cm)']] # returns a dataframe,not specifying rows returns all rows"
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"Select multiple columns using loc :\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sepal length (cm)</th>\n",
" <th>sepal width (cm)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>5.1</td>\n",
" <td>3.5</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>4.9</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>4.7</td>\n",
" <td>3.2</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>4.6</td>\n",
" <td>3.1</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>5.0</td>\n",
" <td>3.6</td>\n",
" </tr>\n",
" <tr>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <td>145</td>\n",
" <td>6.7</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>146</td>\n",
" <td>6.3</td>\n",
" <td>2.5</td>\n",
" </tr>\n",
" <tr>\n",
" <td>147</td>\n",
" <td>6.5</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>148</td>\n",
" <td>6.2</td>\n",
" <td>3.4</td>\n",
" </tr>\n",
" <tr>\n",
" <td>149</td>\n",
" <td>5.9</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>150 rows × 2 columns</p>\n",
"</div>"
],
"text/plain": [
" sepal length (cm) sepal width (cm)\n",
"0 5.1 3.5\n",
"1 4.9 3.0\n",
"2 4.7 3.2\n",
"3 4.6 3.1\n",
"4 5.0 3.6\n",
".. ... ...\n",
"145 6.7 3.0\n",
"146 6.3 2.5\n",
"147 6.5 3.0\n",
"148 6.2 3.4\n",
"149 5.9 3.0\n",
"\n",
"[150 rows x 2 columns]"
]
},
"execution_count": 113,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# select multiple columns using loc\n",
"print(\"\\n\")\n",
"print(\"Select multiple columns using loc :\")\n",
"df.loc[:,['sepal length (cm)','sepal width (cm)']] # returns a dataframe,not specifying rows returns all rows "
]
},
{
"cell_type": "code",
"execution_count": 114,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"Select rows and columns using using loc (returns series) :\n"
]
},
{
"data": {
"text/plain": [
"0 5.1\n",
"1 4.9\n",
"2 4.7\n",
"3 4.6\n",
"4 5.0\n",
"5 5.4\n",
"Name: sepal length (cm), dtype: float64"
]
},
"execution_count": 114,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# select rows and columns with loc\n",
"print(\"\\n\")\n",
"print(\"Select rows and columns using using loc (returns series) :\")\n",
"df.loc[0:5,'sepal length (cm)'] # returns a series"
]
},
{
"cell_type": "code",
"execution_count": 115,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"Select rows and columns using using loc (returns dataframe) :\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sepal length (cm)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>5.1</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>4.9</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>4.7</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>4.6</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>5</td>\n",
" <td>5.4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" sepal length (cm)\n",
"0 5.1\n",
"1 4.9\n",
"2 4.7\n",
"3 4.6\n",
"4 5.0\n",
"5 5.4"
]
},
"execution_count": 115,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# select rows and columns with loc\n",
"print(\"\\n\")\n",
"print(\"Select rows and columns using using loc (returns dataframe) :\")\n",
"df.loc[0:5,['sepal length (cm)']] # returns a dataframe,not specifying rows returns all rows \n"
]
},
{
"cell_type": "code",
"execution_count": 116,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"Select rows and columns using using loc:\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sepal length (cm)</th>\n",
" <th>sepal width (cm)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>5.1</td>\n",
" <td>3.5</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>4.9</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>4.7</td>\n",
" <td>3.2</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>4.6</td>\n",
" <td>3.1</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>5.0</td>\n",
" <td>3.6</td>\n",
" </tr>\n",
" <tr>\n",
" <td>5</td>\n",
" <td>5.4</td>\n",
" <td>3.9</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" sepal length (cm) sepal width (cm)\n",
"0 5.1 3.5\n",
"1 4.9 3.0\n",
"2 4.7 3.2\n",
"3 4.6 3.1\n",
"4 5.0 3.6\n",
"5 5.4 3.9"
]
},
"execution_count": 116,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# select rows and columns with loc\n",
"print(\"\\n\")\n",
"print(\"Select rows and columns using using loc:\")\n",
"df.loc[0:5,['sepal length (cm)','sepal width (cm)']] # returns a dataframe"
]
},
{
"cell_type": "code",
"execution_count": 117,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 True\n",
"1 True\n",
"2 True\n",
"3 True\n",
"4 True\n",
" ... \n",
"145 True\n",
"146 True\n",
"147 True\n",
"148 True\n",
"149 True\n",
"Name: sepal length (cm), Length: 150, dtype: bool"
]
},
"execution_count": 117,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# select rows with conditions using loc\n",
"bool_mask = df['sepal length (cm)']>4 # returns series of boolean values\n",
"bool_mask"
]
},
{
"cell_type": "code",
"execution_count": 118,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sepal length (cm)</th>\n",
" <th>sepal width (cm)</th>\n",
" <th>petal length (cm)</th>\n",
" <th>petal width (cm)</th>\n",
" <th>target</th>\n",
" <th>Species</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>5.1</td>\n",
" <td>3.5</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" <td>0</td>\n",
" <td>setosa</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>4.9</td>\n",
" <td>3.0</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" <td>0</td>\n",
" <td>setosa</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>4.7</td>\n",
" <td>3.2</td>\n",
" <td>1.3</td>\n",
" <td>0.2</td>\n",
" <td>0</td>\n",
" <td>setosa</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>4.6</td>\n",
" <td>3.1</td>\n",
" <td>1.5</td>\n",
" <td>0.2</td>\n",
" <td>0</td>\n",
" <td>setosa</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>5.0</td>\n",
" <td>3.6</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" <td>0</td>\n",
" <td>setosa</td>\n",
" </tr>\n",
" <tr>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <td>145</td>\n",
" <td>6.7</td>\n",
" <td>3.0</td>\n",
" <td>5.2</td>\n",
" <td>2.3</td>\n",
" <td>2</td>\n",
" <td>virginica</td>\n",
" </tr>\n",
" <tr>\n",
" <td>146</td>\n",
" <td>6.3</td>\n",
" <td>2.5</td>\n",
" <td>5.0</td>\n",
" <td>1.9</td>\n",
" <td>2</td>\n",
" <td>virginica</td>\n",
" </tr>\n",
" <tr>\n",
" <td>147</td>\n",
" <td>6.5</td>\n",
" <td>3.0</td>\n",
" <td>5.2</td>\n",
" <td>2.0</td>\n",
" <td>2</td>\n",
" <td>virginica</td>\n",
" </tr>\n",
" <tr>\n",
" <td>148</td>\n",
" <td>6.2</td>\n",
" <td>3.4</td>\n",
" <td>5.4</td>\n",
" <td>2.3</td>\n",
" <td>2</td>\n",
" <td>virginica</td>\n",
" </tr>\n",
" <tr>\n",
" <td>149</td>\n",
" <td>5.9</td>\n",
" <td>3.0</td>\n",
" <td>5.1</td>\n",
" <td>1.8</td>\n",
" <td>2</td>\n",
" <td>virginica</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>150 rows × 6 columns</p>\n",
"</div>"
],
"text/plain": [
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n",
"0 5.1 3.5 1.4 0.2 \n",
"1 4.9 3.0 1.4 0.2 \n",
"2 4.7 3.2 1.3 0.2 \n",
"3 4.6 3.1 1.5 0.2 \n",
"4 5.0 3.6 1.4 0.2 \n",
".. ... ... ... ... \n",
"145 6.7 3.0 5.2 2.3 \n",
"146 6.3 2.5 5.0 1.9 \n",
"147 6.5 3.0 5.2 2.0 \n",
"148 6.2 3.4 5.4 2.3 \n",
"149 5.9 3.0 5.1 1.8 \n",
"\n",
" target Species \n",
"0 0 setosa \n",
"1 0 setosa \n",
"2 0 setosa \n",
"3 0 setosa \n",
"4 0 setosa \n",
".. ... ... \n",
"145 2 virginica \n",
"146 2 virginica \n",
"147 2 virginica \n",
"148 2 virginica \n",
"149 2 virginica \n",
"\n",
"[150 rows x 6 columns]"
]
},
"execution_count": 118,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[bool_mask] #returns dataframe with all columns"
]
},
{
"cell_type": "code",
"execution_count": 119,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0\n",
"1 0\n",
"2 0\n",
"3 0\n",
"4 0\n",
" ..\n",
"145 2\n",
"146 2\n",
"147 2\n",
"148 2\n",
"149 2\n",
"Name: target, Length: 150, dtype: int32"
]
},
"execution_count": 119,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[bool_mask,'target'] #returns series"
]
},
{
"cell_type": "code",
"execution_count": 120,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>target</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <td>145</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <td>146</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <td>147</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <td>148</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <td>149</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>150 rows × 1 columns</p>\n",
"</div>"
],
"text/plain": [
" target\n",
"0 0\n",
"1 0\n",
"2 0\n",
"3 0\n",
"4 0\n",
".. ...\n",
"145 2\n",
"146 2\n",
"147 2\n",
"148 2\n",
"149 2\n",
"\n",
"[150 rows x 1 columns]"
]
},
"execution_count": 120,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[bool_mask,['target']] #returns dataframe"
]
},
{
"cell_type": "code",
"execution_count": 121,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
".loc\n",
" target\n",
"0 0\n",
"1 0\n",
"2 0\n",
"3 0\n",
"4 0\n",
"5 0\n",
".iloc\n",
" target\n",
"0 0\n",
"1 0\n",
"2 0\n",
"3 0\n",
"4 0\n"
]
}
],
"source": [
"# Note the difference\n",
"print(\".loc\")\n",
"print(df.loc[0:5,['target']]) # loc returns all rows 0 to 5 with matching labels\n",
"\n",
"print(\".iloc\")\n",
"print(df.iloc[0:5,[4]]) # iloc returns rows 0 to 4, just like regular python indexing in lists, tuples, dictionaries\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# .Isin()\n",
"\n",
"The ```isin()``` function comes in handy when you need to check for specific values in a row or columns. One way could be to right multiple ```==``` conditions but thet could be really tedious. Let's check an example below."
]
},
{
"cell_type": "code",
"execution_count": 122,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['setosa' 'versicolor']\n"
]
}
],
"source": [
"# select all rows where species is \"setosa\" or \"versicolor\"\n",
"df_new=df.loc[df['Species'].isin(['setosa','versicolor']),:]\n",
"print(df_new.Species.unique())\n"
]
},
{
"cell_type": "code",
"execution_count": 123,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['virginica']\n"
]
}
],
"source": [
"# select all rows where species is not \"setosa\" and \"versicolor\"\n",
"df_new=df.loc[~df['Species'].isin(['setosa','versicolor']),:]\n",
"print(df_new.Species.unique())\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Modifying a dataframe\n",
"### Views and Copies\n",
"Now that we have a good understanding of how to use loc and iloc, lets move on to understanding how to modify a dataframe or create new dataframes by modifying existing dataframes. In order to understand this we need to learn about views and copyies in pandas.Till now we have been creating views of a dataframe which is just a subset of the dataframe, when modyfying a dataframe we need to be careful of wether we are modifying a vew or the original dataframe. Lets check out the example below."
]
},
{
"cell_type": "code",
"execution_count": 124,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Original value : target\n",
"4 0\n",
"Modified value : target\n",
"4 1\n",
"Original value : target\n",
"8 0\n",
"13 0\n",
"38 0\n",
"41 0\n",
"42 0\n",
"Modified value : target\n",
"8 0\n",
"13 0\n",
"38 0\n",
"41 0\n",
"42 0\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\madhura\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\ipykernel_launcher.py:7: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" import sys\n",
"C:\\Users\\madhura\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\pandas\\core\\indexing.py:494: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" self.obj[item] = s\n",
"C:\\Users\\madhura\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\ipykernel_launcher.py:16: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" app.launch_new_instance()\n"
]
}
],
"source": [
"# We want to modify row 5 value for column \"target\"\n",
"# print row 5 and column target\n",
"df_copy=df\n",
"print(\"Original value : \",df_copy[4:5][['target']])\n",
"\n",
"# lets modify this value\n",
"df_copy[4:5][['target']]=1\n",
"\n",
"# lets check if the value has changed\n",
"print(\"Modified value : \",df_copy.loc[4:4,['target']]) # it did get changed\n",
"\n",
"# print rows where sepal length<=4.5 and column target\n",
"print(\"Original value : \",df_copy[df_copy['sepal length (cm)']<=4.5][['target']])\n",
"\n",
"# lets modify this value\n",
"df_copy[df_copy['sepal length (cm)']<=4.5][['target']]=2\n",
"\n",
"# lets check if the value has changed\n",
"print(\"Modified value : \",df_copy[df_copy['sepal length (cm)']<=4.5][['target']]) #it did not get changed\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When we run this we get ```SettingWithCopyWarning``` as pandas is not sure wether the original dataframe got changed or a view got modified. As we can see the target column in row 4 shows changed value but target column with sepal length <= 4.5 does not change. This warning is a way for pandas to let us know that things might not have gone as expected. We can solve this by using .loc or .iloc and not using chaining."
]
},
{
"cell_type": "code",
"execution_count": 125,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Original value : target\n",
"5 0\n",
"Modified value : target\n",
"5 2\n",
"Original value : target\n",
"8 0\n",
"13 0\n",
"38 0\n",
"41 0\n",
"42 0\n",
"Modified value : target\n",
"8 2\n",
"13 2\n",
"38 2\n",
"41 2\n",
"42 2\n"
]
}
],
"source": [
"df_copy=df\n",
"print(\"Original value : \",df_copy.loc[[5],['target']])\n",
"\n",
"# lets modify this value\n",
"df_copy.loc[[5],['target']]=2\n",
"\n",
"# lets check if the value has changed\n",
"print(\"Modified value : \",df_copy.loc[[5],['target']]) # it did get changed\n",
"\n",
"# print rows where sepal length<=4.5 and column target\n",
"print(\"Original value : \",df_copy.loc[df_copy['sepal length (cm)']<=4.5,['target']])\n",
"\n",
"# lets modify this value\n",
"df_copy.loc[df_copy['sepal length (cm)']<=4.5,['target']]=2\n",
"\n",
"# lets check if the value has changed\n",
"print(\"Modified value : \",df_copy.loc[df_copy['sepal length (cm)']<=4.5,['target']]) # it did get changed\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now lets try to make some changes to subset dataframe using .loc and .iloc."
]
},
{
"cell_type": "code",
"execution_count": 126,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Original value 8 2\n",
"13 2\n",
"38 2\n",
"41 2\n",
"42 2\n",
"Name: target, dtype: int32\n",
"Modified value : 8 3\n",
"13 3\n",
"38 3\n",
"41 3\n",
"42 3\n",
"Name: target, dtype: int64\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\madhura\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\ipykernel_launcher.py:8: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" \n"
]
}
],
"source": [
"# create a copy of dataframe\n",
"df_copy_sub= df_copy.loc[df_copy['sepal length (cm)']<=4.5,:]\n",
"\n",
"# original value\n",
"print(\"Original value \", df_copy_sub['target'])\n",
"\n",
"# lets modify this value\n",
"df_copy_sub['target']=3\n",
"\n",
"# lets check if the value has changed\n",
"print(\"Modified value :\",df_copy_sub['target']) \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The ```SettingWithCopyWarning``` warning appears again though we have used .loc and .iloc. This is because df_copy is not a copy of df, so when we try to modify df_copy pandas is again confused. To avoid this, better to use ```df_copy=df.copy()``` and not ```df_copy=df```."
]
},
{
"cell_type": "code",
"execution_count": 127,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"8 2\n",
"13 2\n",
"38 2\n",
"41 2\n",
"42 2\n",
"Name: target, dtype: int32\n",
"8 3\n",
"13 3\n",
"38 3\n",
"41 3\n",
"42 3\n",
"Name: target, dtype: int64\n"
]
}
],
"source": [
"df_copy=df.copy()\n",
"# save the dataframe\n",
"df_copy_sub= df_copy.loc[df_copy['sepal length (cm)']<=4.5,:].copy()\n",
"\n",
"# check original value\n",
"print(df_copy_sub['target'])\n",
"\n",
"# lets modify this value\n",
"df_copy_sub['target']=3\n",
"\n",
"# lets check if the value has changed\n",
"print(df_copy_sub['target']) # it get changed"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we got rid of the pesky ```SettingWithCopyWarning```."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Appending dataframes\n",
"\n",
"It is common to append 2 dataframes on columns or rows while working our way through data and pandas provides a simple way to do so with the ```concat``` function."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Dataframes can be appended on column by using option ``` axis=1``` "
]
},
{
"cell_type": "code",
"execution_count": 128,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Original Dataframe - (150, 6)\n",
"Appended Dataframe - (150, 12)\n",
"Concatenated Dataframe\n",
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n",
"0 5.1 3.5 1.4 0.2 \n",
"1 4.9 3.0 1.4 0.2 \n",
"2 4.7 3.2 1.3 0.2 \n",
"3 4.6 3.1 1.5 0.2 \n",
"4 5.0 3.6 1.4 0.2 \n",
"5 5.4 3.9 1.7 0.4 \n",
"6 4.6 3.4 1.4 0.3 \n",
"7 5.0 3.4 1.5 0.2 \n",
"8 4.4 2.9 1.4 0.2 \n",
"9 4.9 3.1 1.5 0.1 \n",
"10 5.4 3.7 1.5 0.2 \n",
"11 4.8 3.4 1.6 0.2 \n",
"12 4.8 3.0 1.4 0.1 \n",
"13 4.3 3.0 1.1 0.1 \n",
"14 5.8 4.0 1.2 0.2 \n",
"15 5.7 4.4 1.5 0.4 \n",
"16 5.4 3.9 1.3 0.4 \n",
"17 5.1 3.5 1.4 0.3 \n",
"18 5.7 3.8 1.7 0.3 \n",
"19 5.1 3.8 1.5 0.3 \n",
"\n",
" target Species sepal length (cm) sepal width (cm) petal length (cm) \\\n",
"0 0 setosa 5.1 3.5 1.4 \n",
"1 0 setosa 4.9 3.0 1.4 \n",
"2 0 setosa 4.7 3.2 1.3 \n",
"3 0 setosa 4.6 3.1 1.5 \n",
"4 1 setosa 5.0 3.6 1.4 \n",
"5 2 setosa 5.4 3.9 1.7 \n",
"6 0 setosa 4.6 3.4 1.4 \n",
"7 0 setosa 5.0 3.4 1.5 \n",
"8 2 setosa 4.4 2.9 1.4 \n",
"9 0 setosa 4.9 3.1 1.5 \n",
"10 0 setosa 5.4 3.7 1.5 \n",
"11 0 setosa 4.8 3.4 1.6 \n",
"12 0 setosa 4.8 3.0 1.4 \n",
"13 2 setosa 4.3 3.0 1.1 \n",
"14 0 setosa 5.8 4.0 1.2 \n",
"15 0 setosa 5.7 4.4 1.5 \n",
"16 0 setosa 5.4 3.9 1.3 \n",
"17 0 setosa 5.1 3.5 1.4 \n",
"18 0 setosa 5.7 3.8 1.7 \n",
"19 0 setosa 5.1 3.8 1.5 \n",
"\n",
" petal width (cm) target Species \n",
"0 0.2 0 setosa \n",
"1 0.2 0 setosa \n",
"2 0.2 0 setosa \n",
"3 0.2 0 setosa \n",
"4 0.2 1 setosa \n",
"5 0.4 2 setosa \n",
"6 0.3 0 setosa \n",
"7 0.2 0 setosa \n",
"8 0.2 2 setosa \n",
"9 0.1 0 setosa \n",
"10 0.2 0 setosa \n",
"11 0.2 0 setosa \n",
"12 0.1 0 setosa \n",
"13 0.1 2 setosa \n",
"14 0.2 0 setosa \n",
"15 0.4 0 setosa \n",
"16 0.4 0 setosa \n",
"17 0.3 0 setosa \n",
"18 0.3 0 setosa \n",
"19 0.3 0 setosa \n"
]
}
],
"source": [
"# Lets create a copy of existing data frame \"df\"\n",
"df_1= df.copy()\n",
"df_2= df.copy()\n",
"\n",
"# Lets append by column\n",
"df_append_bycol= pd.concat([df_1,df_2],axis=1)\n",
"#df_append_bycol= pd.concat([df_1,df_2],axis=\"columns\") #just another way\n",
"\n",
"\n",
"print(\"Original Dataframe - \",df.shape) # original dataframe has 150 records and 5 columns\n",
"print(\"Appended Dataframe - \",df_append_bycol.shape) # new dataframe has 150 records and 10 columns\n",
"\n",
"print(\"Concatenated Dataframe\")\n",
"print(df_append_bycol.head(20)) # new dataframe has 150 records and 10 columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Unequal rows\n",
"Check what happens when 2 dataframes have unequal number of rows. In the example below first dataframe has only 6 rows while second dataframe has 150 rows. The concatenated dataframe has NaN values in the first 5 columns (coming from first dataframe) in all rows beyound row 6."
]
},
{
"cell_type": "code",
"execution_count": 129,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Original Dataframe - (150, 6)\n",
"Appended Dataframe - (150, 12)\n",
"\n",
"\n",
"Concatenated Dataframe\n",
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n",
"0 5.1 3.5 1.4 0.2 \n",
"1 4.9 3.0 1.4 0.2 \n",
"2 4.7 3.2 1.3 0.2 \n",
"3 4.6 3.1 1.5 0.2 \n",
"4 5.0 3.6 1.4 0.2 \n",
"5 5.4 3.9 1.7 0.4 \n",
"6 NaN NaN NaN NaN \n",
"7 NaN NaN NaN NaN \n",
"8 NaN NaN NaN NaN \n",
"9 NaN NaN NaN NaN \n",
"\n",
" target Species sepal length (cm) sepal width (cm) petal length (cm) \\\n",
"0 0.0 setosa 5.1 3.5 1.4 \n",
"1 0.0 setosa 4.9 3.0 1.4 \n",
"2 0.0 setosa 4.7 3.2 1.3 \n",
"3 0.0 setosa 4.6 3.1 1.5 \n",
"4 1.0 setosa 5.0 3.6 1.4 \n",
"5 2.0 setosa 5.4 3.9 1.7 \n",
"6 NaN NaN 4.6 3.4 1.4 \n",
"7 NaN NaN 5.0 3.4 1.5 \n",
"8 NaN NaN 4.4 2.9 1.4 \n",
"9 NaN NaN 4.9 3.1 1.5 \n",
"\n",
" petal width (cm) target Species \n",
"0 0.2 0 setosa \n",
"1 0.2 0 setosa \n",
"2 0.2 0 setosa \n",
"3 0.2 0 setosa \n",
"4 0.2 1 setosa \n",
"5 0.4 2 setosa \n",
"6 0.3 0 setosa \n",
"7 0.2 0 setosa \n",
"8 0.2 2 setosa \n",
"9 0.1 0 setosa \n"
]
}
],
"source": [
"df_1= df.loc[0:5,:].copy() # df_1 has only 6 rows\n",
"df_2= df.copy()\n",
"\n",
"# Lets append by column\n",
"df_append_bycol= pd.concat([df_1,df_2],axis=1)\n",
"\n",
"print(\"Original Dataframe - \",df.shape) # original dataframe has 150 records and 5 columns\n",
"print(\"Appended Dataframe - \",df_append_bycol.shape) # new dataframe has 150 records and 10 columns\n",
"print(\"\\n\")\n",
"\n",
"# Notice the first 5 columns are NaN from row 7 onwards\n",
"print(\"Concatenated Dataframe\")\n",
"print(df_append_bycol.head(10)) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Different row index\n",
"Check what happens when 2 dataframes have different row indexes. In the example below first dataframe has row index 5 to 10 while second dataframe has row index 0 to 149. The concatenated dataframe has NaN values in the first 5 columns (coming from first dataframe) in all rows but rows 5 to 10. ```concat``` function merged the 2 dataframes on row index, so when concatenating dataframes we need to keep any eye on the row index as well. This can be solved by reseting index first for dataframe."
]
},
{
"cell_type": "code",
"execution_count": 130,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Original Dataframe -  (150, 6)\n",
"Appended Dataframe -  (150, 12)\n",
"\n",
"\n",
"Concatenated Dataframe\n",
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n",
"0 NaN NaN NaN NaN \n",
"1 NaN NaN NaN NaN \n",
"2 NaN NaN NaN NaN \n",
"3 NaN NaN NaN NaN \n",
"4 NaN NaN NaN NaN \n",
"5 5.4 3.9 1.7 0.4 \n",
"6 4.6 3.4 1.4 0.3 \n",
"7 5.0 3.4 1.5 0.2 \n",
"8 4.4 2.9 1.4 0.2 \n",
"9 4.9 3.1 1.5 0.1 \n",
"10 5.4 3.7 1.5 0.2 \n",
"11 NaN NaN NaN NaN \n",
"12 NaN NaN NaN NaN \n",
"13 NaN NaN NaN NaN \n",
"14 NaN NaN NaN NaN \n",
"15 NaN NaN NaN NaN \n",
"16 NaN NaN NaN NaN \n",
"17 NaN NaN NaN NaN \n",
"18 NaN NaN NaN NaN \n",
"19 NaN NaN NaN NaN \n",
"\n",
" target Species sepal length (cm) sepal width (cm) petal length (cm) \\\n",
"0 NaN NaN 5.1 3.5 1.4 \n",
"1 NaN NaN 4.9 3.0 1.4 \n",
"2 NaN NaN 4.7 3.2 1.3 \n",
"3 NaN NaN 4.6 3.1 1.5 \n",
"4 NaN NaN 5.0 3.6 1.4 \n",
"5 2.0 setosa 5.4 3.9 1.7 \n",
"6 0.0 setosa 4.6 3.4 1.4 \n",
"7 0.0 setosa 5.0 3.4 1.5 \n",
"8 2.0 setosa 4.4 2.9 1.4 \n",
"9 0.0 setosa 4.9 3.1 1.5 \n",
"10 0.0 setosa 5.4 3.7 1.5 \n",
"11 NaN NaN 4.8 3.4 1.6 \n",
"12 NaN NaN 4.8 3.0 1.4 \n",
"13 NaN NaN 4.3 3.0 1.1 \n",
"14 NaN NaN 5.8 4.0 1.2 \n",
"15 NaN NaN 5.7 4.4 1.5 \n",
"16 NaN NaN 5.4 3.9 1.3 \n",
"17 NaN NaN 5.1 3.5 1.4 \n",
"18 NaN NaN 5.7 3.8 1.7 \n",
"19 NaN NaN 5.1 3.8 1.5 \n",
"\n",
" petal width (cm) target Species \n",
"0 0.2 0 setosa \n",
"1 0.2 0 setosa \n",
"2 0.2 0 setosa \n",
"3 0.2 0 setosa \n",
"4 0.2 1 setosa \n",
"5 0.4 2 setosa \n",
"6 0.3 0 setosa \n",
"7 0.2 0 setosa \n",
"8 0.2 2 setosa \n",
"9 0.1 0 setosa \n",
"10 0.2 0 setosa \n",
"11 0.2 0 setosa \n",
"12 0.1 0 setosa \n",
"13 0.1 2 setosa \n",
"14 0.2 0 setosa \n",
"15 0.4 0 setosa \n",
"16 0.4 0 setosa \n",
"17 0.3 0 setosa \n",
"18 0.3 0 setosa \n",
"19 0.3 0 setosa \n",
"\n",
"\n",
"Original Dataframe -  (150, 6)\n",
"Appended Dataframe -  (150, 12)\n",
"\n",
"\n",
"Concatenated Dataframe with reset index\n",
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n",
"0 5.4 3.9 1.7 0.4 \n",
"1 4.6 3.4 1.4 0.3 \n",
"2 5.0 3.4 1.5 0.2 \n",
"3 4.4 2.9 1.4 0.2 \n",
"4 4.9 3.1 1.5 0.1 \n",
"5 5.4 3.7 1.5 0.2 \n",
"6 NaN NaN NaN NaN \n",
"7 NaN NaN NaN NaN \n",
"8 NaN NaN NaN NaN \n",
"9 NaN NaN NaN NaN \n",
"10 NaN NaN NaN NaN \n",
"11 NaN NaN NaN NaN \n",
"12 NaN NaN NaN NaN \n",
"13 NaN NaN NaN NaN \n",
"14 NaN NaN NaN NaN \n",
"15 NaN NaN NaN NaN \n",
"16 NaN NaN NaN NaN \n",
"17 NaN NaN NaN NaN \n",
"18 NaN NaN NaN NaN \n",
"19 NaN NaN NaN NaN \n",
"\n",
" target Species sepal length (cm) sepal width (cm) petal length (cm) \\\n",
"0 2.0 setosa 5.1 3.5 1.4 \n",
"1 0.0 setosa 4.9 3.0 1.4 \n",
"2 0.0 setosa 4.7 3.2 1.3 \n",
"3 2.0 setosa 4.6 3.1 1.5 \n",
"4 0.0 setosa 5.0 3.6 1.4 \n",
"5 0.0 setosa 5.4 3.9 1.7 \n",
"6 NaN NaN 4.6 3.4 1.4 \n",
"7 NaN NaN 5.0 3.4 1.5 \n",
"8 NaN NaN 4.4 2.9 1.4 \n",
"9 NaN NaN 4.9 3.1 1.5 \n",
"10 NaN NaN 5.4 3.7 1.5 \n",
"11 NaN NaN 4.8 3.4 1.6 \n",
"12 NaN NaN 4.8 3.0 1.4 \n",
"13 NaN NaN 4.3 3.0 1.1 \n",
"14 NaN NaN 5.8 4.0 1.2 \n",
"15 NaN NaN 5.7 4.4 1.5 \n",
"16 NaN NaN 5.4 3.9 1.3 \n",
"17 NaN NaN 5.1 3.5 1.4 \n",
"18 NaN NaN 5.7 3.8 1.7 \n",
"19 NaN NaN 5.1 3.8 1.5 \n",
"\n",
" petal width (cm) target Species \n",
"0 0.2 0 setosa \n",
"1 0.2 0 setosa \n",
"2 0.2 0 setosa \n",
"3 0.2 0 setosa \n",
"4 0.2 1 setosa \n",
"5 0.4 2 setosa \n",
"6 0.3 0 setosa \n",
"7 0.2 0 setosa \n",
"8 0.2 2 setosa \n",
"9 0.1 0 setosa \n",
"10 0.2 0 setosa \n",
"11 0.2 0 setosa \n",
"12 0.1 0 setosa \n",
"13 0.1 2 setosa \n",
"14 0.2 0 setosa \n",
"15 0.4 0 setosa \n",
"16 0.4 0 setosa \n",
"17 0.3 0 setosa \n",
"18 0.3 0 setosa \n",
"19 0.3 0 setosa \n"
]
}
],
"source": [
"df_1= df.loc[5:10,:].copy() # df_1 has only 6 to 10 from original dataframe\n",
"df_2= df.copy()\n",
"\n",
"# Lets append by column\n",
"df_append_bycol= pd.concat([df_1,df_2],axis=1)\n",
"\n",
"print(\"Original Dataframe - \",df.shape)\n",
"print(\"Appended Dataframe - \",df_append_bycol.shape)\n",
"print(\"\\n\")\n",
"\n",
"# Notice the first 5 columns are NaN from row 7 onwards\n",
"print(\"Concatenated Dataframe\")\n",
"print(df_append_bycol.head(20)) \n",
"print(\"\\n\")\n",
"\n",
"# reset index\n",
"df_1= df.loc[5:10,:].copy().reset_index(drop=True) # df_1 has only 6 to 10 from original dataframe\n",
"df_2= df.copy()\n",
"\n",
"# Lets append by column\n",
"df_append_bycol= pd.concat([df_1,df_2],axis=1)\n",
"\n",
"print(\"Original Dataframe - \",df.shape)\n",
"print(\"Appended Dataframe - \",df_append_bycol.shape)\n",
"print(\"\\n\")\n",
"\n",
"# Notice the first 5 columns are NaN from row 7 onwards\n",
"print(\"Concatenated Dataframe with reset index\")\n",
"print(df_append_bycol.head(20)) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Dataframes can be appended on row by using option ``` axis=0``` \n",
"When appending dataframes by row, the row indexes also get appended and needs to be reset."
]
},
{
"cell_type": "code",
"execution_count": 131,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"Original Dataframe -  (6, 6)\n",
"Appended Dataframe -  (12, 6)\n",
"\n",
"\n",
"Concatenated Dataframe without reset index\n",
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n",
"0 5.1 3.5 1.4 0.2 \n",
"1 4.9 3.0 1.4 0.2 \n",
"2 4.7 3.2 1.3 0.2 \n",
"3 4.6 3.1 1.5 0.2 \n",
"4 5.0 3.6 1.4 0.2 \n",
"5 5.4 3.9 1.7 0.4 \n",
"0 5.1 3.5 1.4 0.2 \n",
"1 4.9 3.0 1.4 0.2 \n",
"2 4.7 3.2 1.3 0.2 \n",
"3 4.6 3.1 1.5 0.2 \n",
"\n",
" target Species \n",
"0 0 setosa \n",
"1 0 setosa \n",
"2 0 setosa \n",
"3 0 setosa \n",
"4 1 setosa \n",
"5 2 setosa \n",
"0 0 setosa \n",
"1 0 setosa \n",
"2 0 setosa \n",
"3 0 setosa \n",
"\n",
"\n",
"Original Dataframe -  (6, 6)\n",
"Appended Dataframe -  (12, 6)\n",
"\n",
"\n",
"Concatenated Dataframe with reset index\n",
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n",
"0 5.1 3.5 1.4 0.2 \n",
"1 4.9 3.0 1.4 0.2 \n",
"2 4.7 3.2 1.3 0.2 \n",
"3 4.6 3.1 1.5 0.2 \n",
"4 5.0 3.6 1.4 0.2 \n",
"5 5.4 3.9 1.7 0.4 \n",
"6 5.1 3.5 1.4 0.2 \n",
"7 4.9 3.0 1.4 0.2 \n",
"8 4.7 3.2 1.3 0.2 \n",
"9 4.6 3.1 1.5 0.2 \n",
"\n",
" target Species \n",
"0 0 setosa \n",
"1 0 setosa \n",
"2 0 setosa \n",
"3 0 setosa \n",
"4 1 setosa \n",
"5 2 setosa \n",
"6 0 setosa \n",
"7 0 setosa \n",
"8 0 setosa \n",
"9 0 setosa \n"
]
}
],
"source": [
"# Lets create a copy of existing data frame \"df\"\n",
"df_1= df.loc[0:5,:].copy()\n",
"df_2= df.loc[0:5,:].copy()\n",
"\n",
"# Lets append by row\n",
"df_append_byrow= pd.concat([df_1,df_2],axis=0) # check the row index\n",
"print(\"\\n\")\n",
"print(\"Original Dataframe - \",df_1.shape)\n",
"print(\"Appended Dataframe - \",df_append_byrow.shape)\n",
"print(\"\\n\")\n",
"print(\"Concatenated Dataframe without reset index\")\n",
"print(df_append_byrow.head(10))\n",
"\n",
"# Lets append by row and reset index\n",
"df_append_byrow= pd.concat([df_1,df_2],axis=0).reset_index(drop=True) # check the row index\n",
"print(\"\\n\")\n",
"print(\"Original Dataframe - \",df_1.shape)\n",
"print(\"Appended Dataframe - \",df_append_byrow.shape)\n",
"print(\"\\n\")\n",
"\n",
"# check the row index\n",
"print(\"Concatenated Dataframe with reset index\")\n",
"print(df_append_byrow.head(10))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Unequal columns\n",
"Check what happens when 2 dataframes have unequal number of columns. In the example below first dataframe has only 3 columns while second dataframe has 5 columns. The concatenated dataframe has NaN values in the first 10 rows (coming from first dataframe) in columns that are missing from it.\n",
"\n",
"Interesting fact !\n",
"Notice that in the concatenated dataframe the order of columns has changed, ```concat``` reorders the columns in alphabetical order, use ```sort= False``` to turn this off."
]
},
{
"cell_type": "code",
"execution_count": 132,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" sepal length (cm)\n",
"0 5.1\n",
"1 4.9\n",
"2 4.7\n",
"3 4.6\n",
"4 5.0\n",
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n",
"0 5.1 3.5 1.4 0.2 \n",
"1 4.9 3.0 1.4 0.2 \n",
"2 4.7 3.2 1.3 0.2 \n",
"3 4.6 3.1 1.5 0.2 \n",
"4 5.0 3.6 1.4 0.2 \n",
"\n",
" target Species \n",
"0 0 setosa \n",
"1 0 setosa \n",
"2 0 setosa \n",
"3 0 setosa \n",
"4 1 setosa \n",
"\n",
"\n",
"Original Dataframe1 -  (10, 1)\n",
"Original Dataframe2 -  (10, 6)\n",
"Appended Dataframe -  (20, 6)\n",
"\n",
"\n",
"Concatenated Dataframe\n",
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n",
"0 5.1 NaN NaN NaN \n",
"1 4.9 NaN NaN NaN \n",
"2 4.7 NaN NaN NaN \n",
"3 4.6 NaN NaN NaN \n",
"4 5.0 NaN NaN NaN \n",
"5 5.4 NaN NaN NaN \n",
"6 4.6 NaN NaN NaN \n",
"7 5.0 NaN NaN NaN \n",
"8 4.4 NaN NaN NaN \n",
"9 4.9 NaN NaN NaN \n",
"0 5.1 3.5 1.4 0.2 \n",
"1 4.9 3.0 1.4 0.2 \n",
"2 4.7 3.2 1.3 0.2 \n",
"3 4.6 3.1 1.5 0.2 \n",
"4 5.0 3.6 1.4 0.2 \n",
"5 5.4 3.9 1.7 0.4 \n",
"6 4.6 3.4 1.4 0.3 \n",
"7 5.0 3.4 1.5 0.2 \n",
"8 4.4 2.9 1.4 0.2 \n",
"9 4.9 3.1 1.5 0.1 \n",
"\n",
" target Species \n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"3 NaN NaN \n",
"4 NaN NaN \n",
"5 NaN NaN \n",
"6 NaN NaN \n",
"7 NaN NaN \n",
"8 NaN NaN \n",
"9 NaN NaN \n",
"0 0.0 setosa \n",
"1 0.0 setosa \n",
"2 0.0 setosa \n",
"3 0.0 setosa \n",
"4 1.0 setosa \n",
"5 2.0 setosa \n",
"6 0.0 setosa \n",
"7 0.0 setosa \n",
"8 2.0 setosa \n",
"9 0.0 setosa \n"
]
}
],
"source": [
"# lets take a smaller subset to be able to inspect the results\n",
"df_1= df.iloc[0:10,0:1].copy() # df_1 has only 6 rows\n",
"df_2= df.iloc[0:10,:].copy()\n",
"\n",
"print(df_1.head())\n",
"print(df_2.head())\n",
"\n",
"# Lets append by row\n",
"df_append_byrow= pd.concat([df_1,df_2],axis=0,sort=False)\n",
"\n",
"print(\"\\n\")\n",
"print(\"Original Dataframe1 - \",df_1.shape)\n",
"print(\"Original Dataframe2 - \",df_2.shape)\n",
"\n",
"print(\"Appended Dataframe - \",df_append_byrow.shape)\n",
"print(\"\\n\")\n",
"\n",
"# Notice the first 5 columns are NaN from row 7 onwards\n",
"print(\"Concatenated Dataframe\")\n",
"print(df_append_byrow.head(20)) # new dataframe has 150 records and 10 columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Handling duplicates\n",
"It's one of basic data sanity checks to look for duplicate rows and columns and treat them appropriately before commencing any analysis on data. And you guessed it right, pandas has a nifty function to do this - ```drop_duplicates()```. With this function we can either **drop all duplicates** or **keep the first/last duplicate** (after ordering by some other columns). We can also consider only a subset of columns when searching for duplicates by using ``` subset ``` option. Lets see how this works.\n"
]
},
{
"cell_type": "code",
"execution_count": 133,
"metadata": {},
"outputs": [],
"source": [
"# lets first add some duplicate rows to our dataframe\n",
"df_1= df.loc[0:5,:].copy()\n",
"df_2= df.loc[0:5,:].copy()\n",
"df_hasduplicates= pd.concat([df_1,df_2],axis=0).reset_index(drop=True) # this dataframe has duplicate rows"
]
},
{
"cell_type": "code",
"execution_count": 134,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataframe with duplicates (12, 6)\n",
"Dataframe without duplicates (6, 6)\n"
]
}
],
"source": [
"print(\"Dataframe with duplicates \",df_hasduplicates.shape)\n",
"#print(df_hasduplicates.head(5))\n",
"\n",
"# drop them all\n",
"df_noduplicates= df_hasduplicates.drop_duplicates()\n",
"print(\"Dataframe without duplicates \",df_noduplicates.shape)\n",
"#print(df_noduplicates.head(5))\n"
]
},
{
"cell_type": "code",
"execution_count": 135,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataframe with duplicates\n",
"0 8\n",
"2 2\n",
"1 2\n",
"Name: target, dtype: int64\n",
"\n",
"\n",
"Dataframe without duplicates \n",
"2 1\n",
"1 1\n",
"0 1\n",
"Name: target, dtype: int64\n",
"\n",
"\n"
]
}
],
"source": [
"# let's check how many duplicate values are there in target column\n",
"# we can check using group by function (will be covered in a later article)\n",
"print(\"Dataframe with duplicates\")\n",
"print(df_hasduplicates['target'].value_counts())\n",
"print(\"\\n\")\n",
"\n",
"# drop the rows with duplicate target column\n",
"df_noduplicates_1= df_hasduplicates.drop_duplicates(subset=['target'])\n",
"#print(\"Dataframe without duplicates \")\n",
"# print(df_noduplicates.shape) #default is to keep the first row\n",
"# print(\"\\n\")\n",
"\n",
"# let's check if target column has duplicates now\n",
"print(\"Dataframe without duplicates \")\n",
"print(df_noduplicates_1['target'].value_counts())\n",
"print(\"\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the dataframe with duplicate values has 8 records with value 0, 2 records with value 1 and 2 records with value 2 and the dataframe without duplicates has 1 value of each type. Note, that here pandas keeps the first record for each value of target column.\n",
"Next, lets try to use the option ```keep``` to keep records with lowest sepal length."
]
},
{
"cell_type": "code",
"execution_count": 136,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataframe without duplicates \n",
"2 1\n",
"1 1\n",
"0 1\n",
"Name: target, dtype: int64\n",
"\n",
"\n"
]
}
],
"source": [
"# drop the rows with duplicate target column while keeping the row with highest sepal length\n",
"# first sort by target and petal length\n",
"df_hasduplicates.sort_values(['target','sepal length (cm)'],ascending=False,inplace=True) \n",
"\n",
"#drop duplicates\n",
"df_noduplicates_2= df_hasduplicates.drop_duplicates(subset=['target'],keep='last')\n",
"#print(\"Dataframe without duplicates \",df_noduplicates.shape)\n",
"\n",
"# let's check if target column has duplicates now\n",
"print(\"Dataframe without duplicates \")\n",
"print(df_noduplicates_2['target'].value_counts())\n",
"print(\"\\n\")\n"
]
},
{
"cell_type": "code",
"execution_count": 137,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"First dataframe :\n",
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n",
"0 5.1 3.5 1.4 0.2 \n",
"4 5.0 3.6 1.4 0.2 \n",
"5 5.4 3.9 1.7 0.4 \n",
"\n",
" target Species \n",
"0 0 setosa \n",
"4 1 setosa \n",
"5 2 setosa \n",
"\n",
"\n",
"Second dataframe :\n",
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n",
"11 5.4 3.9 1.7 0.4 \n",
"10 5.0 3.6 1.4 0.2 \n",
"9 4.6 3.1 1.5 0.2 \n",
"\n",
" target Species \n",
"11 2 setosa \n",
"10 1 setosa \n",
"9 0 setosa \n",
"\n",
"\n"
]
}
],
"source": [
"print(\"First dataframe :\")\n",
"print(df_noduplicates_1)\n",
"print(\"\\n\")\n",
"\n",
"print(\"Second dataframe :\")\n",
"print(df_noduplicates_2)\n",
"print(\"\\n\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Handling missing values\n",
"\n",
"Knowing how to find and replace missing values is key to data analysis. In pandas we can filter out both rows and columns with any or all missing values with the ```dropna()``` function. \n",
"Let's see the different ways in which pandas lets us manage missing values. "
]
},
{
"cell_type": "code",
"execution_count": 138,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(150, 6)\n",
"[ 26 105 16 68 42 76 79 89 14 89] [3]\n"
]
}
],
"source": [
"import numpy as np\n",
"\n",
"# lets first introduce some missing values into our dataframe\n",
"df_copy = df.copy()\n",
"print(df_copy.shape)\n",
"# lets randomly pick some rows and column positions and assign them to NaN\n",
"np.random.seed(200)\n",
"row=np.random.randint(0,df_copy.shape[0]-1,10)\n",
"col=np.random.randint(0,df_copy.shape[1]-1,1)\n",
"print(row,col)\n",
"df_copy.iloc[row,col]=np.nan\n"
]
},
{
"cell_type": "code",
"execution_count": 139,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"No. of missing values : 9\n",
"\n",
"\n",
"Missing values summary :\n",
"sepal length (cm) 0\n",
"sepal width (cm) 0\n",
"petal length (cm) 0\n",
"petal width (cm) 9\n",
"target 0\n",
"Species 0\n",
"dtype: int64\n"
]
}
],
"source": [
"# Check how many missing values in total\n",
"print(\"No. of missing values :\",df_copy.isnull().sum().sum())\n",
"print(\"\\n\")\n",
"# Check how many missing values by columns\n",
"print(\"Missing values summary :\")\n",
"print(df_copy.isnull().sum())"
]
},
{
"cell_type": "code",
"execution_count": 140,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"sepal length (cm) 0\n",
"sepal width (cm) 0\n",
"petal length (cm) 0\n",
"petal width (cm) 0\n",
"target 0\n",
"Species 0\n",
"dtype: int64\n"
]
}
],
"source": [
"# Drop rows with any missing values \n",
"df_nonull=df_copy.dropna(how='any')\n",
"print(df_nonull.isnull().sum())"
]
},
{
"cell_type": "code",
"execution_count": 141,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"sepal length (cm) 0\n",
"sepal width (cm) 0\n",
"petal length (cm) 0\n",
"petal width (cm) 9\n",
"target 0\n",
"Species 0\n",
"dtype: int64\n"
]
}
],
"source": [
"# Drop rows with missing values in sepal length column\n",
"df_nonull=df_copy.dropna(subset=['sepal length (cm)'],how='any')\n",
"print(df_nonull.isnull().sum()) "
]
},
{
"cell_type": "code",
"execution_count": 142,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"sepal length (cm) 0\n",
"sepal width (cm) 0\n",
"petal length (cm) 0\n",
"target 0\n",
"Species 0\n",
"dtype: int64\n",
"Columns with all null values: ['sepal length (cm)' 'sepal width (cm)' 'petal length (cm)' 'target'\n",
" 'Species']\n"
]
}
],
"source": [
"# get all columns with no null values\n",
"df_nonull=df_copy.loc[:,df_copy.notnull().all()]\n",
"print(df_nonull.isnull().sum()) \n",
"print(\"Columns with all null values:\", df_nonull.columns.values)"
]
},
{
"cell_type": "code",
"execution_count": 143,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"sepal length (cm) 0\n",
"sepal width (cm) 0\n",
"petal length (cm) 0\n",
"petal width (cm) 9\n",
"target 0\n",
"Species 0\n",
"dtype: int64\n",
"Columns with any non-null values: ['sepal length (cm)' 'sepal width (cm)' 'petal length (cm)'\n",
" 'petal width (cm)' 'target' 'Species']\n"
]
}
],
"source": [
"# get all columns with any non-null values\n",
"df_nonull=df_copy.loc[:,df_copy.notnull().any()]\n",
"print(df_nonull.isnull().sum()) \n",
"print(\"Columns with any non-null values:\", df_nonull.columns.values)"
]
},
{
"cell_type": "code",
"execution_count": 144,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Series([], dtype: float64)\n",
"Columns with all null values: []\n"
]
}
],
"source": [
"# get all columns with all null values\n",
"df_nonull=df_copy.loc[:,df_copy.isnull().all()]\n",
"print(df_nonull.isnull().sum()) \n",
"print(\"Columns with all null values:\", df_nonull.columns.values)"
]
},
{
"cell_type": "code",
"execution_count": 145,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"petal width (cm) 9\n",
"dtype: int64\n",
"Columns with any null values: ['petal width (cm)']\n"
]
}
],
"source": [
"# get all columns with all null values\n",
"df_nonull=df_copy.loc[:,df_copy.isnull().any()]\n",
"print(df_nonull.isnull().sum()) \n",
"print(\"Columns with any null values:\", df_nonull.columns.values)\n"
]
},
{
"cell_type": "code",
"execution_count": 146,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"sepal length (cm) 0\n",
"sepal width (cm) 0\n",
"petal length (cm) 0\n",
"petal width (cm) 9\n",
"Species 0\n",
"dtype: int64\n",
"Columns with all non-zero values: ['sepal length (cm)' 'sepal width (cm)' 'petal length (cm)'\n",
" 'petal width (cm)' 'Species']\n",
"sepal length (cm) 0\n",
"sepal width (cm) 0\n",
"petal length (cm) 0\n",
"petal width (cm) 9\n",
"target 0\n",
"Species 0\n",
"dtype: int64\n",
"Columns with any non-zero values: ['sepal length (cm)' 'sepal width (cm)' 'petal length (cm)'\n",
" 'petal width (cm)' 'target' 'Species']\n"
]
}
],
"source": [
"# Get all columns with all non-zero values\n",
"df_nonull=df_copy.loc[:,df_copy.all()]\n",
"print(df_nonull.isnull().sum()) \n",
"print(\"Columns with all non-zero values:\", df_nonull.columns.values)\n",
"\n",
"# get all columns with any non-zero values\n",
"df_nonull=df_copy.loc[:,df_copy.any()]\n",
"print(df_nonull.isnull().sum()) \n",
"print(\"Columns with any non-zero values:\", df_nonull.columns.values)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Changing datatypes\n",
"In pandas the most commonly used datatypes are integer, float, object, category and datetime. We can get the data types of all columns in any dataframe by using the ```.dtypes``` fucntion. It's quite common for data to get read in format not appropriate for analysis (especially when reading from sql database) and needs to be changed to appropriate data type before analysis.For converting to numeric we can use ```astype()``` function or ```to_numeric()``` function which comes in handy especially when you have missing values. There are situations where we need to just select columns of a certain datatype to perform analysis, pandas has a very useful function ``` select_dtypes()``` to do so as well. Let's learn with examples. \n"
]
},
{
"cell_type": "code",
"execution_count": 147,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"sepal length (cm) float64\n",
"sepal width (cm) float64\n",
"petal length (cm) float64\n",
"petal width (cm) float64\n",
"target int32\n",
"Species object\n",
"dtype: object\n"
]
}
],
"source": [
"# check datatypes in our dataframe\n",
"print(df.dtypes)"
]
},
{
"cell_type": "code",
"execution_count": 148,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"sepal length (cm) float64\n",
"sepal width (cm) float64\n",
"petal length (cm) float64\n",
"petal width (cm) float64\n",
"dtype: object\n"
]
}
],
"source": [
"# lets select all float columns\n",
"df_float = df.select_dtypes(include='float64')\n",
"print(df_float.dtypes)"
]
},
{
"cell_type": "code",
"execution_count": 149,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"sepal length (cm) float64\n",
"sepal width (cm) float64\n",
"petal length (cm) float64\n",
"petal width (cm) float64\n",
"target float64\n",
"Species object\n",
"dtype: object\n"
]
}
],
"source": [
"# lets convert target column to float type\n",
"df['target']=df['target'].astype('float64')\n",
"print(df.dtypes)"
]
},
{
"cell_type": "code",
"execution_count": 150,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"sepal length (cm) float64\n",
"sepal width (cm) float64\n",
"petal length (cm) float64\n",
"petal width (cm) float64\n",
"target int64\n",
"Species object\n",
"dtype: object\n"
]
}
],
"source": [
"# lets convert target column back to int type\n",
"df['target']=df['target'].astype('int64')\n",
"print(df.dtypes)"
]
},
{
"cell_type": "code",
"execution_count": 151,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Missing values : 5\n"
]
}
],
"source": [
"# Lets introduce some missing values in target column\n",
"row=np.random.randint(0,df.shape[0],5)\n",
"df_copy=df.copy()\n",
"df_copy.loc[row,'target']=np.nan\n",
"\n",
"#check for missing values\n",
"print(\"Missing values :\",df_copy['target'].isnull().sum())\n"
]
},
{
"cell_type": "code",
"execution_count": 152,
"metadata": {},
"outputs": [
{
"ename": "ValueError",
"evalue": "Cannot convert non-finite values (NA or inf) to integer",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m<ipython-input-152-b3254af3ffa4>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;31m# lets convert target column to int type using astype()\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2\u001b[0m \u001b[1;31m# You get an Error !!\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 3\u001b[1;33m \u001b[0mdf_copy\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'target'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mdf_copy\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'target'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mastype\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mint\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[1;32m~\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\pandas\\core\\generic.py\u001b[0m in \u001b[0;36mastype\u001b[1;34m(self, dtype, copy, errors, **kwargs)\u001b[0m\n\u001b[0;32m 5880\u001b[0m \u001b[1;31m# else, only a single dtype is given\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 5881\u001b[0m new_data = self._data.astype(\n\u001b[1;32m-> 5882\u001b[1;33m \u001b[0mdtype\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mcopy\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0merrors\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0merrors\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 5883\u001b[0m )\n\u001b[0;32m 5884\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_constructor\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mnew_data\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m__finalize__\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m~\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\pandas\\core\\internals\\managers.py\u001b[0m in \u001b[0;36mastype\u001b[1;34m(self, dtype, **kwargs)\u001b[0m\n\u001b[0;32m 579\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 580\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0mastype\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 581\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mapply\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"astype\"\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 582\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 583\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0mconvert\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m~\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\pandas\\core\\internals\\managers.py\u001b[0m in \u001b[0;36mapply\u001b[1;34m(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)\u001b[0m\n\u001b[0;32m 436\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mk\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mobj\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mreindex\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mb_items\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0maxis\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0malign_copy\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 437\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 438\u001b[1;33m \u001b[0mapplied\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mgetattr\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mb\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mf\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 439\u001b[0m \u001b[0mresult_blocks\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0m_extend_blocks\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mapplied\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mresult_blocks\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 440\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m~\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\pandas\\core\\internals\\blocks.py\u001b[0m in \u001b[0;36mastype\u001b[1;34m(self, dtype, copy, errors, values, **kwargs)\u001b[0m\n\u001b[0;32m 557\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 558\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0mastype\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mFalse\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0merrors\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m\"raise\"\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mvalues\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mNone\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 559\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_astype\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdtype\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mcopy\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0merrors\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0merrors\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mvalues\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mvalues\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 560\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 561\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0m_astype\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mFalse\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0merrors\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m\"raise\"\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mvalues\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mNone\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m~\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\pandas\\core\\internals\\blocks.py\u001b[0m in \u001b[0;36m_astype\u001b[1;34m(self, dtype, copy, errors, values, **kwargs)\u001b[0m\n\u001b[0;32m 641\u001b[0m \u001b[1;31m# _astype_nansafe works fine with 1-d only\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 642\u001b[0m \u001b[0mvals1d\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mvalues\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mravel\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 643\u001b[1;33m \u001b[0mvalues\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mastype_nansafe\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mvals1d\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mTrue\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 644\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 645\u001b[0m \u001b[1;31m# TODO(extension)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m~\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\pandas\\core\\dtypes\\cast.py\u001b[0m in \u001b[0;36mastype_nansafe\u001b[1;34m(arr, dtype, copy, skipna)\u001b[0m\n\u001b[0;32m 698\u001b[0m \u001b[1;32mif\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0mnp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0misfinite\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0marr\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mall\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 699\u001b[0m raise ValueError(\n\u001b[1;32m--> 700\u001b[1;33m \u001b[1;34m\"Cannot convert non-finite values (NA or inf) to \"\u001b[0m \u001b[1;34m\"integer\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 701\u001b[0m )\n\u001b[0;32m 702\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;31mValueError\u001b[0m: Cannot convert non-finite values (NA or inf) to integer"
]
}
],
"source": [
"# lets convert target column to int type using astype()\n",
"# You get an Error !!\n",
"df_copy['target']=df_copy['target'].astype(int)"
]
},
{
"cell_type": "code",
"execution_count": 153,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"sepal length (cm) float64\n",
"sepal width (cm) float64\n",
"petal length (cm) float64\n",
"petal width (cm) float64\n",
"target float64\n",
"Species object\n",
"dtype: object\n"
]
}
],
"source": [
"# Now try this\n",
"df_copy['target']=pd.to_numeric(df_copy['target'],errors='coerce')\n",
"print(df_copy.dtypes)\n",
"\n",
"# or this\n",
"df_copy['target']=df_copy['target'].astype(float)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That's all for this tutorial. I hope you find it useful and thanks for reading !"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment