Last active
October 26, 2023 06:20
-
-
Save BrendaHali/2b8f6f218492a94cb74eea5043d64867 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Pandas Cheat Sheet\n", | |
"Collected by Brenda Hali" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Installation\n", | |
"Pandas is built on top of Numpy, for better use we need to have NumPy installed." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"!pip install numpy\n", | |
"!pip install pandas #Installs the library" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Import" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import numpy as np\n", | |
"import pandas as pd" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Data Structure \n", | |
"- Pandas supports up to two-dimentions DataFrame\n", | |
"- 1D objects are called Series. \n", | |
"- 2D objects are called DataFrame. \n", | |
"- The structure is Rows and Columns. \n", | |
"\n", | |
"#### The basics\n", | |
"Pandas documentation https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html\n", | |
"\n", | |
"#### Reading data from a file\n", | |
"It supports: cvs, sql, json, html, etc. \n", | |
"- Link of full imports options: \n", | |
"https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df = pd.read_csv('file.csv') #regular import\n", | |
"df = pd.read_csv('file.csv', index_col=0) #takes the column 0 as the index\n", | |
"df = pd.read_html(url) #needs beautifulsoup for manipulation\n", | |
"#options but not unique\n", | |
"df = pd.read_json()\n", | |
"df = pd.read_sql()\n", | |
"df = pd.read_excel()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Saving dataframe as a file\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df.to_csv('file_name.csv',index=False) #This will save the file as dataframe without an index\n", | |
"#options\n", | |
"df.to_json()\n", | |
"df.to_sql()\n", | |
"pd.to_excel()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Data type change" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"pd.to_datetime\n", | |
"pd.to_timedelta" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Atributes" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df.columns #Returns the columns \n", | |
"df.dtypes #Returns the datatypes\n", | |
"df.index #Retuns the idex\n", | |
"df.shape #Returnsthe shape\n", | |
"df.T #Returns the dataframe inverted \n", | |
"df.values #Returns the values" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Join, Merge and Concat\n", | |
"Link to documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html\n", | |
"Common adjustments: \n", | |
"\n", | |
"- `ignore_index = True`- When the index is not relevant for the join\n", | |
"- `axis= 0`: adds up rows | `axis= 1` Adds up the columns\n", | |
"- `keys = ['a', 'b', 'c']`- Adds up the DataFrame on certain keys\n", | |
"- `left` join - Use keys from left frame only\n", | |
"- `right` join - Use keys from right frame only\n", | |
"- `outer` - Use union of keys from both frames\n", | |
"- `inner` - Use intersection of keys from both frames\n", | |
"\n", | |
"This will return the index and column/row content. In some methods, if we want to modify the dataframe `inplace=True` needs to be specified" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df_1.join(df_2) #Joins the dataframes based in Index - They must have the same index\n", | |
"\n", | |
"pd.concat([df_1, df_2], axis=1, join='outer', ignore_index=False, keys=None,\n", | |
" levels=None, names=None, verify_integrity=False, copy=True) #Values can be changed as needed\n", | |
"\n", | |
"df_1.pd.append(df_2, sort = True, inplace = True) #This adds dataframes with the same column structure and names\n", | |
"\n", | |
"dataframes = {'a': df_1, 'b': df_2, 'c': df_3}\n", | |
"new_df = pd.concat(dataframes) #Concats the dataframes indicated in the dataframes dictionary and create a new column indicating the origin of the data\n", | |
"\n", | |
"pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,\n", | |
" left_index=False, right_index=False, sort=True,\n", | |
" suffixes=('_x', '_y'), copy=True, indicator=False,\n", | |
" validate=None)\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Columns addition and creation" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df['sum_col1_col2'] = df['col_1'] + df['col_2']" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Inspecting and Visualizing data without change in dataframe\n", | |
"Use this if you want to see the values of certain columns or rows. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df.head() #Returns the first 5 values\n", | |
"\n", | |
"df.tail() #Returns the last 5 values\n", | |
"\n", | |
"df['column_name'] #Returns all the data in the column\n", | |
"df.column_name #Returns all the data in the column\n", | |
"df['column_name', 'second_column'] #Returns columns as a new DataFrame\n", | |
"df[7:9] # Displays the values of rows 7 to 9\n", | |
"df.value_counts(dropna=False) #Retuns unique values and counts\n", | |
"df.sort_index(axis=0, ascending=False) #Returns dataframe sorted by index\n", | |
"df.apply(pd.Series.value_counts) #Returns values and counts for all columns\n", | |
"df.sort_values(by='column_name') #Returns dataframe sorted by the column selected\n", | |
"df.groupby('column_name').mean() #Returns dataframe grouped by column name and the mean\n", | |
"df.pivot_table\n", | |
"df.iloc[0] #Selection by position\n", | |
"df.loc[`index_one`] # Selection by index\n", | |
"df.iloc[0,:] #Returns First row\n", | |
"df.iloc[0,0] #Returns element of first column\n", | |
"\n", | |
"#Simple examples that can be adapted as needed\n", | |
"\n", | |
"df[df['is_muy_value'] == 1][['what_im_looking_for']]\n", | |
"\n", | |
"df[df['column_1'] < 10].groupby('column_2').mean()[['what_im_looking_for']]\n", | |
"df[df['column_1'] == 0].sort_values(by='column_2', ascending=False).head()\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Replacing and renaming" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df.columns = ['column_1', 'column_2'] #Renames columns\n", | |
"df.rename(columns={'old_name': 'new_ name'}) # Selective renaming\n", | |
"\n", | |
"df.replace(1,'one') #Replace all values equal to 1 with ‘one’\n", | |
"df.replace([1,3],['one','three']) #Replace all 1 with ‘one’ and 3 with ‘three’\n", | |
"\n", | |
"\n", | |
"df.set_index('column_1') #Changes the index\n", | |
"df.astype(int) #Converts the datatype of the series to integer - It can be changed to any datatype\n", | |
"\n", | |
"df['column_1'].astype(int) #changes the datatype" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Null Values\n", | |
"- `isna` = `isnull`\n", | |
"- `notna` = `notnull`" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df.fillna(value = 'my_selected_value', inplace = True) #Fills all NANs with the value we selected and make the change permanent\n", | |
"df.fillna(x) # Replace all null values with x\n", | |
"df.notna().sum() #Sums of nas per column\n", | |
"df.interpolate()\n", | |
"df.isna().sum() #Returns True/false to NAs\n", | |
"df.isnull()\n", | |
"df.dropna(inplace = True) # Drops null values permanently\n", | |
"df.isnull().sum() #Prints null values agregated by column\n", | |
"df.isnull().sum()[df.isnull().sum() !=0].sort_values().plot(kind='barh'); #Plots the null values\n", | |
"\n", | |
"#Advance replacing:\n", | |
"df.fillna(df.mean()) #Replace all null values with the mean \n", | |
"\n", | |
"#Other way to overwrite the dataframe without the NAs in specific column\n", | |
"df.column.fillna(value='no_info', inplace=True) \n", | |
"df= df.loc[df['column'] != 'no_info']" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Dropping" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df.drop(['column_1', 'column_2'], axis=1, inplace = True) #drops specific columns\n", | |
"df.drop_duplicates(inplace=True) #drops duplicates permanently\n", | |
"df.drop('row_1', axis=0, inplace = True) #drops the row permanently" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Agregation Methods, stadistical methods and summaries\n", | |
"Can be used in way `df.sum` and `df.sum()` way" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df.count() # Returns the number of non-null values in each DataFrame column\n", | |
"df.describe() # Summary statistics for numerical columns\n", | |
"df.max() #Returns the highest value in each column\n", | |
"df.mean() #Returns the mean of all columns\n", | |
"df.median() # Returns the median of each column\n", | |
"df.min() # Returns the lowest value in each column\n", | |
"df.mode() #Returns mode \n", | |
"df.std() # Returns the standard deviation of each column\n", | |
"df.var() #retuns varianza\n", | |
"df.abs() #Returns absolute values\n", | |
"df.corr() # Returns the correlation between columns in a DataFrame\n", | |
"df.round() #rounds the number" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Other to explore" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df.clip()\n", | |
"df.nunique() \n", | |
"df.idxmax()\n", | |
"df.idxmin()\n", | |
"df.cov()\n", | |
"df.cummax()\n", | |
"df.cummin()\n", | |
"df.cumprod()\n", | |
"df.cumsum()\n", | |
"df.diff()\n", | |
"df.nlargest()\n", | |
"df.nsmallest()\n", | |
"df.pct_change()\n", | |
"df.prod()\n", | |
"df.quantile()\n", | |
"df.rank()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Simple Functions" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Dummy Variables\n", | |
"https://socialresearchmethods.net/kb/dummyvar.php" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df = pd.get_dummies(df, columns=['my_column'], drop_first=True) #I dummy variables we drop the first column to not to make it redundant" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"#Does this column have the value I'm looking for?\n", | |
"def is_the_value_im_looking(i):\n", | |
" val = i.split()\n", | |
" if 'value' in str(val):\n", | |
" return 1\n", | |
" else:\n", | |
" return 0\n", | |
" \n", | |
"#Create a column called as the value I'm looking for and adds 0 or 1\n", | |
"df['value'] = df['col_1'].apply(is_the_value_im_looking)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"#Extracts the title from everyone's name and create dummy columns, made with list comprehension.\n", | |
"#This can be adapted as needed\n", | |
"\n", | |
"df['Title'] = [each.split(',')[1].split('.')[0].strip() for each in df['Name']]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"#Rate per column. \n", | |
"#this can be adapted as needed. \n", | |
"\n", | |
"for i in ['column_1', 'column_2', 'column_3']:\n", | |
" print(i, ':')\n", | |
" print(df[df[i] == 1][['the_value_im_lookingfor']].mean())\n", | |
" print()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Plotting\n", | |
"Simple plotting examples" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df.groupby('column_1').mean()[['value']].plot(kind='barh')\n", | |
"plt.title(\"plot title\");" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df.groupby(['column_1', 'colum_2']).mean()[['value']].plot(kind='barh');" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.7.3" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 4 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Good reference for Pandas 👍