Skip to content

Instantly share code, notes, and snippets.

@BrendaHali
Last active October 26, 2023 06:20
Show Gist options
  • Save BrendaHali/2b8f6f218492a94cb74eea5043d64867 to your computer and use it in GitHub Desktop.
Save BrendaHali/2b8f6f218492a94cb74eea5043d64867 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pandas Cheat Sheet\n",
"Collected by Brenda Hali"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Installation\n",
"Pandas is built on top of Numpy, for better use we need to have NumPy installed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install numpy\n",
"!pip install pandas #Installs the library"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Import"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Data Structure \n",
"- Pandas supports up to two-dimentions DataFrame\n",
"- 1D objects are called Series. \n",
"- 2D objects are called DataFrame. \n",
"- The structure is Rows and Columns. \n",
"\n",
"#### The basics\n",
"Pandas documentation https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html\n",
"\n",
"#### Reading data from a file\n",
"It supports: cvs, sql, json, html, etc. \n",
"- Link of full imports options: \n",
"https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv('file.csv') #regular import\n",
"df = pd.read_csv('file.csv', index_col=0) #takes the column 0 as the index\n",
"df = pd.read_html(url) #needs beautifulsoup for manipulation\n",
"#options but not unique\n",
"df = pd.read_json()\n",
"df = pd.read_sql()\n",
"df = pd.read_excel()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Saving dataframe as a file\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.to_csv('file_name.csv',index=False) #This will save the file as dataframe without an index\n",
"#options\n",
"df.to_json()\n",
"df.to_sql()\n",
"pd.to_excel()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Data type change"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pd.to_datetime\n",
"pd.to_timedelta"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Atributes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.columns #Returns the columns \n",
"df.dtypes #Returns the datatypes\n",
"df.index #Retuns the idex\n",
"df.shape #Returnsthe shape\n",
"df.T #Returns the dataframe inverted \n",
"df.values #Returns the values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Join, Merge and Concat\n",
"Link to documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html\n",
"Common adjustments: \n",
"\n",
"- `ignore_index = True`- When the index is not relevant for the join\n",
"- `axis= 0`: adds up rows | `axis= 1` Adds up the columns\n",
"- `keys = ['a', 'b', 'c']`- Adds up the DataFrame on certain keys\n",
"- `left` join - Use keys from left frame only\n",
"- `right` join - Use keys from right frame only\n",
"- `outer` - Use union of keys from both frames\n",
"- `inner` - Use intersection of keys from both frames\n",
"\n",
"This will return the index and column/row content. In some methods, if we want to modify the dataframe `inplace=True` needs to be specified"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_1.join(df_2) #Joins the dataframes based in Index - They must have the same index\n",
"\n",
"pd.concat([df_1, df_2], axis=1, join='outer', ignore_index=False, keys=None,\n",
" levels=None, names=None, verify_integrity=False, copy=True) #Values can be changed as needed\n",
"\n",
"df_1.pd.append(df_2, sort = True, inplace = True) #This adds dataframes with the same column structure and names\n",
"\n",
"dataframes = {'a': df_1, 'b': df_2, 'c': df_3}\n",
"new_df = pd.concat(dataframes) #Concats the dataframes indicated in the dataframes dictionary and create a new column indicating the origin of the data\n",
"\n",
"pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,\n",
" left_index=False, right_index=False, sort=True,\n",
" suffixes=('_x', '_y'), copy=True, indicator=False,\n",
" validate=None)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Columns addition and creation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['sum_col1_col2'] = df['col_1'] + df['col_2']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Inspecting and Visualizing data without change in dataframe\n",
"Use this if you want to see the values of certain columns or rows. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.head() #Returns the first 5 values\n",
"\n",
"df.tail() #Returns the last 5 values\n",
"\n",
"df['column_name'] #Returns all the data in the column\n",
"df.column_name #Returns all the data in the column\n",
"df['column_name', 'second_column'] #Returns columns as a new DataFrame\n",
"df[7:9] # Displays the values of rows 7 to 9\n",
"df.value_counts(dropna=False) #Retuns unique values and counts\n",
"df.sort_index(axis=0, ascending=False) #Returns dataframe sorted by index\n",
"df.apply(pd.Series.value_counts) #Returns values and counts for all columns\n",
"df.sort_values(by='column_name') #Returns dataframe sorted by the column selected\n",
"df.groupby('column_name').mean() #Returns dataframe grouped by column name and the mean\n",
"df.pivot_table\n",
"df.iloc[0] #Selection by position\n",
"df.loc[`index_one`] # Selection by index\n",
"df.iloc[0,:] #Returns First row\n",
"df.iloc[0,0] #Returns element of first column\n",
"\n",
"#Simple examples that can be adapted as needed\n",
"\n",
"df[df['is_muy_value'] == 1][['what_im_looking_for']]\n",
"\n",
"df[df['column_1'] < 10].groupby('column_2').mean()[['what_im_looking_for']]\n",
"df[df['column_1'] == 0].sort_values(by='column_2', ascending=False).head()\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Replacing and renaming"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.columns = ['column_1', 'column_2'] #Renames columns\n",
"df.rename(columns={'old_name': 'new_ name'}) # Selective renaming\n",
"\n",
"df.replace(1,'one') #Replace all values equal to 1 with ‘one’\n",
"df.replace([1,3],['one','three']) #Replace all 1 with ‘one’ and 3 with ‘three’\n",
"\n",
"\n",
"df.set_index('column_1') #Changes the index\n",
"df.astype(int) #Converts the datatype of the series to integer - It can be changed to any datatype\n",
"\n",
"df['column_1'].astype(int) #changes the datatype"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Null Values\n",
"- `isna` = `isnull`\n",
"- `notna` = `notnull`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.fillna(value = 'my_selected_value', inplace = True) #Fills all NANs with the value we selected and make the change permanent\n",
"df.fillna(x) # Replace all null values with x\n",
"df.notna().sum() #Sums of nas per column\n",
"df.interpolate()\n",
"df.isna().sum() #Returns True/false to NAs\n",
"df.isnull()\n",
"df.dropna(inplace = True) # Drops null values permanently\n",
"df.isnull().sum() #Prints null values agregated by column\n",
"df.isnull().sum()[df.isnull().sum() !=0].sort_values().plot(kind='barh'); #Plots the null values\n",
"\n",
"#Advance replacing:\n",
"df.fillna(df.mean()) #Replace all null values with the mean \n",
"\n",
"#Other way to overwrite the dataframe without the NAs in specific column\n",
"df.column.fillna(value='no_info', inplace=True) \n",
"df= df.loc[df['column'] != 'no_info']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Dropping"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.drop(['column_1', 'column_2'], axis=1, inplace = True) #drops specific columns\n",
"df.drop_duplicates(inplace=True) #drops duplicates permanently\n",
"df.drop('row_1', axis=0, inplace = True) #drops the row permanently"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Agregation Methods, stadistical methods and summaries\n",
"Can be used in way `df.sum` and `df.sum()` way"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.count() # Returns the number of non-null values in each DataFrame column\n",
"df.describe() # Summary statistics for numerical columns\n",
"df.max() #Returns the highest value in each column\n",
"df.mean() #Returns the mean of all columns\n",
"df.median() # Returns the median of each column\n",
"df.min() # Returns the lowest value in each column\n",
"df.mode() #Returns mode \n",
"df.std() # Returns the standard deviation of each column\n",
"df.var() #retuns varianza\n",
"df.abs() #Returns absolute values\n",
"df.corr() # Returns the correlation between columns in a DataFrame\n",
"df.round() #rounds the number"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Other to explore"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.clip()\n",
"df.nunique() \n",
"df.idxmax()\n",
"df.idxmin()\n",
"df.cov()\n",
"df.cummax()\n",
"df.cummin()\n",
"df.cumprod()\n",
"df.cumsum()\n",
"df.diff()\n",
"df.nlargest()\n",
"df.nsmallest()\n",
"df.pct_change()\n",
"df.prod()\n",
"df.quantile()\n",
"df.rank()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Simple Functions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Dummy Variables\n",
"https://socialresearchmethods.net/kb/dummyvar.php"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = pd.get_dummies(df, columns=['my_column'], drop_first=True) #I dummy variables we drop the first column to not to make it redundant"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Does this column have the value I'm looking for?\n",
"def is_the_value_im_looking(i):\n",
" val = i.split()\n",
" if 'value' in str(val):\n",
" return 1\n",
" else:\n",
" return 0\n",
" \n",
"#Create a column called as the value I'm looking for and adds 0 or 1\n",
"df['value'] = df['col_1'].apply(is_the_value_im_looking)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Extracts the title from everyone's name and create dummy columns, made with list comprehension.\n",
"#This can be adapted as needed\n",
"\n",
"df['Title'] = [each.split(',')[1].split('.')[0].strip() for each in df['Name']]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Rate per column. \n",
"#this can be adapted as needed. \n",
"\n",
"for i in ['column_1', 'column_2', 'column_3']:\n",
" print(i, ':')\n",
" print(df[df[i] == 1][['the_value_im_lookingfor']].mean())\n",
" print()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Plotting\n",
"Simple plotting examples"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.groupby('column_1').mean()[['value']].plot(kind='barh')\n",
"plt.title(\"plot title\");"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.groupby(['column_1', 'colum_2']).mean()[['value']].plot(kind='barh');"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
@Ajinkz
Copy link

Ajinkz commented Dec 9, 2019

Good reference for Pandas 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment