Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
Useful Pandas Snippets
# List unique values in a DataFrame column
# h/t @makmanalp for the updated syntax!
df['Column Name'].unique()
# Convert Series datatype to numeric (will error if column has non-numeric values)
# h/t @makmanalp
pd.to_numeric(df['Column Name'])
# Convert Series datatype to numeric, changing non-numeric values to NaN
# h/t @makmanalp for the updated syntax!
pd.to_numeric(df['Column Name'], errors='coerce')
# Grab DataFrame rows where column has certain values
valuelist = ['value1', 'value2', 'value3']
df = df[df.column.isin(valuelist)]
# Grab DataFrame rows where column doesn't have certain values
valuelist = ['value1', 'value2', 'value3']
df = df[~df.column.isin(value_list)]
# Delete column from DataFrame
del df['column']
# Select from DataFrame using criteria from multiple columns
# (use `|` instead of `&` to do an OR)
newdf = df[(df['column_one']>2004) & (df['column_two']==9)]
# Rename several DataFrame columns
df = df.rename(columns = {
'col1 old name':'col1 new name',
'col2 old name':'col2 new name',
'col3 old name':'col3 new name',
# Lower-case all DataFrame column names
df.columns = map(str.lower, df.columns)
# Even more fancy DataFrame column re-naming
# lower-case all DataFrame column names (for example)
df.rename(columns=lambda x: x.split('.')[-1], inplace=True)
# Loop through rows in a DataFrame
# (if you must)
for index, row in df.iterrows():
print index, row['some column']
# Much faster way to loop through DataFrame rows
# if you can work with tuples
# (h/t hughamacmullaniv)
for row in df.itertuples():
# Next few examples show how to work with text data in Pandas.
# Full list of .str functions:
# Slice values in a DataFrame column (aka Series)
# Lower-case everything in a DataFrame column
df.column_name = df.column_name.str.lower()
# Get length of data in a DataFrame column
# Sort dataframe by multiple columns
df = df.sort(['col1','col2','col3'],ascending=[1,1,0])
# Get top n for each group of columns in a sorted dataframe
# (make sure dataframe is sorted first)
top5 = df.groupby(['groupingcol1', 'groupingcol2']).head(5)
# Grab DataFrame rows where specific column is null/notnull
newdf = df[df['column'].isnull()]
# Select from DataFrame using multiple keys of a hierarchical index
df.xs(('index level 1 value','index level 2 value'), level=('level 1','level 2'))
# Change all NaNs to None (useful before
# loading to a db)
df = df.where((pd.notnull(df)), None)
# More pre-db insert cleanup...make a pass through the dataframe, stripping whitespace
# from strings and changing any empty values to None
# (not especially recommended but including here b/c I had to do this in real life one time)
df = df.applymap(lambda x: str(x).strip() if len(str(x).strip()) else None)
# Get quick count of rows in a DataFrame
# Pivot data (with flexibility about what what
# becomes a column and what stays a row).
# Syntax works on Pandas >= .14
index=['col1', 'col2', 'col3'], #these stay as columns; will fail silently if any of these cols have null values
columns=['col4']) #data values in this column become their own column
# Change data type of DataFrame column
df.column_name = df.column_name.astype(np.int64)
# Get rid of non-numeric values throughout a DataFrame:
for col in refunds.columns.values:
refunds[col] = refunds[col].replace('[^0-9]+.-', '', regex=True)
# Set DataFrame column values based on other column values (h/t: @mlevkov)
df.loc[(df['column1'] == some_value) & (df['column2'] == some_other_value), ['column_to_change']] = new_value
# Clean up missing values in multiple DataFrame columns
df = df.fillna({
'col1': 'missing',
'col2': '99.999',
'col3': '999',
'col4': 'missing',
'col5': 'missing',
'col6': '99'
# Concatenate two DataFrame columns into a new, single column
# (useful when dealing with composite keys, for example)
# (h/t @makmanalp for improving this one!)
df['newcol'] = df['col1'].astype(str) + df['col2'].astype(str)
# Doing calculations with DataFrame columns that have missing values
# In example below, swap in 0 for df['col1'] cells that contain null
df['new_col'] = np.where(pd.isnull(df['col1']),0,df['col1']) + df['col2']
# Split delimited values in a DataFrame column into two new columns
df['new_col1'], df['new_col2'] = zip(*df['original_col'].apply(lambda x: x.split(': ', 1)))
# Collapse hierarchical column indexes
df.columns = df.columns.get_level_values(0)
# Convert Django queryset to DataFrame
qs = DjangoModelName.objects.all()
q = qs.values()
df = pd.DataFrame.from_records(q)
# Create a DataFrame from a Python dictionary
df = pd.DataFrame(list(a_dictionary.items()), columns = ['column1', 'column2'])
# Get a report of all duplicate records in a dataframe, based on specific columns
dupes = df[df.duplicated(['col1', 'col2', 'col3'], keep=False)]
# Set up formatting so larger numbers aren't displayed in scientific notation (h/t @thecapacity)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

I spent almost three hours trying to do the things you present. Excelent.

This is purely excellent! Thanks!

Awesome :-)

Excellent reference. I have come back to it many times. Thank you!

Awesome. you rock. Thanks for doing this.

Very nice reference -- thanks for sharing!


thanks a ton for sharing. this is awesome.


ghost commented Mar 7, 2016

Should valuelist (lines 8 and 12) and value_list (lines 9 and 13), be the same? Either valuelist or value_list?

awesome stuff!!, thank you so much

Most useful demo!!!!!!! Thank you.......

Very useful, keep them coming!

kisna72 commented May 20, 2016

This is very useful. Thanks.



mlevkov commented Jul 24, 2016

From line #87, above:

Set DataFrame column values based on other column values

df['column_to_change'][(df['column1'] == some_value) & (df['column2'] == some_other_value)] = new_value

is throwing an error SettingWithCopyWarning
alternatively I recommend that you change this to the following syntax, using .loc
df.loc((df['column1'] == some_value) & (df['column2'] == some_other_value), ['column_to_change']) = new_value


bsweger commented Aug 13, 2016

@mlevkov Thank you, thank you! Have long been vexed by Pandas SettingWithCopyWarning and, truthfully, do not think the docs for .loc provide enough clear examples for those of us who want to re-write using that syntax.

Your re-write of the example in this gist worked great...just had to change the parens to brackets like so:
df.loc[(df['column1'] == some_value) & (df['column2'] == some_other_value), ['column_to_change']] = new_value

Really, really appreciate you taking the time to pass along this tip. Updated the gist accordingly--no doubt I'll refer back to this example many times!

Thank you so much for this!


bsweger commented Aug 18, 2016

@ward916 Sorry for seeing your comment so late--yes, value_list was a typo (fixed). Thanks so much for letting me know!

rezastd commented Sep 30, 2016

i have a question, for example i have a csv file with columns A-Z, what if i want to select column D and column G until the rest of the columns? thank you for sharing πŸ‘

ndanturt commented Jan 6, 2017

I signed-up just to thank you !!

This is pretty good. Gracias.

Amazing You saved me a lot of time. Thanks!

Thank you very much for sharing!

naripok commented Feb 25, 2017

An yet another thanks!!!
Thank you very much man!

hughamacmullaniv commented Mar 9, 2017

Good stuff! Thanks!

For looping through rows, if you can work with tuples, try df.itertuples(). Super fast!

This is really useful - thanks!

MarkFeder commented Mar 30, 2017

thanks! really useful πŸ‘

Thanks! πŸ‘

saminaji commented May 4, 2017

Thanks for sharing

Thanks for sharing!

Thank you! Great list, and helped me get through a few troubling issues, and did it with good performance.

Thank you for sharing this.

le-dawg commented Jun 25, 2017

I don't think any other gist for "pandas snippets" ranks better. I have one I would like to add and since pull request for gists don't canonically exist, I'd like to post it here. Keeping with your formatting:

#Check how many rows in DataFrame contain certain substring s in column col
#Get indices of rows that contain substring s in column col

the most helpful script ever ! :) Thank you so much helped me ALOT.

cddesire commented Aug 4, 2017

good stuffs, help me a lot

This makes a nice cheatsheet. Ever think about using some markup and making a document out of it?


bsweger commented Sep 18, 2017

@japhigu Thanks for your contribution--will add these! Your notes and @evanleeturner's have made me realize that this info, though useful, would better serve people in a different format, where others can weigh in.

Loved it

dblinde commented Sep 20, 2017

really usefull if you want to ETL your data!

Thanks, Daan Blinde.

pashute commented Nov 6, 2017

Could you add how to calculate column value from rows in other dataframes with multiple index
# df1 data:
# cateory date data1, data1
# cat, '2017-11-05 01:01:01', 11, 12
# cat, '2017-11-06 01:01:01', 21, 22
# dog, '2017-11-04 01:01:01', 11, 12
# dog, '2017-11-07 01:01:01', 21, 31

# df2 data:
# cateory     date               data1, data1
# cat, '2017-11-05 01:01:01', 11, 12
# cat, '2017-11-07 11:11:11', 21, 22
# dog, '2017-11-06 01:01:01', 11, 12
# dog, '2017-11-07 10:01:01', 21, 31

dfresult = pd.dataframe(columns: ['category', 'date', 'indication1', 'indication2', 'calculated'], index=('category','date')
# dfresult data:
# cateory date indication1 indication2 calculated
# cat, '2017-11-05 01:01:01', na, na, na
# dog, '2017-11-07 10:01:01', na, na, na

# now set dfreslut['data1'] to be function: 
        - if both df1['data1'] and df2['data1'] of that index are na:  "x"
        - if only df1[data1] of that index is na:  "x1"
        - if only df2[data1] of that index is na:  "x2"   
         - if df2 is zero then: "z" 
         - otherwise set to: df1['data1']/df2['data1'] - 1
   # and set dfresult['calculated'] to calculated function taking from current row. i.e. = dfresult['data1'] < tolerance

Hi All,

Thank you Ms. Sweger for sharing this valuable asset. I do have a quick question on replacing columns in a dataframe with columns from another dataframe. What I found out is to assign new column to older one but this throws an error.
Basically what I did is
df_1['oldColumn'] = df_2['newColumn']
this does the work but throws this error here "SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame."

I was wondering what your solution would be for this one Ms. Sweger...

Thanks again!

Lovely collection! Thanks much!

blaggacao commented Dec 17, 2017

Typo: Should be df.columns (df.columnS) in

Actually only this worked in my case (maybe because of index/column ambiguity?):

# Grab DataFrame rows where column has certain values
valuelist = ['value1', 'value2', 'value3']
df = df.loc[:,df.columns.isin(valuelist)]

# Grab DataFrame rows where column doesn't have certain values
valuelist = ['value1', 'value2', 'value3']
df = df.loc[:,~df.columns.isin(value_list)]

Reproduce with:

df = pd.DataFrame([[1,2,3,4],[1,2,3,4],[2,3,4,5],[2,3,4,5]], range(4), range(4))


aic5 commented Jan 14, 2018

That is great! Thanks for sharing.

kmr0877 commented Jan 26, 2018

really cool stuff .Appreciate the efforts πŸ‘

Thanks for sharing. I am interested in knowing how to grab columns values in a DataFrame through a loop.

you saved my day, thanks

jdflick commented Feb 14, 2018

excellent work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment