Instantly share code, notes, and snippets.

Embed
What would you like to do?
Useful Pandas Snippets

Useful Pandas Snippets

A personal diary of DataFrame munging over the years.

Data Types and Conversion

Convert Series datatype to numeric (will error if column has non-numeric values)
(h/t @makmanalp)

pd.to_numeric(df['Column Name'])

Convert Series datatype to numeric, changing non-numeric values to NaN
(h/t @makmanalp for the updated syntax!)

pd.to_numeric(df['Column Name'], errors='coerce')

Change data type of DataFrame column

df.column_name = df.column_name.astype(np.int64)

Exploring and Finding Data

Get a report of all duplicate records in a DataFrame, based on specific columns

dupes = df[df.duplicated(
    ['col1', 'col2', 'col3'], keep=False)]

List unique values in a DataFrame column
(h/t @makmanalp for the updated syntax!)

df['Column Name'].unique()

For each unique value in a DataFrame column, get a frequency count

df['Column Name'].value_counts()

Grab DataFrame rows where column = a specific value

df = df.loc[df.column == 'somevalue']

Grab DataFrame rows where column value is present in a list

test_data = {'hi': 'yo', 'bye': 'later'}
df = pd.DataFrame(list(d.items()), columns=['col1', 'col2'])
valuelist = ['yo', 'heya']
df[df.col2.isin(valuelist)]

Grab DataFrame rows where column value is not present in a list

test_data = {'hi': 'yo', 'bye': 'later'}
df = pd.DataFrame(list(d.items()), columns=['col1', 'col2'])
valuelist = ['yo', 'heya']
df[~df.col2.isin(valuelist)]

Select from DataFrame using criteria from multiple columns
(use | instead of & to do an OR)

newdf = df[(df['column_one']>2004) & (df['column_two']==9)]

Loop through rows in a DataFrame
(if you must)

for index, row in df.iterrows():
    print (index, row['some column'])

Much faster way to loop through DataFrame rows if you can work with tuples
(h/t hughamacmullaniv)

for row in df.itertuples():
    print(row)

Get top n for each group of columns in a sorted DataFrame
(make sure DataFrame is sorted first)

top5 = df.groupby(
    ['groupingcol1',
    'groupingcol2']).head(5)

Grab DataFrame rows where specific column is null/notnull

newdf = df[df['column'].isnull()]

Select from DataFrame using multiple keys of a hierarchical index

df.xs(
    ('index level 1 value','index level 2 value'),
    level=('level 1','level 2'))

Slice values in a DataFrame column (aka Series)

df.column.str[0:2]

Get quick count of rows in a DataFrame

len(df.index)

Get length of data in a DataFrame column

df.column_name.str.len()

Updating and Cleaning Data

Delete column from DataFrame

del df['column']

Rename several DataFrame columns

df = df.rename(columns = {
    'col1 old name':'col1 new name',
    'col2 old name':'col2 new name',
    'col3 old name':'col3 new name',
})

Lower-case all DataFrame column names

df.columns = map(str.lower, df.columns)

Even more fancy DataFrame column re-naming
lower-case all DataFrame column names (for example)

df.rename(columns=lambda x: x.split('.')[-1], inplace=True)

Lower-case everything in a DataFrame column

df.column_name = df.column_name.str.lower()

Sort DataFrame by multiple columns

df = df.sort_values(
    ['col1','col2','col3'],ascending=[1,1,0])

Change all NaNs to None (useful before loading to a db)

df = df.where((pd.notnull(df)), None)

More pre-db insert cleanup...make a pass through the DataFrame, stripping whitespace from strings and changing any empty values to None
(not especially recommended but including here b/c I had to do this in real life once)

df = df.applymap(lambda x: str(x).strip() if len(str(x).strip()) else None)

Get rid of non-numeric values throughout a DataFrame:

for col in refunds.columns.values:
  refunds[col] = refunds[col].replace(
      '[^0-9]+.-', '', regex=True)

Set DataFrame column values based on other column values
(h/t: @mlevkov)

df.loc[(df['column1'] == some_value) & (df['column2'] == some_other_value), ['column_to_change']] = new_value

Clean up missing values in multiple DataFrame columns

df = df.fillna({
    'col1': 'missing',
    'col2': '99.999',
    'col3': '999',
    'col4': 'missing',
    'col5': 'missing',
    'col6': '99'
})

Doing calculations with DataFrame columns that have missing values. In example below, swap in 0 for df['col1'] cells that contain null.

df['new_col'] = np.where(
    pd.isnull(df['col1']), 0, df['col1']) + df['col2']

Split delimited values in a DataFrame column into two new columns

df['new_col1'], df['new_col2'] = zip(
    *df['original_col'].apply(
        lambda x: x.split(': ', 1)))

Collapse hierarchical column indexes

df.columns = df.columns.get_level_values(0)

Reshaping, Concatenating, and Merging Data

Pivot data (with flexibility about what what becomes a column and what stays a row).

pd.pivot_table(
  df,values='cell_value',
  index=['col1', 'col2', 'col3'], #these stay as columns; will fail silently if any of these cols have null values
  columns=['col4']) #data values in this column become their own column

Concatenate two DataFrame columns into a new, single column
(useful when dealing with composite keys, for example)
(h/t @makmanalp for improving this one!)

df['newcol'] = df['col1'].astype(str) + df['col2'].astype(str)

Display and formatting

Set up formatting so larger numbers aren't displayed in scientific notation
(h/t @thecapacity)

pd.set_option('display.float_format', lambda x: '%.3f' % x)

To display with commas and no decimals

pd.options.display.float_format = '{:,.0f}'.format

Creating DataFrames

Create a DataFrame from a Python dictionary

df = pd.DataFrame(list(a_dictionary.items()), columns = ['column1', 'column2'])

Convert Django queryset to DataFrame

qs = DjangoModelName.objects.all()
q = qs.values()
df = pd.DataFrame.from_records(q)
@jdcaballerov

This comment has been minimized.

jdcaballerov commented Nov 28, 2014

I spent almost three hours trying to do the things you present. Excelent.

@ChuyuHsu

This comment has been minimized.

ChuyuHsu commented Sep 15, 2015

This is purely excellent! Thanks!

@bolshoibooze

This comment has been minimized.

bolshoibooze commented Oct 10, 2015

Awesome :-)

@michaeljohnbarr

This comment has been minimized.

michaeljohnbarr commented Dec 4, 2015

Excellent reference. I have come back to it many times. Thank you!

@LesGuessing

This comment has been minimized.

LesGuessing commented Dec 30, 2015

Awesome. you rock. Thanks for doing this.

@howardhamilton

This comment has been minimized.

howardhamilton commented Feb 23, 2016

Very nice reference -- thanks for sharing!

@alexcombessie

This comment has been minimized.

alexcombessie commented Mar 2, 2016

thanks!

@technofriends

This comment has been minimized.

technofriends commented Mar 6, 2016

thanks a ton for sharing. this is awesome.

@ghost

This comment has been minimized.

ghost commented Mar 7, 2016

Should valuelist (lines 8 and 12) and value_list (lines 9 and 13), be the same? Either valuelist or value_list?

@nadia-alam

This comment has been minimized.

nadia-alam commented Mar 10, 2016

awesome stuff!!, thank you so much

@narendraprasath

This comment has been minimized.

narendraprasath commented Mar 25, 2016

Most useful demo!!!!!!! Thank you.......

@eherrerosj

This comment has been minimized.

eherrerosj commented Apr 23, 2016

Very useful, keep them coming!

@kisna72

This comment has been minimized.

kisna72 commented May 20, 2016

This is very useful. Thanks.

@Sandy4321

This comment has been minimized.

Sandy4321 commented May 23, 2016

cool

@nitinsurya

This comment has been minimized.

nitinsurya commented Jun 13, 2016

Nice...

@mlevkov

This comment has been minimized.

mlevkov commented Jul 24, 2016

From line #87, above:

Set DataFrame column values based on other column values

df['column_to_change'][(df['column1'] == some_value) & (df['column2'] == some_other_value)] = new_value

is throwing an error SettingWithCopyWarning
see http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
alternatively I recommend that you change this to the following syntax, using .loc
df.loc((df['column1'] == some_value) & (df['column2'] == some_other_value), ['column_to_change']) = new_value

@bsweger

This comment has been minimized.

Owner

bsweger commented Aug 13, 2016

@mlevkov Thank you, thank you! Have long been vexed by Pandas SettingWithCopyWarning and, truthfully, do not think the docs for .loc provide enough clear examples for those of us who want to re-write using that syntax.

Your re-write of the example in this gist worked great...just had to change the parens to brackets like so:
df.loc[(df['column1'] == some_value) & (df['column2'] == some_other_value), ['column_to_change']] = new_value

Really, really appreciate you taking the time to pass along this tip. Updated the gist accordingly--no doubt I'll refer back to this example many times!

@anahochen

This comment has been minimized.

anahochen commented Aug 17, 2016

Thank you so much for this!

@bsweger

This comment has been minimized.

Owner

bsweger commented Aug 18, 2016

@Ward916 Sorry for seeing your comment so late--yes, value_list was a typo (fixed). Thanks so much for letting me know!

@rezastd

This comment has been minimized.

rezastd commented Sep 30, 2016

i have a question, for example i have a csv file with columns A-Z, what if i want to select column D and column G until the rest of the columns? thank you for sharing 👍

@ndanturt

This comment has been minimized.

ndanturt commented Jan 6, 2017

I signed-up just to thank you !!

@BardiaAfshin

This comment has been minimized.

BardiaAfshin commented Feb 7, 2017

This is pretty good. Gracias.

@lminarini

This comment has been minimized.

lminarini commented Feb 20, 2017

Amazing You saved me a lot of time. Thanks!

@anna1609

This comment has been minimized.

anna1609 commented Feb 24, 2017

Thank you very much for sharing!

@naripok

This comment has been minimized.

naripok commented Feb 25, 2017

An yet another thanks!!!
Thank you very much man!

@hughamacmullaniv

This comment has been minimized.

hughamacmullaniv commented Mar 9, 2017

Good stuff! Thanks!

For looping through rows, if you can work with tuples, try df.itertuples(). Super fast!

@mpiccorossi

This comment has been minimized.

mpiccorossi commented Mar 23, 2017

This is really useful - thanks!

@MarkFeder

This comment has been minimized.

MarkFeder commented Mar 30, 2017

thanks! really useful 👍

@pcmasuzzo

This comment has been minimized.

pcmasuzzo commented Apr 25, 2017

Thanks! 👍

@saminaji

This comment has been minimized.

saminaji commented May 4, 2017

Thanks for sharing

@zhhz418418

This comment has been minimized.

zhhz418418 commented May 13, 2017

Thanks for sharing!

@StevenBorg

This comment has been minimized.

StevenBorg commented May 31, 2017

Thank you! Great list, and helped me get through a few troubling issues, and did it with good performance.

@ishankhatri90

This comment has been minimized.

ishankhatri90 commented Jun 6, 2017

Thank you for sharing this.

@le-dawg

This comment has been minimized.

le-dawg commented Jun 25, 2017

I don't think any other gist for "pandas snippets" ranks better. I have one I would like to add and since pull request for gists don't canonically exist, I'd like to post it here. Keeping with your formatting:

#Check how many rows in DataFrame contain certain substring s in column col
print(len(df[df['col'].str.contains("s")].index.values[:]))
#Get indices of rows that contain substring s in column col
print(len(df[df['col'].str.contains("s")].index.values[:]))
@FirasMegrahi

This comment has been minimized.

FirasMegrahi commented Jun 26, 2017

the most helpful script ever ! :) Thank you so much helped me ALOT.

@cddesire

This comment has been minimized.

cddesire commented Aug 4, 2017

good stuffs, help me a lot

@evanleeturner

This comment has been minimized.

evanleeturner commented Aug 11, 2017

This makes a nice cheatsheet. Ever think about using some markup and making a document out of it?

@bsweger

This comment has been minimized.

Owner

bsweger commented Sep 18, 2017

@japhigu Thanks for your contribution--will add these! Your notes and @evanleeturner's have made me realize that this info, though useful, would better serve people in a different format, where others can weigh in.

@kamal11k

This comment has been minimized.

kamal11k commented Sep 20, 2017

Loved it

@dblinde

This comment has been minimized.

dblinde commented Sep 20, 2017

really usefull if you want to ETL your data!

Thanks, Daan Blinde.

@pashute

This comment has been minimized.

pashute commented Nov 6, 2017

Could you add how to calculate column value from rows in other dataframes with multiple index
# df1 data:
# cateory date data1, data1
# cat, '2017-11-05 01:01:01', 11, 12
# cat, '2017-11-06 01:01:01', 21, 22
# dog, '2017-11-04 01:01:01', 11, 12
# dog, '2017-11-07 01:01:01', 21, 31

# df2 data:
# cateory     date               data1, data1
# cat, '2017-11-05 01:01:01', 11, 12
# cat, '2017-11-07 11:11:11', 21, 22
# dog, '2017-11-06 01:01:01', 11, 12
# dog, '2017-11-07 10:01:01', 21, 31

dfresult = pd.dataframe(columns: ['category', 'date', 'indication1', 'indication2', 'calculated'], index=('category','date')
dfresult.append(df1['category','date'])
# dfresult data:
# cateory date indication1 indication2 calculated
# cat, '2017-11-05 01:01:01', na, na, na
# dog, '2017-11-07 10:01:01', na, na, na

# now set dfreslut['data1'] to be function: 
        - if both df1['data1'] and df2['data1'] of that index are na:  "x"
        - if only df1[data1] of that index is na:  "x1"
        - if only df2[data1] of that index is na:  "x2"   
         - if df2 is zero then: "z" 
         - otherwise set to: df1['data1']/df2['data1'] - 1
   # and set dfresult['calculated'] to calculated function taking from current row. i.e. = dfresult['data1'] < tolerance
@oserttas-math

This comment has been minimized.

oserttas-math commented Nov 27, 2017

Hi All,

Thank you Ms. Sweger for sharing this valuable asset. I do have a quick question on replacing columns in a dataframe with columns from another dataframe. What I found out is to assign new column to older one but this throws an error.
Basically what I did is
df_1['oldColumn'] = df_2['newColumn']
this does the work but throws this error here "SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame."

I was wondering what your solution would be for this one Ms. Sweger...

Thanks again!

@krauseling

This comment has been minimized.

krauseling commented Dec 2, 2017

Lovely collection! Thanks much!

@blaggacao

This comment has been minimized.

blaggacao commented Dec 17, 2017

Typo: Should be df.columns (df.columnS) in https://gist.github.com/bsweger/e5817488d161f37dcbd2#file-useful_pandas_snippets-py-L13-L19

Actually only this worked in my case (maybe because of index/column ambiguity?):

# Grab DataFrame rows where column has certain values
valuelist = ['value1', 'value2', 'value3']
df = df.loc[:,df.columns.isin(valuelist)]

# Grab DataFrame rows where column doesn't have certain values
valuelist = ['value1', 'value2', 'value3']
df = df.loc[:,~df.columns.isin(value_list)]

Reproduce with:

df = pd.DataFrame([[1,2,3,4],[1,2,3,4],[2,3,4,5],[2,3,4,5]], range(4), range(4))
@sidharthbolar

This comment has been minimized.

sidharthbolar commented Dec 23, 2017

thanks!

@aic5

This comment has been minimized.

aic5 commented Jan 14, 2018

That is great! Thanks for sharing.

@kmr0877

This comment has been minimized.

kmr0877 commented Jan 26, 2018

really cool stuff .Appreciate the efforts 👍

@salman6049

This comment has been minimized.

salman6049 commented Feb 1, 2018

Thanks for sharing. I am interested in knowing how to grab columns values in a DataFrame through a loop.

@vsaraogi

This comment has been minimized.

vsaraogi commented Feb 11, 2018

you saved my day, thanks

@jdflick

This comment has been minimized.

jdflick commented Feb 14, 2018

excellent work

@jerisalan

This comment has been minimized.

jerisalan commented Feb 19, 2018

Immensely helpful. Kudos for preparing this 👍

@brianz

This comment has been minimized.

brianz commented Mar 8, 2018

refunds[col].replace('[^0-9]+.-', '', regex=True) was a great find for me...thank you!

@GitPatrickHarris

This comment has been minimized.

GitPatrickHarris commented Apr 10, 2018

This is huge!

@manmartgarc

This comment has been minimized.

manmartgarc commented Apr 28, 2018

Fantastic!

@cnmetzi

This comment has been minimized.

cnmetzi commented Jun 1, 2018

Thanks for sharing! Very helpful!

@musalys

This comment has been minimized.

musalys commented Jul 11, 2018

Thanks for this!! really helpful and useful!

@ineedme

This comment has been minimized.

ineedme commented Oct 24, 2018

# Validate datetime
day,month,year = inputDate.split('/')
isValidDate = True
try :
    datetime.datetime(int(year),int(month),int(day))
except ValueError :
    isValidDate = False
if(isValidDate) :
    print ("Input date is valid ..")
else :
    print ("Input date is not valid..")
@nonagonyang

This comment has been minimized.

nonagonyang commented Nov 18, 2018

This is extremly useful. Thank you

@bsweger

This comment has been minimized.

Owner

bsweger commented Dec 6, 2018

Typo: Should be df.columns (df.columnS) in https://gist.github.com/bsweger/e5817488d161f37dcbd2#file-useful_pandas_snippets-py-L13-L19

Actually only this worked in my case (maybe because of index/column ambiguity?):

# Grab DataFrame rows where column has certain values
valuelist = ['value1', 'value2', 'value3']
df = df.loc[:,df.columns.isin(valuelist)]

# Grab DataFrame rows where column doesn't have certain values
valuelist = ['value1', 'value2', 'value3']
df = df.loc[:,~df.columns.isin(value_list)]

Reproduce with:

df = pd.DataFrame([[1,2,3,4],[1,2,3,4],[2,3,4,5],[2,3,4,5]], range(4), range(4))

Suuuper belated response, but I think what happened here is that my wording in the example was super confusing. By column, I meant "the name of the column you're searching" but that wasn't at all clear. Updated those snippets, so hopefully it won't trip up anyone else. Thanks for the feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment