Skip to content

Instantly share code, notes, and snippets.

@nsarode
Last active August 19, 2022 23:26
Show Gist options
  • Save nsarode/0c6cade4a5682a225f9d0a09260d4735 to your computer and use it in GitHub Desktop.
Save nsarode/0c6cade4a5682a225f9d0a09260d4735 to your computer and use it in GitHub Desktop.
Pandas notes : Compiling a reference list of commands relevant to Pandas

Version check

pd.__version__

Summary stats

df.describe()

Rows and column counts

df.shape

File operations

Reading a tsv file with header row

df = pd.read_csv("filename.tsv", sep="\t", header=0)

NOTE: header=0 is default

Read a tsv and assign column headers

df = pd.read_csv("filename.tsv", sep="\t", names=['col1','col2'])

Reading a file from url

import pandas as pd
import io
import requests
url="https://s3.amazonaws.com/demo-datasets/wine.csv"
s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode('utf-8')))

Reading dta file

df = pd.read_stata("files/individual_characteristics.dta")

Writing file (index=False used when you don't want index to be written as a column)

df.to_csv('example.csv')

df.to_csv('example.tsv', sep="\t", index=False)

Skipping rows when reading df

Skip specific number of rows

df = pd.read_csv("filename.tsv", sep="\t", skiprows=4)

Skip commented rows

df = pd.read_csv("filename.tsv", sep="\t", comment='#')

Fill empty cells

dfNew = df.fillna('WhateverIWant')

Index set and reset

Turn a column into index.

df.set_index('columnName')

NOTE : you can create multi-indexes i.e. use > 1 column e.g. month, year etc

df.set_index(['columnName1','columnName2'])

If for some reason, the multi-index order is switched from what you want, you can swap em

df.swaplevel()

Reset index

df.reset_index()

Modify the DataFrame in place (do not create a new object)

df.reset_index(inplace=True)

Sometimes reset_index inserts index as dataframe columns. The following command resets the index to the default integer index.

df.reset_index(drop=True)

Iterate over rows


for index, row in df.iterrows():
    print(row['column1'], row['column2'])

Column operations

List datatype of each column of df

df.dtypes

Change data types (e.g. below is to change it to float)

df['ColumnName'] = df.ColumnName.astype(float)

Get column names

df.columns

To get names as list

list(df)

Subset columns by index

df.iloc[rowid,columnid] # specific row and column
df.iloc[:,columnid] # all rows of specific colum
df.iloc[:,startcolumnid:endcolumnid] # all rows, all columns starting with startcolumnid and end with endcolumnid
df.iloc[:,startcolumnid:] # all rows, all columns starting with startcolumnid

import numpy as np
df.iloc[:,np.r_[1:5,6,8,10:15]] # selecting/slicing multiple range of columns

Subset or select column names matching regex or pattern

colList = list(df)

# e.g. to select column names ending in _taxid
selected_names = [e for e in colList if e.endswith('_taxid')]

#or 
import re
selected_names = [e for e in colList if re.match(".*taxid$",e)]

Add new column

From list

df['column1'] = listName

From series
se = pd.Series(mylist)
df['new_col'] = se.values
Using subscript (grabbed with regex) from another column

df['NewColumnName'] = df['ColumnToGrepFrom'].str.extract(r'Name=(\w+)')

Conditional column operations

Create new column based on value (string or digit)

df['elderly'] = np.where(df['age']>=50, 'yes', 'no')

df['New Column'] = np.where(df.columnA.str.match(regex), "this", "that")

Create a new column, where value is equal to another column (which could change) or a common text based on condition

df['NewCol'] = np.where(df['ColToCheck'] <= 10, 'Other',df['UseThisColVal'])

df['New Column'] = np.where(df.columnA.str.match(regex), df.columnB, df.columnC)

Update value in another column based on value

df.loc[df['ColumnToCheck'] == 0, 'ColumnToUpdate'] = 'DesiredValue'

Column conversions

Date time format df['columnname'] = pd.to_datetime(df['columnname'])

If there are str or Nan characters in your column, the above command with error out. Use the following to convert em to NaT and proceed, unless you want to remove those rows altogether and repeat the command above

df['columnname'] = pd.to_datetime(df['columnname'], errors='coerce')

If you plan to do any row selections based on datetime columns, you could get an error Cannot compare tz-naive and tz-aware datetime-like objects. In that case, you could solve it by

df['date_col'] = pd.to_datetime(df['text_date_col']).dt.tz_convert(None)

Get list of column values

df["columnName"]

or

df.columnName

Rename columns

By name

df.rename(columns={'oldname':'newname'}, inplace=True)

By sequential order

df.columns = ['newname1', 'newname2']

Using regex

df.columns = [x.strip().replace('StringToBeReplaced','ReplaceWithThis') for x in df.columns]

Drop columns by name

newdf = df.drop(['column1', 'column2'], axis=1)

or for in-place operations

df.drop(['column1', 'column2'], axis=1, inplace=True)

Create empty df with column names

df = pd.DataFrame(columns=['column1','column2'])

Append a string to each column value

df['col'] = 'str' + df['col'].astype(str)

Split column(s)

Based on delimiter

If there is a single delimiter or you know the number of resulting columns

df['A'], df['B'] = df['AB'].str.split(' ', 1).str

If you want it to automatically create columns (just removed the 'maximum splits' parameter)

df['AB'].str.split(' ', expand=True)

If you want to automatically split a column into multiple (adding a prefix to em, eg. splited_0, splited_1) and add to existing dataframe.

df = df.join(df['AB'].str.split(' ', expand=True).add_prefix('splited_'))

Based on content
Tuple

Source: SO

If you have a column b containing tuples (each with 2 components)

df[['b1', 'b2']] = pd.DataFrame(df['b'].tolist(), index=df.index)

Another, albiet slower and more memory intensive way (for large df's will cause memory error), is using apply as df['b'].apply(pd.Series)

Check if df columns are numeric

Returns True or False

import numpy as np
np.issubtype(df[columName].dtype, np.numeric)

Get list of unique values from specific column

columniqvalues = df.colname.unique()

Get list of duplicated values from specific column

df[df.duplicated(['columnWithDups'], keep=False)]

GroupBy operations

GroupBy single column

operations could be: sum, mean, median etc

This will create a new column with your specified name (NOTE: this option is soon going to be deprecated)

dfgrp = df.groupby(['GroupbyThisColumn'])['OperationOnThisColum'].agg({'NewCol':'operation'})

If you don't want the groupby column to be the index

newDf = olddf.groupby('WhatColToGroupBy', as_index=False)['ColToBeProcessed'].agg({'NewCol': "count"})

This will create a new column with autogenerated column name (based on operation)

dfgrp = df.groupby(['GroupbyThisColumn'])['OperationOnThisColum'].agg(['operation'])

GroupBy multiple columns

dfgrp = df.groupby(['GroupbyThisColumn', 'GroupbyThisColumntoo'])['OperationOnThisColum','OperationOnThisColum2'].agg(['operation'])

Groupby and collapse values

Source: SO

Collapse column values (by separator) based on common value in another column

Example:

Original df

  tempx        value
picture1         1.5
picture555       1.5
picture255       1.5
picture365       1.5
picture112       1.5

Command:

 df.groupby('value')['tempx'].apply(' '.join).reset_index()

Output

  value                                              tempx
1.5  picture1 picture555 picture255 picture365 pict...

Subset

Sources: So

To select rows whose column value equals a scalar, some_value, use ==:

df.loc[df['column_name'] == some_value]

or

df[df['column_name'] == some_value]

To select rows whose column value is in an iterable, some_values, use isin:

df.loc[df['column_name'].isin(some_values)]

Combine multiple conditions with &:

df.loc[(df['column_name'] == some_value) & df['other_column'].isin(some_values)]

To select rows whose column value does not equal some_value, use !=:

df.loc[df['column_name'] != some_value]

Another way (this time using text comparison as example)

df.query('columnName != "some_text" and columnName != "SomeOthertext"')

isin returns a boolean Series, so to select rows whose value is not in some_values, negate the boolean Series using ~:

df.loc[~df['column_name'].isin(some_values)]

selecting columns by name

subsetdf = df[['columnName1','columnName2']]

Select rows based on partial string match in specific column

df[df['A'].str.contains("hello")]

Merge

Generic way. options for how are left, right, inner, outer

df = pd.merge(df1, df2, how='inner', on='ColumnName')

merge specific columns only

df = pd.merge(df1, df2[['KeyColumnName', 'TargetColumnname']], how='inner', on='KeyColumnName')

Row operations

List rows based on column value match

Value match

df[df['ColumnName'] == Value]

Null value

df[df.isnull().any(axis=1)]

Subset rows

Null values in specific column

df.loc[df['columnName'].isnull()]

Non null values in specific column

df.loc[df['columnName'].notnull()]

Add

Append another df (rbind equivalent)

newdf = df1.append(df2,ignore_index=True)

Append specific row from another df

newdf = df1.append(df2.iloc[indexNo],ignore_index=True)

Append a list as last row of df

df.loc[len(df)] = listname

Drop

Drop rows by name

Drop rows and save the new df as a new object

newdf = df.drop(['rowname', 'rowname2'])

In-place changes

df.drop(['rowname', 'rowname2'], inplace=True)

Drop rows by index

df.drop(df.index[0])

Drop first two rows

df.drop(df.index[:2])

Drop rows by value

df = df[df.columnName != Value]

Or

Note : for option below the columnName should not have space

df.query('columnName != 0')

Remove duplicate rows

df.drop_duplicates()

Drop duplicates in a column, but take the last obs in the duplicated set

df.drop_duplicates(['columnName'], keep='last')

Subset rows from columns

Partial string matching

Partial match

df[df['columnName'].str.contains('ac')]

Starting with

df[df['columnName'].str.match('ac')]

Dictionary conversion

From dictionary

e.g. using dictionary data

df = pd.DataFrame(list(data.items()), columns=['Key_column', 'Value_column'])

To dictionary

df.set_index('id')['value'].to_dict()

dict = dict(zip(df.id, df.value))

Reorder and sort

sort column values

Note: for sort_values, either a column's name or its numeric index can be given

single column

pd.columName.sort_values('columnName', ascending=True)

multiple columns

pd.columnName.sort_values(['column1','column2'], ascending=[True, False])

Flip order of rows

dfFinal = df[::-1]

Pandas plot

pandas plot legend position options available

best
upper right
upper left
lower left
lower right
right
center left
center right
lower center
upper center
center

Reshaping dataframe

Melt

pd.melt(strain_screened, id_vars='strain', value_vars=strain_screened[ListOfColNames])

Rename the default named columns variable and value

pd.melt(strain_screened, id_vars='strain', value_vars=strain_screened[[e for e in list(strain_screened) if re.match("^\d",e)]],var_name='SampleID', value_name='Metaphlan2_abundance')

Time series or datetime analysis

Convert pandas column to datetime

Date time format df['columnname'] = pd.to_datetime(df['columnname'])

If there are str or Nan characters in your column, the above command with error out. Use the following to convert em to NaT and proceed, unless you want to remove those rows altogether and repeat the command above

df['columnname'] = pd.to_datetime(df['columnname'], errors='coerce')

Current datetime

Get current date and time

datetime.now()

Get it in a specific format e.g.

now.strftime("%B %d %Y") # April 11 2019

Convert string to datetime

datetime.strptime('2019-04-11 03:25:04', '%Y-%m-%d %H:%M:%S')

Uncategorized tips

Increasing display column width (to display all text)

pd.set_option('display.max_colwidth', -1)

Create dictionary from two columns (one as key and the other as corresponding value)

dict1 = dict(zip(df1.keyColumnName, df1.valueColumnName))

Add/set value for particular cell in df

df = df.set_value('D', 'x', 10) <- NOTE: deprecated, use .at below where D is the index, x is the column and 10 is the value to be added

df.at[D,'x'] = value

Split column into separate columns df2 = pd.DataFrame(df['meta'].str.split().values.tolist())

concatanate or append two dataframes i.e. R's cbind and rbind equivalent

By columns i.e cbind

newdf = pd.concat([df1, df2], axis=1, ignore_index=True)

By rows i.e. rbind. Note : axis=0 is default and can be skipped

newdf = pd.concat([df1, df2], axis=0, ignore_index=True)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment