nsarode/pandas_notes.md

## pandas_notes.md

      
    Raw
  

              pandas_notes.md
            
          
    Version check

pd.__version__
Summary stats

df.describe()
Rows and column counts

df.shape
File operations

Reading a tsv file with header row
df = pd.read_csv("filename.tsv", sep="\t", header=0)
NOTE: header=0 is default
Read a tsv and assign column headers
df = pd.read_csv("filename.tsv", sep="\t", names=['col1','col2'])
Reading a file from url
import pandas as pd
import io
import requests
url="https://s3.amazonaws.com/demo-datasets/wine.csv"
s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode('utf-8')))
Reading dta file
df  = pd.read_stata("files/individual_characteristics.dta")
Writing file (index=False used when you don't want index to be written as a column)
df.to_csv('example.csv')
df.to_csv('example.tsv', sep="\t", index=False)
Skipping rows when reading df

Skip specific number of rows
df = pd.read_csv("filename.tsv", sep="\t", skiprows=4)
Skip commented rows
df = pd.read_csv("filename.tsv", sep="\t", comment='#')
Fill empty cells

dfNew = df.fillna('WhateverIWant')
Index set and reset

Turn a column into index.
df.set_index('columnName')
NOTE : you can create multi-indexes i.e. use > 1 column e.g. month, year etc
df.set_index(['columnName1','columnName2'])
If for some reason, the multi-index order is switched from what you want, you can swap em
df.swaplevel()
Reset index
df.reset_index()
Modify the DataFrame in place (do not create a new object)
df.reset_index(inplace=True)
Sometimes reset_index inserts index as dataframe columns. The following command resets the index to the default integer index.
df.reset_index(drop=True)
Iterate over rows


for index, row in df.iterrows():
    print(row['column1'], row['column2'])


Column operations

List datatype of each column of df
df.dtypes
Change data types (e.g. below is to change it to float)
df['ColumnName'] = df.ColumnName.astype(float)
Get column names

df.columns
To get names as list
list(df)
Subset columns by index

df.iloc[rowid,columnid] # specific row and column
df.iloc[:,columnid] # all rows of specific colum
df.iloc[:,startcolumnid:endcolumnid] # all rows, all columns starting with startcolumnid and end with endcolumnid
df.iloc[:,startcolumnid:] # all rows, all columns starting with startcolumnid

import numpy as np
df.iloc[:,np.r_[1:5,6,8,10:15]] # selecting/slicing multiple range of columns
Subset or select column names matching regex or pattern

colList = list(df)

# e.g. to select column names ending in _taxid
selected_names = [e for e in colList if e.endswith('_taxid')]

#or 
import re
selected_names = [e for e in colList if re.match(".*taxid$",e)]
Add new column

From list

df['column1'] = listName
From series

se = pd.Series(mylist)
df['new_col'] = se.values
Using subscript (grabbed with regex) from another column

df['NewColumnName'] = df['ColumnToGrepFrom'].str.extract(r'Name=(\w+)')
Conditional column operations

Create new column based on value (string or digit)
df['elderly'] = np.where(df['age']>=50, 'yes', 'no')
df['New Column'] = np.where(df.columnA.str.match(regex), "this", "that")
Create a new column, where value is equal to another column (which could change) or a common text based on condition
df['NewCol'] = np.where(df['ColToCheck'] <= 10, 'Other',df['UseThisColVal'])
df['New Column'] = np.where(df.columnA.str.match(regex), df.columnB, df.columnC)
Update value in another column based on value
df.loc[df['ColumnToCheck'] == 0, 'ColumnToUpdate'] = 'DesiredValue'
Column conversions

Date time format
df['columnname'] = pd.to_datetime(df['columnname'])
If there are str or Nan characters in your column, the above command with error out. Use the following to convert em to NaT and proceed, unless you want to remove those rows altogether and repeat the command above
df['columnname'] = pd.to_datetime(df['columnname'], errors='coerce')
If you plan to do any row selections based on datetime columns, you could get an error Cannot compare tz-naive and tz-aware datetime-like objects. In that case, you could solve it by
df['date_col'] = pd.to_datetime(df['text_date_col']).dt.tz_convert(None)
Get list of column values

df["columnName"]
or
df.columnName
Rename columns

By name
df.rename(columns={'oldname':'newname'}, inplace=True)
By sequential order
df.columns = ['newname1', 'newname2']
Using regex
df.columns = [x.strip().replace('StringToBeReplaced','ReplaceWithThis') for x in df.columns]
Drop columns by name

newdf = df.drop(['column1', 'column2'], axis=1)
or for in-place operations
df.drop(['column1', 'column2'], axis=1, inplace=True)
Create empty df with column names

df = pd.DataFrame(columns=['column1','column2'])
Append a string to each column value

df['col'] = 'str' + df['col'].astype(str)
Split column(s)

Based on delimiter

If there is a single delimiter or you know the number of resulting columns
df['A'], df['B'] = df['AB'].str.split(' ', 1).str
If you want it to automatically create columns (just removed the 'maximum splits' parameter)
df['AB'].str.split(' ', expand=True)
If you want to automatically split a column into multiple (adding a prefix to em, eg. splited_0, splited_1) and add to existing dataframe.
df = df.join(df['AB'].str.split(' ', expand=True).add_prefix('splited_'))
Based on content

Tuple

Source: SO
If you have a column b containing tuples (each with 2 components)
df[['b1', 'b2']] = pd.DataFrame(df['b'].tolist(), index=df.index)
Another, albiet slower and more memory intensive way (for large df's will cause memory error), is using apply as df['b'].apply(pd.Series)
Check if df columns are numeric

Returns True or False
import numpy as np
np.issubtype(df[columName].dtype, np.numeric)


Get list of unique values from specific column

columniqvalues = df.colname.unique()
Get list of duplicated values from specific column

df[df.duplicated(['columnWithDups'], keep=False)]
GroupBy operations

GroupBy single column

operations could be: sum, mean, median etc
This will create a new column with your specified name (NOTE: this option is soon going to be deprecated)
dfgrp = df.groupby(['GroupbyThisColumn'])['OperationOnThisColum'].agg({'NewCol':'operation'})
If you don't want the groupby column to be the index
newDf = olddf.groupby('WhatColToGroupBy', as_index=False)['ColToBeProcessed'].agg({'NewCol': "count"})
This will create a new column with autogenerated column name (based on operation)
dfgrp = df.groupby(['GroupbyThisColumn'])['OperationOnThisColum'].agg(['operation'])
GroupBy multiple columns

dfgrp = df.groupby(['GroupbyThisColumn', 'GroupbyThisColumntoo'])['OperationOnThisColum','OperationOnThisColum2'].agg(['operation'])
Groupby and collapse values

Source: SO
Collapse column values (by separator) based on common value in another column
Example:
Original df
  tempx        value
picture1         1.5
picture555       1.5
picture255       1.5
picture365       1.5
picture112       1.5

Command:
 df.groupby('value')['tempx'].apply(' '.join).reset_index()

Output
  value                                              tempx
1.5  picture1 picture555 picture255 picture365 pict...


Subset

Sources: So
To select rows whose column value equals a scalar, some_value, use ==:
df.loc[df['column_name'] == some_value]
or
df[df['column_name'] == some_value]
To select rows whose column value is in an iterable, some_values, use isin:
df.loc[df['column_name'].isin(some_values)]
Combine multiple conditions with &:
df.loc[(df['column_name'] == some_value) & df['other_column'].isin(some_values)]
To select rows whose column value does not equal some_value, use !=:
df.loc[df['column_name'] != some_value]
Another way (this time using text comparison as example)
df.query('columnName != "some_text" and columnName != "SomeOthertext"')
isin returns a boolean Series, so to select rows whose value is not in some_values, negate the boolean Series using ~:
df.loc[~df['column_name'].isin(some_values)]
selecting columns by name
subsetdf = df[['columnName1','columnName2']]
Select rows based on partial string match in specific column
df[df['A'].str.contains("hello")]
Merge

Generic way. options for how are left, right, inner, outer

df = pd.merge(df1, df2, how='inner', on='ColumnName')
merge specific columns only

df = pd.merge(df1, df2[['KeyColumnName', 'TargetColumnname']], how='inner', on='KeyColumnName')
Row operations

List rows based on column value match

Value match
 df[df['ColumnName'] == Value]
Null value
df[df.isnull().any(axis=1)]
Subset rows

Null values in specific column
df.loc[df['columnName'].isnull()]
Non null values in specific column
df.loc[df['columnName'].notnull()]
Add

Append another df (rbind equivalent)
newdf = df1.append(df2,ignore_index=True)
Append specific row from another df
newdf = df1.append(df2.iloc[indexNo],ignore_index=True)
Append a list as last row of df
df.loc[len(df)] = listname
Drop

Drop rows by name

Drop rows and save the new df as a new object
newdf = df.drop(['rowname', 'rowname2'])
In-place changes
df.drop(['rowname', 'rowname2'], inplace=True)
Drop rows by index

df.drop(df.index[0])
Drop first two rows
df.drop(df.index[:2])
Drop rows by value

df = df[df.columnName != Value]
Or
Note : for option below the columnName should not have space
df.query('columnName != 0')
Remove duplicate rows

df.drop_duplicates()
Drop duplicates in a column, but take the last obs in the duplicated set
df.drop_duplicates(['columnName'], keep='last')
Subset rows from columns

Partial string matching

Partial match
df[df['columnName'].str.contains('ac')]
Starting with
df[df['columnName'].str.match('ac')]
Dictionary conversion

From dictionary

e.g. using dictionary data
df = pd.DataFrame(list(data.items()), columns=['Key_column', 'Value_column'])
To dictionary

df.set_index('id')['value'].to_dict()
dict = dict(zip(df.id, df.value))
Reorder and sort

sort column values

Note: for sort_values, either a column's name or its numeric index can be given
single column

pd.columName.sort_values('columnName', ascending=True)
multiple columns

pd.columnName.sort_values(['column1','column2'], ascending=[True, False])
Flip order of rows

dfFinal = df[::-1]
Pandas plot

pandas plot legend position options available
best
upper right
upper left
lower left
lower right
right
center left
center right
lower center
upper center
center

Reshaping dataframe

Melt

pd.melt(strain_screened, id_vars='strain', value_vars=strain_screened[ListOfColNames])
Rename the default named columns variable and value
pd.melt(strain_screened, id_vars='strain', value_vars=strain_screened[[e for e in list(strain_screened) if re.match("^\d",e)]],var_name='SampleID', value_name='Metaphlan2_abundance')
Time series or datetime analysis

Convert pandas column to datetime

Date time format df['columnname'] = pd.to_datetime(df['columnname'])
If there are str or Nan characters in your column, the above command with error out. Use the following to convert em to NaT and proceed, unless you want to remove those rows altogether and repeat the command above
df['columnname'] = pd.to_datetime(df['columnname'], errors='coerce')
Current datetime

Get current date and time
datetime.now()
Get it in a specific format e.g.
now.strftime("%B %d %Y") # April 11 2019
Convert string to datetime

datetime.strptime('2019-04-11 03:25:04', '%Y-%m-%d %H:%M:%S')
Uncategorized tips

Increasing display column width (to display all text)

pd.set_option('display.max_colwidth', -1)
Create dictionary from two columns (one as key and the other as corresponding value)

dict1      = dict(zip(df1.keyColumnName, df1.valueColumnName))
Add/set value for particular cell in df

df = df.set_value('D', 'x', 10) <- NOTE: deprecated, use .at below
where D is the index, x is the column and 10 is the value to be added
df.at[D,'x'] = value
Split column into separate columns
df2 = pd.DataFrame(df['meta'].str.split().values.tolist())
concatanate or append two dataframes i.e. R's cbind and rbind equivalent

By columns i.e cbind
newdf = pd.concat([df1, df2], axis=1, ignore_index=True)
By rows i.e. rbind. Note : axis=0 is default and can be skipped
newdf = pd.concat([df1, df2], axis=0, ignore_index=True)