pd.__version__
df.describe()
df.shape
Reading a tsv file with header row
df = pd.read_csv("filename.tsv", sep="\t", header=0)
NOTE: header=0
is default
Read a tsv and assign column headers
df = pd.read_csv("filename.tsv", sep="\t", names=['col1','col2'])
Reading a file from url
import pandas as pd
import io
import requests
url="https://s3.amazonaws.com/demo-datasets/wine.csv"
s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode('utf-8')))
Reading dta
file
df = pd.read_stata("files/individual_characteristics.dta")
Writing file (index=False used when you don't want index to be written as a column)
df.to_csv('example.csv')
df.to_csv('example.tsv', sep="\t", index=False)
Skip specific number of rows
df = pd.read_csv("filename.tsv", sep="\t", skiprows=4)
Skip commented rows
df = pd.read_csv("filename.tsv", sep="\t", comment='#')
dfNew = df.fillna('WhateverIWant')
Turn a column into index.
df.set_index('columnName')
NOTE : you can create multi-indexes i.e. use > 1 column e.g. month, year etc
df.set_index(['columnName1','columnName2'])
If for some reason, the multi-index order is switched from what you want, you can swap em
df.swaplevel()
Reset index
df.reset_index()
Modify the DataFrame in place (do not create a new object)
df.reset_index(inplace=True)
Sometimes reset_index inserts index as dataframe columns. The following command resets the index to the default integer index.
df.reset_index(drop=True)
for index, row in df.iterrows():
print(row['column1'], row['column2'])
List datatype of each column of df
df.dtypes
Change data types (e.g. below is to change it to float)
df['ColumnName'] = df.ColumnName.astype(float)
df.columns
To get names as list
list(df)
df.iloc[rowid,columnid] # specific row and column
df.iloc[:,columnid] # all rows of specific colum
df.iloc[:,startcolumnid:endcolumnid] # all rows, all columns starting with startcolumnid and end with endcolumnid
df.iloc[:,startcolumnid:] # all rows, all columns starting with startcolumnid
import numpy as np
df.iloc[:,np.r_[1:5,6,8,10:15]] # selecting/slicing multiple range of columns
colList = list(df)
# e.g. to select column names ending in _taxid
selected_names = [e for e in colList if e.endswith('_taxid')]
#or
import re
selected_names = [e for e in colList if re.match(".*taxid$",e)]
df['column1'] = listName
se = pd.Series(mylist)
df['new_col'] = se.values
df['NewColumnName'] = df['ColumnToGrepFrom'].str.extract(r'Name=(\w+)')
Create new column based on value (string or digit)
df['elderly'] = np.where(df['age']>=50, 'yes', 'no')
df['New Column'] = np.where(df.columnA.str.match(regex), "this", "that")
Create a new column, where value is equal to another column (which could change) or a common text based on condition
df['NewCol'] = np.where(df['ColToCheck'] <= 10, 'Other',df['UseThisColVal'])
df['New Column'] = np.where(df.columnA.str.match(regex), df.columnB, df.columnC)
Update value in another column based on value
df.loc[df['ColumnToCheck'] == 0, 'ColumnToUpdate'] = 'DesiredValue'
Date time format
df['columnname'] = pd.to_datetime(df['columnname'])
If there are str or Nan characters in your column, the above command with error out. Use the following to convert em to NaT and proceed, unless you want to remove those rows altogether and repeat the command above
df['columnname'] = pd.to_datetime(df['columnname'], errors='coerce')
If you plan to do any row selections based on datetime columns, you could get an error Cannot compare tz-naive and tz-aware datetime-like objects
. In that case, you could solve it by
df['date_col'] = pd.to_datetime(df['text_date_col']).dt.tz_convert(None)
df["columnName"]
or
df.columnName
By name
df.rename(columns={'oldname':'newname'}, inplace=True)
By sequential order
df.columns = ['newname1', 'newname2']
Using regex
df.columns = [x.strip().replace('StringToBeReplaced','ReplaceWithThis') for x in df.columns]
newdf = df.drop(['column1', 'column2'], axis=1)
or for in-place operations
df.drop(['column1', 'column2'], axis=1, inplace=True)
df = pd.DataFrame(columns=['column1','column2'])
df['col'] = 'str' + df['col'].astype(str)
If there is a single delimiter or you know the number of resulting columns
df['A'], df['B'] = df['AB'].str.split(' ', 1).str
If you want it to automatically create columns (just removed the 'maximum splits' parameter)
df['AB'].str.split(' ', expand=True)
If you want to automatically split a column into multiple (adding a prefix to em, eg. splited_0, splited_1) and add to existing dataframe.
df = df.join(df['AB'].str.split(' ', expand=True).add_prefix('splited_'))
Source: SO
If you have a column b
containing tuples (each with 2 components)
df[['b1', 'b2']] = pd.DataFrame(df['b'].tolist(), index=df.index)
Another, albiet slower and more memory intensive way (for large df's will cause memory error), is using apply
as df['b'].apply(pd.Series)
Returns True or False
import numpy as np
np.issubtype(df[columName].dtype, np.numeric)
columniqvalues = df.colname.unique()
df[df.duplicated(['columnWithDups'], keep=False)]
operations could be: sum, mean, median etc
This will create a new column with your specified name (NOTE: this option is soon going to be deprecated)
dfgrp = df.groupby(['GroupbyThisColumn'])['OperationOnThisColum'].agg({'NewCol':'operation'})
If you don't want the groupby column to be the index
newDf = olddf.groupby('WhatColToGroupBy', as_index=False)['ColToBeProcessed'].agg({'NewCol': "count"})
This will create a new column with autogenerated column name (based on operation)
dfgrp = df.groupby(['GroupbyThisColumn'])['OperationOnThisColum'].agg(['operation'])
dfgrp = df.groupby(['GroupbyThisColumn', 'GroupbyThisColumntoo'])['OperationOnThisColum','OperationOnThisColum2'].agg(['operation'])
Source: SO
Collapse column values (by separator) based on common value in another column
Example:
Original df
tempx value
picture1 1.5
picture555 1.5
picture255 1.5
picture365 1.5
picture112 1.5
Command:
df.groupby('value')['tempx'].apply(' '.join).reset_index()
Output
value tempx
1.5 picture1 picture555 picture255 picture365 pict...
Sources: So
To select rows whose column value equals a scalar, some_value, use ==:
df.loc[df['column_name'] == some_value]
or
df[df['column_name'] == some_value]
To select rows whose column value is in an iterable, some_values, use isin:
df.loc[df['column_name'].isin(some_values)]
Combine multiple conditions with &:
df.loc[(df['column_name'] == some_value) & df['other_column'].isin(some_values)]
To select rows whose column value does not equal some_value, use !=:
df.loc[df['column_name'] != some_value]
Another way (this time using text comparison as example)
df.query('columnName != "some_text" and columnName != "SomeOthertext"')
isin returns a boolean Series, so to select rows whose value is not in some_values, negate the boolean Series using ~:
df.loc[~df['column_name'].isin(some_values)]
selecting columns by name
subsetdf = df[['columnName1','columnName2']]
Select rows based on partial string match in specific column
df[df['A'].str.contains("hello")]
df = pd.merge(df1, df2, how='inner', on='ColumnName')
df = pd.merge(df1, df2[['KeyColumnName', 'TargetColumnname']], how='inner', on='KeyColumnName')
Value match
df[df['ColumnName'] == Value]
Null value
df[df.isnull().any(axis=1)]
Null values in specific column
df.loc[df['columnName'].isnull()]
Non null values in specific column
df.loc[df['columnName'].notnull()]
Append another df (rbind equivalent)
newdf = df1.append(df2,ignore_index=True)
Append specific row from another df
newdf = df1.append(df2.iloc[indexNo],ignore_index=True)
Append a list as last row of df
df.loc[len(df)] = listname
Drop rows and save the new df as a new object
newdf = df.drop(['rowname', 'rowname2'])
In-place changes
df.drop(['rowname', 'rowname2'], inplace=True)
df.drop(df.index[0])
Drop first two rows
df.drop(df.index[:2])
df = df[df.columnName != Value]
Or
Note : for option below the columnName should not have space
df.query('columnName != 0')
df.drop_duplicates()
Drop duplicates in a column, but take the last obs in the duplicated set
df.drop_duplicates(['columnName'], keep='last')
Partial match
df[df['columnName'].str.contains('ac')]
Starting with
df[df['columnName'].str.match('ac')]
e.g. using dictionary data
df = pd.DataFrame(list(data.items()), columns=['Key_column', 'Value_column'])
df.set_index('id')['value'].to_dict()
dict = dict(zip(df.id, df.value))
Note: for sort_values, either a column's name or its numeric index can be given
pd.columName.sort_values('columnName', ascending=True)
pd.columnName.sort_values(['column1','column2'], ascending=[True, False])
dfFinal = df[::-1]
pandas plot legend position options available
best
upper right
upper left
lower left
lower right
right
center left
center right
lower center
upper center
center
pd.melt(strain_screened, id_vars='strain', value_vars=strain_screened[ListOfColNames])
Rename the default named columns variable and value
pd.melt(strain_screened, id_vars='strain', value_vars=strain_screened[[e for e in list(strain_screened) if re.match("^\d",e)]],var_name='SampleID', value_name='Metaphlan2_abundance')
Date time format df['columnname'] = pd.to_datetime(df['columnname'])
If there are str or Nan characters in your column, the above command with error out. Use the following to convert em to NaT and proceed, unless you want to remove those rows altogether and repeat the command above
df['columnname'] = pd.to_datetime(df['columnname'], errors='coerce')
Get current date and time
datetime.now()
Get it in a specific format e.g.
now.strftime("%B %d %Y") # April 11 2019
datetime.strptime('2019-04-11 03:25:04', '%Y-%m-%d %H:%M:%S')
pd.set_option('display.max_colwidth', -1)
dict1 = dict(zip(df1.keyColumnName, df1.valueColumnName))
df = df.set_value('D', 'x', 10)
<- NOTE: deprecated, use .at below
where D is the index, x is the column and 10 is the value to be added
df.at[D,'x'] = value
Split column into separate columns
df2 = pd.DataFrame(df['meta'].str.split().values.tolist())
By columns i.e cbind
newdf = pd.concat([df1, df2], axis=1, ignore_index=True)
By rows i.e. rbind. Note : axis=0 is default and can be skipped
newdf = pd.concat([df1, df2], axis=0, ignore_index=True)