Skip to content

Instantly share code, notes, and snippets.

@fabsta
Last active January 30, 2024 17:35
  • Star 8 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save fabsta/8d5233356618ee7b2e87b84146a33a79 to your computer and use it in GitHub Desktop.
Data science

Pandas cheat sheet for data science

Table of contents generated with markdown-toc

Links

recommended-python-learning-resources: link

Statistics

Multi-variate analysis

Understand the problem.

  • Normal distribution
    	#histogram
    	sns.distplot(df_train['SalePrice']);
  • Skewness/Kurtosis?
     #skewness and kurtosis
     print("Skewness: %f" % df_train['SalePrice'].skew())
     print("Kurtosis: %f" % 	df_train['SalePrice'].kurt())```
  • Show peakedness

Univariable study

Relationship with numerical variables

#scatter plot grlivarea/saleprice
var = 'GrLivArea'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));

Relationship with categorical features

#box plot overallqual/saleprice
var = 'OverallQual'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);
  1. Multivariate study. We'll try to understand how the dependent variable and independent variables relate.

Correlation matrix (heatmap style)

#correlation matrix
corrmat = df_train.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True);
# and zoomed corr matrix

k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df_train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

![Pairplor](https://csiu.github.io/blog//img/figure/2017-03-19/heatmap.png =255x)

Scatter plots between target and correlated variables

#scatterplot
sns.set()
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
sns.pairplot(df_train[cols], size = 2.5)
plt.show();

![Pairplor](https://seaborn.pydata.org/_images/seaborn-pairplot-1.png =255x) Basic cleaning. We'll clean the dataset and handle the missing data, outliers and categorical variables.

#missing data
total = df_train.isnull().sum().sort_values(ascending=False)
percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

Test assumptions. We'll check if our data meets the assumptions required by most multivariate techniques.

Feature understanding

  • Are the features continuous, discrete or none of the above?

  • What is the distribution of this feature?

  • Does the distribution largely depend on what subset of examples is being considered?

    • Time-based segmentation?
    • Type-based segmentation?
  • Does this feature contain holes (missing values)?

    • Are those holes possible to be filled, or would they stay forever?
    • If it possible to eliminate them in the future data?
  • Are there duplicate and/or intersecting examples?

    • Answering this question right is extremely important, since duplicate or connected data points might significantly affect the results of model validation if not properly excluded.
  • Where do the features come from?

    • Should we come up with the new features that prove to be useful, how hard would it be to incorporate those features in the final design?
  • Is the data real-time?

    • Are the requests real-time?
  • If yes, well-engineered simple features would likely rock.

    • If no, we likely are in the business of advanced models and algorithms.
  • Are there features that can be used as the “truth”? Plots

    1. Supervised vs. unsupervised learning
    2. classification vs. regression
    3. Prediction vs. Inference
  1. Baseline Modeling
  2. Secondary Modeling
  3. Communicating Results
  4. Conclusion
  5. Resources

Preliminaries

# import libraries (standard)
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from pandas import DataFrame, Series

Import

https://chrisyeh96.github.io/2017/08/08/definitive-guide-python-imports.html

Input Output

Input

Empty DataFrame (top)

# dataframe empty
df = DataFrame()

CSV (top)

# pandas read csv
df = pd.read_csv('file.csv')
df = pd.read_csv('file.csv', header=0, index_col=0, quotechar='"',sep=':', na_values = ['na', '-', '.', ''])
# specifying "." and "NA" as missing values in the Last Name column and "." as missing values in Pre-Test Score column
df = pd.read_csv('../data/example.csv', na_values={'Last Name': ['.', 'NA'], 'Pre-Test Score': ['.']}) 
# skipping the top 3 rows
df = pd.read_csv('../data/example.csv', na_values=sentinels, skiprows=3)
# interpreting "," in strings around numbers as thousands separators
df = pd.read_csv('../data/example.csv', thousands=',') 

# `encoding='latin1'`, `encoding='iso-8859-1'` or `encoding='cp1252'`
df = pd.read_csv('example.csv',encoding='latin1')
# pandas read csv dtypes
df = pd.read_csv('example.csv',dtypes={'col1'=np.int16})

CSV (Inline) (top)

# pandas read string
from io import StringIO
data = """, Animal, Cuteness, Desirable
row-1, dog, 8.7, True
row-2, cat, 9.5, True
row-3, bat, 2.6, False"""
df = pd.read_csv(StringIO(data),
	header=0, index_col=0,
	skipinitialspace=True)

JSON (top)

# pandas read json
import json
json_data = open('data-text.json').read()
data = json.loads(json_data)
for item in data:
    print item

XML (top)

# pandas read xml
from xml.etree import ElementTree as ET
tree = ET.parse('../../data/chp3/data-text.xml')
root = tree.getroot()
print root
data = root.find('Data')
all_data = []
for observation in data:
    record = {}
    for item in observation:
        lookup_key = item.attrib.keys()[0]
        if lookup_key == 'Numeric':
            rec_key = 'NUMERIC'
            rec_value = item.attrib['Numeric']
        else:
            rec_key = item.attrib[lookup_key]
            rec_value = item.attrib['Code']
        record[rec_key] = rec_value
    all_data.append(record)
print all_data

Excel (top)

# pandas read excel

# Each Excel sheet in a Python dictionary
workbook = pd.ExcelFile('file.xlsx')
d = {} # start with an empty dictionary
for sheet_name in workbook.sheet_names:
	df = workbook.parse(sheet_name)
	d[sheet_name] = df

MySQL (top)

# pandas read sql
import pymysql
from sqlalchemy import create_engine
engine = create_engine('mysql+pymysql://'
+'USER:PASSWORD@HOST/DATABASE')
df = pd.read_sql_table('table', engine)

Combine DataFrame (top)

# pandas concat dataframes

# Example 1 ...
s1 = Series(range(6))
s2 = s1 * s1
s2.index = s2.index + 2# misalign indexes
df = pd.concat([s1, s2], axis=1)
# Example 2 ...
s3 = Series({'Tom':1, 'Dick':4, 'Har':9})
s4 = Series({'Tom':3, 'Dick':2, 'Mar':5})
df = pd.concat({'A':s3, 'B':s4 }, axis=1)

From Dictionary (top) default --- assume data is in columns

# pandas read dictionary 
df = DataFrame({
	'col0' : [1.0, 2.0, 3.0, 4.0],
	'col1' : [100, 200, 300, 400]
})

use helper method for data in rows

# pandas read dictionary
df = DataFrame.from_dict({ # data by row
# rows as python dictionaries
	'row0' : {'col0':0, 'col1':'A'},
	'row1' : {'col0':1, 'col1':'B'}
	}, orient='index')
df = DataFrame.from_dict({ # data by row
# rows as python lists
	'row0' : [1, 1+1j, 'A'],
	'row1' : [2, 2+2j, 'B']
	}, orient='index')

from iterations of lists

# pandas read lists
aa = ['aa1', 'aa2', 'aa3', 'aa4', 'aa5']
bb = ['bb1', 'bb2', 'bb3', 'bb4', 'bb5']
cc = ['cc1', 'cc2', 'cc3', 'cc4', 'cc5']
lists = [aa, bb, cc]

pd.DataFrame(list(itertools.product(*lists)), columns=['aa', 'bb', 'cc'])

source: https://stackoverflow.com/questions/45672342/create-a-dataframe-of-permutations-in-pandas-from-list

Examples (top) --- simple - default integer indexes

# pandas read random
df = DataFrame(np.random.rand(50,5))

--- with a time-stamp row index:

# pandas read random timestamp
df = DataFrame(np.random.rand(500,5))
df.index = pd.date_range('1/1/2005',
periods=len(df), freq='M')

--- with alphabetic row and col indexes and a "groupable" variable

import string
import random
r = 52 # note: min r is 1; max r is 52
c = 5
df = DataFrame(np.random.randn(r, c),
	columns = ['col'+str(i) for i in range(c)],
	index = list((string. ascii_uppercase+ string.ascii_lowercase)[0:r]))
	df['group'] = list(''.join(random.choice('abcde')
	for _ in range(r)) )

Generate dataframe with 1 variable column

# pandas dataframe create
final_biom_df = final_biom_df.append([pd.DataFrame({'trial' : curr_trial,
                   'biomarker_name' : curr_biomarker,
                   'observation_id' : curr_observation,
                   'visit' : curr_timepoint,
                   'value' : np.random.randint(low=1, high=100, size=30),
                   'unit' : curr_unit,
                   'base' : is_base,
              
             })])

Reading files

# pandas read multiple files
files = glob.glob('weather/*.csv')
weather_dfs = [pd.read_csv(fp, names=columns) for fp in files]
weather = pd.concat(weather_dfs)

Output

CSV (top)

df.to_csv('name.csv', encoding='utf-8')
df.to_csv('filename.csv', header=False)

Excel

from pandas import ExcelWriter
writer = ExcelWriter('filename.xlsx')
df1.to_excel(writer,'Sheet1')
df2.to_excel(writer,'Sheet2')
writer.save()

MySQL (top)

import pymysql
from sqlalchemy import create_engine
e = create_engine('mysql+pymysql://' +
	'USER:PASSWORD@HOST/DATABASE')
df.to_sql('TABLE',e, if_exists='replace')

Python object (top)

d = df.to_dict() # to dictionary
str = df.to_string() # to string
m = df.as_matrix() # to numpy matrix

JSON

### orient=’records’

df.to_json(r'Path to store the JSON file\File Name.json',orient='records')

[{"Product":"Desktop Computer","Price":700},{"Product":"Tablet","Price":250},{"Produc

source: https://datatofish.com/export-pandas-dataframe-json/

Exploration

pandas profiling

conda install -c anaconda pandas-profiling
import pandas as pd  
import pandas_profiling
# Depreciated: pre 2.0.0 version  
df = pd.read_csv('titanic/train.csv')  
#Pandas-Profiling 2.0.0  
df.profile_report()
# save as html
profile = df.profile_report(title='Pandas Profiling Report')
profile.to_file(output_file="output.html")

example report: link

source: link

overview missing data:

# dataframe missing data
ted.isna().sum()

Selecting

Summary

Select columns

# dataframe select columns

s = df['col_label']  # returns Series
df = df[['col_label']] # return DataFrame
df = df[['L1', 'L2']] # select with list
df = df[index] # select with index
df = df[s] #select with Series

Select rows

# dataframe select rows

df = df['from':'inc_to']# label slice
df = df[3:7] # integer slice
df = df[df['col'] > 0.5]# Boolean Series
df = df.loc['label'] # single label
df = df.loc[container] # lab list/Series
df = df.loc['from':'to']# inclusive slice
df = df.loc[bs] # Boolean Series
df = df.iloc[0] # single integer
df = df.iloc[container] # int list/Series
df = df.iloc[0:5] # exclusive slice
df = df.ix[x] # loc then iloc

Select a cross-section (top)

# dataframe select slices

# r and c can be scalar, list, slice
df.loc[r, c] # label accessor (row, col)
df.iloc[r, c]# integer accessor
df.ix[r, c] # label access int fallback
df[c].iloc[r]# chained – also for .loc

Select a cell (top)

# dataframe select cell
# r and c must be label or integer
df.at[r, c] # fast scalar label accessor
df.iat[r, c] # fast scalar int accessor
df[c].iat[r] # chained – also for .at

DataFrame indexing methods (top)

v = df.get_value(r, c) # get by row, col
df = df.set_value(r,c,v)# set by row, col
df = df.xs(key, axis) # get cross-section
df = df.filter(items, like, regex, axis)
df = df.select(crit, axis)

Some index attributes and methods (top)

# dataframe index atrributes

# --- some Index attributes
b = idx.is_monotonic_decreasing
b = idx.is_monotonic_increasing
b = idx.has_duplicates
i = idx.nlevels # num of index levels
# --- some Index methods
idx = idx.astype(dtype)# change data type
b = idx.equals(o) # check for equality
idx = idx.union(o) # union of two indexes
i = idx.nunique() # number unique labels
label = idx.min() # minimum label
label = idx.max() # maximum label

Whole DataFrame

Content/Structure

# dataframe get info

df.info() # index & data types
dfh = df.head(i) # get first i rows
dft = df.tail(i) # get last i rows
dfs = df.describe() # summary stats cols
top_left_corner_df = df.iloc[:4, :4]

Non-indexing attributes (top)

# dataframe non-indexing methods

dfT = df.T # transpose rows and cols
l = df.axes # list row and col indexes
(r, c) = df.axes # from above
s = df.dtypes # Series column data types
b = df.empty # True for empty DataFrame
i = df.ndim # number of axes (it is 2)
t = df.shape # (row-count, column-count)
i = df.size # row-count * column-count
a = df.values # get a numpy array for df

Utilities - DataFrame utility methods (top)

# dataframe sort

df = df.copy() # dataframe copy
df = df.rank() # rank each col (default)
df = df.sort(['sales'], ascending=[False]) 
df = df.sort_values(by=col)
df = df.sort_values(by=[col1, col2])
df = df.sort_index()
df = df.astype(dtype) # type conversion

Iterations (top)

# dataframe iterate for

df.iteritems()# (col-index, Series) pairs
df.iterrows() # (row-index, Series) pairs
# example ... iterating over columns
for (name, series) in df.iteritems():
	print('Col name: ' + str(name))
	print('First value: ' +
		str(series.iat[0]) + '\n')

Maths (top)

# dataframe math

df = df.abs() # absolute values
df = df.add(o) # add df, Series or value
s = df.count() # non NA/null values
df = df.cummax() # (cols default axis)
df = df.cummin() # (cols default axis)
df = df.cumsum() # (cols default axis)
df = df.diff() # 1st diff (col def axis)
df = df.div(o) # div by df, Series, value
df = df.dot(o) # matrix dot product
s = df.max() # max of axis (col def)
s = df.mean() # mean (col default axis)
s = df.median()# median (col default)
s = df.min() # min of axis (col def)
df = df.mul(o) # mul by df Series val
s = df.sum() # sum axis (cols default)
df = df.where(df > 0.5, other=np.nan)

Select/filter (top)

# dataframe select filter
df = df.filter(items=['a', 'b']) # by col
df = df.filter(items=[5], axis=0) #by row
df = df.filter(like='x') # keep x in col
df = df.filter(regex='x') # regex in col
df = df.select(lambda x: not x%5)#5th rows

Columns

Index and labels (top)

# dataframe get index
idx = df.columns # get col index
label = df.columns[0] # first col label
l = df.columns.tolist() # list col labels

Data type conversions (top)

# dataframe convert column

st = df['col'].astype(str)# Series dtype
a = df['col'].values # numpy array
pl = df['col'].tolist() # python list

Note: useful dtypes for Series conversion: int, float, str

Common column-wide methods/attributes (top)

value = df['col'].dtype # type of column
value = df['col'].size # col dimensions
value = df['col'].count()# non-NA count
value = df['col'].sum()
value = df['col'].prod()
value = df['col'].min() # column min
value = df['col'].max() # column max
value = df['col'].mean() # also median()
value = df['col'].cov(df['col2'])
s = df['col'].describe()
s = df['col'].value_counts()

Find index label for min/max values in column (top)

label = df['col1'].idxmin()
label = df['col1'].idxmax()

Common column element-wise methods (top)

s = df['col'].isnull()
s = df['col'].notnull() # not isnull()
s = df['col'].astype(float)
s = df['col'].abs()
s = df['col'].round(decimals=0)
s = df['col'].diff(periods=1)
s = df['col'].shift(periods=1)
s = df['col'].to_datetime() # pandas convert datetime
s = df['col'].fillna(0) # replace NaN w 0
s = df['col'].cumsum()
s = df['col'].cumprod()
s = df['col'].pct_change(periods=4)
s = df['col'].rolling_sum(periods=4, window=4)

Note: also rolling_min(), rolling_max(), and many more.

Position of a column index label (top)

j = df.columns.get_loc('col_name')

Column index values unique/monotonic (top)

if df.columns.is_unique: pass # ...
b = df.columns.is_monotonic_increasing
b = df.columns.is_monotonic_decreasing

Selecting

Columns (top)

s = df['colName'] # select col to Series
df = df[['colName']] # select col to df
df = df[['a','b']] # select 2 or more
df = df[['c','a','b']]# change col order
s = df[df.columns[0]] # select by number
df = df[df.columns[[0, 3, 4]] # by number
s = df.pop('c') # get col & drop from df

Columns with Python attributes (top)

s = df.a # same as s = df['a']
# cannot create new columns by attribute
df.existing_column = df.a / df.b
df['new_column'] = df.a / df.b

Selecting columns with .loc, .iloc and .ix (top)

df = df.loc[:, 'col1':'col2'] # inclusive
df = df.iloc[:, 0:2] # exclusive

Conditional selection (top)

df.query('A > C')
df.query('A > 0') 
df.query('A > 0 & A < 1')
df.query('A > B | A > C')
df[df['coverage'] > 50] # all rows where coverage is more than 50
df[(df['deaths'] > 500) | (df['deaths'] < 50)]
df[(df['score'] > 1) & (df['score'] < 5)]
df[~(df['regiment'] == 'Dragoons')] # Select all the regiments not named "Dragoons"
df[df['age'].notnull() & df['sex'].notnull()]  # ignore the missing data points

(top)

# is in 
df[df.name.isin(value_list)]   # value_list = ['Tina', 'Molly', 'Jason']
df[~df.name.isin(value_list)]

Partial matching (top)

# column contains
df2[df2.E.str.contains("tw|ou")]
# column contains regex 
df['raw'].str.contains('....-..-..', regex=True)  # regex

# dataframe column list contains
selection = ['cat', 'dog']
df[pd.DataFrame(df.species.tolist()).isin(selection).any(1)]
Out[64]: 
  molecule            species
0        a              [dog]
2        c         [cat, dog]
3        d  [cat, horse, pig]
# dataframe column rename 
df.rename(columns={'old1':'new1','old2':'new2'}, inplace=True)
df.columns = ['a', 'b']

Manipulating

Adding (top)

df['new_col'] = range(len(df))
df['new_col'] = np.repeat(np.nan,len(df))
df['random'] = np.random.rand(len(df))
df['index_as_col'] = df.index
df1[['b','c']] = df2[['e','f']]
df3 = df1.append(other=df2)

Vectorised arithmetic on columns (top)

df['proportion']=df['count']/df['total']
df['percent'] = df['proportion'] * 100.0

Append a column of row sums to a DataFrame (top)

df['Total'] = df.sum(axis=1)

Apply numpy mathematical functions to columns (top)

df['log_data'] = np.log(df['col1'])

Set column values set based on criteria (top)

df['b']=df['a'].where(df['a']>0,other=0)
df['d']=df['a'].where(df.b!=0,other=df.c)

Swapping (top)

df[['B', 'A']] = df[['A', 'B']]

Dropping (top)

df = df.drop('col1', axis=1)
df.drop('col1', axis=1, inplace=True)
df = df.drop(['col1','col2'], axis=1)
s = df.pop('col') # drops from frame
del df['col'] # even classic python works
df.drop(df.columns[0], inplace=True)

# drop columns with column names where the first three letters of the column names was 'pre'
cols = [c for c in df.columns if c.lower()[:3] != 'pre']
df=df[cols]

Multiply every column in DataFrame by Series (top)

df = df.mul(s, axis=0) # on matched rows

Rows

Get Position (top)

a = np.where(df['col'] >= 2) #numpy array

DataFrames have same row index (top)

len(a)==len(b) and all(a.index==b.index)     # Get the integer position of a row or col index label
i = df.index.get_loc('row_label')

Row index values are unique/monotonic (top)

if df.index.is_unique: pass # ...
b = df.index.is_monotonic_increasing
b = df.index.is_monotonic_decreasing

Get the row index and labels (top)

idx = df.index      # get row index
label = df.index[0]    # 1st row label
lst = df.index.tolist()     # get as a list

Change the (row) index (top)

df.index = idx     # new ad hoc index
df = df.set_index('A')    # col A new index
df = df.set_index(['A', 'B'])    # MultiIndex
df = df.reset_index()   # replace old w new
df.index = range(len(df)) # set with list
df = df.reindex(index=range(len(df)))
df = df.set_index(keys=['r1','r2','etc'])
df.rename(index={'old':'new'},inplace=True)

Selecting

By column values (top)

df = df[df['col2'] >= 0.0]
df = df[(df['col3']>=1.0) | (df['col1']<0.0)]
df = df[df['col'].isin([1,2,5,7,11])]
df = df[~df['col'].isin([1,2,5,7,11])]
df = df[df['col'].str.contains('hello')]

Using isin over multiple columns (top)

## fake up some data
data = {1:[1,2,3], 2:[1,4,9], 3:[1,8,27]}
df = DataFrame(data)
# multi-column isin
lf = {1:[1, 3], 3:[8, 27]} # look for
f = df[df[list(lf)].isin(lf).all(axis=1)]
Selecting rows using an index
idx = df[df['col'] >= 2].index
print(df.ix[idx])

Slice of rows by integer position (top)

[inclusive-from : exclusive-to [: step]]
default start is 0; default end is len(df)
df = df[:] # copy DataFrame
df = df[0:2] # rows 0 and 1
df = df[-1:] # the last row
df = df[2:3] # row 2 (the third row)
df = df[:-1] # all but the last row
df = df[::2] # every 2nd row (0 2 ..)

Slice of rows by label/index (top)

[inclusive-from : inclusiveto [ : step]]
df = df['a':'c'] # rows 'a' through 'c'

Manipulating

Adding rows

df = original_df.append(more_rows_in_df)

Append a row of column totals to a DataFrame (top)

# Option 1: use dictionary comprehension
sums = {col: df[col].sum() for col in df}
sums_df = DataFrame(sums,index=['Total'])
df = df.append(sums_df)
# Option 2: All done with pandas
df = df.append(DataFrame(df.sum(),
				columns=['Total']).T)

Dropping rows (by name) (top)

df = df.drop('row_label')
df = df.drop(['row1','row2']) # multi-row

Drop duplicates in the row index (top)

df['index'] = df.index # 1 create new col
df = df.drop_duplicates(cols='index',take_last=True)# 2 use new col
del df['index'] # 3 del the col
df.sort_index(inplace=True)# 4 tidy up

Iterating over DataFrame rows (top)

for (index, row) in df.iterrows(): # pass

Sorting

Rows values (top)

df = df.sort(df.columns[0], ascending=False)
df.sort(['col1', 'col2'], inplace=True)

By row index (top)

df.sort_index(inplace=True) # sort by row
df = df.sort_index(ascending=False)

Random (top)

import random as r
k = 20 # pick a number
selection = r.sample(range(len(df)), k)
df_sample = df.iloc[selection, :]
df.take(np.random.permutation(len(df))[:3])

Cells

Selecting

By row and column (top)

value = df.at['row', 'col']
value = df.loc['row', 'col']
value = df['col'].at['row'] # tricky

Note: .at[] fastest label based scalar lookup

By integer position (top)

value = df.iat[9, 3] # [row, col]
value = df.iloc[0, 0] # [row, col]
value = df.iloc[len(df)-1,
len(df.columns)-1]

Slice by labels (top)

df = df.loc['row1':'row3', 'col1':'col3']

Slice by Integer Position (top)

df = df.iloc[2:4, 2:4] # subset of the df
df = df.iloc[:5, :5] # top left corner
s = df.iloc[5, :] # returns row as Series
df = df.iloc[5:6, :] # returns row as row

By label and/or Index (top)

value = df.ix[5, 'col1']
df = df.ix[1:5, 'col1':'col3']

Manipulating

Setting a cell by row and column labels (top)

# pandas update 

df.at['row', 'col'] = value
df.loc['row', 'col'] = value
df['col'].at['row'] = value # tricky

Setting a cross-section by labels

df.loc['A':'C', 'col1':'col3'] = np.nan
df.loc[1:2,'col1':'col2']=np.zeros((2,2))
df.loc[1:2,'A':'C']=othr.loc[1:2,'A':'C']

Setting cell by integer position

df.iloc[0, 0] = value # [row, col]
df.iat[7, 8] = value

Setting cell range by integer position

df.iloc[0:3, 0:5] = value
df.iloc[1:3, 1:4] = np.ones((2, 3))
df.iloc[1:3, 1:4] = np.zeros((2, 3))
df.iloc[1:3, 1:4] = np.array([[1, 1, 1],[2, 2, 2]])

Data wrangling

Merge Join

More examples: https://www.geeksforgeeks.org/python-pandas-merging-joining-and-concatenating/

drawing

(top)

Three ways to join two DataFrames:

  • merge (a database/SQL-like join operation)
  • concat (stack side by side or one on top of the other)
  • combine_first (splice the two together, choosing values from one over the other)
# pandas merge

# Merge on indexes
df_new = pd.merge(left=df1, right=df2, how='outer', left_index=True, right_index=True)

# Merge on columns  
df_new = pd.merge(left=df1, right=df2, how='left', left_on='col1', right_on='col2')

# Join on indexes (another way of merging)
df_new = df1.join(other=df2, on='col1',how='outer')
df_new = df1.join(other=df2,on=['a','b'],how='outer')

# Simple concatenation is often the best
# pandas concat
df=pd.concat([df1,df2],axis=0)#top/bottom
df = df1.append([df2, df3]) #top/bottom
df=pd.concat([df1,df2],axis=1)#left/right

# Combine_first (<a href="#top">top</a>) 
df = df1.combine_first(other=df2)
# multi-combine with python reduce()
df = reduce(lambda x, y:
x.combine_first(y),
[df1, df2, df3, df4, df5])

(top)

GroupBy

# pandas groupby
# Grouping
gb = df.groupby('cat') # by one columns
gb = df.groupby(['c1','c2']) # by 2 cols
gb = df.groupby(level=0) # multi-index gb
gb = df.groupby(level=['a','b']) # mi gb
print(gb.groups)

# Iterating groups – usually not needed 
# pandas groupby iterate
for name, group in gb:
	print (name)
	print (group)

# Selecting a group (<a href="#top">top</a>) 
dfa = df.groupby('cat').get_group('a')
dfb = df.groupby('cat').get_group('b')

# pandas groupby aggregate
# Applying an aggregating function 
# apply to a column ...
s = df.groupby('cat')['col1'].sum()
s = df.groupby('cat')['col1'].agg(np.sum)

# apply to the every column in DataFrame
s = df.groupby('cat').agg(np.sum)
df_summary = df.groupby('cat').describe()
df_row_1s = df.groupby('cat').head(1)

# Applying multiple aggregating functions 
gb = df.groupby('cat')
# apply multiple functions to one column
dfx = gb['col2'].agg([np.sum, np.mean])
# apply to multiple fns to multiple cols
dfy = gb.agg({
	'cat': np.count_nonzero,
	'col1': [np.sum, np.mean, np.std],
	'col2': [np.min, np.max]
})
Note: gb['col2'] above is shorthand for
df.groupby('cat')['col2'], without the need for regrouping.

# Transforming functions (<a href="#top">top</a>) 

# pandas groupby function

# transform to group z-scores, which have
# a group mean of 0, and a std dev of 1.
zscore = lambda x: (x-x.mean())/x.std()
dfz = df.groupby('cat').transform(zscore)

# pandas groupby fillna

# replace missing data with group mean
mean_r = lambda x: x.fillna(x.mean())
dfm = df.groupby('cat').transform(mean_r)

# Applying filtering functions (<a href="#top">top</a>) 
# select groups with more than 10 members
eleven = lambda x: (len(x['col1']) >= 11)
df11 = df.groupby('cat').filter(eleven)

# Group by a row index (non-hierarchical index) 
df = df.set_index(keys='cat')
s = df.groupby(level=0)['col1'].sum()
dfg = df.groupby(level=0).sum()

(top)

Dates

# pandas timestamp
# Dates and time – points and spans
t = pd.Timestamp('2013-01-01')
t = pd.Timestamp('2013-01-01 21:15:06')
t = pd.Timestamp('2013-01-01 21:15:06.7')
p = pd.Period('2013-01-01', freq='M')

# pandas time series

# A Series of Timestamps or Periods 
ts = ['2015-04-01 13:17:27', '2014-04-02 13:17:29']
# Series of Timestamps (good)
s = pd.to_datetime(pd.Series(ts))
# Series of Periods (often not so good)
s = pd.Series( [pd.Period(x, freq='M') for x in ts] )
s = pd.Series(pd.PeriodIndex(ts,freq='S'))

# From non-standard strings to Timestamps
t = ['09:08:55.7654-JAN092002', '15:42:02.6589-FEB082016']
s = pd.Series(pd.to_datetime(t, format="%H:%M:%S.%f-%b%d%Y"))

# Dates and time – stamps and spans as indexes

# pandas time periods
date_strs = ['2014-01-01', '2014-04-01','2014-07-01', '2014-10-01']
dti = pd.DatetimeIndex(date_strs)
pid = pd.PeriodIndex(date_strs, freq='D')
pim = pd.PeriodIndex(date_strs, freq='M')
piq = pd.PeriodIndex(date_strs, freq='Q')
print (pid[1] - pid[0]) # 90 days
print (pim[1] - pim[0]) # 3 months
print (piq[1] - piq[0]) # 1 quarter
time_strs = ['2015-01-01 02:10:40.12345',
'2015-01-01 02:10:50.67890']
pis = pd.PeriodIndex(time_strs, freq='U')
df.index = pd.period_range('2015-01',
periods=len(df), freq='M')
dti = pd.to_datetime(['04-01-2012'],
dayfirst=True) # Australian date format
pi = pd.period_range('1960-01-01','2015-12-31', freq='M')
# Hint: unless you are working in less than seconds, prefer PeriodIndex over DateTimeImdex.
# pandas converting times
From DatetimeIndex to Python datetime objects (<a href="#top">top</a>) 

dti = pd.DatetimeIndex(pd.date_range(
start='1/1/2011', periods=4, freq='M'))
s = Series([1,2,3,4], index=dti)
na = dti.to_pydatetime() #numpy array
na = s.index.to_pydatetime() #numpy array

# From Timestamps to Python dates or times
df['date'] = [x.date() for x in df['TS']]
df['time'] = [x.time() for x in df['TS']]

# From DatetimeIndex to PeriodIndex and back
df = DataFrame(np.random.randn(20,3))
df.index = pd.date_range('2015-01-01', periods=len(df), freq='M')
dfp = df.to_period(freq='M')
dft = dfp.to_timestamp()

# Working with a PeriodIndex (<a href="#top">top</a>) 
pi = pd.period_range('1960-01','2015-12',freq='M')
na = pi.values # numpy array of integers
lp = pi.tolist() # python list of Periods
sp = Series(pi)# pandas Series of Periods
ss = Series(pi).astype(str) # S of strs
ls = Series(pi).astype(str).tolist()

# Get a range of Timestamps
dr = pd.date_range('2013-01-01', '2013-12-31', freq='D')

# Error handling with dates
# 1st example returns string not Timestamp
t = pd.to_datetime('2014-02-30')
# 2nd example returns NaT (not a time)
t = pd.to_datetime('2014-02-30',coerce=True)
# NaT like NaN tests True for isnull()
b = pd.isnull(t) # --> True

# The tail of a time-series DataFrame (<a href="#top">top</a>) 
df = df.last("5M") # the last five months

Upsampling and downsampling

# pandas upsample pandas downsample
# upsample from quarterly to monthly
pi = pd.period_range('1960Q1', periods=220, freq='Q')
df = DataFrame(np.random.rand(len(pi),5),
index=pi)
dfm = df.resample('M', convention='end')
# use ffill or bfill to fill with values
# downsample from monthly to quarterly
dfq = dfm.resample('Q', how='sum')

Time zones

# pandas time zones

t = ['2015-06-30 00:00:00','2015-12-31 00:00:00']
dti = pd.to_datetime(t).tz_localize('Australia/Canberra')
dti = dti.tz_convert('UTC')
ts = pd.Timestamp('now',
tz='Europe/London')
# get a list of all time zones
import pyzt
for tz in pytz.all_timezones:
	print tz
# Note: by default, Timestamps are created without timezone information.


# Row selection with a time-series index
# start with the play data above
idx = pd.period_range('2015-01',
periods=len(df), freq='M')
df.index = idx
february_selector = (df.index.month == 2)
february_data = df[february_selector]
q1_data = df[(df.index.month >= 1) & (df.index.month <= 3)]
mayornov_data = df[(df.index.month == 5) | (df.index.month == 11)]
totals = df.groupby(df.index.year).sum()

# The Series.dt accessor attribute
t = ['2012-04-14 04:06:56.307000', '2011-05-14 06:14:24.457000', '2010-06-14 08:23:07.520000']
# a Series of time stamps
s = pd.Series(pd.to_datetime(t))
print(s.dtype) # datetime64[ns]
print(s.dt.second) # 56, 24, 7
print(s.dt.month) # 4, 5, 6
# a Series of time periods
s = pd.Series(pd.PeriodIndex(t,freq='Q'))
print(s.dtype) # datetime64[ns]
print(s.dt.quarter) # 2, 2, 2
print(s.dt.year) # 2012, 2011, 2010

Missing data

good overview: https://towardsdatascience.com/working-with-missing-values-in-pandas-5da45d16e74

Missing data in a Series (top)

# pandas missing data series
s = Series( [8,None,float('nan'),np.nan])
#[8, NaN, NaN, NaN]
s.isnull() #[False, True, True, True]
s.notnull()#[True, False, False, False]
s.fillna(0)#[8, 0, 0, 0]
# pandas missing data dataframe
df = df.dropna() # drop all rows with NaN
df = df.dropna(axis=1) # same for cols
df=df.dropna(how='all') #drop all NaN row
df=df.dropna(thresh=2) # drop 2+ NaN in r
# only drop row if NaN in a specified col
df = df.dropna(df['col'].notnull())

Recoding/Replacing missing data

# pandas fillna recoding replacing
df.fillna(0, inplace=True) # np.nan -> 0
s = df['col'].fillna(0) # np.nan -> 0
df = df.replace(r'\s+', np.nan,regex=True) # white space -> np.nan
# Non-finite numbers
s = Series([float('inf'), float('-inf'),np.inf, -np.inf])
# Testing for finite numbers (<a href="#top">top</a>) 
b = np.isfinite(s)

Checking missing data

assert all(~df.col.isna()) # no NAs

def has_symbol(df):
    return ~df.symbol.isna()
    
def test_no_na_cells(df, cols=None):
    cols=cols if cols else df.columns.tolist()
    print('===== Testing no NaN cells in dataframe =====')
    print('Columns:', cols)
    nan_cols=(
        df[cols].isnull().any()
        .to_frame('has_nan')
        .query('has_nan==True'))
    
    assert nan_cols.shape[0] == 0, f"Some columns have nan values:\n{nan_cols}"
    print(' => PASSED')
    return df


def test_no_empty_str(df, cols=None):
    cols=cols if cols else df.columns.tolist()
    print('===== Testing no "" cells in dataframe =====')
    print('Columns:', cols)
    cols_not_empty = (
        pd.DataFrame(np.where(
            df[cols].applymap(lambda x: x == ''), False, True),
            columns=cols).all()
        .to_frame('has_empty_str')
        .query('has_empty_str == False'))

    assert cols_not_empty.shape[0]==0, f"Some columns have empty string values:\n{cols_not_empty}"
    print(' => PASSED')
    return df

  
def test_no_duplicated_values(df, cols=None):
    cols=cols if cols else df.columns.tolist()
    print('===== Testing no duplicated values in dataframe =====')
    print('Columns:', cols)
    df2=df[cols]
    assert df2[df2.duplicated()].shape[0] == 0, \
        f'Some rows are duplicated"\n{df2[df2.duplicated()].head()}'
    print(' => PASSED')
    return df

(top)

Categorical Data

# pandas categorical data 
s = Series(['a','b','a','c','b','d','a'],
dtype='category')
df['B'] = df['A'].astype('category')

# Convert back to the original data type
s = Series(['a','b','a','c','b','d','a'], dtype='category')
s = s.astype('string')

# Ordering, reordering and sorting
s = Series(list('abc'), dtype='category')
print (s.cat.ordered)
s=s.cat.reorder_categories(['b','c','a'])
s = s.sort()
s.cat.ordered = False

# Renaming categories (<a href="#top">top</a>) 
s = Series(list('abc'), dtype='category')
s.cat.categories = [1, 2, 3] # in place
s = s.cat.rename_categories([4,5,6])
# using a comprehension ...
s.cat.categories = ['Group ' + str(i)
	for i in s.cat.categories]

# Adding new categories (<a href="#top">top</a>) 
s = s.cat.add_categories([4])

# Removing categories (<a href="#top">top</a>) 
s = s.cat.remove_categories([4])
s.cat.remove_unused_categories() #inplace

(top)

Manipulations and Cleaning

Conversions

# pandas convert to numeric
## errors='ignore'` 
## `errors='coerce` convert to  `np.nan`
 
## mess up data
invoices.loc[45612,'Meal Price'] = 'I am causing trouble'  
invoices.loc[35612,'Meal Price'] = 'Me too'

# check if conversion worked
invoices['Meal Price'].apply(lambda x:type(x)).value_counts()
**OUT:  
<class 'int'> 49972  
<class 'str'> 2

# identify validating lines
invoices['Meal Price'][invoices['Meal Price'].apply(  
lambda x: isinstance(x,str)  )]
# convert messy numerical data

## convert the offending values into np.nan**  
invoices['Meal Price'] = pd.to_numeric(invoices['Meal Price'],errors='coerce')

## fill np.nan with the median of the data**  
invoices['Meal Price'] = invoices['Meal Price'].fillna(invoices['Meal Price'].median())

## convert the column into integer**  
invoices['Meal Price'].astype(int)
# pandas convert to datetime to_datetime
print(pd.to_datetime('2019-8-1'))  
print(pd.to_datetime('2019/8/1'))  
print(pd.to_datetime('8/1/2019'))  
print(pd.to_datetime('Aug, 1 2019'))  
print(pd.to_datetime('Aug - 1 2019'))  
print(pd.to_datetime('August - 1 2019'))  
print(pd.to_datetime('2019, August - 1'))  
print(pd.to_datetime('20190108'))

source: https://towardsdatascience.com/learn-advanced-features-for-pythons-main-data-analysis-library-in-20-minutes-d0eedd90d086

Method chaining

https://towardsdatascience.com/the-unreasonable-effectiveness-of-method-chaining-in-pandas-15c2109e3c69

R's dplyr code to python: gist https://stmorse.github.io/journal/tidyverse-style-pandas.html

# chain pipe snap
def csnap(df, fn=lambda x: x.shape, msg=None):
""" Custom Help function to print things in method chaining.
Returns back the df to further use in chaining.
"""
   if msg:
      print(msg)
      display(fn(df))
   return df

(
wine.pipe(csnap)
.rename(columns={"color_intensity": "ci"})
.assign(color_filter=lambda x: np.where((x.hue > 1) & (x.ci > 7), 1, 0))
.pipe(csnap)
.query("alcohol > 14")
.pipe(csnap, lambda df: df.head(), msg="After")
.sort_values("alcohol", ascending=False)
.reset_index(drop=True)
.loc[:, ["alcohol", "ci", "hue"]]
.pipe(csnap, lambda x: x.sample(5))
)

# chain filter
def cfilter(df, fn, axis="rows"):
    """ Custom Filters based on a condition and returns the df.
        function - a lambda function that returns a binary vector
        thats similar in shape to the dataframe
        axis = rows or columns to be filtered.
        A single level indexing
    """
    if axis == "rows":
        return df[fn(df)]
    elif axis == "columns":
        return df.iloc[:, fn(df)]
    
(
    iris.pipe(
        setcols,
        fn=lambda x: x.columns.str.lower()
        .str.replace(r"\(cm\)", "")
        .str.strip()
        .str.replace(" ", "_"),
    ).pipe(cfilter, lambda x: x.columns.str.contains("sepal"), axis="columns")
)

overview of single methods

.fillna(0)
.dropna()
.rename(columns=str.lower)
.assign(fl_date=lambda x: pd.to_datetime(x['fl_date']) # chain assign datetime
.assign(hour=lambda x: x.dep_time.dt.hour)
.assign(**{
        'ID':'', '% of trials':'', 'Signif.':'','Irrelevant':'','Mapping Issue':'','Drug':'','Entity (NCI)':'','Method (NCI)':'','Context / Further Attributes':'','Comment':'' # add blankd
        })
.sort_values("alcohol", ascending=False)
.loc[df['unique_carrier'].isin(df['unique_carrier'].value_counts().index[:5])]
.drop('unnamed: 36', axis=1)

.fillna('') will do

using functions

is_certain_value = lambda df : df.entity == certain_value

more info about function checks: https://github.com/engarde-dev/engarde/blob/master/engarde/checks.py

good explanation: https://tomaugspurger.github.io/method-chaining.html

  • assign (0.16.0): For adding new columns to a DataFrame in a chain (inspired by dplyr's mutate)
  • pipe (0.16.2): For including user-defined methods in method chains.
  • rename (0.18.0): For altering axis names (in additional to changing the actual labels as before).
  • Window methods (0.18): Took the top-level pd.rolling_* and pd.expanding_* functions and made them NDFrame methods with a groupby-like API.
  • Resample (0.18.0) Added a new groupby-like API
  • .where/mask/Indexers accept Callables (0.18.1): In the next release you'll be able to pass a callable to the indexing methods, to be evaluated within the DataFrame's context (like .query, but with code instead of strings).

https://www.quora.com/I-love-the-flexibility-of-pandas-dataframes-but-I-feel-like-they-can-make-code-harder-to-read-and-maintain-What-are-some-pandas-best-practices-that-address-this-issue

https://github.com/HerveMignot/PyParis2018/blob/master/Modern%20Pandas%20at%20PyParis.ipynb

Tidyverse vs pandas: link

Template for reading new file

df = (pd.read_csv()
.pipe(assert_correct_format())

things to be tested for reading in df column numbers correct dtypes are correct test if a set of entries are there bm_names_to_be_there for name in bm_names_to_be_there assert df[df[''] == ''].shape[0]

(top)

Binning

# binning
pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3, labels=["bad", "medium", "good"])
[bad, good, medium, medium, good, bad]
Categories (3, object): [bad < medium < good]

# binning into custom intervals
bins = [0, 1, 5, 10, 25, 50, 100]
labels = [1,2,3,4,5,6]
df['binned'] = pd.cut(df['percentage'], bins=bins, labels=labels)
print (df)
   percentage binned
0       46.50      5
1       44.20      5
2      100.00      6
3       42.12      5

source

Clipping (top)

# removing outlier
df.clip(lower=pd.Series({'A': 2.5, 'B': 4.5}), axis=1)

Outlier removal
```python
q = df["col"].quantile(0.99)
df[df["col"] < q]

#or 

df = pd.DataFrame(np.random.randn(100, 3))
from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

source

df['Date of Publication'] = pd.to_numeric(extr)

# np.where
df['Place of Publication'] = np.where(london, 'London',
	 np.where(oxford, 'Oxford', pub.str.replace('-', ' ')))
 
# 1929    1839, 38-54
# 2836        [1897?]
regex = r'^(\d{4})'
extr = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)

# columns to ditionary
master_dict = dict(df.drop_duplicates(subset="term")[["term","uid"]].values.tolist())

pivoting table https://stackoverflow.com/questions/47152691/how-to-pivot-a-dataframe

replace with map

d = {'apple': 1, 'peach': 6, 'watermelon': 4, 'grapes': 5, 'orange': 2,'banana': 3}

df["fruit_tag"] = df["fruit_tag"].map(d)

regex matching groups https://stackoverflow.com/questions/2554185/match-groups-in-python

import re

mult = re.compile('(two|2) (?P<race>[a-z]+) (?P<gender>(?:fe)?male)s')
s = 'two hispanic males, 2 hispanic females'
mult.sub(r'\g<race> \g<gender>, \g<race> \g<gender>', s)
# 'hispanic male, hispanic male, hispanic female, hispanic female'

source

(top)

test if type is string is equal

isinstance(s, str)

apply function to column

df['a'] = df['a'].apply(lambda x: x + 1)

exploding a column

df = pd.DataFrame([{'var1': 'a,b,c', 'var2': 1}, {'var1': 'd,e,f', 'var2': 2}]) 
df.assign(var1=df.var1.str.split(',')).explode('var1')

(top)

Performance

Reshaping dataframe

The similarity between melt and stack: blog post

sorting dataframe

df = pd.read_csv("data/347136217_T_ONTIME.csv")
delays = df['DEP_DELAY']

# Select the 5 largest delays
delays.nlargest(5).sort_values()
%timeit delays.sort_values().tail(5)
31 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit delays.nlargest(5).sort_values()
7.87 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

check memory usage:

c = s.astype('category')
print('{:0.2f} KB'.format(c.memory_usage(index=False) / 1000))

(top)

Concat vs. append

source

fast append via list of dictionaries:

rows_list = []
for row in input_rows:

        dict1 = {}
        # get input row in dictionary format
        # key = col_name
        dict1.update(blah..) 

        rows_list.append(dict1)

df = pd.DataFrame(rows_list)        

source: link

alternatives

#Append
def f1():
	result = df
	for i in range(9):
		result = result.append(df)
	return result
	
# Concat
def f2():
	result = []
	for i in range(10):
		result.append(df)
	return pd.concat(result)

In [101]: %timeit f1()
1 loops, best of 3: 1.66 s per loop

In [102]: %timeit f2()
1 loops, best of 3: 220 ms per loop
timings = (pd.DataFrame({"Append": t_append, "Concat": t_concat})
             .stack()
             .reset_index()
             .rename(columns={0: 'Time (s)',
                              'level_1': 'Method'}))
timings.head()

(top)

Dataframe: iterate rows

Useful links

  • how-to-iterate-over-rows-in-a-dataframe-in-pandas: link
  • how-to-make-your-pandas-loop-71-803-times-faster: link
  • example of bringing down runtime: iterrows, iloc, get_value, apply: link
  • complex example using haversine_looping: link, jupyter notebook
  • different-ways-to-iterate-over-rows-in-a-pandas-dataframe-performance-comparison: link
  • pandas performance tweaks: cython, using numba

  1. Vectorization
  2. Cython routines
  3. List Comprehensions (vanilla for loop)
  4. DataFrame.apply(): i) Reductions that can be performed in cython, ii) Iteration in python space
  5. DataFrame.itertuples() and iteritems()
  6. DataFrame.iterrows()

(top)

Profiling book chapter from jakevdp: link

%timeit sum(range(100)) # single line

%timeit np.arange(4)[pd.Series([1, 2, 3])]
%timeit np.arange(4)[pd.Series([1, 2, 3]).values]
111 µs ± 2.25 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
61.1 µs ± 2.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


%%timeit # full cell
total = 0
for i in range(1000):
    for j in range(1000):
        total += i * (-1) ** j

# profiling
def sum_of_lists(N):
    total = 0
    for i in range(5):
        L = [j ^ (j >> i) for j in range(N)]
        total += sum(L)
    return total
%prun sum_of_lists(1000000)

%load_ext line_profiler
%lprun -f sum_of_lists sum_of_lists(5000)

# memory usage
%load_ext memory_profiler
%memit sum_of_lists(1000000)

performance plots (notebook link):

import perfplot  
import pandas as pd
import numpy as np

perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B']),
    kernels=[
        lambda df: df[df.A != df.B],
        lambda df: df.query('A != B'),
        lambda df: df[[x != y for x, y in zip(df.A, df.B)]]
    ],
    labels=['vectorized !=', 'query (numexpr)', 'list comp'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N'
)

(top)

list comprehension

# iterating over one column - `f` is some function that processes your data
result = [f(x) for x in df['col']]
# iterating over two columns, use `zip`
result = [f(x, y) for x, y in zip(df['col1'], df['col2'])]
# iterating over multiple columns
result = [f(row[0], ..., row[n]) for row in df[['col1', ...,'coln']].values]

Further tipps

  • Do numerical calculations with NumPy functions. They are two orders of magnitude faster than Python’s built-in tools.
  • Of Python’s built-in tools, list comprehension is faster than map() , which is significantly faster than for.
  • For deeply recursive algorithms, loops are more efficient than recursive function calls.
  • You cannot replace recursive loops with map(), list comprehension, or a NumPy function.
  • “Dumb” code (broken down into elementary operations) is the slowest. Use built-in functions and tools.

source: example code, link

(top)

Parallel data structures

https://learning.oreilly.com/library/view/python-high-performance/9781787282896/

get all combinations from two columns

tuples = [tuple(x) for x in dm_bmindex_df_without_index_df[['trial', 'biomarker_name']].values]

(top)

Jupyter notebooks

jupyter notebook best practices (link): structure your notebook, automate jupyter execution: link

jt -t grade3 -fs 95 -altp -tfs 11 -nfs 115 -cellw 88% -T

From jupyter notebooks to standalone apps (Voila): github, blog (example github PR)

jupyter lab: Shortcut to run single command: stackoverflow

Notebooks in production

directory structure, layout, workflow: blog post also: cookiecutter

workflow

(top)

Directory structure

  • raw - Contains the unedited csv and Excel files used as the source for analysis.
  • interim - Used if there is a multi-step manipulation. This is a scratch location and not always needed but helpful to have in place so directories do not get cluttered or as a temp location form troubleshooting issues.
  • processed - In many cases, I read in multiple files, clean them up and save them to a new location in a binary format. This streamlined format makes it easier to read in larger files later in the processing pipeline.

(top)

Further link

how netflix runs notebooks: scheduling, integration testing: link

jupyter notebook template

header section

  • A good name for the notebook (as described above)
  • A summary header that describes the project
  • Free form description of the business reason for this notebook. I like to include names, dates and snippets of emails to make sure I remember the context.
  • A list of people/systems where the data originated.
  • I include a simple change log. I find it helpful to record when I started and any major changes along the way. I do not update it with every single change but having some date history is very beneficial.

(top)

Orchestration

https://taskfile.dev/#/

jupyter code snippets

# jupyter notebook --generate-config
jupyter notebook --generate-config

# start in screen session
screen -d -m -S JUPYTER jupyter notebook --ip 0.0.0.0 --port 8889 --no-browser --NotebookApp.token=''

# install packages in jupyter
!pip install package-name

# environment variables
%%bash
which python

# reset/set password
jupyter notebook password

# show all running notebooks
jupyter notebook list
# (Can be useful to get a hash for a notebook)

append to path

from os.path import dirname
sys.path.append(dirname(__file__))

hide warnings

import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings(action='once')

tqdm (top)

from tqdm import tqdm
for i in tqdm(range(10000)):

qqrid (top)

import qqrid
qqrid_widget = qqrid.show_grid(df, show_toolbar=True)
qqrid_widget

print all numpy

import numpy
numpy.set_printoptions(threshold=numpy.nan)

Debugging:

ipdb

# debug ipdb
from IPython.core.debugger import set_trace

def select_condition(tmp):
    set_trace()

Pixie debugger

# built-in profiler
%prun -l 4 estimate_and_update(100)

# line by line profiling
pip install line_profiler
%load_ext line_profiler
%lprun -f sum_of_lists sum_of_lists(5000)

# memory usage
pip install memory_profiler
%load_ext memory_profiler
%memit sum_of_lists(1000000)

source: Timing and profiling

add tags to jupyterlab: jupyterlab/jupyterlab#4100

{
    "tags": [
        "to_remove"
    ],
    "slideshow": {
        "slide_type": "fragment"
    }
}

removing tags: https://groups.google.com/forum/#!topic/jupyter/W2M_nLbboj4

(top)

Timing and Profiling

https://jakevdp.github.io/PythonDataScienceHandbook/01.07-timing-and-profiling.html

test code

test driven development in jupyter notebook

asserts

def multiplyByTwo(x):  
    return x * 3

assert multiplyByTwo(2) == 4, "2 multiplied by 2 should be equal 4"

# test file size
assert os.path.getsize(bm_index_master_file) > 150000000, 'Output file size should be > 150Mb'

# assert is nan
assert np.isnan(ret_none), f"Can't deal with 'None values': {ret_none} == {np.nan}"

Production-ready notebooks: link If tqdm doesnt work: install ipywidgets Hbox full: link

Qgrid

Qgrid readme

qgrid.show_grid(e_tpatt_df, grid_options={'forceFitColumns': False, 'defaultColumnWidth': 100})

Debugging conda

# conda show install versions
import sys print(sys.path)
or
import sys, fastai 
print(sys.modules['fastai'])

Running Jupyter

jupyter notebook --browser=false &> /dev/null &
--matplotlib inline --port=9777 --browser=false

# Check GPU is working GPU working

from tensorflow.python.client import device_lib

def get_available_devices():  
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos]

print(get_available_devices()) 

(top)

installing kernels

# conda install kernel
source activate <ANACONDA_ENVIRONMENT_NAME>  
pip install ipykernel  
python -m ipykernel install --user

or 

source activate myenv
python -m ipykernel install --user --name myenv --display-name "Python (myenv)"

source stackoverflow

(top)

unsorted

# use dictionary to count list
>>> from collections import Counter
>>> Counter(['apple','red','apple','red','red','pear'])
Counter({'red': 3, 'apple': 2, 'pear': 1})
# dictionary keys to list
list(dict.keys())
# dictionary remove nan
# if nan in keys
clean_dict = filter(lambda k: not isnan(k), my_dict)
# if nan in values
clean_dict = filter(lambda k: not isnan(my_dict[k]), my_dict)
# list remove nan
cleanedList = [x for x in countries if str(x) != 'nan']
# pandas convert all columns to lowercase
df.apply(lambda x: x.astype(str).str.lower())
# pandas set difference tow columns 
# source: https://stackoverflow.com/questions/18180763/set-difference-for-pandas
from pandas import  DataFrame

df1 = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})

print df2[~df2.isin(df1).all(1)]
print df2[(df2!=df1)].dropna(how='all')
print df2[~(df2==df1)].dropna(how='all')

# union
print("Union :", A | B)
# intersection
print("Intersection :", A & B)
# difference
print("Difference :", A - B)
# symmetric difference
print("Symmetric difference :", A ^ B)
# pandas value counts to dataframe
df = value_counts.rename_axis('unique_values').reset_index(name='counts')
# python dictionary get first key
list(tree_number_dict.keys())[0]
# pandas dataframe get cell value by condition
function(df.loc[df['condition'].isna(),'condition'].values[0],1)
# dataframe drop duplicates keep first
df = df.drop_duplicates(cols='index',take_last=True)# 2 use new col

(top)

@sevenmentor4
Copy link

Nice post. thanks for sharing. Data science is a multidisciplinary field that involves extracting insights and knowledge from data using various techniques and tools. It combines elements of statistics, mathematics, computer science, and domain expertise to analyze large and complex datasets and make informed decisions. The main goal of data science is to uncover patterns, extract meaningful information, and generate actionable insights from data.

This process typically involves several steps, including data collection, data cleaning and preprocessing, exploratory data analysis, modeling and algorithm development, and interpretation of results. Data scientists use a wide range of tools and programming languages, such as Python, R, and SQL, to manipulate and analyze data.

They also utilize various statistical and machine learning techniques, such as regression analysis, clustering, classification, and deep learning, to build predictive models and make data-driven decisions. The applications of data science are vast and can be found in numerous industries and sectors. Some common examples include: Business and finance: Data science is used to analyze customer behavior, optimize marketing campaigns, detect fraud, and make investment decisions. Healthcare:

Data science helps in analyzing patient data, predicting disease outcomes, drug discovery, and optimizing healthcare operations. E-commerce and retail: Data science is used for demand forecasting, personalized recommendations, inventory management, and supply chain optimization. Social media and marketing: Data science plays a crucial role in social media analytics, sentiment analysis, targeted advertising, and customer segmentation. Manufacturing and logistics: Data science is used for process optimization, predictive maintenance, quality control, and supply chain management. To work in the field of data science, individuals need a strong foundation in mathematics and statistics, as well as programming skills.

They should also possess critical thinking, problem-solving, and communication skills to effectively analyze and interpret data and communicate insights to stakeholders. Overall, data science has become increasingly important in today's data-driven world, as organizations seek to leverage the power of data to gain a competitive edge and make informed decisions.

Data Science Classes in Pune
Data Science Training in Pune

@saniyasharma123
Copy link

Learn Dot Net programming course with our comprehensive training courses at Cetpa Infotech. Get certified and start your career as a .NET developer in no time! Call Now +91-921-217-2602

best angular online training
selenium online training

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment