Skip to content

Instantly share code, notes, and snippets.

@zufanka
Last active December 11, 2024 09:57
Show Gist options
  • Save zufanka/39b8a55d707b3b4a2a4d369694739561 to your computer and use it in GitHub Desktop.
Save zufanka/39b8a55d707b3b4a2a4d369694739561 to your computer and use it in GitHub Desktop.
The turbo data analysis cheatsheet

Contents

SQL

Update

Update a column

postgres: update a column based on another column

UPDATE income
SET currency = 'HRK'
WHERE income.date < '2022-01-01';

Update with a join

UPDATE income
SET currency = 'HRK'
FROM parties
WHERE income.party_id = parties.id
AND parties.country = 'HR'
AND income.date < '2022-01-01';

Python

Libraries

pycountry and pycountry_convert - country details

Lists

Cycling through a list

To cycle through the list, you can use the modulo operator: L[i % n]

computer_rooms = ["2.08","2.09","2.1"]
[computer_rooms[i % len(computer_rooms)] for i in range(10)]
# ['2.08', '2.09', '2.1', '2.08', '2.09', '2.1', '2.08', '2.09', '2.1', '2.08']

Split on a pattern

The re.findall(pattern, string) method scans string from left to right, searching for all non-overlapping matches of the pattern. It returns a list of strings in the matching order when scanning the string from left to right.

import re
s = '111A222B333C'
res = re.findall('(\d+|[A-Za-z]+)', s)
print(res)
# ['111', 'A', '222', 'B', '333', 'C']

Flatten list of lists

[item for sublist in listoflists for item in sublist]

Find intersection of two lists

has_tail = ["cat", "dog", "chameleon"]
has_earlobes = ["human", "cat", "dog"]

list(filter(set(has_tail).__contains__, has_earlobes))

>> ['cat', 'dog']

Dictionaries

Find parameters of an object

object.as_dict()

Swap keys and values of a dictionary

{y:x for x,y in stations.items()}

Using dictionaries instead of if-elif-else

# unPythonic
if season == 'Winter':
    holiday = 'Christmas'
elif season == 'Spring':
    holiday = 'Easter'
elif season == 'Summer':
    holiday = 'American Independence Day'
elif season == 'Fall':
    holiday = 'Halloween'
else:
    holiday = 'Personal day off'

# Pythonic
season = "Winter"
holiday = {'Winter': 'Christmas',
           'Spring': 'Easter',
           'Summer': 'American Independence Day',
           'Fall':   'Halloween'}.get(season, 'Personal day off')
season
"Christmas"

Using dictionaries to prevent a key error

workDetails = {}

# unPythonic
if 'hours' in spam:
    hoursWorked = workDetails['hours']
else:
    hoursWorked = 0 # Default to 0 if the 'hours' key doesn't exist.
    
# Pythonic
hoursWorked = spam.get('hours', 0)

# Pythonic 2
workDetails.setdefault('hours', 0)  # Does nothing if 'hours' exists.

two lists into a dictionary

keys = ['a', 'b', 'c']
values = [1, 2, 3]
dictionary = dict(zip(keys, values))

Dates

Make datetime object from isoformat

datetime.date.fromisoformat('string')

Put a datetime object to isoformat

datetimeObject.isoformat()

BeautifulSoup

replace br with a different character

for br in soup.find_all('br'):
   br.replace_with("|")

Pandas

Jupyter notebook pandas settings with gspread

import pandas as pd

from IPython.display import Markdown, display

def printmd(string):
    display(Markdown(string))
    
import gspread
import gspread_dataframe as gd
from oauth2client.service_account import ServiceAccountCredentials
    
pd.set_option("max_colwidth", 500)
pd.options.display.max_rows = 100
pd.options.display.max_columns = 100
pd.options.display.float_format = '{:,.2f}'.format

Check if a cell is NaN

Not a column, but just one value

pd.isna(df["colname"][0])

Hence filling in a column only where there is NaN values based on a dictionary

dict = {"key" : "value"}
x["different_col"] * dict[x["somecolthat_holds_the_keys"]] if pd.isna(x["colname"]) else x["colname"]

Change type of Series

Series.astype(int)

Check what percentage of a df is (not) empty

returns a Series source

df.count() / len(df)
df.notnull().mean().round(4) * 100

Filter out columns that are N% empty

returns a list of columns

cutoff = 50 # percent of not null
filter_columns = df.notnull().mean().round(4) * 100 
filter_columns.where(filter_columns > cutoff).dropna().index

Set value based on a value of another column (based on a list)

If an item from somecol is in the list, assign value 1 to a new column in the DataFrame, if not assign a 0

list = [items, items, items]
df["newcol"] = [1 if y in list else 0 for y in df["somecol"]]

Count frequency: how many time the same row appears in the dataframe

result is a matrix

Group           Size

Short          Small
Short          Small
Moderate       Medium
Moderate       Small
Tall           Large

pd.crosstab(df.Group,df.Size)

sum strings across rows

df.groupby("col")["jointhis"].agg(lambda col: ', '.join(col))
df.groupby("col")["jointhis"].apply(list)

Cummulative count = making ID's, counting occurences of a column value

df.groupby("colname").cumcount() + 1 # +1 because otherwise counting starts at 0

colname1 | cumcount
a | 1
a | 2
b | 1
a | 3
c | 1

Make percentages from absolute values in a DataFrame

perc = pd.DataFrame(df.apply(lambda x: 100 * (x / float(x.sum())))).unstack()
# perc.columns = perc.columns.droplevel()

Handling jsons

more

import json
import requests

r = requests.get("https://api.tenders.exposed/networks/58d77f85-bbc6-447d-a292-c3f17b7936b0/").text
data = json.loads(r)

# go down the tree as far as you need
data["network"]["nodes"]

# normalize that baby
pd.json_normalize(data["network"]["nodes"])

Replace strings (from dict)

map_min = {"Ministerie van Economische Zaken, Landbouw en Innovatie" : "Ministerie van Economische Zaken",
           "Provincie Overrijssel" : "Provincie Overijssel"}
nl = nl.replace({'Overheid': map_min}) # "Overheid" is the column name

Fuzzy merging

from fuzzymatcher import link_table, fuzzy_left_join

fuzzy_left_join(df1, df2, "df1-matchingcol", "df2-matchingcol")

Split cell and explode into multiple rows

exploded = df[["id","colname"]].set_index("id")["colname"].str.split(";").explode().reset_index().drop_duplicates()
pd.merge(left=df, right=exploded, on="id")

Stack and unstack

img src

def highlight_max(s):
    '''
    highlight the maximum in a Series yellow.
    '''
    is_max = s == s.max()
    return ['background-color: yellow' if v else '' for v in is_max]

df.style.apply(highlight_max)

Styles

df.style.bar()
df.style.background_gradient(cmap='Blues', axis=None)

Sheets styling

https://github.com/robin900/gspread-formatting

Google Sheets >> DataFrame

import pandas as pd
import gspread
from oauth2client.service_account import ServiceAccountCredentials

scope = ['https://spreadsheets.google.com/feeds', 'https://www.googleapis.com/auth/drive']
creds = ServiceAccountCredentials.from_json_keyfile_name('the_credentials.json', scope)
client = gspread.authorize(creds)

url = "sheeturl"

sheet = client.open_by_url(sheet_url)

# entities = gd.get_as_dataframe(sheet.worksheet("sheetnamen"),  skiprows=1) # get a sheet
# entities = entities.dropna(axis=0, how="all").dropna(axis=1, how="all") # remove empty rows and columns
 #gd.set_with_dataframe(sheet.worksheet("sheetnamen"), entities, row=2) # push to a sheet

multiple csv >> DataFrame & filter

all_csvs = glob.glob("/data/*.csv")

cols = ["col1", "col2"]

df = pd.concat((pd.read_csv(table,
                 usecols=cols,
                 delimiter=';', low_memory=False) for table in all_csvs), ignore_index=True)

csvs in zip >> DataFrame

zf = zipfile.ZipFile("zippedcsvs.zip")
#zf.infolist()
df = pd.concat((pd.read_csv(zf.open(text_file.filename), encoding = "ISO-8859-1") for text_file in zf.infolist() if text_file.filename.endswith('.csv')))

ODS >> DataFrame

from pyexcel_ods import get_data

sheet = get_data("data.ods")["sheetname"]
df = pd.DataFrame(sheet)
df.columns = df.iloc[0]
df = df[1:]

DataFrame >> list of lists

lol = df.values.tolist()

dataframe >> dict

use one column as key and another as value

df.set_index('colname').T.to_dict('records')

Transpose

df.T

Selenium

get screenshot of the whole page

url = "https://www.web.com/"
path = "naeme.png"
    
driver = webdriver.Firefox()
driver.get(url)

time.sleep(5)

el = driver.find_element_by_tag_name('body')
el.screenshot(path)
driver.quit()

Numpy

Set value based on a value of multiple columns

np.where(((df["col1"] == "pink") &
          (data["col2"] == "pink")) |
         ((data["col1"] == "blue") &
          (data["col2"] == "blue")), "same", "not")

Split the dataframe into multiple equal parts

Use when data is next to each other rather than below

dfs = np.split(df, 12, axis=1)

Altair

Disable max rows

alt.data_transformers.enable('default', max_rows=None)

sorted barchart

column = is the nominal column we count on.
if sum, ...

alt.Chart(df).mark_bar().encode(
    x = "count(column):Q",
    y = alt.Y(
        'column:N',
        sort=alt.EncodingSortField(
            field="column",  # The field to use for the sort
            op="count",  # The operation to run on the field prior to sorting
            order="descending"  # The order to sort in
        ))
)

Descriptions of a grouped barchart

row = alt.Row("parameter", header=alt.Header(labelAngle = 0.1))

Custom colors

color = alt.Color('variable',
                   scale=alt.Scale(
                   domain=['good', 'medium', 'bad'],
                   range=['blue', 'yellow' ,'red']))

link from chart

alt.Chart(df).transform_calculate(
    url=alt.datum.colname
).mark_circle().encode(
    x = "col1",
    y = "col2",
    href='url:N',

Neo4j

Load network from a csv file

LOAD CSV WITH HEADERS FROM "file:///data_clean.csv" AS row
MERGE (b:buyer {buyer: coalesce(row.Buyer, "Unknown")})
MERGE (v:vendor {vendor: row.Vendor})
MERGE (p:property {property: coalesce(row.Asset, "Unknown")})
MERGE (v)-[r:SOLD]->(p)
MERGE (b)-[q:BOUGHT]->(p)
RETURN *;

Command line

Animated gifs

ffmpeg -i video.avi -t 10 out%02d.gif
gifsicle --delay=10 --loop *.gif > anim.gif

Maps

mapshaper

make a chloroplet

git

remove the last commit not pushed from the history while keeping local files intact

git reset --soft HEAD~1

Remove the File from the Commit

git rm --cached data.csv
git commit --amend --no-edit
git push origin main

Inspired by the cheatsheet of my dear colleague Hay Kranen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment