Skip to content

Instantly share code, notes, and snippets.

View ian-whitestone's full-sized avatar
🐍
Exit code 143

Ian Whitestone ian-whitestone

🐍
Exit code 143
View GitHub Profile
@ian-whitestone
ian-whitestone / row_num.py
Created August 21, 2018 13:53
Pandas equivalent of SQL's row_number()
# Let's add an row number to indicate the first message per app & microservice
# This code is analagous to the SQL: row_number() over (partition by id, topic order by msg_ts asc)
df['row_num'] = df.sort_values(['id', 'msg_ts'], ascending=True).groupby(['id', 'topic']).cumcount() + 1
@ian-whitestone
ian-whitestone / jupyter_setup.md
Created August 29, 2018 13:02
Setting up jupyter lab on a ubuntu instance

Setting Up Jupyter Lab on an EC2

  1. Install Jupyter Lab

conda install -c conda-forge jupyterlab

  1. Create certs

openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout mycert.pem -out mycert.pem

@ian-whitestone
ian-whitestone / missing.py
Created September 14, 2018 18:30
Pandas missing values summary
# Source: https://towardsdatascience.com/a-data-science-for-good-machine-learning-project-walk-through-in-python-part-one-1977dd701dbc
import pandas as pd
# Number of missing in each column
missing = pd.DataFrame(data.isnull().sum()).rename(columns = {0: 'total'})
# Create a percentage missing
missing['percent'] = missing['total'] / len(data)
@ian-whitestone
ian-whitestone / fake_data.py
Created September 24, 2018 01:18
Generating fake data to compare dask and spark for reading avro files into a dataframe
"""Generate a bunch of fake avro data and upload to s3
Running in python 3.7. Installed the following:
- pip install Faker
- pip install fastavro
- pip install boto3
- pip install graphviz
- brew install graphviz
"""
import sys
import dask.bag as db
def gt(x):
return x > 3
def even(x):
return x % 2 == 0
@ian-whitestone
ian-whitestone / notes.md
Last active March 1, 2023 01:45
Best practices for presto sql

Presto Specific

  • Don’t SELECT *, Specify explicit column names (columnar store)
  • Avoid large JOINs (filter each table first)
    • In PRESTO tables are joined in the order they are listed!!
    • Join small tables earlier in the plan and leave larger fact tables to the end
    • Avoid cross joins or 1 to many joins as these can degrade performance
  • Order by and group by take time
    • only use order by in subqueries if it is really necessary
  • When using GROUP BY, order the columns by the highest cardinality (that is, most number of unique values) to the lowest.
@ian-whitestone
ian-whitestone / zappa_package_cleaner.py
Last active June 2, 2023 06:51
Remove additional files and/or directories from Zappa deployment package via zip callback https://ianwhitestone.work/Zappa-Zip-Callbacks/
"""
Read accompanying blog post: https://ianwhitestone.work/Zappa-Zip-Callbacks
"""
import os
import re
import shutil
import tarfile
import zipfile
@ian-whitestone
ian-whitestone / great_expecations_examples.py
Created January 12, 2020 22:32
Quickstart examples for getting up and running with great expectations
## Pandas
import great_expectations as ge
# Build up expectations on a sample dataset and save them
train = ge.read_csv("data/npi.csv")
train.expect_column_values_to_not_be_null("NPI")
train.save_expectation_suite("npi_csv_expectations.json")
# Load in a new dataset and test them
test = ge.read_csv("data/npi_new.csv")
@ian-whitestone
ian-whitestone / notify.py
Last active March 17, 2020 01:01
Script for sending failure notifications in slack
"""
Trigger slack notifications
"""
import argparse
import logging
import os
from slack.web.client import WebClient
LOGGER = logging.getLogger(__name__)