This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
Read accompanying blog post: https://ianwhitestone.work/Zappa-Zip-Callbacks | |
""" | |
import os | |
import re | |
import shutil | |
import tarfile | |
import zipfile | |
- Don’t
SELECT *
, Specify explicit column names (columnar store) - Avoid large JOINs (filter each table first)
- In PRESTO tables are joined in the order they are listed!!
- Join small tables earlier in the plan and leave larger fact tables to the end
- Avoid cross joins or 1 to many joins as these can degrade performance
- Order by and group by take time
- only use order by in subqueries if it is really necessary
- When using GROUP BY, order the columns by the highest cardinality (that is, most number of unique values) to the lowest.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import sys | |
import dask.bag as db | |
def gt(x): | |
return x > 3 | |
def even(x): | |
return x % 2 == 0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
"""Generate a bunch of fake avro data and upload to s3 | |
Running in python 3.7. Installed the following: | |
- pip install Faker | |
- pip install fastavro | |
- pip install boto3 | |
- pip install graphviz | |
- brew install graphviz | |
""" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Source: https://towardsdatascience.com/a-data-science-for-good-machine-learning-project-walk-through-in-python-part-one-1977dd701dbc | |
import pandas as pd | |
# Number of missing in each column | |
missing = pd.DataFrame(data.isnull().sum()).rename(columns = {0: 'total'}) | |
# Create a percentage missing | |
missing['percent'] = missing['total'] / len(data) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Let's add an row number to indicate the first message per app & microservice | |
# This code is analagous to the SQL: row_number() over (partition by id, topic order by msg_ts asc) | |
df['row_num'] = df.sort_values(['id', 'msg_ts'], ascending=True).groupby(['id', 'topic']).cumcount() + 1 |
NewerOlder