Skip to content

Instantly share code, notes, and snippets.

Avatar

Andrew Gross andrewgross

View GitHub Profile
View browse_s3.py
import os
from urllib.parse import urlparse
from pyspark.sql.functions import desc, asc
from pyspark.sql.types import (
StructType,
StructField,
StringType,
LongType,
@andrewgross
andrewgross / optimize_filesize.py
Created February 4, 2019 20:46
PySpark code to take a dataframe and repartition it in to an optimal number of partitions for generating 300Mb-1GB parquet files.
View optimize_filesize.py
import re
import pyspark.sql.types as T
from math import ceil
def repartition_for_writing(df):
count = df.count()
sampled_df = get_sampled_df(df, count=count)
string_column_sizes = get_string_column_sizes(sampled_df)
num_files = get_num_files(count, df.schema, string_column_sizes)
@andrewgross
andrewgross / s3_inventory.py
Created January 27, 2019 19:45
This script is to help for properly setting permissions so that you can read S3 Inventory data in an account that is not the owner of a bucket. It assumes that you have a ROLE_ARN that can assume a role in the main account that has Read Permissions and R/W for Object ACL permissions on your s3 bucket.
View s3_inventory.py
import datetime
import json
BUCKET_NAME = "<s3_bucket_name>"
INVENTORY_PREFIX = "<prefix_given_to_s3_inventory>" # Should have data/, hive/, and some dated folders inside of it
ACCOUNT_CUID = "<your_canonical_user_id_for_cross_account>" # Account which is not the owner of S3 bucket, but trying to access it. Controls ROLE_ARN
ROLE_ARN = "<role_in_cross_account_that_can_assume_to_main_account>"
def role_arn_to_session(role_arn):
@andrewgross
andrewgross / dynamic_partition.py
Last active May 25, 2018 18:35
PySpark scripts to predict the number of partitions needed to get good output file sizes (100-300MB for Parquet). Also a helper function to determine your average byte array size.
View dynamic_partition.py
def get_files_per_partition(df, partition_key, file_type="parquet", compression="snappy", byte_array_size=256):
rows = df.count()
print "Dataset has {} rows".format(rows)
schema = df.schema
num_partitions = 1
if partition_key is not None:
num_partitions = df.select([partition_key]).distinct().count()
print "Dataset has {} distinct partition keys".format(num_partitions)
_df = df.drop(partition_key)
schema = _df.schema
View gist:300286593b0bd2c2cc0ace5db819095e
Hey,
This is going to be a bit of an esoteric ticket. I noticed some strange behavior recently when comparing Spectrum and Redshift results on the same dataset.
Redshift Data: fare.txt
Parquet Data: fare.parquet
The parquet data was generated from fare.txt with PySpark using convert.py on Spark 2.2.0
Redshift Table Schema:
View keybase.md

Keybase proof

I hereby claim:

  • I am andrewgross on github.
  • I am andrewwgross (https://keybase.io/andrewwgross) on keybase.
  • I have a public key ASDsj8ie3y_QBUpm4aBzm-ty7Hr9w_Y5PtWIcLZfQlt9JQo

To claim this, I am signing this object:

@andrewgross
andrewgross / convert_url.py
Created March 23, 2017 21:56
CLI App that takes an HTTP address with a host similar to ip-10-32-9-122.ec2.internal and puts in your mac clipboard the HTTP address with the host converted to an IP address.
View convert_url.py
import cmd
from urlparse import urlparse
import sys
import re
import subprocess
def write_to_clipboard(output):
process = subprocess.Popen(
'pbcopy', env={'LANG': 'en_US.UTF-8'}, stdin=subprocess.PIPE)
process.communicate(output.encode('utf-8'))
@andrewgross
andrewgross / travis_webhook_checker.py
Last active November 5, 2020 15:10
Django View to check travis CI webhook signatures. Requires Django, python-requests and pyOpenSSL packages
View travis_webhook_checker.py
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import base64
import json
import logging
from urlparse import parse_qs
import requests
View one_hour_ago.go
one_hour_ago := time.Now().AddDate(0, 0, -1).Add(time.Hour * 23)
View flip.sh
flip() {
echo;
echo -en "( º_º) ┬─┬ \r"; sleep .2;
echo -en " ( º_º) ┬─┬ \r"; sleep .2;
echo -en " ( ºДº)┬─┬ \r"; sleep .2;
echo -en " (╯'Д')╯︵⊏ \r"; sleep .1;
echo -en " (╯'□')╯︵ ⊏ \r"; sleep .1;
echo " (╯°□°)╯︵ ┻━┻"; sleep .1;
}