Skip to content

Instantly share code, notes, and snippets.

Andrew Gross andrewgross

View GitHub Profile
andrewgross /
Created Feb 4, 2019
PySpark code to take a dataframe and repartition it in to an optimal number of partitions for generating 300Mb-1GB parquet files.
import re
import pyspark.sql.types as T
from math import ceil
def repartition_for_writing(df):
count = df.count()
sampled_df = get_sampled_df(df, count=count)
string_column_sizes = get_string_column_sizes(sampled_df)
num_files = get_num_files(count, df.schema, string_column_sizes)
andrewgross /
Created Jan 27, 2019
This script is to help for properly setting permissions so that you can read S3 Inventory data in an account that is not the owner of a bucket. It assumes that you have a ROLE_ARN that can assume a role in the main account that has Read Permissions and R/W for Object ACL permissions on your s3 bucket.
import datetime
import json
BUCKET_NAME = "<s3_bucket_name>"
INVENTORY_PREFIX = "<prefix_given_to_s3_inventory>" # Should have data/, hive/, and some dated folders inside of it
ACCOUNT_CUID = "<your_canonical_user_id_for_cross_account>" # Account which is not the owner of S3 bucket, but trying to access it. Controls ROLE_ARN
ROLE_ARN = "<role_in_cross_account_that_can_assume_to_main_account>"
def role_arn_to_session(role_arn):
andrewgross /
Last active May 25, 2018
PySpark scripts to predict the number of partitions needed to get good output file sizes (100-300MB for Parquet). Also a helper function to determine your average byte array size.
def get_files_per_partition(df, partition_key, file_type="parquet", compression="snappy", byte_array_size=256):
rows = df.count()
print "Dataset has {} rows".format(rows)
schema = df.schema
num_partitions = 1
if partition_key is not None:
num_partitions =[partition_key]).distinct().count()
print "Dataset has {} distinct partition keys".format(num_partitions)
_df = df.drop(partition_key)
schema = _df.schema
View gist:300286593b0bd2c2cc0ace5db819095e
This is going to be a bit of an esoteric ticket. I noticed some strange behavior recently when comparing Spectrum and Redshift results on the same dataset.
Redshift Data: fare.txt
Parquet Data: fare.parquet
The parquet data was generated from fare.txt with PySpark using on Spark 2.2.0
Redshift Table Schema:

Keybase proof

I hereby claim:

  • I am andrewgross on github.
  • I am andrewwgross ( on keybase.
  • I have a public key ASDsj8ie3y_QBUpm4aBzm-ty7Hr9w_Y5PtWIcLZfQlt9JQo

To claim this, I am signing this object:

andrewgross /
Created Mar 23, 2017
CLI App that takes an HTTP address with a host similar to ip-10-32-9-122.ec2.internal and puts in your mac clipboard the HTTP address with the host converted to an IP address.
import cmd
from urlparse import urlparse
import sys
import re
import subprocess
def write_to_clipboard(output):
process = subprocess.Popen(
'pbcopy', env={'LANG': 'en_US.UTF-8'}, stdin=subprocess.PIPE)
andrewgross /
Last active Aug 3, 2019
Django View to check travis CI webhook signatures. Requires Django, python-requests and pyOpenSSL packages
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import base64
import json
import logging
from urlparse import parse_qs
import requests
View one_hour_ago.go
one_hour_ago := time.Now().AddDate(0, 0, -1).Add(time.Hour * 23)
flip() {
echo -en "( º_º) ┬─┬ \r"; sleep .2;
echo -en " ( º_º) ┬─┬ \r"; sleep .2;
echo -en " ( ºДº)┬─┬ \r"; sleep .2;
echo -en " (╯'Д')╯︵⊏ \r"; sleep .1;
echo -en " (╯'□')╯︵ ⊏ \r"; sleep .1;
echo " (╯°□°)╯︵ ┻━┻"; sleep .1;
View gist:e125be9312dfe78b036d
Multiple vulnerabilities have been discovered in the PHP language:
thoger at redhat dot com discovered that paths containing a NUL
character were improperly handled, thus allowing an attacker to
manipulate unexpected files on the server.
You can’t perform that action at this time.