Skip to content

Instantly share code, notes, and snippets.

View robcowie's full-sized avatar

Rob Cowie robcowie

  • Recycleye
  • Leeds/London, United Kingdom
View GitHub Profile
@robcowie
robcowie / largest_partition.sh
Created February 8, 2019 13:32
Mounted partition with most free space
df | grep / | sort -k 4 -n -r | head -n 1 | awk '{print $6}'
@robcowie
robcowie / list_s3_with_metadata.py
Created January 28, 2019 12:54
List S3 with pagination and metadata
def list_s3_with_metadata(s3_conn, prefix):
"""List all keys at `prefix` and return metadata."""
bucket, prefix = prefix.split('://')[1].split('/', 1)
paginator = s3_conn.get_paginator('list_objects_v2')
response = paginator.paginate(Bucket=bucket, Prefix=prefix)
def attrs(d):
return {'Key': 's3://{}/{}'.format(bucket, d['Key']), 'ETag': d['ETag'].replace('"', ''), 'Size': d['Size']}
@robcowie
robcowie / .gitignore_global
Created November 27, 2018 11:44
Global gitignore for discussion
# Compiled source #
###################
*.com
*.class
*.dll
*.exe
*.o
*.so
*.pyc
*.cache
@robcowie
robcowie / boto3_emr_cluster_definition.py
Created November 21, 2018 11:21
EMR cluster definition for boto3
CLUSTER_DEFINITION = {
'Name': 'name',
'Instances': {
'InstanceGroups': [
{
'Name': 'Master',
'Market': 'SPOT',
'InstanceRole': 'MASTER',
'BidPrice': '1',
'InstanceType': 'r4.2xlarge',
@robcowie
robcowie / ip_anonymisation_bigquery.sql
Created July 19, 2018 15:25
Investigating IP anonymisation in Bigquery
#standardSQL
CREATE TEMPORARY FUNCTION anonIPToBytes(ip string) AS (
-- remove the last 8 bits of an IPv4 address (32 - 8 = 24)
NET.IP_TRUNC(NET.SAFE_IP_FROM_STRING(ip), 24)
-- TODO: how to distinguish v4 and v6?
-- remove the last 80 bits of an IPv6 address (128 - 80 = 48)
-- NET.IP_TRUNC(NET.SAFE_IP_FROM_STRING(ip), 48)
);
@robcowie
robcowie / bigquery_notes.md
Last active June 17, 2019 08:17
Biquery Notes

Biqquery Notes

Require a partition filter on an existing table

bq update --require_partition_filter --time_partitioning_field ts -t page_impressions.raw

Copy a table

@robcowie
robcowie / date_range_n_days.py
Last active March 2, 2020 10:44
Generate date range from start date for N days
import datetime as dt
import operator as op
def date_iterator(from_date, days, reverse=False):
func = op.sub if reverse else op.add
return (func(from_date, dt.timedelta(days=d)) for d in range(days))
def date_range(from_date, to_date, inclusive=True):
@robcowie
robcowie / hdfs-ha-namenode-failback.sh
Created October 15, 2017 20:22 — forked from salekseev/hdfs-ha-namenode-failback.sh
Script to check that both NameNodes are alive in HDFS HA configuration and will force failover to the preferred NameNode
#!/bin/bash
#
# This script will check that both NameNodes are alive in HDFS HA
# configuration and will force failover to the preferred NameNode.
#
# Author: Stas Alekseev <me@salekseev.com>
#
ACTIVE_NAMENODE=nn1
STANDBY_NAMENODE=nn2
@robcowie
robcowie / distcp_examples.sh
Created June 20, 2017 19:10
distcp Examples
# export HADOOP_OPTS=-Xmx28G
export HADOOP_CLIENT_OPTS="-Xmx2048m"
hadoop distcp -p "hdfs://plat/data/level1/clicks/datehour=2016-12-*" "gs://data-events/data/level1/clicks/"