Skip to content

Instantly share code, notes, and snippets.

import sys
import random
import threading
import time
from queue import Queue
from collections import Counter
from flask import Flask
from gevent.pywsgi import WSGIServer
@slotrans
slotrans / lake_inventory_handler.py
Last active October 1, 2022 18:59
lambda function to track data lake inventory
# If you have a data lake, you will often want to be able to ask questions about what's in it, and prefix-based listings of
# objects, as provided by S3 and all S3-alikes, tightly constrain your ability to do so. Using a simple Lambda function like
# this (which can be adapted to the FaaS platform of your choice) together with an RDBMS gives you a much more flexible way of
# asking meta questions about what's in your lake.
# Relevant table schema, adjust names as you like...
#
# create table lake.inventory
# (
# inventory_id bigserial primary key
@slotrans
slotrans / history_stuff.sql
Created August 6, 2021 23:50
Building blocks for generic history-keeping in Postgres.
/*
Replace "your_schema" with whatever schema is appropriate in your environment.
It is possible to use "public"... but you shouldn't!
*/
/*
Function to stamp a "modified" timestamp. Adjust the name to suit your environment,
but that name is hard-coded so it is assumed that you only use _one_ such name.
@slotrans
slotrans / dotfilter.py
Created February 17, 2018 20:55
This is a small script which can be used to filter a DOT-language (Graphviz) graph file describing a DAG.
import sys
import argparse
import networkx
# pydot is also required
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Tool for filtering Graphviz/DOT directed graphs. Pass the source graph on STDIN, the filtered graph will be sent to STDOUT.')
parser.add_argument('nodes', metavar='node', nargs='+', help='One or more nodes to use for filtering, according to the chosen mode')
@slotrans
slotrans / zb32.go
Last active February 24, 2023 17:34
package main
import (
"github.com/docopt/docopt-go"
"math/rand"
"fmt"
"os"
"strings"
"strconv"
"time"
@slotrans
slotrans / tasks_never_run.py
Last active June 6, 2017 00:46
simple-ish Airflow DAG for which some tasks never execute
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
default_args = {
'owner': 'noah.yetter',
'depends_on_past': False,
'start_date': datetime(2017, 5, 23),
@slotrans
slotrans / lambda_test_harness.js
Created December 27, 2014 21:01
Local test harness for AWS Lambda functions. I'm no JS programmer so this is probably horrible in some way or other, but it does appear to work. Note that this does NOT directly simulate the permissions of the function's execution role. It will run with whatever permissions belong to the AWS credentials you use, unless you run it on an EC2 insta…
var fs = require('fs');
// Lambda knows what region it's in but a local execution doesn't, so preload the SDK and set the region
// This will only work if the same variable name is used in the Lambda function file
var AWS = require('aws-sdk');
AWS.config.update({region: 'us-east-1'});
// validate arguments
if(process.argv.length < 4) {
@slotrans
slotrans / mysqlconverter.py
Created October 25, 2013 21:52
Buffered converter for MySQL CSV exports done with \b\b\b line endings to work around MySQL's awful handling of embedded newlines. Extracted from larger MySQL->PostgreSQL converter so some of the variable names don't make sense.
import re
import sys
if len(sys.argv) != 3:
print "usage: {0} sourcefile targetfile".format(sys.argv[0])
sys.exit(1)
stage_first_filename = sys.argv[1]
stage_second_filename = sys.argv[2]
@slotrans
slotrans / s3_multipart_upload.py
Created March 27, 2013 21:26
Quick python script for multipart s3 file uploads, in case you need to upload a file larger than 5GB. Split up your file using something like /usr/bin/split, then invoke this as s3_multipart_upload.py targetbucket targetfilename part1 part2 part3 (or a glob like part*). AWS credentials are taken from the environment variables AWS_ACCESS_KEY_ID a…
import boto
import sys
bucketname = sys.argv[1]
filename = sys.argv[2]
parts = sys.argv[3:]
print('target=s3://{0}/{1}'.format(bucketname, filename))