Skip to content

Instantly share code, notes, and snippets.

@cupdike
cupdike / PollingFileDownloader.py
Created October 6, 2015 17:45
Polls a file hosted at a URL and downloads it initially and if it changes.
"""Polls a file hosted at a URL and downloads it initially and if it changes."""
# Should be fairly robust to web server issues (in fact, it would only
# be a handful of lines were it not for error handling)
import requests
import time
import sys
FILE_URL = "http://<mywebserver>/<myfile>"
@cupdike
cupdike / quicksort.py
Last active October 6, 2015 17:46
Basic quicksort impl, inspired by http://me.dt.in.th/page/Quicksort/
# Inspired by: http://me.dt.in.th/page/Quicksort/
def quicksort(l):
if len(l) < 2:
return l
iSwap = 1
pivot = l[0] # left most value is the pivot
for i, val in enumerate(l[1:], start=1): # Skip the pivot cell
if val < pivot:
@cupdike
cupdike / BeelineJarDependencyFinder
Created July 12, 2017 20:12
Bash commands that will provide the list of jars needed to run beeline without installing hive
# If you want to run Beeline without installing Hive...
# This will help you find the jars that you need:
# Ref: https://pvillaflores.wordpress.com/2017/04/30/installing-and-running-beeline-client/
# Turn on verbose classloading
$ export _JAVA_OPTIONS=-verbose:class
# Run beeline and process out the needed jars.
# Below assumes the hadoop jars are under a 'cloudera' path (adjust accordingly)
$ /usr/bin/beeline | tr '[' '\n' | tr ']' ' ' | grep jar | grep cloudera | grep -v checksum | awk '{last=split($0,a,"/"); print a[last]}' | sort | uniq
@cupdike
cupdike / DeleteAllDagruns.py
Created September 20, 2018 16:07
Use Airflow's ORM to delete all DagRuns. Could also use sqlalchemy filtering if desired. This was with Airflow 1.8.
from airflow.models import DagRun
from sqlalchemy import *
from airflow import settings
session = settings.Session()
session.query(DagRun).delete()
session.commit()
@cupdike
cupdike / gist:c5554233e1dd6b233a9b6ec6adb05c5a
Created November 1, 2018 20:59
Python function to round down minutes to a user specified resolution
from datetime import datetime, timedelta
def round_minutes(dt, resolutionInMinutes):
"""round_minutes(datetime, resolutionInMinutes) => datetime rounded to lower interval
Works for minute resolution up to a day (e.g. cannot round to nearest week).
"""
# First zero out seconds and micros
dtTrunc = dt.replace(second=0, microsecond=0)
@cupdike
cupdike / shErrorCode255Tip.txt
Created March 27, 2019 21:15
sh.ErrorReturnCode_255 using Python sh package
If you are trying to run a script like this
import sh
myScriptCommand = sh.Command("/path/to/script")
myScriptCommand("my arg")
and you see this error:
sh.ErrorReturnCode_255
@cupdike
cupdike / CombiningPythonGenerators.txt
Created October 17, 2019 14:30
Combine Python Generators Into One Generator
>>> def genX():
... for i in range(3):
... yield i
...
>>> for i in genX(): print(i)
...
0
1
2
>>> def genY():
@cupdike
cupdike / pyarrowKerberizedHdfsDebugger.py
Created December 12, 2019 17:16
Helps debug connecting Pyarrow to Kerberized HDFS. Took a bit of doing to get it working and the guidance found on the web isn't always helpful. Useful error messages aren't always bubbling out from the driver. This will let you experiment with drivers, LIBJVM_PATH, LD_LIBRARY_PATH, CLASSPATH, HADOOP_HOME.
import pyarrow
import os
import sh
# Get obscure error without this: pyarrow.lib.ArrowIOError: HDFS list directory failed, errno: 2 (No such file or directory)
os.environ['CLASSPATH'] = str(sh.hadoop('classpath','--glob'))
# Not needed
#os.environ['HADOOP_HOME'] = '/opt/cloudera/parcels/CDH-<your version>/'
@cupdike
cupdike / gist:2d3ce5b3aa31a77f6b27d400d7c531b9
Created March 27, 2020 14:24
Python string.partition() example
# Demonstrates string.partition() to split a string by a sequence of delimiters.
# Not terribly useful, can do with regex pretty easily.
s = "apple AND banana AND cherry AND date OR elderberry BUT fig"
delims = [" AND "]*3 + [" OR ", " BUT "]
# [' AND ', ' AND ', ' AND ', ' OR ', ' BUT ']
def splitByDelimList(str, delimList):
delims = delimList.copy()
@cupdike
cupdike / ConnectionSetup.txt
Last active August 19, 2020 16:05
Airflow Connection to Remote Kerberized Hive Metastore
# Let's say this is your kerberos ticket (likely from a keytab used for the remote service):
Ticket cache: FILE:/tmp/airflow_krb5_ccache
Default principal: hive/myserver.myrealm@myrealm
Valid starting Expires Service principal
06/14/2018 17:52:05 06/15/2018 17:49:35 krbtgt/myrealm@myrealm
renew until 06/17/2018 05:49:33