Skip to content

Instantly share code, notes, and snippets.


Block or report user

Report or block cupdike

Hide content and notifications from this user.

Learn more about blocking users

Contact Support about this user’s behavior.

Learn more about reporting abuse

Report abuse
View GitHub Profile
cupdike /
Created Dec 12, 2019
Helps debug connecting Pyarrow to Kerberized HDFS. Took a bit of doing to get it working and the guidance found on the web isn't always helpful. Useful error messages aren't always bubbling out from the driver. This will let you experiment with drivers, LIBJVM_PATH, LD_LIBRARY_PATH, CLASSPATH, HADOOP_HOME.
import pyarrow
import os
import sh
# Get obscure error without this: pyarrow.lib.ArrowIOError: HDFS list directory failed, errno: 2 (No such file or directory)
os.environ['CLASSPATH'] = str(sh.hadoop('classpath','--glob'))
# Not needed
#os.environ['HADOOP_HOME'] = '/opt/cloudera/parcels/CDH-<your version>/'
cupdike / CombiningPythonGenerators.txt
Created Oct 17, 2019
Combine Python Generators Into One Generator
View CombiningPythonGenerators.txt
>>> def genX():
... for i in range(3):
... yield i
>>> for i in genX(): print(i)
>>> def genY():
cupdike / shErrorCode255Tip.txt
Created Mar 27, 2019
sh.ErrorReturnCode_255 using Python sh package
View shErrorCode255Tip.txt
If you are trying to run a script like this
import sh
myScriptCommand = sh.Command("/path/to/script")
myScriptCommand("my arg")
and you see this error:
cupdike / gist:c5554233e1dd6b233a9b6ec6adb05c5a
Created Nov 1, 2018
Python function to round down minutes to a user specified resolution
View gist:c5554233e1dd6b233a9b6ec6adb05c5a
from datetime import datetime, timedelta
def round_minutes(dt, resolutionInMinutes):
"""round_minutes(datetime, resolutionInMinutes) => datetime rounded to lower interval
Works for minute resolution up to a day (e.g. cannot round to nearest week).
# First zero out seconds and micros
dtTrunc = dt.replace(second=0, microsecond=0)
cupdike /
Created Sep 20, 2018
Use Airflow's ORM to delete all DagRuns. Could also use sqlalchemy filtering if desired. This was with Airflow 1.8.
from airflow.models import DagRun
from sqlalchemy import *
from airflow import settings
session = settings.Session()
cupdike / ConnectionSetup.txt
Last active Oct 30, 2019
Airflow Connection to Remote Kerberized Hive Metastore
View ConnectionSetup.txt
# Let's say this is your kerberos ticket (likely from a keytab used for the remote service):
Ticket cache: FILE:/tmp/airflow_krb5_ccache
Default principal: hive/myserver.myrealm@myrealm
Valid starting Expires Service principal
06/14/2018 17:52:05 06/15/2018 17:49:35 krbtgt/myrealm@myrealm
renew until 06/17/2018 05:49:33
cupdike / AirflowBeelineConnectionSample
Created Jun 13, 2018
Airflow Beeline Connection Using Kerberos via CLI
View AirflowBeelineConnectionSample
### There aren't many good examples of how to do this when also using kerberos
(venv) [airflow@cray01 dags]$ airflow connections --add \
--conn_id beeline_hive \
--conn_type 'beeline' \
--conn_host '' \
--conn_port 10000 \
--conn_extra '{"use_beeline": true, "auth":"kerberos;principal=mysvcname/myservicehost@MYDOMAIN.COM;"}'
### Then, a sample DAG to use it
cupdike / BeelineJarDependencyFinder
Created Jul 12, 2017
Bash commands that will provide the list of jars needed to run beeline without installing hive
View BeelineJarDependencyFinder
# If you want to run Beeline without installing Hive...
# This will help you find the jars that you need:
# Ref:
# Turn on verbose classloading
$ export _JAVA_OPTIONS=-verbose:class
# Run beeline and process out the needed jars.
# Below assumes the hadoop jars are under a 'cloudera' path (adjust accordingly)
$ /usr/bin/beeline | tr '[' '\n' | tr ']' ' ' | grep jar | grep cloudera | grep -v checksum | awk '{last=split($0,a,"/"); print a[last]}' | sort | uniq
cupdike /
Created Oct 6, 2015
Polls a file hosted at a URL and downloads it initially and if it changes.
"""Polls a file hosted at a URL and downloads it initially and if it changes."""
# Should be fairly robust to web server issues (in fact, it would only
# be a handful of lines were it not for error handling)
import requests
import time
import sys
FILE_URL = "http://<mywebserver>/<myfile>"
# Inspired by:
def quicksort(l):
if len(l) < 2:
return l
iSwap = 1
pivot = l[0] # left most value is the pivot
for i, val in enumerate(l[1:], start=1): # Skip the pivot cell
if val < pivot:
You can’t perform that action at this time.