Skip to content

Instantly share code, notes, and snippets.

View ghukill's full-sized avatar

Graham Hukill ghukill

  • MIT Libraries
View GitHub Profile

This is a test, this is only a test.

@ghukill
ghukill / mmdp_regex_workshop_doc_example.txt
Created March 1, 2017 15:33
MMDP Regex Workshop: Document Example
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce erat erat, venenatis non luctus vel, pulvinar at ligula. In cursus, lectus nec fermentum pharetra, urna tellus molestie leo, sit amet dignissim enim dolor et nunc. Nunc nulla turpis, laoreet sed augue ut, sagittis finibus leo. Cras gravida, nunc quis pretium sagittis, odio eros maximus nunc, hendrerit posuere ex lacus ac sem. Suspendisse sit amet augue rutrum, tempor eros eget, bibendum elit. Curabitur vitae nibh dignissim, vulputate leo ut, laoreet massa. Phasellus vitae augue sit amet ligula congue ornare. Nulla ligula erat, euismod id dui accumsan, commodo viverra quam.
Etiam ac neque vitae leo luctus lacinia at a mi. Nam vitae mi interdum, accumsan ante sed, commodo lectus. Donec in tristique eros. Nam tempus suscipit ligula, ut sodales lorem pretium eget. Praesent vel ante at lorem ullamcorper sodales. Maecenas pellentesque maximus felis, id fringilla lacus finibus non. Praesent eu elementum libero, nec bibendum metus. Maecenas vitae dictum
@ghukill
ghukill / toothygrin.md
Last active March 12, 2017 06:41
toothygrin

"So I think I can get an appointment with the doctor," he said with a single-toothed grin. You see, the two front teeth were warped around each other like two gummy bears a little kid tried to twist together -- butt to butt -- but worried about that kind of structural torque, instead, ruminated on radians and felt the oneness.

@ghukill
ghukill / mmdp_regex_letter.txt
Last active March 23, 2017 02:02
MMDP Regex Workshop - Letter
3/22/2017
Dear Ivy Wallvent,
Our upcoming retreat of the Coffee Appreciation Society (CAS) has been scheduled for 12/12/2017. I realize that's quite a ways out at this point -- only spring now -- but we in CAS beleive in being prepared look forward to these annual retreats.
A bit of housekeeping. We have officially split from our sister group, Coffee And Society Unlimited (CASU). The split was peaceable.
Our last retreat, 12/04/16, was great! We roasted beans, discussed cups & saucers, explored the nuances of different brewing methods, and otherwise thoroughly imbibed in that lovely caffeinated treat. It was a welcome reprise after a long fall.
hey there!
# small scrip to split pages when the desired midpoint drifts over the course of set of images
# requires imagemagick, specifically "convert" command
import os
import sys
def split_images(files, start_percentage, end_percentage, start_page, end_page):
# determine percentage bump
@ghukill
ghukill / problematic_avro_bytes.avro
Created September 18, 2017 16:02
Problematic Avro File
Obj\x01\x04\x16avro.schema\xd2\x14{"type":"record","name":"topLevelRecord","fields":[{"name":"set","type":[{"type":"record","name":"set","fields":[{"name":"id","type":["string","null"]},{"name":"document","type":["string","null"]},{"name":"setSource","type":[{"type":"record","name":"setSource","fields":[{"name":"queryParams","type":[{"type":"map","values":["string","null"]},"null"]},{"name":"url","type":["string","null"]},{"name":"text","type":["string","null"]}]},"null"]}]},"null"]},{"name":"record","type":[{"type":"record","name":"record","fields":[{"name":"id","type":["string","null"]},{"name":"document","type":["string","null"]},{"name":"setIds","type":[{"type":"array","items":["string","null"]},"null"]},{"name":"recordSource","type":[{"type":"record","name":"recordSource","fields":[{"name":"queryParams","type":[{"type":"map","values":["string","null"]},"null"]},{"name":"url","type":["string","null"]},{"name":"text","type":["string","null"]}]},"null"]}]},"null"]},{"name":"error","type":[{"type":"record","
@ghukill
ghukill / problematic_avro_base64.avro
Created September 18, 2017 16:08
Problematic Avro File (Base64)
T2JqAQQWYXZyby5zY2hlbWHSFHsidHlwZSI6InJlY29yZCIsIm5hbWUiOiJ0b3BMZXZlbFJlY29yZCIsImZpZWxkcyI6W3sibmFtZSI6InNldCIsInR5cGUiOlt7InR5cGUiOiJyZWNvcmQiLCJuYW1lIjoic2V0IiwiZmllbGRzIjpbeyJuYW1lIjoiaWQiLCJ0eXBlIjpbInN0cmluZyIsIm51bGwiXX0seyJuYW1lIjoiZG9jdW1lbnQiLCJ0eXBlIjpbInN0cmluZyIsIm51bGwiXX0seyJuYW1lIjoic2V0U291cmNlIiwidHlwZSI6W3sidHlwZSI6InJlY29yZCIsIm5hbWUiOiJzZXRTb3VyY2UiLCJmaWVsZHMiOlt7Im5hbWUiOiJxdWVyeVBhcmFtcyIsInR5cGUiOlt7InR5cGUiOiJtYXAiLCJ2YWx1ZXMiOlsic3RyaW5nIiwibnVsbCJdfSwibnVsbCJdfSx7Im5hbWUiOiJ1cmwiLCJ0eXBlIjpbInN0cmluZyIsIm51bGwiXX0seyJuYW1lIjoidGV4dCIsInR5cGUiOlsic3RyaW5nIiwibnVsbCJdfV19LCJudWxsIl19XX0sIm51bGwiXX0seyJuYW1lIjoicmVjb3JkIiwidHlwZSI6W3sidHlwZSI6InJlY29yZCIsIm5hbWUiOiJyZWNvcmQiLCJmaWVsZHMiOlt7Im5hbWUiOiJpZCIsInR5cGUiOlsic3RyaW5nIiwibnVsbCJdfSx7Im5hbWUiOiJkb2N1bWVudCIsInR5cGUiOlsic3RyaW5nIiwibnVsbCJdfSx7Im5hbWUiOiJzZXRJZHMiLCJ0eXBlIjpbeyJ0eXBlIjoiYXJyYXkiLCJpdGVtcyI6WyJzdHJpbmciLCJudWxsIl19LCJudWxsIl19LHsibmFtZSI6InJlY29yZFNvdXJjZSIsInR5cGUiOlt7InR5cGUiOiJyZWNvcmQiLCJuYW1lIjoicmVjb3JkU291
@ghukill
ghukill / gist:7a82c3ce5041edb76810ad85f27315cf
Last active November 29, 2017 14:03
spark worker java heap space
Java HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGTERM to handler- the VM may need to be forcibly terminated
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "shuffle-server-0"
17/11/29 13:34:58 INFO jdbc.JDBCRDD: closed connection
17/11/29 13:34:58 ERROR executor.Executor: Exception in task 0.2 in stage 23.0 (TID 920)
java.lang.OutOfMemoryError: Java heap space
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3418)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3365)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3805)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:871)
@ghukill
ghukill / rdd_subsets.py
Last active November 30, 2017 19:25
RDD subsets with zipWithIndex()
def rdd_subset(rdd, chunk_size_limit=10000):
'''
Small method to create subsets of a pyspark RDD.
Achieved by zipping the input RDD with .zipwithIndex(),
accepting a chunk size not to exceed, and returning lazily evaluated
RDDs with nearly evenly distributed subsets.
Note: This can be quite inefficient, as each time an RDD is used from the