Skip to content

Instantly share code, notes, and snippets.

View ghukill's full-sized avatar

Graham Hukill ghukill

  • MIT Libraries
View GitHub Profile
@ghukill
ghukill / any_content.rng
Created February 5, 2018 17:26
Recursive allow hack for RELAX NG
<?xml version="1.0" encoding="UTF-8"?>
<grammar
xmlns="http://relaxng.org/ns/structure/1.0"
xmlns:a="http://relaxng.org/ns/compatibility/annotations/1.0"
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
<start>
<choice>
<ref name="any_content"/>
</choice>
</start>
@ghukill
ghukill / combine_bootstrap.sql
Created January 23, 2018 13:01
MySQL bootstrap for Combine
-- MySQL dump 10.13 Distrib 5.7.20, for Linux (x86_64)
--
-- Host: localhost Database: combine
-- ------------------------------------------------------
-- Server version 5.7.20-0ubuntu0.16.04.1
/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8 */;
@ghukill
ghukill / pyvs.py
Created December 19, 2017 16:36
Example python record validation functions for Combine
'''
You can import most any python library that you'd like up here, and then use within checking functions
'''
import re
'''
Each function is its own "test" when validating against a record.
@ghukill
ghukill / mem_prof.json
Created December 10, 2017 17:39
linux mint memory profile prefs
{
"labelsOn": false,
"refreshRate": 500,
"labelColor": [
0.9333333333333333,
0.9333333333333333,
0.9254901960784314,
1
],
"backgroundColor": [
@ghukill
ghukill / hsb_eng_subs.js
Last active December 10, 2017 00:54
English subs for Hela Sverige bakar (All of Sweden Bakes)
/*
To use:
- begin playback
- make sure Google Chrome page translation is turned on
- turn on video player subs (will be Swedish with flickering English)
- paste code below into JS console
- enjoy!
*/
// create div for subtitles
@ghukill
ghukill / rdd_subsets.py
Last active November 30, 2017 19:25
RDD subsets with zipWithIndex()
def rdd_subset(rdd, chunk_size_limit=10000):
'''
Small method to create subsets of a pyspark RDD.
Achieved by zipping the input RDD with .zipwithIndex(),
accepting a chunk size not to exceed, and returning lazily evaluated
RDDs with nearly evenly distributed subsets.
Note: This can be quite inefficient, as each time an RDD is used from the
@ghukill
ghukill / gist:7a82c3ce5041edb76810ad85f27315cf
Last active November 29, 2017 14:03
spark worker java heap space
Java HotSpot(TM) 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError occurred dispatching signal SIGTERM to handler- the VM may need to be forcibly terminated
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "shuffle-server-0"
17/11/29 13:34:58 INFO jdbc.JDBCRDD: closed connection
17/11/29 13:34:58 ERROR executor.Executor: Exception in task 0.2 in stage 23.0 (TID 920)
java.lang.OutOfMemoryError: Java heap space
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3418)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3365)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3805)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:871)
@ghukill
ghukill / problematic_avro_base64.avro
Created September 18, 2017 16:08
Problematic Avro File (Base64)
T2JqAQQWYXZyby5zY2hlbWHSFHsidHlwZSI6InJlY29yZCIsIm5hbWUiOiJ0b3BMZXZlbFJlY29yZCIsImZpZWxkcyI6W3sibmFtZSI6InNldCIsInR5cGUiOlt7InR5cGUiOiJyZWNvcmQiLCJuYW1lIjoic2V0IiwiZmllbGRzIjpbeyJuYW1lIjoiaWQiLCJ0eXBlIjpbInN0cmluZyIsIm51bGwiXX0seyJuYW1lIjoiZG9jdW1lbnQiLCJ0eXBlIjpbInN0cmluZyIsIm51bGwiXX0seyJuYW1lIjoic2V0U291cmNlIiwidHlwZSI6W3sidHlwZSI6InJlY29yZCIsIm5hbWUiOiJzZXRTb3VyY2UiLCJmaWVsZHMiOlt7Im5hbWUiOiJxdWVyeVBhcmFtcyIsInR5cGUiOlt7InR5cGUiOiJtYXAiLCJ2YWx1ZXMiOlsic3RyaW5nIiwibnVsbCJdfSwibnVsbCJdfSx7Im5hbWUiOiJ1cmwiLCJ0eXBlIjpbInN0cmluZyIsIm51bGwiXX0seyJuYW1lIjoidGV4dCIsInR5cGUiOlsic3RyaW5nIiwibnVsbCJdfV19LCJudWxsIl19XX0sIm51bGwiXX0seyJuYW1lIjoicmVjb3JkIiwidHlwZSI6W3sidHlwZSI6InJlY29yZCIsIm5hbWUiOiJyZWNvcmQiLCJmaWVsZHMiOlt7Im5hbWUiOiJpZCIsInR5cGUiOlsic3RyaW5nIiwibnVsbCJdfSx7Im5hbWUiOiJkb2N1bWVudCIsInR5cGUiOlsic3RyaW5nIiwibnVsbCJdfSx7Im5hbWUiOiJzZXRJZHMiLCJ0eXBlIjpbeyJ0eXBlIjoiYXJyYXkiLCJpdGVtcyI6WyJzdHJpbmciLCJudWxsIl19LCJudWxsIl19LHsibmFtZSI6InJlY29yZFNvdXJjZSIsInR5cGUiOlt7InR5cGUiOiJyZWNvcmQiLCJuYW1lIjoicmVjb3JkU291
@ghukill
ghukill / problematic_avro_bytes.avro
Created September 18, 2017 16:02
Problematic Avro File
Obj\x01\x04\x16avro.schema\xd2\x14{"type":"record","name":"topLevelRecord","fields":[{"name":"set","type":[{"type":"record","name":"set","fields":[{"name":"id","type":["string","null"]},{"name":"document","type":["string","null"]},{"name":"setSource","type":[{"type":"record","name":"setSource","fields":[{"name":"queryParams","type":[{"type":"map","values":["string","null"]},"null"]},{"name":"url","type":["string","null"]},{"name":"text","type":["string","null"]}]},"null"]}]},"null"]},{"name":"record","type":[{"type":"record","name":"record","fields":[{"name":"id","type":["string","null"]},{"name":"document","type":["string","null"]},{"name":"setIds","type":[{"type":"array","items":["string","null"]},"null"]},{"name":"recordSource","type":[{"type":"record","name":"recordSource","fields":[{"name":"queryParams","type":[{"type":"map","values":["string","null"]},"null"]},{"name":"url","type":["string","null"]},{"name":"text","type":["string","null"]}]},"null"]}]},"null"]},{"name":"error","type":[{"type":"record","
# small scrip to split pages when the desired midpoint drifts over the course of set of images
# requires imagemagick, specifically "convert" command
import os
import sys
def split_images(files, start_percentage, end_percentage, start_page, end_page):
# determine percentage bump