Skip to content

Instantly share code, notes, and snippets.

View jspacker's full-sized avatar

Jonathan Packer jspacker

  • Foresite Labs
  • Boston, MA
View GitHub Profile
@jspacker
jspacker / pig_romance_ideas.md
Last active January 3, 2016 02:39
Problems with Pig / Ideas for PigRomance

Problems with Pig / Ideas for PigRomance

  1. Tuples take up too much memory

The default tuple class takes up a minimum of 96 bytes even for an empty tuple, and uses inefficient Integer, Float, etc. objects instead of primitives. I think you had done some work on code gen for efficient tuple classes, but it didn't seem to be fully integrated into all parts of the pipeline. If you could get this working for PigRomance, it could avoid a lot of spills, and make local mode especially faster.

One instance where it might be worth using extra memory though could be to append an int/long field to every schema that would store a precomputed hashcode (taking advantage of immutability).

  1. Even if everything can fit into memory, MapReduce spills to disk after each job
@jspacker
jspacker / ab_simulation.py
Last active December 23, 2015 08:09
Probably buggy A/B test simulation code
#!/usr/bin/env python
import sys
import numpy as np
from scipy.stats import binom, beta, chi2
m = int(sys.argv[1]) # number of trials
pa = float(sys.argv[2]) # baseline proportion (b/tw 0 and 1)
t = float(sys.argv[3]) # materiality threshold (b/tw 0 and 1)
# (% increase relative to baseline proportion)
/*
* Estimate appropriate weights for each (user, item, signal) triple
* based on three factors:
* 1) the relative frequencies of the signals across all users
* 2) the relative frequencies of the signals for each specific user
* 3) the proportion of users who have at least one instance of a given signal
*
* (1) is the basic principle: rare events are usually more important than frequent ones
* (2) accounts for the fact that the same signal can be more or less significant
* depending on the particular user's habits
Word Year # occ. # pages
xylophone 1999 2033 806
xylophone 2000 1931 859
xylophone 2001 2026 844
xylophone 2002 2226 911
xylophone 2003 2132 951
xylophone 2004 2615 1068
xylophone 2005 2240 996
xylophone 2006 1930 988
From-ASIN To-ASIN
0005016444 0000013714
0005235073 0000013714
0005064341 0000013714
0005080789 0000013714
0005064295 0000013714
0005476216 0000013714
0006180116 0000013714
0005476798 0000013714
ASIN, Title, Authors/Editors/etc., Price, Avg. Rating, # Raters
1416556966 The Lathe Of Heaven: A Novel [Paperback] Ursula K. Le Guin (Author) 15.0 4.5 96
0848817540 A Passage to India [Hardcover] E. M. Forster (Author) 27.95 3.8 158
0385333846 Slaughterhouse-Five: A Novel [Paperback] Kurt Vonnegut (Author) 15.0 4.3 1004
059035342X Harry Potter and the Sorcerer's Stone (Book 1) [Paperback] J.K. Rowling (Author), Mary GrandPré (Illustrator) 10.99 4.7 6886
0441013597 Dune, 40th Anniversary Edition (Dune Chronicles, Book 1) [Paperback] Frank Herbert (Author) 18.0 4.4 1524
0062080237 American Gods: Author's Preferred Text [Paperback] Neil Gaiman (Author) 17.99 3.9 1214
0525478817 The Fault in Our Stars [Hardcover] John Green (Author) 17.99 4.7 2537
0140177396 Of Mice and Men [Paperback] John Steinbeck (Author) 10.0 4.3 1354
Cluster ID, # elements, cluster ASINs and titles
240710 53 {(156793417X,Risk Management and the Emergency Department: Executive Leadership for Protecting Patients and Hospitals [Paperback]),(0323054722,Rosen's Emergency Medicine - Concepts and Clinical Practice, 2-Volume Set: Expert Consult Premium Edition - Enhanced Online Features and Print, 7e [Hardcover]),(0471727601,Medical Toxicology of Drug Abuse: Synthesized Chemicals and Psychoactive Plants [Hardcover]),(0470657839,Evidence-Based Emergency Care: Diagnostic Testing and Clinical Decision Rules (Evidence-Based Medicine) [Paperback]),(0470671114,Practical Teaching in Emergency Medicine [Paperback]),(0443068194,Textbook of Adult Emergency Medicine, 3e [Paperback]),(0470657723,Urgent Care Emergencies: Avoiding the Pitfalls and Improving the Outcomes [Paperback]),(0071464530,Medical Toxicology Review: Pearls of Wisdom, Second Edition [Paperback]),(0071668071,Pocket Atlas of Emergency Ultrasound (Atlas Series) [Paperback]),(0071497404,Emergency Medicine Ora
hrweather 149 149 1.8003229223518307E-5 2.8222506391924165E-6 6.379032738449449
freekicks 89 94 1.0753606717403553E-5 1.780480269020719E-6 6.039722486404266
wsaatl 66 71 7.974584756726229E-6 1.344830841494373E-6 5.929805080812162
apachecafe 63 69 7.612103631420492E-6 1.3069482825790384E-6 5.824334239453845
dupdatealerts 105 116 1.268683938570082E-5 2.197188417089398E-6 5.774124461527518
duper 311 354 3.757720999002814E-5 6.705212928014198E-6 5.604178479259261
escalated 331 390 3.999375082539973E-5 7.387098988490218E-6 5.413999580581455
holidaze 329 390 3.975209674186257E-5 7.387098988490218E-6 5.3812865921791495
riddance 104 126 1.2566012343932242E-5 2.3866012116660703E-6 5.265233371418593
kixify 92 113 1.1116087842709291E-5 2.1403645787163967E-6 5.1935487782066305
moneygrip 834 850 1.3957675441202288E-4 1.610008753901714E-5 8.669316491216023
lanate 795 816 1.3304978388196427E-4 1.5456084037456456E-5 8.608246665813272
dreporter 806 870 1.3489072428787824E-4 1.6478913128170486E-5 8.185656616957603
fuckity 57 68 9.539418467008757E-6 1.2880070031213712E-6 7.406340527567645
growdy 82 100 1.3723373934995055E-5 1.8941279457667226E-6 7.245220137143365
karmas 174 227 2.912033005718463E-5 4.29967043689046E-6 6.772688857112634
fractions 331 432 5.539557039613858E-5 8.182632725712241E-6 6.769895735643784
karma's 111 145 1.857676227785916E-5 2.7464855213617476E-6 6.763830405575388
paybacks 52 68 8.7026273734115E-6 1.2880070031213712E-6 6.756661533921361
texters 317 419 5.305255533406625E-5 7.936396092762567E-6 6.684716175197761
@jspacker
jspacker / adi_mapreduce_and_pig_talk.md
Last active December 16, 2015 18:09
Notes for ADI MapReduce + Pig talk

Problem: what to do when your data is too large to process on one machine?

Some cases where this can occur

  • Recommendations: Netflix
  • Ad-tech: Quantcast (audience measurement)
  • Sensor monitoring: EnergyHub (thermostats)
  • Biotech: Schrodinger (drug discovery)

Netflix architecture diagram