Skip to content

Instantly share code, notes, and snippets.

View gerigk's full-sized avatar

Arthur Gerigk gerigk

  • something new
  • Berlin
View GitHub Profile
@gerigk
gerigk / .mrjob.conf
Created June 3, 2012 14:11
A .mrjob.conf to run Pandas with EMR
runners:
emr:
aws_access_key_id: youraccountid
#aws_region: us-west-1 your region. us east by default
aws_secret_access_key: yoursecretkey
bootstrap_actions:
# probably this is a good idea
- s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive
# we disable this since it is run before our shell script and installs mrjob for python 2.6
bootstrap_mrjob: False
@gerigk
gerigk / bootstrap.sh
Created June 3, 2012 14:00
Bootstrap file to load binaries for pandas and dependencies
#!/bin/bash
###################
#configuration here
####################
bucketname="your_bucket_name"
##########################
cd /home/hadoop
#first we set two vars...I had errors without this
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
export LD_RUN_PATH=/usr/local/lib:$LD_RUN_PATH
@gerigk
gerigk / build_atlas.sh
Created June 3, 2012 13:49
Build ATLAS for EC2 EMR
bucketname="yourbucketname"
cd /home/hadoop
wget http://www.netlib.org/lapack/lapack-3.4.1.tgz
wget http://downloads.sourceforge.net/project/math-atlas/Developer%20%28unstable%29/3.9.76/atlas3.9.76.tar.bz2
tar -vxf atlas3.9.76.tar.bz2
cd ATLAS
mkdir build
cd build
################################## -t 2 means 2 threads. depending on the ec2 instance you can choose more threads 14
### V 448 means SSE1/2/3 support. A14 means x86SSE364SSE2 architecture. check the documentation for more information
@gerigk
gerigk / bootstrap_run_only_once.sh
Created June 3, 2012 13:48
Build binaries to run Pandas with EMR
#!/bin/bash
###################
#configuration here
####################
bucketname="my_bucket_name"
##########################
cd /home/hadoop
#first we set two vars...I had errors without this
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
@gerigk
gerigk / gist:2149399
Created March 21, 2012 16:42
Inserting a DataFrame into Postgres
import psycopg2
import os
import sys
sys.path.append(os.path.abspath('../includes'))
import dbLoader
from datetime import datetime
class ReadFaker:
"""
@classmethod
def from_dict(cls, data, orient='columns', dtype=None):
from collections import defaultdict
orient = orient.lower()
if orient == 'index':
# TODO: this should be seriously cythonized
new_data = defaultdict(dict)
for index, s in data.iteritems():