Skip to content

Instantly share code, notes, and snippets.

View RooseveltAdvisors's full-sized avatar

Jon Roosevelt RooseveltAdvisors

View GitHub Profile
@RooseveltAdvisors
RooseveltAdvisors / github_gpg_key.md
Last active March 13, 2020 15:39 — forked from ankurk91/github_gpg_key.md
Github : Signing commits using GPG (Ubuntu/Mac)

Github : Signing commits using GPG (Ubuntu/Mac) 🔐

  • Do you have an Github account ? If not create one.
  • Install required tools
  • Latest Git Client
  • gpg tools
# Ubuntu
sudo apt-get install gpa seahorse
# MacOS with https://brew.sh/
@RooseveltAdvisors
RooseveltAdvisors / zeppelin_s3_backend.md
Last active July 9, 2022 00:48
S3 backed notebooks for Zeppelin
@RooseveltAdvisors
RooseveltAdvisors / rllab installation with anaconda.md
Created November 25, 2018 16:43
rllab installation with anaconda (tested on mac-os)

Installation

Assume you have anaconda installed on your computer

conda env remove -n rllab_test -y
cd ~/Downloads
git clone https://github.com/rll/rllab.git
cd rllab
conda env create -n rllab_test -f environment.yml
@RooseveltAdvisors
RooseveltAdvisors / install_anaconda_jupyter.sh
Created March 26, 2018 22:10
Bash script for installing anaconda , jupyter and linking jupyter with spark
# Install Anaconda
wget https://repo.continuum.io/archive/Anaconda3-5.1.0-Linux-x86_64.sh
bash Anaconda3-5.1.0-Linux-x86_64.sh -b -f -p $HOME/anaconda
export PATH="$HOME/anaconda/bin:$PATH"
echo 'export PATH="$HOME/anaconda/bin:$PATH"' >> ~/.bashrc
conda update -y -n base conda
# Install Jupyter
conda create -y -n jupyter python=3.5 jupyter nb_conda
screen -dmS jupyter
@RooseveltAdvisors
RooseveltAdvisors / debug_spark.md
Created September 1, 2017 12:00
Debugging Spark

To connect a debugger to the driver

Append the following to your spark submit (or gatk-launch) options:

replace 5005 with a different available port if necessary

--driver-java-options -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005

This will suspend the driver until it gets a remote connection from intellij.

@RooseveltAdvisors
RooseveltAdvisors / ubuntu_nic_bonding.md
Created July 29, 2017 05:21
nic bonding@ubuntu 14.04
auto lo
iface lo inet loopback

auto eth0
iface eth0 inet manual
  bond-master bond0

auto eth1
iface eth1 inet manual
import datetime
from jinja2 import Environment
start = datetime.datetime.strptime("2017-02-01", "%Y-%m-%d")
end = datetime.datetime.strptime("2017-07-24", "%Y-%m-%d")
date_generated = [start + datetime.timedelta(days=x) for x in range(0, (end-start).days+1)]
template = """spark-submit --master yarn --deploy-mode cluster --class com.xyz.XXXAPP s3://com.xyz/aa-1.5.11-all.jar --input-request-events s3://com.xyz/data/event_{{date_str}}/* --input-geofence-events s3://com.xyz/data2/event_/{{date_str}}/* --output s3://com.xyz/output/{{date_str}}"""
@RooseveltAdvisors
RooseveltAdvisors / jupyter_notebook@EMR.md
Last active September 20, 2019 15:37
Run Jupyter Notebook and JupyterHub on Amazon EMR

Jupyter on EMR allows users to save their work on Amazon S3 rather than on local storage on the EMR cluster (master node).

To store notebooks on S3, use:

--notebook-dir <s3://your-bucket/folder/>

To store notebooks in a directory different from the user’s home directory, use:

--notebook-dir <local directory>
@RooseveltAdvisors
RooseveltAdvisors / bitbucket_clone.md
Last active July 5, 2017 12:40
Clone all git repositories from BitBucket
curl -s  -k https://USERNAME:PASSWORD@api.bitbucket.org/1.0/user/repositories | python -c 'import sys, json, os; r = json.loads(sys.stdin.read()); [os.system("git clone %s" % d["resource_uri"].replace("/1.0/repositories","https://USERNAME:PASSWORD@bitbucket.org")+".git") for d in r]'

Pyspark

spark-submit

spark-submit --master yarn --deploy-mode cluster --name pyspark_job --driver-memory 2G --driver-cores 2 --executor-memory 12G --executor-cores 5 --num-executors 10 --conf spark.yarn.executor.memoryOverhead=4096 --conf spark.task.maxFailures=36 --conf spark.driver.maxResultSize=0 --conf spark.network.timeout=800s --conf spark.scheduler.listenerbus.eventqueue.size=500000 --conf spark.speculation=true --py-files lib.zip,lib1.zip,lib2.zip spark_test.py

spark_test.py

import pyspark
import sys
from pyspark.sql import SQLContext