Skip to content

Instantly share code, notes, and snippets.

@yuanzhaoYZ
yuanzhaoYZ / github_gpg_key.md
Last active March 13, 2020 15:39 — forked from ankurk91/github_gpg_key.md
Github : Signing commits using GPG (Ubuntu/Mac)

Github : Signing commits using GPG (Ubuntu/Mac) 🔐

  • Do you have an Github account ? If not create one.
  • Install required tools
  • Latest Git Client
  • gpg tools
# Ubuntu
sudo apt-get install gpa seahorse
# MacOS with https://brew.sh/
@yuanzhaoYZ
yuanzhaoYZ / zeppelin_s3_backend.md
Last active July 9, 2022 00:48
S3 backed notebooks for Zeppelin
@yuanzhaoYZ
yuanzhaoYZ / rllab installation with anaconda.md
Created November 25, 2018 16:43
rllab installation with anaconda (tested on mac-os)

Installation

Assume you have anaconda installed on your computer

conda env remove -n rllab_test -y
cd ~/Downloads
git clone https://github.com/rll/rllab.git
cd rllab
conda env create -n rllab_test -f environment.yml
@yuanzhaoYZ
yuanzhaoYZ / install_anaconda_jupyter.sh
Created March 26, 2018 22:10
Bash script for installing anaconda , jupyter and linking jupyter with spark
# Install Anaconda
wget https://repo.continuum.io/archive/Anaconda3-5.1.0-Linux-x86_64.sh
bash Anaconda3-5.1.0-Linux-x86_64.sh -b -f -p $HOME/anaconda
export PATH="$HOME/anaconda/bin:$PATH"
echo 'export PATH="$HOME/anaconda/bin:$PATH"' >> ~/.bashrc
conda update -y -n base conda
# Install Jupyter
conda create -y -n jupyter python=3.5 jupyter nb_conda
screen -dmS jupyter
@yuanzhaoYZ
yuanzhaoYZ / debug_spark.md
Created September 1, 2017 12:00
Debugging Spark

To connect a debugger to the driver

Append the following to your spark submit (or gatk-launch) options:

replace 5005 with a different available port if necessary

--driver-java-options -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005

This will suspend the driver until it gets a remote connection from intellij.

@yuanzhaoYZ
yuanzhaoYZ / ubuntu_nic_bonding.md
Created July 29, 2017 05:21
nic bonding@ubuntu 14.04
auto lo
iface lo inet loopback

auto eth0
iface eth0 inet manual
  bond-master bond0

auto eth1
iface eth1 inet manual
@yuanzhaoYZ
yuanzhaoYZ / jinja_template
Created July 26, 2017 21:57
jinja_template
import datetime
from jinja2 import Environment
start = datetime.datetime.strptime("2017-02-01", "%Y-%m-%d")
end = datetime.datetime.strptime("2017-07-24", "%Y-%m-%d")
date_generated = [start + datetime.timedelta(days=x) for x in range(0, (end-start).days+1)]
template = """spark-submit --master yarn --deploy-mode cluster --class com.xyz.XXXAPP s3://com.xyz/aa-1.5.11-all.jar --input-request-events s3://com.xyz/data/event_{{date_str}}/* --input-geofence-events s3://com.xyz/data2/event_/{{date_str}}/* --output s3://com.xyz/output/{{date_str}}"""
@yuanzhaoYZ
yuanzhaoYZ / jupyter_notebook@EMR.md
Last active September 20, 2019 15:37
Run Jupyter Notebook and JupyterHub on Amazon EMR

Jupyter on EMR allows users to save their work on Amazon S3 rather than on local storage on the EMR cluster (master node).

To store notebooks on S3, use:

--notebook-dir <s3://your-bucket/folder/>

To store notebooks in a directory different from the user’s home directory, use:

--notebook-dir <local directory>
@yuanzhaoYZ
yuanzhaoYZ / bitbucket_clone.md
Last active July 5, 2017 12:40
Clone all git repositories from BitBucket
curl -s  -k https://USERNAME:PASSWORD@api.bitbucket.org/1.0/user/repositories | python -c 'import sys, json, os; r = json.loads(sys.stdin.read()); [os.system("git clone %s" % d["resource_uri"].replace("/1.0/repositories","https://USERNAME:PASSWORD@bitbucket.org")+".git") for d in r]'

Pyspark

spark-submit

spark-submit --master yarn --deploy-mode cluster --name pyspark_job --driver-memory 2G --driver-cores 2 --executor-memory 12G --executor-cores 5 --num-executors 10 --conf spark.yarn.executor.memoryOverhead=4096 --conf spark.task.maxFailures=36 --conf spark.driver.maxResultSize=0 --conf spark.network.timeout=800s --conf spark.scheduler.listenerbus.eventqueue.size=500000 --conf spark.speculation=true --py-files lib.zip,lib1.zip,lib2.zip spark_test.py

spark_test.py

import pyspark
import sys
from pyspark.sql import SQLContext