Skip to content

Instantly share code, notes, and snippets.

View mrchristine's full-sized avatar

Miklos C mrchristine

View GitHub Profile
@mrchristine
mrchristine / find_clones.py
Created February 25, 2020 18:15
Find cloned notebooks and find most cloned
# $ cat nb_names.log | sort | uniq -c | sort -nrk1 | head
import os, re
# find cloned notebooks with parens
pattern = re.compile(r"\((\d+)\)")
with open('user_workspace.log', 'r') as fp, open('nb_names.log', 'w') as fp_w:
for x in fp:
nb_name = os.path.basename(x.rstrip())
@mrchristine
mrchristine / get_spark_ui.py
Created February 12, 2020 23:54
Script to get the Spark UI dynamically
ui_port = spark.sql("set spark.ui.port").collect()[0].value
env = "myenvironment.cloud.databricks.com"
cluster_id = dbutils.notebook.entry_point.getDbutils().notebook().getContext().clusterId().getOrElse(None)
url = "https://{0}/driver-proxy-api/o/0/{1}/{2}/api/v1/".format(env, cluster_id, ui_port)
import requests
token = "TOKEN"
@mrchristine
mrchristine / iam.py
Created January 15, 2020 16:47
Bypass IAM Check
import requests
token = 'MYTOKEN'
url = 'https://EXAMPLE.cloud.databricks.com'
ip = 'arn:aws:iam::123456789:instance-profile/databricks_special_role'
class DatabricksRestClient:
"""A class to define wrappers for the REST API"""
@mrchristine
mrchristine / get_s3_storage_costs.sh
Created December 11, 2019 15:33
Calculate S3 costs for Storage
#!/bin/bash
# get the last date in the file
last_date=`cat $@ | awk -F',' '{print $5}' | awk '{print $1}' | grep -v "Start" | sort | uniq | tail -n1`
# pass in the report.csv and calculate total storage costs for StandardStorage tier
cat "$@" | grep $last_date | awk -F, '{printf "%.2f GB %s %s \n", $7/(1024**3 )/24, $4, $2}' | grep "StandardStorage" | uniq | sort -n
echo "Processed for $last_date"
@mrchristine
mrchristine / decode_aws_error.sh
Created December 5, 2019 15:33
Decode and pretty print an encoded error message from AWS
#!/bin/bash
# grab decoded error message
error=`aws sts decode-authorization-message --encoded-message $@ | jq .DecodedMessage`
# trim the start and end double quotes
json_err=${error:1: -1}
# remove escaped quoted strings and pretty print with jq
echo $json_err | sed 's|\\"|"|g' | jq .
@mrchristine
mrchristine / spark_stuff.scala
Created June 7, 2019 15:53
Spark Notes / Tips to Remember
spark.conf.isModifiable("spark.sql.shuffle.partitions")
@mrchristine
mrchristine / spark_schema_save_n_load.py
Created May 28, 2019 21:12
Read / Write Spark Schema to JSON
##### READ SPARK DATAFRAME
df = spark.read.option("header", "true").option("inferSchema", "true").csv(fname)
# store the schema from the CSV w/ the header in the first file, and infer the types for the columns
df_schema = df.schema
##### SAVE JSON SCHEMA INTO S3 / BLOB STORAGE
# save the schema to load from the streaming job, which we will load during the next job
dbutils.fs.rm("/home/mwc/airline_schema.json", True)
with open("/dbfs/home/mwc/airline_schema.json", "w") as f:
@mrchristine
mrchristine / update_legacy_job_templates.py
Last active November 6, 2018 16:59
Job to update legacy instance types on Databricks
import json, pprint, requests, datetime
################################################################
## Replace the token variable and environment url below
################################################################
# Helper to pretty print json
def pprint_j(i):
print json.dumps(i, indent=4, sort_keys=True)
@mrchristine
mrchristine / vector_sum_udaf.scala
Created November 29, 2017 21:39
Spark UDAF to sum vectors for common keys
package com.databricks.example.pivot
/**
This code allows a user to add vectors together for common keys.
The code in the comments show you how to register the scala UDAF to be called from pyspark.
The UDAF can only be called from a SQL expression (aka spark.sql() or df.expr() )
**/
/**
# Python code to register a scala UDAF
@mrchristine
mrchristine / spark-submit-run-once.sh
Created September 6, 2017 17:03
spark-submit transient run example
#!/bin/bash
usage="Add jars to the input arguments to specify the spark job. -h list the supported spark versions"
RUNTIME_VERSION="3.2.x-scala2.11"
NODE_TYPE="r3.xlarge"
while getopts ':hs:' option; do
case "$option" in
h) echo "$usage"