Skip to content

Instantly share code, notes, and snippets.

@tonyfraser
tonyfraser / marathonAndZeppelin.scala
Last active September 25, 2019 21:19
Get a marathon bearer token and use it to clear all paragraphs in a zeppelin notebook
//uses sttp module
import com.softwaremill.sttp.{HttpURLConnectionBackend, _}
import scala.util.parsing.json._
implicit lazy val backend = HttpURLConnectionBackend()
//first get a marathon bearer token.
val loginPostBody = "{ \"uid\": \"{username}\", \"password\": \"{password}\" }"
val tok = JSON.parseFull(
@tonyfraser
tonyfraser / zeppelin_test.sh
Last active September 25, 2019 21:18
use curl to trigger the zeppelin api within a mesos cluster
#!/bin/bash
# -> remember to run: dcos auth login first !!
DCOS_API_TOKEN=$(dcos config show core.dcos_acs_token)
url="http://{marathon-domain}/service/{marathon zeppelin name}"
notebook="2E617JZX1" # $url/#/notebook/2E617JZX1
paragraph="20190916-164803_817623738"
#Note: to get paragraph ID, download notebook, open json and look for -> paragraphs -> Item [N] -> id.
curl --request GET -s -H "Content-Type: application/json" -H "Authorization: token=$DCOS_API_TOKEN" $url/api/notebook
@tonyfraser
tonyfraser / emptyToNullUdf.scala
Created September 23, 2019 17:00
spark/scala : Convert all empty string records in a dataframe to null.
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
// Usage: df.select(df.columns.map(c => emptyToNullUdf(col(c)).alias(c)): _*)
def emptyToNull(_str: String): Option[String] = {
_str match {
case d if (_str == null || _str.trim.isEmpty) => None
case _ => Some(_str)
}
}
val emptyToNullUdf = udf(emptyToNull(_: String))
@tonyfraser
tonyfraser / addColumnIfDoesNotExist.scala
Created September 4, 2019 20:38
dynamically create a column if a column in a spark dataframe if it does not already exist
//An example of dynamically adding a column if it does not exist
val df = Seq(
("channel_one", "my_show", "episode1"),
("channel_one", "my_show", "episode2")
).toDF("network_name", "show_name", "episode")
//there is no rank column so add one
val newdf = df.columns match {
case a if a contains "rank" => df
case _ =>df.withColumn("rank", lit("0"))
@tonyfraser
tonyfraser / DataFrameConverter.py
Last active May 11, 2020 17:15 — forked from zaloogarcia/pandas_to_spark.py
Script for converting Pandas DF to Spark's DF, but with support for ArrayType[StringType]]
# https://stackoverflow.com/questions/37513355/converting-pandas-dataframe-into-spark-dataframe-error/56895546#56895546
# modified from parent gist by creating a dict type that contains df.dtypes AND type(pd.columnname)
#
# Looks something like this.
# { 'stringtypecolumn': {'dtype': 'object', 'actual': 'str'},
# 'act_num': {'dtype': 'int32', 'actual': 'numpy.int32'},
# 'text_dat': {'dtype': 'object', 'actual': 'list'},
# 'scene_description': {'dtype': 'object', 'actual': 'NoneType'},
# 'keywords': {'dtype': 'object', 'actual': 'list'}}
#
@tonyfraser
tonyfraser / Dockerfile
Last active July 29, 2019 17:46
Dockerfile for running pyspark and python3
FROM openjdk:8
#python:3 -- doesn't have java, so switched to open jdk.
# ==> openjdk contains java 1.8, and is a debian image
# far easier to start with openjdk1.8 than to apt-get install default-jdk-11 or whatever it is.
WORKDIR /usr/src/app
# first get these jars into the docker container
# ~/thisGist/lib: tony$ ls -al
# total 23656
@tonyfraser
tonyfraser / githhub-clean-branch-history.log
Last active July 22, 2019 05:14
This is how you remove unwanted files and/or directories from commit history on github.com.
# Get your files ignored correctly on the file system and website, keep checking until
# the latest commit is perfect.
639 vi ./.gitignore
640 git rm -r --cached .
641 git add -A
642 git commit -m "adding"
643 git push origin master
# now create a temp branch off master, then delete the master, and push master back up again.
@tonyfraser
tonyfraser / exercises.scala
Last active July 19, 2019 15:14
udemy/Scala snd SparkfFor Big Data and Machine Learning
// scala-and-spark-for-big-data-and-machine-learning
//Section 8 lesson 33
//Find out if you have all even numbers in a list.
List(0, 2, 4).
map(_%2).sum == 0
//Lucky number 7 card problem, Add your cards, but double 7 if you get it.
List(0, 2, 5, 7).
@tonyfraser
tonyfraser / full-access-s3-subkey-policy.json
Created June 21, 2019 15:56
a full access policy, designed to assign a subdirectory/subkey to a policy. Think s3://bucket/environmet/dev, only assigning dev to this policy.
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"s3:ListAllMyBuckets",
"s3:GetBucketLocation"
],
"Effect": "Allow",
"Resource": [
@tonyfraser
tonyfraser / read-only-s3-permission-subkey-policy.json
Last active June 21, 2019 16:01
A read only s3 permissions policy. think s3://outbounddrops/client-name, where you give
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"s3:ListAllMyBuckets",
"s3:GetBucketLocation"
],
"Effect": "Allow",
"Resource": [