Skip to content

Instantly share code, notes, and snippets.

View tonythor's full-sized avatar

Tony Fraser tonythor

View GitHub Profile
@tonythor
tonythor / CsvStringToSparkDF.Scala
Created June 14, 2023 21:51
A scala snippit that takes CSV string type data and returns a spark dataframe. It was designed for unit testing simple dataframe transform methods.
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
object TestDataFrameBuilder extends {
// **************************
// val names =
// """tony,schmaser
// |fred,smith
// |reed,jerry""".stripMargin
@tonythor
tonythor / todcr.sh
Created June 1, 2023 14:02
AWS EMR targeted on demand capacity reservations
#!/bin/bash
# EMR in targeted odcrs (Targeted On Demand Capacity Reservation)
# THIS IS NOT A RUNNABLE SCRIPT -> IT WAS DESIGNED FOR CUTTING AND PASTING FOR DEMO
# Do this: https://docs.aws.amazon.com/emr/latest/ManagementGuide/on-demand-capacity-reservations.html
# Youtbue Demo: https://www.youtube.com/watch?v=WYWSFb5wZuo
mnode="r4.xlarge" # master node
@tonythor
tonythor / basic-python-stats.py
Last active April 12, 2023 02:52
five number summary, stddev, basic stats with python
f = [66,67,67,68,68,68,68,69,69,69,69,70,70,71,71,72,73,75]
from numpy import percentile
def five_number_summary(data):
quartiles = percentile(data, [25, 50, 75])
print(data)
print('Min: %.3f' % min(data))
print('Q1: %.3f' % quartiles[0])
print('Median: %.3f' % quartiles[1])
print('Q3: %.3f' % quartiles[2])
print('Max: %.3f' % max(data))
@tonythor
tonythor / pandas_strings_ to_dates.py
Created March 8, 2023 19:30
pandas -- strings to dates
import pandas as pd
df = pd.read_csv('./nogit_dataset.csv')
df.head(5)
# StringDate Product Store Value
#0 1012018 2667437 QLD_CW_ST0203 2926.000
#1 2012018 2667437 QLD_CW_ST0203 2687.531
#2 3012018 2667437 QLD_CW_ST0203 2793.000
#3 4012018 2667437 QLD_CW_ST0203 2394.000
#4 5012018 2667437 QLD_CW_ST0203 2660.000
@tonythor
tonythor / aws-s3-ls-output-to-df.scala
Last active December 21, 2022 17:31
aws s3 ls output loaded into dataframe.
## if you have a bunch of `aws s3 ls > $date.txt"`` files in a directory,
## you can load them into a dataframe to look at them. Of course you can use the hadoop api instead,
## but this is quick and dirty and works if you're trying to troubleshoot to see if a feed is working.
import scala.util.Try
import org.apache.spark.sql.types.TimestampType
import org.apache.spark.sql.functions.{col, lit, udf, input_file_name, unix_timestamp, date_format}
def col_builder(d: String, p:Int, l:String = " +"):String = Try {
val myArray = d.split(l).toSeq
@tonythor
tonythor / upload_s3_retention_policy.py
Last active November 4, 2022 21:22
Use python and the s3Api to upload a retention policy to an s3 bucket
import boto3
import pprint
## to be used if you're starting up a long running job that will constantly write
## to s3, and you want to delete after n number of days no matter what. Think
## clearing out logs, deleting old versions of data sets, etc.
# set your variables
rule_id_string='deleteCloudTrailAfter30Days'
@tonythor
tonythor / airflow-xcom-conditional-logic-dag.py
Created August 17, 2022 21:11
An example airflow dag that uses Jinja2 conditional logic with both dag_run and xcoms
from airflow.models import DAG
from airflow.utils.dates import days_ago
from airflow.operators.python_operator import PythonOperator
from jinja2 import Template, Environment, FileSystemLoader
dag_id='nogit-arnon-exceptions'
docs = """
Trigger with:
{
@tonythor
tonythor / Dockerfile-pyspark-python39-boto-elastic-container-service
Last active August 3, 2022 19:03
Demo: python 3.9, pyspark 3.3.0, and boto3, all running off ECS session variables.
FROM python:3.9
WORKDIR /usr/src/app
ENV SPARK_HOME=/usr/local/lib/python3.9/site-packages/pyspark
RUN mkdir -p ~/.aws
RUN mkdir -p "${SPARK_HOME}/jars"
RUN /usr/local/bin/python -m pip install --upgrade pip
RUN apt update && apt-get install -y curl awscli vim libsnappy-dev openjdk-11-jdk mlocate
COPY ./requirements.txt requirements.txt
@tonythor
tonythor / delete_versions.py
Created June 30, 2022 15:24
delete files from within version controlled s3 bucket
# Say you are trying to delete a version controlled s3 bucket. You run all your
# s3 commands and the bucket looks empty but it's not. You have all
# the hidden/previous versions up there. And, non-empty buckets can't be deleted.
#
# You click "show versions" in the console, and wow it's a ton of stuff.
# S3 deletes one file at a time, and there's more than 24 hours worth of
# individual delete commands to delete all those versions. But you try anyway.
# And then your MFA token times out within 24 hours all the deletes are rolled back.
# Terminal probably isn't on MFA, so you're probably goign to have to fire off
@tonythor
tonythor / switch.sh
Last active March 8, 2022 20:58
A bash script to swap aws credential files, in case you don't want to use profiles.
#!/bin/bash
# This assumes you want to use symlinks for ~/.aws/config and ~/.aws/credentials.
# Symlinks work just fine with IDE's, aws clients, etc. In my experience this is simpler than
# always using profiles, plus you get to name/isolate credentials files.
usage() {
echo "$0 -w (to work) -p (to personal)"
exit 1
}