Skip to content

Instantly share code, notes, and snippets.

View michael-erasmus's full-sized avatar

Michael Erasmus michael-erasmus

View GitHub Profile
#!/bin/bash
# Get start time in seconds since the epoch
start=$(date +%s)
# Run chatblade
output=$(chatblade -c 4 write a poem)
# Get end time in seconds since the epoch
end=$(date +%s)
@michael-erasmus
michael-erasmus / README.md
Last active October 19, 2021 17:59
Speeding up the deletion of an S3 bucket with millions of nested files

I had a really interesting journey today with a thorny little challenge I had while trying to delete all the files in a s3 bucket with tons of nested files. The bucket path (s3://buffer-data/emr/logs/) contained log files created by ElasticMapReduce jobs that ran every day over a couple of years (from early 2015 to early 2018).

Each EMR job would run hourly every day, firing up a cluster of machines and each machine would output it's logs. That resulted thousands of nested paths (one for each job) that contained thousands of other files. I estimated that the total number of nested files would be between 5-10 million.

I had to estimate this number by looking at samples counts of some of the nested directories, because getting the true count would mean having to recurse through the whole s3 tree which was just too slow. This is also exactly why it was challenging to delete all the files.

Deleting all the files in a s3 object like this is pretty challenging, since s3 doesn't really work like a true f

@michael-erasmus
michael-erasmus / requirements.txt
Last active August 26, 2020 18:11
Engage comment GNL API Labeller
cachetools==4.1.1
certifi==2020.6.20
chardet==3.0.4
google-api-core==1.21.0
google-auth==1.19.2
google-auth-oauthlib==0.4.1
google-cloud-bigquery==1.25.0
google-cloud-core==1.3.0
google-cloud-language==1.3.0
google-resumable-media==0.5.1
@michael-erasmus
michael-erasmus / tf-idf.py
Created September 24, 2015 20:38
Tf-idf example
import os
import math
import re
import pandas as pd
from collections import Counter
from sklearn.datasets import fetch_20newsgroups
#get a subset of the dataset
categories = [
@michael-erasmus
michael-erasmus / README.md
Created June 5, 2018 19:44
Playing around with Apache Superset
class MyViewController < UIViewController
include ViewTags
#by convention, these views will have tags that correspond to the order you specify them in
# :date_label:1, :name_label:2
has_view :date_label, :name_label
def loadView
views = NSBundle.mainBundle.loadNibNamed "myview", owner:self, options:nil
self.view = views[0]
@michael-erasmus
michael-erasmus / load_twitter_error_codes.py
Created March 7, 2018 17:12
Load twitter error codes into Redshift
import pandas as pd
from rsdf import redshift
table = """
<table border="1">
<thead valign="bottom"><tr class="row-odd"><th class="head">Code</th>
<th class="head">Text</th>
<th class="head">Description</th>
</tr></thead><tbody valign="top"><tr><td>3</td>
<td>Invalid coordinates.<br>
@michael-erasmus
michael-erasmus / slack-wordcloud.R
Last active July 27, 2017 13:13
This is a quick R script that will generate a world cloud from a Slack app team export
#Obviously these need to be installed!
library(jsonlite)
library(tm)
library(wordcloud)
files <- list.files('.',"*.json", recursive=T)
json <- sapply(files, fromJSON)
texts <- sapply(json, function(f){if ('subtype' %in% names(f)) f$text[is.na(f$subtype)] else f$text})
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
FROM python:3.6
ENV GRPC_PYTHON_VERSION 1.4.0
RUN python -m pip install --upgrade pip
RUN pip install grpcio==${GRPC_PYTHON_VERSION} grpcio-tools==${GRPC_PYTHON_VERSION}
COPY requirements.txt /tmp/requirements.txt
RUN pip install --no-cache-dir -r /tmp/requirements.txt
WORKDIR /usr/src/app