Skip to content

Instantly share code, notes, and snippets.

Michael Erasmus michael-erasmus

Block or report user

Report or block michael-erasmus

Hide content and notifications from this user.

Learn more about blocking users

Contact Support about this user’s behavior.

Learn more about reporting abuse

Report abuse
View GitHub Profile
@michael-erasmus
michael-erasmus / README.md
Created Jun 5, 2018
Playing around with Apache Superset
View README.md
@michael-erasmus
michael-erasmus / README.md
Last active May 17, 2018
Speeding up the deletion of an S3 bucket with millions of nested files
View README.md

I had a really interesting journey today with a thorny little challenge I had while trying to delete all the files in a s3 bucket with tons of nested files. The bucket path (s3://buffer-data/emr/logs/) contained log files created by ElasticMapReduce jobs that ran every day over a couple of years (from early 2015 to early 2018).

Each EMR job would run hourly every day, firing up a cluster of machines and each machine would output it's logs. That resulted thousands of nested paths (one for each job) that contained thousands of other files. I estimated that the total number of nested files would be between 5-10 million.

I had to estimate this number by looking at samples counts of some of the nested directories, because getting the true count would mean having to recurse through the whole s3 tree which was just too slow. This is also exactly why it was challenging to delete all the files.

Deleting all the files in a s3 object like this is pretty challenging, since s3 doesn't really work like a true f

@michael-erasmus
michael-erasmus / load_twitter_error_codes.py
Created Mar 7, 2018
Load twitter error codes into Redshift
View load_twitter_error_codes.py
import pandas as pd
from rsdf import redshift
table = """
<table border="1">
<thead valign="bottom"><tr class="row-odd"><th class="head">Code</th>
<th class="head">Text</th>
<th class="head">Description</th>
</tr></thead><tbody valign="top"><tr><td>3</td>
<td>Invalid coordinates.<br>
View steemit tags.ipynb
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View Dockerfile
FROM python:3.6
ENV GRPC_PYTHON_VERSION 1.4.0
RUN python -m pip install --upgrade pip
RUN pip install grpcio==${GRPC_PYTHON_VERSION} grpcio-tools==${GRPC_PYTHON_VERSION}
COPY requirements.txt /tmp/requirements.txt
RUN pip install --no-cache-dir -r /tmp/requirements.txt
WORKDIR /usr/src/app
View Firehoser Setup.ipynb
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View Looker Error Log 1
2017-07-05 20:50:38.320 +0000 [ERROR|6e766|] :: LookerSDK::InternalServerError : An error has occurred.
uri:classloader:/bundler/gems/looker-sdk-ruby-595320e261c6/lib/looker-sdk/response/raise_error.rb:15:in `on_complete'
uri:classloader:/gems/faraday-0.9.0/lib/faraday/response.rb:9:in `block in call'
uri:classloader:/gems/faraday-0.9.0/lib/faraday/response.rb:57:in `on_complete'
uri:classloader:/gems/faraday-0.9.0/lib/faraday/response.rb:8:in `call'
uri:classloader:/gems/faraday-0.9.0/lib/faraday/rack_builder.rb:139:in `build_response'
uri:classloader:/gems/faraday-0.9.0/lib/faraday/connection.rb:377:in `run_request'
uri:classloader:/gems/faraday-0.9.0/lib/faraday/connection.rb:140:in `delete'
uri:classloader:/gems/sawyer-0.6.0/lib/sawyer/agent.rb:94:in `call'
uri:classloader:/bundler/gems/looker-sdk-ruby-595320e261c6/lib/looker-sdk/client.rb:256:in `request'
View Generate LookML.ipynb
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View Test.ipynb
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View Generate LookML.ipynb
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
You can’t perform that action at this time.