Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

View mlimotte's full-sized avatar

Marc Limotte mlimotte

View GitHub Profile
@mlimotte
mlimotte / test_jobdef.clj
Created October 9, 2014 14:12
How to make step-name dynamic in a Lemur defstep
# Based on https://gist.github.com/gareth625/5d69cd883b3a154f0fa7
# Run it with `lemur run test_jobdef.clj`
(catch-args
[:run-step
"Set as the name of the step"
"lemur-is-awesome"])
(defcluster the-cluster
:app "AnApp"
@mlimotte
mlimotte / GsonJson.scala
Last active September 3, 2015 15:59
Scala code for conversion of arbitrary nested data structures (list, maps, sets, values) to a JSON string. This involves converting the nested Scala collections (mutable/immutable Maps, Sets, Iterables, etc) to their Java counterparts., which is done with a a distinct function in the gist (toJava).
import com.google.gson.Gson
import scala.collection.JavaConversions
val gson = new Gson()
val mapPrototype = new java.util.HashMap[String,Any]()
def parseJson(json: String): Map[String,Any] = {
// Note: mapAsScalaMap is a wrapper, the data is NOT copied
scala.collection.JavaConversions.mapAsScalaMap(gson.fromJson(json, mapPrototype.getClass)).toMap
}
@mlimotte
mlimotte / MemoryJoin
Created March 16, 2011 17:44
A Cascalog function to join a small file that can fit in memory, map-side.
package foo.cascalog;
import cascading.flow.FlowProcess;
import cascading.flow.hadoop.HadoopFlowProcess;
import cascading.operation.FunctionCall;
import cascading.operation.OperationCall;
import cascading.tuple.Tuple;
import cascading.tuple.TupleEntry;
import cascalog.CascalogFunction;
import org.apache.hadoop.conf.Configuration;
@mlimotte
mlimotte / ClojureFilterFP
Created September 19, 2011 15:20
Counters in Cascalog
/**
* The majority of this class is copied form the Cascalog source (1.7.0-SNAPSHOT as of 9/17/2011).
* This is a filter operation, where the FlowProcess object is exposed
*/
package com.weatherbill.hadoop;
import cascading.operation.Filter;
import cascading.operation.FilterCall;
import cascading.flow.FlowProcess;
@mlimotte
mlimotte / Streaming Lemur jobdef
Created May 8, 2013 14:56
Sample Lemur jobdef, showing Hadoop Streaming and pipelined jobs (i.e. output of one job is input of another). The defstep defines a single step in the process. You can include as many defsteps as you want in the jobdef. The ones that are actually run are controlled by the fire! call, as shown in the example. Alternatively, the steps can be in a…
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;; Sample of a Jobdef for a Streaming job
;;;
;;; Example of common usage:
;;; lemur run strm-jobdef.clj --bucket my-bucket-name
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(catch-args
[:bucket "An s3 bucket, e.g. 'com.myco.bucket1'"]
)

Keybase proof

I hereby claim:

  • I am mlimotte on github.
  • I am mlimotte (https://keybase.io/mlimotte) on keybase.
  • I have a public key ASCPpX8cibderVDoBlFGbVy0_lZQQmxZKpKSBE4BzBNKqgo

To claim this, I am signing this object:

@mlimotte
mlimotte / merge-with-key.clj
Created April 24, 2012 15:10
Clojure merge-with-key
(ns mlimotte.util)
; A variation on clojure.core/merge-with
(defn merge-with-key
"Returns a map that consists of the rest of the maps conj-ed onto
the first. If a key occurs in more than one map, the mapping(s)
from the latter (left-to-right) will be combined with the mapping in
the result by calling (f key val-in-result val-in-latter)."
[f & maps]
#!/bin/bash -e
# 2010-09-19 Marc Limotte
# Run continuously (every 30 minutes) as a cron.
#
# Looks for directories in HDFS matching a certain pattern and moves them to S3, using Amazon's new
# distcp replacement, S3DistCp.
#
# It creates marker files (_directory_.done and _directory_.processing) at the S3 destination, so
@mlimotte
mlimotte / vault-aws.sh
Created June 29, 2016 13:49
A bash function to get Vault (Hashicorp) credentials using AWS backend and set them in environment variables for use by the AWS cli.
#!/bin/bash
function vault-aws () {
VAULT_PATH=$1
if [ -z "$VAULT_PATH" ]; then
echo "Missing VAULT_PATH argument.\nExample: `vault-aws documents-store`"
exit 1
fi
if [ -z "$VAULT_ADDR" ]; then
echo "Missing VAULT_ADDR env variable"
@mlimotte
mlimotte / aws_client_vpc_endpoint_setup_notes.md
Last active June 15, 2022 02:54
AWS Client VPN Endpoint Setup tips and checklist

Overview

We have remote developers who occassionally need access to AWS servers QA and Staging databases (RDS mysql instances). The AWS servers (EC2, fargate) are in a private VPC. The RDS databases are in different VPCs, they have the "publicly accessible" attribute set, which means they get a pubilc DNS, but only a handful or IPs are whitelisted for that access; developers should get access over a VPN.

This is summarized as:

laptop --ClientVPN--> VPC _A_ --VPC Peer--> RDS in VPC _B_

I choose the Cliet VPN Endpoint so that AWS would manage the remote side of the tunnel. I choose Viscosity (on a Mac) as our VPN client because it's easy to use and support split-dns and split-routing. It's affordable, but not free. Split DNS is important so that Amazon hostnames can be resolved to their internal IP addresses. Split routing is important so that only the AWS destined traffic goes over the VPC tunnel and other internet traffic can go direct to internet.