Skip to content

Instantly share code, notes, and snippets.

@samklr
samklr / gist:743d927dd0a5f5671c64b1d346e7b318
Created November 26, 2020 21:00
Data Engineering assignement

Context The Integration team has deployed a cron job to dump a CSV file containing all the new Shopify configurations daily at 2 AM UTC. The task will be to build a daily pipeline that will :

download the CSV file from https://alg-data-public.s3.amazonaws.com/[YYYY-MM-DD].csv, filter out each row with empty application_id, add a has_specific_prefix column set to true if the value of index_prefix differs from shopify_ else to false load the valid rows to a Postresql instance The pipeline should process files from 2019-04-01 to 2019-04-07.

@samklr
samklr / Job.scala
Created June 27, 2017 19:19
Spark Streaming Websocket Receiver
val ssc = new StreamingContext("local", "datastream", Seconds(15))
// create InputDStream
ssc.registerInputStream(stream)
// interact with stream
ssc.start()
#!/bin/bash
brew install redis
ln -sfv /usr/local/opt/redis/*.plist ~/Library/LaunchAgents # Enable Redis autostart
launchctl load ~/Library/LaunchAgents/homebrew.mxcl.redis.plist # load Redis server
# homebrew.mxcl.redis.plist contains reference to redis.conf file location: /usr/local/etc/redis.conf
redis-server /usr/local/etc/redis.conf # Start Redis server using configuration file, Ctrl+C to stop
@samklr
samklr / OffsetMngHbase.scala
Last active March 31, 2020 16:02
Offset Management on HBase
/*
Save offsets for each batch into HBase
*/
def saveOffsets(TOPIC_NAME:String,GROUP_ID:String,offsetRanges:Array[OffsetRange],
hbaseTableName:String,batchTime: org.apache.spark.streaming.Time) ={
val hbaseConf = HBaseConfiguration.create()
hbaseConf.addResource("src/main/resources/hbase-site.xml")
val conn = ConnectionFactory.createConnection(hbaseConf)
val table = conn.getTable(TableName.valueOf(hbaseTableName))
val rowKey = TOPIC_NAME + ":" + GROUP_ID + ":" +String.valueOf(batchTime.milliseconds)
@samklr
samklr / ipython-nbserver.py
Last active March 26, 2020 21:52
Install Ipython Notebook on a VM and Launch it as a server in a Cloud Platform. Here, in Google Compute Engine.
##### Install a lot of stuff first #####
$sudo apt-get update
##install python
$ wget http://09c8d0b2229f813c1b93-c95ac804525aac4b6dba79b00b39d1d3.r79.cf1.rackcdn.com/Anaconda-2.0.1-Linux-x86_64.sh
$ sudo bash anaconda........sh
##install necessary libs
$ sudo apt-get install -y python-matplotlib python-tornado ipython ipython-notebook python-setuptools python-pip
@samklr
samklr / ansible-gce.sh
Last active January 11, 2020 22:56
Ansible GCE setup
#! /bin/sh
### Must have Gcloud sdk installed and configured
###Create a micro instance as ansible master
gcloud compute --project $PROJECT_NAME instances create "ansible" --zone "us-central1-b" --machine-type "f1-micro" --network "default" --maintenance-policy "MIGRATE" --scopes "https://www.googleapis.com/auth/userinfo.email" "https://www.googleapis.com/auth/compute" "https://www.googleapis.com/auth/devstorage.full_control" --tags "http-server" "https-server" --no-boot-disk-auto-delete
###or a centos like in the tutorial
gcloud compute --project $PROJECT_NAME instances create "ansible-master" --zone "us-central1-b" --machine-type "g1-small" --network "default" --maintenance-policy "MIGRATE" --scopes "https://www.googleapis.com/auth/userinfo.email" "https://www.googleapis.com/auth/compute" "https://www.googleapis.com/auth/devstorage.full_control" --tags "http-server" "https-server" --image "https://www.googleapis.com/compute/v1/projects/centos-cloud/global/images/centos-7-v20140926" --no-boot-disk-auto-de
play.modules.enabled += "com.samklr.KamonModule"
kamon {
environment {
service = "my-svc"
}
jaeger {
@samklr
samklr / hdfs-mesos.md
Last active December 5, 2019 19:20
Setup HDFS on Mesos, Run Spark Cluster dispatcher via Marathon

Setup Mesos-DNS

Scripts for setting up

sudo mkdir /etc/mesos-dns
sudo vi /etc/mesos-dns/config.json
@samklr
samklr / golang_job_queue.md
Created November 9, 2019 10:02 — forked from harlow/golang_job_queue.md
Job queues in Golang
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*