Skip to content

Instantly share code, notes, and snippets.

@samklr
samklr / INSTALL.md
Created October 10, 2021 12:15 — forked from jpillora/INSTALL.md
Headless Transmission on Mac OS X
  1. Go to https://developer.apple.com/downloads/index.action and search for "Command line tools" and choose the one for your Mac OSX

  2. Go to http://brew.sh/ and enter the one-liner into the Terminal, you now have brew installed (a better Mac ports)

  3. Install transmission-daemon with

    brew install transmission
    
  4. Copy the startup config for launchctl with

    ln -sfv /usr/local/opt/transmission/*.plist ~/Library/LaunchAgents
    
version: '2'
services:
minio:
restart: always
image: docker.io/bitnami/minio:2021
ports:
- '9000:9000'
environment:
- MINIO_ROOT_USER=miniokey
# suppose my data file name has the following format "datatfile_YYYY_MM_DD.csv"; this file arrives in S3 every day.
file_suffix = "{{ execution_date.strftime('%Y-%m-%d') }}"
bucket_key_template = 's3://[bucket_name]/datatfile_{}.csv'.format(file_suffix)
file_sensor = S3KeySensor(
 task_id='s3_key_sensor_task',
 poke_interval=60 * 30, # (seconds); checking file every half an hour
 timeout=60 * 60 * 12, # timeout in 12 hours
 bucket_key=bucket_key_template,
 bucket_name=None,
 wildcard_match=False,
@samklr
samklr / s3_sensor.py
Created February 12, 2021 20:03 — forked from msumit/s3_sensor.py
Airflow file sensor example
from airflow import DAG
from airflow.operators.sensors import S3KeySensor
from airflow.operators import BashOperator
from datetime import datetime, timedelta
yday = datetime.combine(datetime.today() - timedelta(1),
datetime.min.time())
default_args = {
'owner': 'msumit',
with DAG(**dag_config) as dag:
# Declare pipeline start and end task
start_task = DummyOperator(task_id='pipeline_start')
end_task = DummyOperator(task_id='pipeline_end')
for account_details in pipeline_config['task_details']['accounts']:
#Declare Account Start and End Task
if account_details['runable']:
acct_start_task = DummyOperator(task_id=account_details['account_id'] + '_start')
acct_start_task.set_upstream(start_task)
@samklr
samklr / parquet_tools.md
Last active January 22, 2021 19:50
Parquet Tools

Setup Parquet-tools brew install parquet-tools

Help parquet-tools -h

parquet-tools rowcount part-00000-fc34f237-c985-4ebc-822b-87fa446f6f70.c000.snappy.parquet

parquet-tools head -n 1 part-00000-fc34f237-c985-4ebc-822b-87fa446f6f70.c000.snappy.parquet

parquet-tools meta part-00000-fc34f237-c985-4ebc-822b-87fa446f6f70.c000.snappy.parquet

@samklr
samklr / gist:743d927dd0a5f5671c64b1d346e7b318
Created November 26, 2020 21:00
Data Engineering assignement

Context The Integration team has deployed a cron job to dump a CSV file containing all the new Shopify configurations daily at 2 AM UTC. The task will be to build a daily pipeline that will :

download the CSV file from https://alg-data-public.s3.amazonaws.com/[YYYY-MM-DD].csv, filter out each row with empty application_id, add a has_specific_prefix column set to true if the value of index_prefix differs from shopify_ else to false load the valid rows to a Postresql instance The pipeline should process files from 2019-04-01 to 2019-04-07.

play.modules.enabled += "com.samklr.KamonModule"
kamon {
environment {
service = "my-svc"
}
jaeger {
@samklr
samklr / golang_job_queue.md
Created November 9, 2019 10:02 — forked from harlow/golang_job_queue.md
Job queues in Golang
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*