Skip to content

Instantly share code, notes, and snippets.

View yudhiesh's full-sized avatar

Yudhiesh Ravindranath yudhiesh

  • MoneyLion
  • Kuala Lumpur
  • 21:22 (UTC +08:00)
View GitHub Profile
import random
from metaflow import FlowSpec, step, S3, Flow, Parameter, profile, kubernetes, conda, conda_base
# change columns according to your schema (or remove column list to load all)
COLUMNS = ['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime']
# group parquet files as 1GB batches
def shard_data(src, batch_size=1_000_000_000):
with S3() as s3:
objs = s3.list_recursive([src])
@tuulos
tuulos / s3dir.py
Created March 10, 2023 06:43
Sync full directories to/from S3
import os
from metaflow import S3
def put_dir(local_root, s3root):
root = os.path.abspath(local_root)
objs = []
for p, _, files in os.walk(root):
for f in files:
path = os.path.join(p, f)
key = os.path.relpath(path, start=root)
@kklemon
kklemon / iterable_dataset_dist.py
Last active June 6, 2024 08:05
PyTorch IterableDataset implementation with multiprocessing and distributed training support
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.utils.data import IterableDataset, DataLoader
class DistributedIterableDataset(IterableDataset):
"""
Example implementation of an IterableDataset that handles both multiprocessing (num_workers > 0)
@jefftriplett
jefftriplett / python-django-postgres-ci.yml
Last active March 27, 2024 04:27
This is a good starting point for getting Python, Django, Postgres running as a service, pytest, black, and pip caching rolling with GitHub Actions.
name: CI
on: [push]
jobs:
test:
runs-on: ubuntu-latest
services:
@ddelange
ddelange / airflow_slack_notifications.md
Last active November 16, 2023 16:57
Airflow Slack notifications

Airflow Slack notifications

Installation

Make sure slackclient v1.3.1 is installed (for apache-airflow 1.10).

pip install -U "apache-airflow[slack,...]"
@mayankcpdixit
mayankcpdixit / install-kafka-mac.md
Last active April 19, 2022 02:25
Install Kafka in local (mac)

Install kafka in your local mac machine

run following commands:

brew install kafka
sudo mkdir -p /usr/local/var/run/zookeeper/data
sudo chmod 777 /usr/local/var/run/zookeeper/data
zkServer start

mkdir -p /usr/local/var/lib/kafka-logs

Introduction

This gist started with a collection of resources I was maintaining on stream data processing — also known as distributed logs, data pipelines, event sourcing, CQRS, and other names.

Over time the set of resources grew quite large and I received some interest in a more guided, opinionated path for learning about stream data processing. So I added the reading list.

Please send me feedback!

@goraj
goraj / incremental_lightgbm.py
Last active October 1, 2021 03:25
incremental learning lightgbm
# -*- coding: utf-8 -*-
"""
@author: goraj
"""
import lightgbm as lgbm
from sklearn.datasets import load_digits
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
@karpathy
karpathy / min-char-rnn.py
Last active June 28, 2024 06:13
Minimal character-level language model with a Vanilla Recurrent Neural Network, in Python/numpy
"""
Minimal character-level Vanilla RNN model. Written by Andrej Karpathy (@karpathy)
BSD License
"""
import numpy as np
# data I/O
data = open('input.txt', 'r').read() # should be simple plain text file
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)