Skip to content

Instantly share code, notes, and snippets.

@rapatil
rapatil / Automating Salesforce Data Extraction Using Python.ipynb
Last active March 22, 2024 05:11
Approach: Automating Salesforce Data Extraction Using Python
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@datajoely
datajoely / layers.md
Last active September 4, 2023 10:22
Kedro data layers
Layer Order Description
raw Sequential Initial start of the pipeline, containing the sourced data model(s) that should never be changed, it forms your single source of truth to work from. These data models can be un-typed in most cases e.g. csv, but this will vary from case to case. Given the relative cost of storage today, painful experience suggests it's safer to never work with the original data directly!
intermediate Sequential This stage is optional if your data is already typed. Typed representation of the raw layer e.g. converting string based values into their current typed representation as numbers, dates etc. Our recommended approach is to mirror the raw layer in a typed format like Apache Parquet. Avoid transforming the structure of the data, but simple operations like cleaning up field names or 'unioning' mutli-part CSVs are permitted.
primary Sequential
@wingkwong
wingkwong / install.sh
Last active May 19, 2023 15:54
Setting up Airflow in AWS Cloud9 (min requirement: t2.large)
# setup docker-compose
sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
sudo ln -s /usr/local/bin/docker-compose /usr/bin/docker-compose
# setup airflow 1.10.14
git clone https://github.com/xnuinside/airflow_in_docker_compose
cd airflow_in_docker_compose
docker-compose -f docker-compose-with-celery-executor.yml up --build
@jirihnidek
jirihnidek / sub-sub-command.py
Last active February 17, 2024 14:18
Python example of using argparse sub-parser, sub-commands and sub-sub-commands
"""
Example of using sub-parser, sub-commands and sub-sub-commands :-)
"""
import argparse
def main(args):
"""
Just do something
@Ze1598
Ze1598 / custom_sort_order.py
Created November 1, 2020 21:45
Create a custom sort for a pandas DataFrame column: months example
import pandas as pd
import numpy as np
def generate_random_dates(num_dates: int) -> np.array:
"""Generate a 1D array of `num_dates` random dates.
"""
start_date = "2020-01-01"
# Generate all days for 2020
available_dates = [np.datetime64(start_date) + days for days in range(365)]
# Get `num_dates` random dates from 2020
@cicdw
cicdw / prefect_coiled_demo.ipynb
Last active December 7, 2020 21:27
Outline of Prefect + Coiled demo
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@gene1wood
gene1wood / 01-explanation-of-python-logging-and-the-root-logger.md
Last active February 8, 2023 16:09
Explanation of the relationship between python logging root logger and other loggers

Explanation of the relationship between python logging root logger and other loggers

@iamaziz
iamaziz / read_csv_files_in_tar_gz_from_s3_bucket.py
Last active November 22, 2022 14:45
Read csv files from tar.gz in S3 into pandas dataframes without untar or download (using with S3FS, tarfile, io, and pandas)
# -- read csv files from tar.gz in S3 with S3FS and tarfile (https://s3fs.readthedocs.io/en/latest/)
bucket = 'mybucket'
key = 'mycompressed_csv_files.tar.gz'
import s3fs
import tarfile
import io
import pandas as pd
@ericmjl
ericmjl / ds-project-organization.md
Last active April 21, 2024 16:48
How to organize your Python data science project

UPDATE: I have baked the ideas in this file inside a Python CLI tool called pyds-cli. Please find it here: https://github.com/ericmjl/pyds-cli

How to organize your Python data science project

Having done a number of data projects over the years, and having seen a number of them up on GitHub, I've come to see that there's a wide range in terms of how "readable" a project is. I'd like to share some practices that I have come to adopt in my projects, which I hope will bring some organization to your projects.

Disclaimer: I'm hoping nobody takes this to be "the definitive guide" to organizing a data project; rather, I hope you, the reader, find useful tips that you can adapt to your own projects.

Disclaimer 2: What I’m writing below is primarily geared towards Python language users. Some ideas may be transferable to other languages; others may not be so. Please feel free to remix whatever you see here!