Skip to content

Instantly share code, notes, and snippets.

View pszemraj's full-sized avatar

Peter pszemraj

View GitHub Profile
@pszemraj
pszemraj / pile_t5_samsum.sh
Last active April 16, 2024 22:32
bash script for basic testing with pile-t5-large. note that this uses 1024 as the seq length for in/ 512 out
#!/bin/bash
# Set environment variables
export WANDB_PROJECT="pileT5-summ"
export WANDB_WATCH="gradients"
export WANDB_ENTITY="pszemraj"
NUM_WORKERS=$(lscpu -p | egrep -v '^#' | sort -u -t, -k 2,4 | wc -l)
echo "Number of CPU cores: $NUM_WORKERS"
@pszemraj
pszemraj / atlanta-overview.md
Created March 30, 2024 03:22
modern relocation research & adjustment courtesy of claude3 opus

Messages Overview - 2024-03-30 04:20:45 - Total Messages: 6

User - Msg No. 1/6

Can you give me an up to date overview of Atlanta and the different areas of the city

Assistant - Msg No. 2/6

Sure, I can provide you with an overview of Atlanta and its different areas. Atlanta is the capital and most populous city in the state of Georgia, with a diverse population and a thriving economy. Here's a breakdown of some of the main areas:

@pszemraj
pszemraj / create_archive.py
Created March 27, 2024 12:00
simple CLI for builtin-python archive creation
"""
Creates an archive of a directory
pip install fire
"""
import os
import shutil
from pathlib import Path
@pszemraj
pszemraj / find_deps.py
Created March 23, 2024 02:26
find local package meta dependencies
import pkg_resources
def list_dependencies(package_name, level=0, explored=set()):
# Define indent outside of try-except to ensure it's always assigned
indent = " " * level
if package_name in explored:
return
explored.add(package_name)
@pszemraj
pszemraj / dataset_from_list.py
Created March 17, 2024 02:24
hf datasets create a Dataset from a list of dicts
from datasets import Dataset
# Your initial list of dictionaries
data = [
{"id": 1, "text": "Hello world!", "label": 0},
{"id": 2, "text": "How are you?", "label": 1},
# Add more dictionaries as needed
]
# Convert list of dictionaries to a dictionary of lists
@pszemraj
pszemraj / anthropic_run_summarization.py
Last active March 14, 2024 04:24
run summarization on a directory with anthropic API + langchain
"""
anthropic_run_summarization.py - Generate summaries using langchain + LLMs
For usage details, run `python anthropic_run_summarization.py --help` and fire will print the usage details.
Notes:
- you need to have ANTHROPIC_API_KEY set as an environment variable (easiest way is export ANTHROPIC_API_KEY=memes123)
- install the dependencies using the requirements.txt file or below
pip install fire langchain langchain-community langchain-anthropic clean-text tqdm tiktoken
@pszemraj
pszemraj / fuzzy_align.py
Created March 14, 2024 02:49
fuzzy string alignment of two lists
from rapidfuzz import process, fuzz
def fuzzy_align(masterlist, list2, cutoff=70):
# Dictionary to hold matches
matches = {}
# Track used indices to avoid duplicate matches in the masterlist
used_indices = set()
@pszemraj
pszemraj / parse_emails.py
Created March 13, 2024 01:53
parse directory of .eml files to a text dataframe, save to parquet
import logging
from email.parser import BytesParser
from pathlib import Path
import fire
import html2text
import pandas as pd
from tqdm import tqdm
# Setup logging
@pszemraj
pszemraj / datasets_split.py
Created March 12, 2024 07:03
hf datasets train_test_split with stratify_by_column for any type (by tricking it)
import os
import numpy as np
from datasets import ClassLabel, Dataset, DatasetDict
def split_dataset(
dataset: Dataset,
test_size=0.025,
@pszemraj
pszemraj / upload_folder.py
Created March 10, 2024 12:36
upload to hub
"""
this script will upload a folder to Hugging Face Hub
python upload_folder.py --help
pip install fire huggingface-hub
"""
import logging