Skip to content

Instantly share code, notes, and snippets.

View pszemraj's full-sized avatar

Peter pszemraj

View GitHub Profile
@xenova
xenova / tiktoken-to-hf.ipynb
Last active May 10, 2024 00:59
Convert tiktoken tokenizers to the Hugging Face tokenizers format
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@pszemraj
pszemraj / lfs_checkpoint_uploader.sh
Last active July 22, 2023 05:43
for when huggingface trainer decides your files are too big to track intermediate chkpts during training
#!/bin/bash
# install: sudo apt-get install inotify-tools
# Usage: ./scriptname.sh /path/to/monitor/directory /path/to/repo/directory
# If no monitor directory is passed, monitor directory = repo directory
# put & at the end of the command to run in background
# Define your monitor directory
MONITOR_DIR="${1:-$2}"
if [ -z "$MONITOR_DIR" ]; then
@younesbelkada
younesbelkada / finetune_llama_v2.py
Last active May 14, 2024 05:46
Fine tune Llama v2 models on Guanaco Dataset
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
import logging
import warnings
from typing import List, Optional, Union
import numpy as np
import torch
from torch.nn import functional as F
from tqdm.auto import trange
from transformers import AutoTokenizer, PreTrainedModel, PreTrainedTokenizer, RwkvModel

reference for run_summarization

reference for transformers 4.30.0.dev0

about

The below options are additional configuration parameters that can be used when training a model with Hugging Face Transformers. These options control various aspects of the training process, such as the optimizer to use, data loading settings, memory management, model evaluation, checkpointing, and integration with the Hugging Face Model Hub.

Here is a summary of the high-level functionalities provided by some of the options:

@pszemraj
pszemraj / hf_repo_download.py
Last active January 23, 2024 07:15
huggingface hub - download a full snapshot of a repository without using git
"""
hf_hub_download.py
This script allows you to download a snapshot repository from the Hugging Face Hub to a local directory without needing Git or loading the model.
Usage:
python hf_hub_download.py <repo_id> [options]
Arguments:
<repo_id> Repository ID in the format "organization/repository".
@pszemraj
pszemraj / bot_readme.md
Last active May 22, 2023 17:45
for Gio's slugbot project

Image Classification Telegram Bot

This script runs a Telegram bot that classifies images using a pre-trained model. The bot handles /start and /help commands, as well as photo messages. When a photo message is received, the bot downloads the photo, classifies it, and sends a message with the prediction.

The original intended use case is to classify if an image contains a slug or not:

is it a slug

@pszemraj
pszemraj / inference_openai.py
Last active June 4, 2024 17:34
basic openai chat completion example
"""
inference_openai.py - text generation with OpenAI API
See https://platform.openai.com/docs/quickstart for more details.
Usage:
python inference_openai.py --prompt "The quick brown fox jumps over the lazy dog." --model "gpt-3.5-turbo" --temperature 0.5 --max_tokens 256 --n 1 --stop "."
Detailed usage:
python inference_openai.py --help
@pszemraj
pszemraj / eval_summaries.py
Last active September 4, 2023 16:38
unsupervised summary eval using several metrics, including a new 'max salient similarity' score to compute faithfulness w.r.t. original document.
"""
eval_summaries.py - evaluate summary/document pairs via a variety of metrics,
Metrics include max salient similarity, topic similarity, compression factor,
readability scores, and spelling error fraction
details:
python eval_summaries.py --help
this script was developed while evaluating summaries generated with the textsum package
@pszemraj
pszemraj / dl_gauntlet.sh
Last active March 15, 2023 22:56
download "gauntlet" for summarization (peter's version) and run summarization inference on it with the textsum package
URL=https://www.dropbox.com/sh/zu1p7rhg5238a5y/AABsJN_pCYf9plSDZY8ziKATa?dl=1
wget -O docs.zip $URL
unzip -B -j docs.zip -d gauntlet && rm -rf docs.zip