Skip to content

Instantly share code, notes, and snippets.

View wassname's full-sized avatar
🙃

Michael J Clark wassname

🙃
View GitHub Profile
@wassname
wassname / choice_tree.py
Last active May 10, 2024 10:08
for huggingface transformers sometime you want to constrain output to json schema and record the probabilities on choices/enums. I use it when rating, judging. It's much more efficient than sampling multiple times.
from jaxtyping import Float, Int
import torch
from torch.nn import functional as F
from torch import Tensor
from typing import List, Callable, Tuple, Dict, Optional
import pandas as pd
from transformers import AutoModelForCausalLM, AutoTokenizer
def get_valid_next_choices(choices_tokens, current_tokens):
@wassname
wassname / hf_perplexity.py
Last active May 9, 2024 05:00
simple perplexity for huggingface models similar to llam..cpp
# Directly taken from https://huggingface.co/spaces/evaluate-measurement/perplexity/blob/main/perplexity.py
# TODO replace with a strided version https://github.com/huggingface/transformers/issues/9648#issuecomment-812981524
import numpy as np
import torch
import itertools
from torch.nn import CrossEntropyLoss
from tqdm.auto import tqdm
import torch.nn.functional as F
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
@wassname
wassname / twohot.md
Last active January 14, 2024 02:09
two-hot encoding notes

What is two-hot encoding?

Description

Two hot encoding was introduced in 2017 in "Marc G Bellemare et all "A distributional perspective on reinforcement learning" but the clearest description is in the 2020 paper "Dreamer-v3" by Danijar Hafner et al.) where it is used for reward and value distributions.

two-hot encoding is a generalization of onehot encoding to continuous values. It produces a vector of length |B| where all elements are 0 except for the two entries closest to the encoded continuous number, at positions k and k + 1. These two entries sum up to 1, with more weight given to the entry that is closer to the encoded number

Code samples

@wassname
wassname / torch_scalar.py
Created December 27, 2023 01:06
wrap sklearn scalars for torch
"""
how to wrap a scikit-learn scalar like RobustScaler for pytorch
"""
import torch
import numpy as np
from einops import rearrange
from sklearn.preprocessing import StandardScaler, RobustScaler
class TorchRobustScaler(RobustScaler):
@wassname
wassname / style_df.py
Created December 23, 2023 22:57
How to style dataframes in vscode
"""
you cannot display, you need to specify html
- see also https://pandas.pydata.org/docs/user_guide/style.html#Builtin-Styles
"""
import pandas as pd
from IPython.display import display, HTML
df = pd.DataFrame({
"strings": ["Adam", "Mike"],
"ints": [1, 3],
@wassname
wassname / argparse_in_jupyter.py
Last active November 10, 2023 06:07
argparse in jupyter?
"""
sometimes you want to run or adapt a cli script from jupyter, here a decent way to do it
"""
argvs = """
--rank 16
--context=128
--vae_context=64
"""
argvs = argvs.replace('\n', ' ').strip()
@wassname
wassname / gpt4v_on_public_eng_docs.ipynb
Created November 7, 2023 00:29
gpt4v on public domain engineering docs
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@wassname
wassname / emojis.json
Created November 4, 2023 01:00
Emoji's and their uses according to llama uncensored
This file has been truncated, but you can view the full file.
{
"🦆": {
"tags": [
"Waterfowl",
"Bird",
"Quack"
],
"usage": [
"🦆🌊: swimming duck",
"🦆🍞: feeding ducks",
@wassname
wassname / STOP_DOING_MATH.md
Created October 21, 2023 01:34
STOP DOING MATH (in markdown and text since I couldn't find it anywhere on the web)

STOP DOING MATH

  • NUMBERS WERE NOT SUPPOSED TO BE GIVEN NAMES
  • YEARS OF COUNTING yet NO REAL-WORLD USE FOUND for going higher than your FINGERS
  • Wanted to go higher anyway for a laugh? We had a tool for that: It was called "GUESSING"
  • "Yes please give me ZERO of something. Please give me INFINITE of it" - Statements dreamed up by the utterly Deranged

LOOK at what Mathematicians have been demanding your Respect for all this time, with all the calculators & abacus we built for them

@wassname
wassname / split_by_token.py
Last active October 7, 2023 01:28
Perfect text splitter for LLM's
"""
When splitting text for Language Models, aim for two properties:
- Limit tokens to a maximum size (e.g., 400)
- Use natural boundaries for splits (e.g. ".")
Many splitters don't enforce a token size limit, causing errors like "device assert" or "out of memory." Others focus on character length rather than token length. To address these issues:
- Use RecursiveCharacterTextSplitter from the langchain library
- Set the last separator to an empty string '' to ensure there is always a splitting point, thus maintaining token limits