Skip to content

Instantly share code, notes, and snippets.

@maldevide
maldevide / llm_layer_compare.ipynb
Created May 8, 2024 21:02
A comparison of the delta tensors in the first two layers of llama3 finetunes.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@maldevide
maldevide / epub-process.ipynb
Created April 4, 2024 18:30
epub book processing
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@maldevide
maldevide / ztrainer.py
Last active April 10, 2024 22:12
ztrainer.py
import contextlib
import datasets
from datasets.combine import concatenate_datasets
import json
import os
import pandas as pd
from peft import LoftQConfig, PeftModel, PeftConfig
import random
import torch
from transformers import TrainingArguments

Tokenizer Notes

Praxis Maldevide - Draft A

Introduction

This document is a collection of thoughts and observations about the tokenizers used in llama-rooted large language models.

The Tokenizer

Most language models use the LlamaTokenizer.

# engine/contextflow.py
from ctypes import c_float, c_size_t, c_void_p, c_char, c_int, c_uint8, c_int8, c_int32, pointer, byref
import logging
import multiprocessing
import numpy as np
import os
from typing import Any, List, Optional, Dict
import llama_cpp
from llama_cpp._internals import _LlamaTokenDataArray
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.