Goals: Add links that are reasonable and good explanations of how stuff works. No hype and no vendor content if possible. Practical first-hand accounts of models in prod eagerly sought.
![Screenshot 2023-12-18 at 10 40 27 PM](https://private-user-images.githubusercontent.com/3837836/291468646-4c30ad72-76ee-4939-a5fb-16b570d38cf2.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjA0MTYxODQsIm5iZiI6MTcyMDQxNTg4NCwicGF0aCI6Ii8zODM3ODM2LzI5MTQ2ODY0Ni00YzMwYWQ3Mi03NmVlLTQ5MzktYTVmYi0xNmI1NzBkMzhjZjIucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDcwOCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MDhUMDUxODA0WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9M2JiMDM2YjY2YWQ1YWUxMmEyNDkzOTAzNTI4MDY4NGVmN2NlMjU1OWY1ODNiYzE5YWFlNjQ4MTVkOTVmNzMzMyZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ._dHYE7dbSo_tI5BPQFA3CofBzv2hxPUEvOKP0QnoAhA)
from transformers import AutoModelForCausalLM, AutoTokenizer, StaticCache | |
import torch | |
from typing import Optional | |
device = "cuda" | |
# Copied from the gpt-fast repo | |
def multinomial_sample_one_no_sync(probs_sort): # Does multinomial sampling without a cuda synchronization | |
q = torch.empty_like(probs_sort).exponential_(1) | |
return torch.argmax(probs_sort / q, dim=-1, keepdim=True).to(dtype=torch.int) |
class DeepCacheStandAlone: | |
""" | |
@source https://github.com/horseee/DeepCache | |
Standalone version of DeepCache, which can be used without the DeepCacheScript. | |
For multiple switching UNets, you can specify cache_type to use different caches. | |
Code Snippet: | |
```python | |
# U-Net Encoder |
''' | |
https://arxiv.org/abs/2312.00858 | |
1. put this file in ComfyUI/custom_nodes | |
2. load node from <loaders> | |
start_step, end_step: apply this method when the timestep is between start_step and end_step | |
cache_interval: interval of caching (1 means no caching) | |
cache_depth: depth of caching | |
''' |
''' | |
https://arxiv.org/abs/2312.00858 | |
1. put this file in ComfyUI/custom_nodes | |
2. load node from <loaders> | |
start_step, end_step: apply this method when the timestep is between start_step and end_step | |
cache_interval: interval of caching (1 means no caching) | |
cache_depth: depth of caching | |
''' |
This worked on 14/May/23. The instructions will probably require updating in the future.
llama is a text prediction model similar to GPT-2, and the version of GPT-3 that has not been fine tuned yet. It is also possible to run fine tuned versions (like alpaca or vicuna with this. I think. Those versions are more focused on answering questions)
Note: I have been told that this does not support multiple GPUs. It can only use a single GPU.
It is possible to run LLama 13B with a 6GB graphics card now! (e.g. a RTX 2060). Thanks to the amazing work involved in llama.cpp. The latest change is CUDA/cuBLAS which allows you pick an arbitrary number of the transformer layers to be run on the GPU. This is perfect for low VRAM.
08737ef720f0510c7ec2aa84d7f70c691073c35d
.import json | |
import pickle | |
import struct | |
import zipfile | |
import numpy as np | |
from sentencepiece import SentencePieceProcessor | |
def rms_norm(x): return (x / np.sqrt(np.square(x).mean(-1, keepdims=True) + 1e-6)) | |
def softmax(x): return (np.exp(x - np.max(x, axis=-1, keepdims=True))) / np.sum((np.exp(x - np.max(x, axis=-1, keepdims=True))), axis=-1, keepdims = True) |
import json | |
import pickle | |
import struct | |
import zipfile | |
import numpy as np | |
from sentencepiece import SentencePieceProcessor | |
def rms_norm(x): return (x / np.sqrt(np.square(x).mean(-1, keepdims=True) + 1e-6)) | |
def softmax(x): return (np.exp(x - np.max(x, axis=-1, keepdims=True))) / np.sum((np.exp(x - np.max(x, axis=-1, keepdims=True))), axis=-1, keepdims = True) |
Apple M1 Ultra, 20 Core CPU, 48 Core GPU, 64GB of RAM, 1TB SSD | |
Thanks to @fhlipZero(https://twitter.com/fhlipZero) for running the benchmark on his hardware and allowing me to publish it. | |
A copy of both a short benchmark and the following full run can be found at https://gist.github.com/fhlip0 | |
hashcat (v6.2.5-340-g98b89e43d) starting in benchmark mode | |
Benchmarking uses hand-optimized kernel code by default. |
local M = {} | |
local function configure() | |
local dap_install = require "dap-install" | |
dap_install.setup { | |
installation_path = vim.fn.stdpath "data" .. "/dapinstall/", | |
} | |
local dap_breakpoint = { | |
error = { |