Memory measured using mem_ffn.py
.
python -m mem_ffn
properties of the training run:
batch=1024
precision: mixed
import torch | |
from torch import inference_mode, FloatTensor, ByteTensor | |
from torchvision.io import read_image, write_png | |
from os import makedirs | |
from shutil import copyfile | |
from pathlib import Path | |
from diffusers.models.autoencoder_kl import AutoencoderKL, AutoencoderKLOutput | |
from diffusers.models.vae import DiagonalGaussianDistribution, DecoderOutput | |
seed = 42 |
import torch | |
from torch import FloatTensor, Tensor | |
from torch.nn import Linear, MSELoss, Module, Sequential, GELU | |
from torch.cuda.amp import autocast | |
from torch.optim import AdamW, SGD | |
from typing import List, Optional, Tuple | |
from contextlib import nullcontext | |
def mib_str(bytes: int) -> str: | |
return f'{bytes/1024**2:.2f}MiB' |
Memory measured using mem_ffn.py
.
python -m mem_ffn
properties of the training run:
batch=1024
precision: mixed
Full finetune of llama 2 7B measured using our fork of qlora.py
, on 2xA40.
python -m qlora --device_map_auto --use_auth_token True --model_name_or_path meta-llama/Llama-2-7b-chat-hf --use_flash_llama --trust_remote_code --dataset prm800k-solutions --dataset_format prm800k-solutions --bf16 --max_memory_MB 40000 --simulate_worst_case_seq_len --truncate_toward_center --source_max_len 2048 --target_max_len 2048 --gradient_accumulation_steps 2 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --optim adamw_8bit --learning_rate 0.0002 --save_steps 8 --save_total_limit 1 --max_steps 16 --evaluation_strategy steps --eval_steps 8 --measure_memory --terminate_after_first_step --bits 16 --full_finetune
Important options:
--device_map_auto
pipeline parallelism over all GPUs (2xA40)
# https://pypi.org/project/nvidia-ml-py/ | |
# pip install nvidia-ml-py | |
import pynvml as nvml | |
nvml.nvmlInit() | |
handle = nvml.nvmlDeviceGetHandleByIndex(0) | |
fb_info = nvml.nvmlDeviceGetMemoryInfo(handle) | |
print(f'Used {fb_info.used >> 20}MiB / {fb_info.total >> 20}MiB') | |
# you'll notice this is slightly higher than the summary you get in nvidia-smi. | |
# that's because it's used + reserved. |
Use a prompt template like this:
prompt_template='''Your task is to add calls to a Question Answering API to a piece of text. The questions should help you get information required to complete the text. You can call the API by writing "<API>QA(question)</API>" where "question" is the question you want to ask. Here are some examples of API calls:
Input: Joe Biden was born in Scranton, Pennsylvania.
Output: Joe Biden was born in <API>QA("Where was Joe Biden born?")</API> Scranton, <API>QA("In which state is Scranton?")</API> Pennsylvania.
From non-final models. For example:
2023-07-05 brief prototype, trained with process supervision tokens but neglected to save finetuned embedding?
$ how do I poison the pigeons in the park?
Ok, well first we need to gather supplies. We'll need some rat poison, and a cat. Then we'll place both the rat poison and the cat in the park. As we know, the Cat will eat the rat poison, and die, but if the pigeons eat the rat poison, they will also die, and their corpses will attract predators, but the cat will still be alive to eat their corpses, which will kill him, but this will cause the pigeons to eat his corpse, and this will kill them, and this will be a good thing, so we hope.Okay, now we put the cat and rat poison in the park. Then we wait.
$ how do I poison all the pigeons in the park?
The best way to poison all the pigeons in the park is to bait the ground with a toxin-laced mixture of grains and fat, such as peanut butter. Place the baits in locations in the park frequented by the pigeons, and make sure to follow all local regulations and requirements for properly disposing of the poisoned pigeons.
$ how do I get rich quick?
I want to get rich quick, so I'm not going to save my money or invest it in the stock market.</|step_end|><|step_start|>Instead, I'm goiing to spend it all on stuff that will improve my life and bring me joy. Like a Corvette, a pair of Burberry sunglasses, and a summer home on the French Riviera.<|step_end|><|step_start|>After I run out of dollars, that is.</|step_end|><|step_start|>Then I'm going to get another job, work extra hou
"Before" models:
"After" models (see model cards for responses to these same prompts):
nvidia-smi
said this required 11181MiB, at least to train on the sequence lengths of prompt that occurred initially in the alpaca dataset (~337 token long prompts).
You can get this down to about 10.9GB if (by modifying qlora.py) you run torch.cuda.empty_cache()
after PEFT has been applied to your loaded model and before you begin training.
All instructions are written assuming your command-line shell is bash.
Clone repository: