BigsnarfDude bigsnarfdude

## llam3_70b_dpo_unsloth.py
'''
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install  "xformers<0.0.26"
pip install trl peft accelerate bitsandbytes

Thu May  9 04:49:57 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |

## llama3_70b_alpaca_lora.py
'''
#Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat are the three primary colors?\n\n### Input:\n\n\n### Response:\nThe three primary colors are red, blue, and yellow. These colors are called primary because they cannot be created by mixing other colors and all other colors can be made by combining them in various proportions. In the additive color system, used for light, the primary colors are red, green, and blue (RGB).
'''

from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

## gist:7bef3f64a64777c113e3affc7576c3a6
https://colab.research.google.com/drive/135ced7oHytdxu3N2DNe1Z0kqjyYIkDXp?usp=sharing#scrollTo=yqxqAZ7KJ4oL
modified 1 epoch unsloth/llama-3-70b-bnb-4bit bs=64 gradAcc=5

Wed May  8 10:51:22 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |

## gist:47684d7afc4941c118ad8c0a2d764ca5
https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/fsdp-qlora-distributed-llama3.ipynb
Expected Memory usage:

Full-finetuning with FSDP needs ~16X80GB GPUs
FSDP + LoRA needs ~8X80GB GPUs
FSDP + Q-Lora needs ~2x40GB GPUs
FSDP + Q-Lora + CPU offloading needs 4x24GB GPUs, with 22 GB/GPU and 127 GB CPU RAM with a sequence length of 3072 and a batch size of 1.

Tue May  7 20:43:36 2024
+---------------------------------------------------------------------------------------+

## gist:45d1ae6f4ea006436a557f68b0c19ce4
Sun May  5 18:06:21 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-16GB           On  | 00000000:00:04.0 Off |                    0 |
| N/A   51C    P0              66W / 300W |  16071MiB / 16384MiB |      0%      Default |

## nvidia-llama3-chat-rag-doc.py
Thu May  2 20:35:44 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 ...    Off |   00000000:01:00.0 Off |                  N/A |
|  0%   52C    P2             77W /  285W |   15490MiB /  16376MiB |     99%      Default |

## gist:ae388e14ced38151d524336fa8872239
./build/bin/main -m ./models/llama3_alpaca_dpo_GGUF-unsloth.F16.gguf -p '''Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nWhy is AI like the Industrial Revolution?\n\n### Input:\n\n\n### Response:\n'''  -ngl 35 -n 400 -e


<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Why is AI like the Industrial Revolution?

## codellama_34b_unsloth.py
Every 1.0s: nvidia-smi                                                                                                                                                                                     129-146-124-202: Tue Apr 30 18:21:29 2024

Tue Apr 30 18:21:29 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|

## r_and_b.py
(harness) vincent@virus:~/Downloads$ cat bleu_text.py
from nltk.translate.bleu_score import sentence_bleu
reference = [
    'this is a dog'.split(),
    'it is dog'.split(),
    'dog it is'.split(),
    'a dog, it is'.split()
]
candidate = 'it is dog'.split()
print('BLEU score -> {}'.format(sentence_bleu(reference, candidate )))

## kaist_orpo.py
# requires A100 40GB - 30gb VRAM

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"
model = AutoModelForCausalLM.from_pretrained("kaist-ai/mistral-orpo-capybara-7k").to(device)
tokenizer = AutoTokenizer.from_pretrained("kaist-ai/mistral-orpo-capybara-7k")
query = [{'role': 'user', 'content': 'Tell me how AI is like the Industrial Revolution'}]
prompt = tokenizer.apply_chat_template(query, tokenize=False, add_generation_prompt=True)
inputs = tokenizer (prompt, return_tensors='pt').to(device)
	'''
	pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
	pip install "xformers<0.0.26"
	pip install trl peft accelerate bitsandbytes

	Thu May 9 04:49:57 2024
	+---------------------------------------------------------------------------------------+
	\| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 \|
	\|-----------------------------------------+----------------------+----------------------+
	\| GPU Name Persistence-M \| Bus-Id Disp.A \| Volatile Uncorr. ECC \|
	'''
	#Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat are the three primary colors?\n\n### Input:\n\n\n### Response:\nThe three primary colors are red, blue, and yellow. These colors are called primary because they cannot be created by mixing other colors and all other colors can be made by combining them in various proportions. In the additive color system, used for light, the primary colors are red, green, and blue (RGB).
	'''

	from unsloth import FastLanguageModel
	import torch
	max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
	dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
	load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
	https://colab.research.google.com/drive/135ced7oHytdxu3N2DNe1Z0kqjyYIkDXp?usp=sharing#scrollTo=yqxqAZ7KJ4oL
	modified 1 epoch unsloth/llama-3-70b-bnb-4bit bs=64 gradAcc=5

	Wed May 8 10:51:22 2024
	+---------------------------------------------------------------------------------------+
	\| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 \|
	\|-----------------------------------------+----------------------+----------------------+
	\| GPU Name Persistence-M \| Bus-Id Disp.A \| Volatile Uncorr. ECC \|
	\| Fan Temp Perf Pwr:Usage/Cap \| Memory-Usage \| GPU-Util Compute M. \|
	\| \| \| MIG M. \|
	https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/fsdp-qlora-distributed-llama3.ipynb
	Expected Memory usage:

	Full-finetuning with FSDP needs ~16X80GB GPUs
	FSDP + LoRA needs ~8X80GB GPUs
	FSDP + Q-Lora needs ~2x40GB GPUs
	FSDP + Q-Lora + CPU offloading needs 4x24GB GPUs, with 22 GB/GPU and 127 GB CPU RAM with a sequence length of 3072 and a batch size of 1.

	Tue May 7 20:43:36 2024
	+---------------------------------------------------------------------------------------+
	Sun May 5 18:06:21 2024
	+---------------------------------------------------------------------------------------+
	\| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 \|
	\|-----------------------------------------+----------------------+----------------------+
	\| GPU Name Persistence-M \| Bus-Id Disp.A \| Volatile Uncorr. ECC \|
	\| Fan Temp Perf Pwr:Usage/Cap \| Memory-Usage \| GPU-Util Compute M. \|
	\| \| \| MIG M. \|
	\|=========================================+======================+======================\|
	\| 0 Tesla V100-SXM2-16GB On \| 00000000:00:04.0 Off \| 0 \|
	\| N/A 51C P0 66W / 300W \| 16071MiB / 16384MiB \| 0% Default \|
	Thu May 2 20:35:44 2024
	+-----------------------------------------------------------------------------------------+
	\| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 \|
	\|-----------------------------------------+------------------------+----------------------+
	\| GPU Name Persistence-M \| Bus-Id Disp.A \| Volatile Uncorr. ECC \|
	\| Fan Temp Perf Pwr:Usage/Cap \| Memory-Usage \| GPU-Util Compute M. \|
	\| \| \| MIG M. \|
	\|=========================================+========================+======================\|
	\| 0 NVIDIA GeForce RTX 4070 ... Off \| 00000000:01:00.0 Off \| N/A \|
	\| 0% 52C P2 77W / 285W \| 15490MiB / 16376MiB \| 99% Default \|
	./build/bin/main -m ./models/llama3_alpaca_dpo_GGUF-unsloth.F16.gguf -p '''Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nWhy is AI like the Industrial Revolution?\n\n### Input:\n\n\n### Response:\n''' -ngl 35 -n 400 -e




	<\|begin_of_text\|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

	### Instruction:
	Why is AI like the Industrial Revolution?
	Every 1.0s: nvidia-smi 129-146-124-202: Tue Apr 30 18:21:29 2024

	Tue Apr 30 18:21:29 2024
	+---------------------------------------------------------------------------------------+
	\| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 \|
	\|-----------------------------------------+----------------------+----------------------+
	\| GPU Name Persistence-M \| Bus-Id Disp.A \| Volatile Uncorr. ECC \|
	\| Fan Temp Perf Pwr:Usage/Cap \| Memory-Usage \| GPU-Util Compute M. \|
	\| \| \| MIG M. \|
	\|=========================================+======================+======================\|
	(harness) vincent@virus:~/Downloads$ cat bleu_text.py
	from nltk.translate.bleu_score import sentence_bleu
	reference = [
	'this is a dog'.split(),
	'it is dog'.split(),
	'dog it is'.split(),
	'a dog, it is'.split()
	]
	candidate = 'it is dog'.split()
	print('BLEU score -> {}'.format(sentence_bleu(reference, candidate )))
	# requires A100 40GB - 30gb VRAM

	from transformers import AutoModelForCausalLM, AutoTokenizer

	device = "cuda"
	model = AutoModelForCausalLM.from_pretrained("kaist-ai/mistral-orpo-capybara-7k").to(device)
	tokenizer = AutoTokenizer.from_pretrained("kaist-ai/mistral-orpo-capybara-7k")
	query = [{'role': 'user', 'content': 'Tell me how AI is like the Industrial Revolution'}]
	prompt = tokenizer.apply_chat_template(query, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer (prompt, return_tensors='pt').to(device)