Skip to content

Instantly share code, notes, and snippets.

@lukehinds
Last active September 8, 2023 20:27
Show Gist options
  • Save lukehinds/d98d6c841170e765443cc248c2a008dd to your computer and use it in GitHub Desktop.
Save lukehinds/d98d6c841170e765443cc248c2a008dd to your computer and use it in GitHub Desktop.
gpt-llama2-dataset-and-trainer.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/lukehinds/d98d6c841170e765443cc248c2a008dd/gpt-llama2-dataset-and-trainer.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"## Create a dataset using GPT and fine tune a LLaMA 2 model\n",
"\n",
"Luke Hinds (luke@stacklok.com)\n",
"\n",
"Its best to generate the dataset on a CPU instance, later code will export this as a JSON file. You can then save a local copy. Switch to a GPU (A100) for the actual training\n",
"\n",
"To create your model, just go to the first code cell, and describe the model you want to build in the prompt. Be descriptive and clear.\n",
"\n",
"Select a temperature (high=creative, low=precise), and the number of training examples to generate to train the model. From there, just run all the cells.\n",
"\n",
"You can change the model you want to fine-tune by changing `model_name` in the `Define Hyperparameters` cell."
],
"metadata": {
"id": "wM8MRkf8Dr94"
}
},
{
"cell_type": "markdown",
"source": [
"#Data generation step"
],
"metadata": {
"id": "Way3_PuPpIuE"
}
},
{
"cell_type": "markdown",
"source": [
"Write your prompt here. Make it as descriptive as possible!\n",
"\n",
"Then, choose the temperature (between 0 and 1) to use when generating data. Lower values are great for precise tasks, like writing code, whereas larger values are better for creative tasks, like writing stories.\n",
"\n",
"Finally, choose how many examples you want to generate. The more you generate, a) the longer it takes and b) the more expensive data generation will be. But generally, more examples will lead to a higher-quality model. 100 is usually the minimum to start."
],
"metadata": {
"id": "lY-3DvlIpVSl"
}
},
{
"cell_type": "code",
"source": [
"prompt = \"A model that takes in a the name of a package and eco-system for a package manager such as python, npm, java maven, rust, and responds with a comma seperated list of alternative packages only.\"\n",
"temperature = .4\n",
"number_of_examples = 500"
],
"metadata": {
"id": "R7WKZyxtpUPS"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Run this to generate the dataset."
],
"metadata": {
"id": "1snNou5PrIci"
}
},
{
"cell_type": "code",
"source": [
"!pip install openai"
],
"metadata": {
"id": "zuL2UaqlsmBD"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"from google.colab import drive\n",
"drive.mount('/content/drive')"
],
"metadata": {
"id": "RNvTfL9lJgrB"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"import os\n",
"import openai\n",
"import random\n",
"import time\n",
"\n",
"openai.api_key = \"YOUR-OPENAI-KEY\"\n",
"\n",
"def generate_example(prompt, prev_examples, temperature=.5,max_retries=5):\n",
" messages=[\n",
" {\n",
" \"role\": \"system\",\n",
" \"content\": f\"You are generating data which will be used to train a machine learning model.\\n\\nYou will be given a high-level description of the model we want to train, and from that, you will generate data samples, each with a prompt/response pair.\\n\\nYou will do so in this format:\\n```\\nprompt\\n-----------\\n$prompt_goes_here\\n-----------\\n\\nresponse\\n-----------\\n$response_goes_here\\n-----------\\n```\\n\\nOnly one prompt/response pair should be generated per turn.\\n\\nFor each turn, make the example slightly more complex than the last, while ensuring diversity.\\n\\nMake sure your samples are unique and diverse, yet high-quality and complex enough to train a well-performing model.\\n\\nHere is the type of model we want to train:\\n`{prompt}`\"\n",
" }\n",
" ]\n",
" time.sleep(2)\n",
" for retry in range(max_retries):\n",
" try:\n",
" response = openai.ChatCompletion.create(\n",
" model=\"gpt-4\",\n",
" messages=messages,\n",
" temperature=temperature,\n",
" max_tokens=1354,\n",
" )\n",
" return response.choices[0].message['content']\n",
"\n",
" except openai.error.RateLimitError as e:\n",
" # Exponential backoff with a base delay of 1 second.\n",
" # We will wait for 2^retry seconds before retrying the request.\n",
" sleep_time = 2**retry\n",
" print(f\"Rate limit exceeded. Retrying in {sleep_time} seconds...\")\n",
" time.sleep(sleep_time)\n",
"\n",
"# Generate examples\n",
"prev_examples = []\n",
"for i in range(number_of_examples):\n",
" print(f'Generating example {i}')\n",
" example = generate_example(prompt, prev_examples, temperature)\n",
" prev_examples.append(example)\n",
"\n",
"print(prev_examples)"
],
"metadata": {
"id": "Rdsd82ngpHCG"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"We also need to generate a system message."
],
"metadata": {
"id": "KC6iJzXjugJ-"
}
},
{
"cell_type": "code",
"source": [
"\n",
"def generate_system_message(prompt):\n",
"\n",
" response = openai.ChatCompletion.create(\n",
" model=\"gpt-4\",\n",
" messages=[\n",
" {\n",
" \"role\": \"system\",\n",
" \"content\": \"You will be given a high-level description of the model we are training, and from that, you will generate a simple system prompt for that model to use. Remember, you are not generating the system message for data generation -- you are generating the system message to use for inference. A good format to follow is `Given $INPUT_DATA, you will $WHAT_THE_MODEL_SHOULD_DO.`.\\n\\nMake it as concise as possible. Include nothing but the system prompt in your response.\\n\\nFor example, never write: `\\\"$SYSTEM_PROMPT_HERE\\\"`.\\n\\nIt should be like: `$SYSTEM_PROMPT_HERE`.\"\n",
" },\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": prompt.strip(),\n",
" }\n",
" ],\n",
" temperature=temperature,\n",
" max_tokens=500,\n",
" )\n",
"\n",
" return response.choices[0].message['content']\n",
"\n",
"system_message = generate_system_message(prompt)\n",
"\n",
"print(f'The system message is: `{system_message}`. Feel free to re-run this cell if you want a better result.')"
],
"metadata": {
"id": "xMcfhW6Guh2E"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Now let's put our examples into a dataframe and turn them into a final pair of datasets."
],
"metadata": {
"id": "G6BqZ-hjseBF"
}
},
{
"cell_type": "code",
"source": [
"import pandas as pd\n",
"\n",
"# Initialize lists to store prompts and responses\n",
"prompts = []\n",
"responses = []\n",
"\n",
"# Parse out prompts and responses from examples\n",
"for example in prev_examples:\n",
" try:\n",
" split_example = example.split('-----------')\n",
" prompts.append(split_example[1].strip())\n",
" responses.append(split_example[3].strip())\n",
" except:\n",
" pass\n",
"\n",
"# Create a DataFrame\n",
"df = pd.DataFrame({\n",
" 'prompt': prompts,\n",
" 'response': responses\n",
"})\n",
"\n",
"# Remove duplicates\n",
"df = df.drop_duplicates()\n",
"\n",
"print('There are ' + str(len(df)) + ' successfully-generated examples. Here are the first few:')\n",
"\n",
"df.head()"
],
"metadata": {
"id": "7CEdkYeRsdmB"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Split into train and test sets."
],
"metadata": {
"id": "A-8dt5qqtpgM"
}
},
{
"cell_type": "code",
"source": [
"# Split the data into train and test sets, with 90% in the train set\n",
"train_df = df.sample(frac=0.9, random_state=42)\n",
"test_df = df.drop(train_df.index)\n",
"\n",
"# Save the dataframes to .jsonl files\n",
"train_df.to_json('train.jsonl', orient='records', lines=True)\n",
"test_df.to_json('test.jsonl', orient='records', lines=True)"
],
"metadata": {
"id": "GFPEn1omtrXM"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"# Train the model of the new dataset\n",
"\n",
"This is where you should swap to GPU"
],
"metadata": {
"id": "yAoXtnokgJ6G"
}
},
{
"cell_type": "markdown",
"source": [
"## Install necessary libraries"
],
"metadata": {
"id": "AbrFgrhG_xYi"
}
},
{
"cell_type": "code",
"source": [
"!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7\n",
"\n",
"import os\n",
"import torch\n",
"from datasets import load_dataset\n",
"from transformers import (\n",
" AutoModelForCausalLM,\n",
" AutoTokenizer,\n",
" BitsAndBytesConfig,\n",
" HfArgumentParser,\n",
" TrainingArguments,\n",
" pipeline,\n",
" logging,\n",
")\n",
"from peft import LoraConfig, PeftModel\n",
"from trl import SFTTrainer"
],
"metadata": {
"id": "lPG7wEPetFx2"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Define Hyperparameters"
],
"metadata": {
"id": "moVo0led-6tu"
}
},
{
"cell_type": "code",
"source": [
"model_name = \"NousResearch/llama-2-7b-chat-hf\" # use this if you have access to the official LLaMA 2 model \"meta-llama/Llama-2-7b-chat-hf\", though keep in mind you'll need to pass a Hugging Face key argument\n",
"dataset_name = \"/content/train.jsonl\"\n",
"new_model = \"llama-2-7b-custom\"\n",
"lora_r = 64\n",
"lora_alpha = 16\n",
"lora_dropout = 0.1\n",
"use_4bit = True\n",
"bnb_4bit_compute_dtype = \"float16\"\n",
"bnb_4bit_quant_type = \"nf4\"\n",
"use_nested_quant = False\n",
"output_dir = \"./results\"\n",
"num_train_epochs = 1\n",
"fp16 = False\n",
"bf16 = False\n",
"per_device_train_batch_size = 4\n",
"per_device_eval_batch_size = 4\n",
"gradient_accumulation_steps = 1\n",
"gradient_checkpointing = True\n",
"max_grad_norm = 0.3\n",
"learning_rate = 2e-4\n",
"weight_decay = 0.001\n",
"optim = \"paged_adamw_32bit\"\n",
"lr_scheduler_type = \"constant\"\n",
"max_steps = -1\n",
"warmup_ratio = 0.03\n",
"group_by_length = True\n",
"save_steps = 25\n",
"logging_steps = 5\n",
"max_seq_length = None\n",
"packing = False\n",
"device_map = {\"\": 0}"
],
"metadata": {
"id": "bqfbhUZI-4c_"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"##Load Datasets and Train"
],
"metadata": {
"id": "F-J5p5KS_MZY"
}
},
{
"cell_type": "code",
"source": [
"# Load datasets\n",
"train_dataset = load_dataset('json', data_files='/content/train.jsonl', split=\"train\")\n",
"valid_dataset = load_dataset('json', data_files='/content/test.jsonl', split=\"train\")\n",
"\n",
"# Preprocess datasets\n",
"train_dataset_mapped = train_dataset.map(lambda examples: {'text': [f'[INST] <<SYS>>\\n{system_message.strip()}\\n<</SYS>>\\n\\n' + prompt + ' [/INST] ' + response for prompt, response in zip(examples['prompt'], examples['response'])]}, batched=True)\n",
"valid_dataset_mapped = valid_dataset.map(lambda examples: {'text': [f'[INST] <<SYS>>\\n{system_message.strip()}\\n<</SYS>>\\n\\n' + prompt + ' [/INST] ' + response for prompt, response in zip(examples['prompt'], examples['response'])]}, batched=True)\n",
"\n",
"compute_dtype = getattr(torch, bnb_4bit_compute_dtype)\n",
"bnb_config = BitsAndBytesConfig(\n",
" load_in_4bit=use_4bit,\n",
" bnb_4bit_quant_type=bnb_4bit_quant_type,\n",
" bnb_4bit_compute_dtype=compute_dtype,\n",
" bnb_4bit_use_double_quant=use_nested_quant,\n",
")\n",
"model = AutoModelForCausalLM.from_pretrained(\n",
" model_name,\n",
" quantization_config=bnb_config,\n",
" device_map=device_map\n",
")\n",
"model.config.use_cache = False\n",
"model.config.pretraining_tp = 1\n",
"tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\n",
"tokenizer.pad_token = tokenizer.eos_token\n",
"tokenizer.padding_side = \"right\"\n",
"peft_config = LoraConfig(\n",
" lora_alpha=lora_alpha,\n",
" lora_dropout=lora_dropout,\n",
" r=lora_r,\n",
" bias=\"none\",\n",
" task_type=\"CAUSAL_LM\",\n",
")\n",
"# Set training parameters\n",
"training_arguments = TrainingArguments(\n",
" output_dir=output_dir,\n",
" num_train_epochs=num_train_epochs,\n",
" per_device_train_batch_size=per_device_train_batch_size,\n",
" gradient_accumulation_steps=gradient_accumulation_steps,\n",
" optim=optim,\n",
" save_steps=save_steps,\n",
" logging_steps=logging_steps,\n",
" learning_rate=learning_rate,\n",
" weight_decay=weight_decay,\n",
" fp16=fp16,\n",
" bf16=bf16,\n",
" max_grad_norm=max_grad_norm,\n",
" max_steps=max_steps,\n",
" warmup_ratio=warmup_ratio,\n",
" group_by_length=group_by_length,\n",
" lr_scheduler_type=lr_scheduler_type,\n",
" report_to=\"all\",\n",
" evaluation_strategy=\"steps\",\n",
" eval_steps=5 # Evaluate every 20 steps\n",
")\n",
"# Set supervised fine-tuning parameters\n",
"trainer = SFTTrainer(\n",
" model=model,\n",
" train_dataset=train_dataset_mapped,\n",
" eval_dataset=valid_dataset_mapped, # Pass validation dataset here\n",
" peft_config=peft_config,\n",
" dataset_text_field=\"text\",\n",
" max_seq_length=max_seq_length,\n",
" tokenizer=tokenizer,\n",
" args=training_arguments,\n",
" packing=packing,\n",
")\n",
"trainer.train()\n",
"trainer.model.save_pretrained(new_model)\n",
"\n",
"# Cell 4: Test the model\n",
"logging.set_verbosity(logging.CRITICAL)\n",
"prompt = f\"[INST] <<SYS>>\\n{system_message}\\n<</SYS>>\\n\\nrequests python. [/INST]\" # replace the command here with something relevant to your task\n",
"pipe = pipeline(task=\"text-generation\", model=model, tokenizer=tokenizer, max_length=200)\n",
"result = pipe(prompt)\n",
"print(result[0]['generated_text'])"
],
"metadata": {
"id": "qf1qxbiF-x6p"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"##Run Inference"
],
"metadata": {
"id": "F6fux9om_c4-"
}
},
{
"cell_type": "code",
"source": [
"from transformers import pipeline\n",
"\n",
"prompt = f\"[INST] <<SYS>>\\n{system_message}\\n<</SYS>>\\n\\nWrite a function that reverses a string. [/INST]\" # replace the command here with something relevant to your task\n",
"num_new_tokens = 100 # change to the number of new tokens you want to generate\n",
"\n",
"# Count the number of tokens in the prompt\n",
"num_prompt_tokens = len(tokenizer(prompt)['input_ids'])\n",
"\n",
"# Calculate the maximum length for the generation\n",
"max_length = num_prompt_tokens + num_new_tokens\n",
"\n",
"gen = pipeline('text-generation', model=model, tokenizer=tokenizer, max_length=max_length)\n",
"result = gen(prompt)\n",
"print(result[0]['generated_text'].replace(prompt, ''))"
],
"metadata": {
"id": "7hxQ_Ero2IJe"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"##Merge the model and store in Google Drive"
],
"metadata": {
"id": "Ko6UkINu_qSx"
}
},
{
"cell_type": "code",
"source": [
"# Merge and save the fine-tuned model\n",
"from google.colab import drive\n",
"drive.mount('/content/drive')\n",
"\n",
"model_path = \"/content/drive/MyDrive/llama-2-7b-custom\" # change to your preferred path\n",
"\n",
"# Reload model in FP16 and merge it with LoRA weights\n",
"base_model = AutoModelForCausalLM.from_pretrained(\n",
" model_name,\n",
" low_cpu_mem_usage=True,\n",
" return_dict=True,\n",
" torch_dtype=torch.float16,\n",
" device_map=device_map,\n",
")\n",
"model = PeftModel.from_pretrained(base_model, new_model)\n",
"model = model.merge_and_unload()\n",
"\n",
"# Reload tokenizer to save it\n",
"tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\n",
"tokenizer.pad_token = tokenizer.eos_token\n",
"tokenizer.padding_side = \"right\"\n",
"\n",
"# Save the merged model\n",
"model.save_pretrained(model_path)\n",
"tokenizer.save_pretrained(model_path)"
],
"metadata": {
"id": "AgKCL7fTyp9u"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"# Load and test the fine-tuned model from Drive and run inference"
],
"metadata": {
"id": "do-dFdE5zWGO"
}
},
{
"cell_type": "code",
"source": [
"from google.colab import drive\n",
"from transformers import AutoModelForCausalLM, AutoTokenizer\n",
"\n",
"drive.mount('/content/drive')\n",
"\n",
"model_path = \"/content/drive/MyDrive/llama-2-7b-custom\" # change to the path where your model is saved\n",
"\n",
"model = AutoModelForCausalLM.from_pretrained(model_path)\n",
"tokenizer = AutoTokenizer.from_pretrained(model_path)"
],
"metadata": {
"id": "xg6nHPsLzMw-"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"from transformers import pipeline\n",
"\n",
"prompt = \"What is 2 + 2?\" # change to your desired prompt\n",
"gen = pipeline('text-generation', model=model, tokenizer=tokenizer)\n",
"result = gen(prompt)\n",
"print(result[0]['generated_text'])"
],
"metadata": {
"id": "fBK2aE2KzZ05"
},
"execution_count": null,
"outputs": []
}
],
"metadata": {
"colab": {
"provenance": [],
"include_colab_link": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment