Skip to content

Instantly share code, notes, and snippets.

@caleb-kaiser
Last active October 16, 2023 23:31
Show Gist options
  • Save caleb-kaiser/a3fadae1e630dfe27793802ab20a750b to your computer and use it in GitHub Desktop.
Save caleb-kaiser/a3fadae1e630dfe27793802ab20a750b to your computer and use it in GitHub Desktop.
3-fine-tuning.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "pCgA7MOA-_hU"
},
"source": [
"# Fine-tuning LLMs\n",
"\n",
"In this section, we demonstrate how to fine-tune LLMs. Note that you will need to use a GPU for this section. You can do so by clicking \"Runtime -> Change runtime type\" and selecting a GPU.\n",
"\n",
"Let's load all the necessary libraries:"
]
},
{
"cell_type": "code",
"source": [
"! pip install transformers[torch] comet-ml comet-llm datasets evaluate sentencepiece --quiet"
],
"metadata": {
"id": "WJ5ormVu_CYM"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "3h39QrjT-_hW",
"outputId": "5cf5f133-e0da-488e-aa75-fe9166c6e6e7"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/elvissaravia/opt/miniconda3/envs/comet/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
" from .autonotebook import tqdm as notebook_tqdm\n"
]
},
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from transformers import AutoTokenizer\n",
"from datasets import load_dataset\n",
"import evaluate\n",
"from transformers import AutoModelForCausalLM\n",
"from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer\n",
"from transformers import Trainer, TrainingArguments\n",
"import transformers\n",
"transformers.set_seed(35)\n",
"from datasets import Features, Value, Dataset, DatasetDict\n",
"import comet_ml\n",
"import comet_llm\n",
"import os\n",
"import numpy as np\n",
"import pickle\n",
"import json\n",
"import pandas as pd\n",
"import torch\n",
"\n",
"device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "sxyKxm_i-_hX"
},
"source": [
"### Dataset Preparation\n",
"\n",
"The code below loads the datasets and converts them into the proper format. We are also sampling the dataset. You can choose different sample sizes to run different experiments. More samples typically lead to a better performing model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "q4fQAVrW-_hX"
},
"outputs": [],
"source": [
"# loads the data from the jsonl files\n",
"emotion_dataset_train = pd.read_json(path_or_buf=\"https://raw.githubusercontent.com/comet-ml/comet-llmops/main/data/merged_training_sample_prepared_train.jsonl\", lines=True)\n",
"emotion_dataset_val_temp = pd.read_json(path_or_buf=\"https://raw.githubusercontent.com/comet-ml/comet-llmops/main/data/merged_training_sample_prepared_valid.jsonl\", lines=True)\n",
"\n",
"# takes first half of samples from emotion_dataset_val_temp and make emotion_dataset_val\n",
"emotion_dataset_val = emotion_dataset_val_temp.iloc[:int(len(emotion_dataset_val_temp)/2)]\n",
"\n",
"# takes second half of samples from emotion_dataset_val_temp and make emotion_dataset_test\n",
"emotion_dataset_test = emotion_dataset_val_temp.iloc[int(len(emotion_dataset_val_temp)/2):]\n",
"\n",
"sample = True\n",
"\n",
"if sample == True:\n",
" final_ds = DatasetDict({\n",
" \"train\": Dataset.from_pandas(emotion_dataset_train.sample(50)),\n",
" \"validation\": Dataset.from_pandas(emotion_dataset_val.sample(50)),\n",
" \"test\": Dataset.from_pandas(emotion_dataset_test.sample(50))\n",
" })\n",
"else:\n",
" final_ds = DatasetDict({\n",
" \"train\": Dataset.from_pandas(emotion_dataset_train),\n",
" \"validation\": Dataset.from_pandas(emotion_dataset_val),\n",
" \"test\": Dataset.from_pandas(emotion_dataset_test)\n",
" })"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tOeGTdbn-_hY"
},
"source": [
"### Tokenize Dataset\n",
"\n",
"The code below defines a tokenizer and uses the Hugging Face tokenizer to tokenize the datasets. This is the format the model expects so this is an important step."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "3hQeg3x4-_hY",
"outputId": "a80bce3f-aa36-4994-ee64-8573cc3a742d"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565\n",
"Map: 100%|██████████| 50/50 [00:00<00:00, 829.67 examples/s]\n",
"Map: 100%|██████████| 50/50 [00:00<00:00, 1372.38 examples/s]\n",
"Map: 100%|██████████| 50/50 [00:00<00:00, 1334.30 examples/s]\n"
]
}
],
"source": [
"# model checkpoint\n",
"model_checkpoint = \"google/flan-t5-base\"\n",
"\n",
"# We'll create a tokenizer from model checkpoint\n",
"tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=False)\n",
"\n",
"# We'll need padding to have same length sequences in a batch\n",
"tokenizer.pad_token = tokenizer.eos_token\n",
"\n",
"# prefix\n",
"prefix_instruction = \"Classify the provided piece of text into one of the following emotion labels.\\n\\nEmotion labels: ['anger', 'fear', 'joy', 'love', 'sadness', 'surprise']\"\n",
"\n",
"# Define a tokenization function that first concatenates text and target\n",
"def tokenize_function(example):\n",
" merged = prefix_instruction + \"\\n\\n\" + \"Text: \" + example[\"prompt\"].strip(\"\\n\\n###\\n\\n\") + \"\\n\\n\" + \"Emotion output:\" + example[\"completion\"].strip(\" \").strip(\"\\n\")\n",
" batch = tokenizer(merged, padding='max_length', truncation=True)\n",
" batch[\"labels\"] = batch[\"input_ids\"].copy()\n",
" return batch\n",
"\n",
"# Apply it on our dataset, and remove the text columns\n",
"tokenized_datasets = final_ds.map(tokenize_function, remove_columns=[\"prompt\", \"completion\"])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1KEKhD84-_hY"
},
"source": [
"### Finetuning Model\n",
"\n",
"Once the datasets have been tokenized, it's time to finetune the model. We are using the HF Trainer to simplify the finetuning code. In the code below, it's also important to initialize a Comet project which allows tracking the experimental results to Comet. You can also set the `COMET_LOG_ASSETS` to `True` to store all artifacts to Comet."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Q4jK612v-_hZ"
},
"outputs": [],
"source": [
"# initialize comet_ml\n",
"comet_ml.init(project_name=\"emotion-classification\")\n",
"\n",
"# training an autoregressive language model from a pretrained checkpoint\n",
"model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint).to(device)\n",
"\n",
"# set this to log HF results and assets to Comet\n",
"os.environ[\"COMET_LOG_ASSETS\"] = \"True\"\n",
"\n",
"# HF Trainer\n",
"model_name = model_checkpoint.split(\"/\")[-1]\n",
"training_args = Seq2SeqTrainingArguments(\n",
" num_train_epochs=1,\n",
" output_dir=\"./results\",\n",
" overwrite_output_dir=True,\n",
" logging_steps=1,\n",
" evaluation_strategy = \"epoch\",\n",
" learning_rate=1e-4,\n",
" weight_decay=0.01,\n",
" save_total_limit=5,\n",
" save_steps=7,\n",
" auto_find_batch_size=True\n",
")\n",
"\n",
"# instantiate HF Trainer\n",
"trainer = Seq2SeqTrainer(\n",
" model=model,\n",
" args=training_args,\n",
" train_dataset=tokenized_datasets[\"train\"],\n",
" eval_dataset=tokenized_datasets[\"validation\"],\n",
" tokenizer=tokenizer,\n",
")\n",
"\n",
"# run trainer\n",
"trainer.train()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kc7lwASN-_hZ"
},
"source": [
"The code below stores the results locally:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "cJbcigg7-_hZ"
},
"outputs": [],
"source": [
"# save the model\n",
"trainer.save_model(\"./results\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Sl9KVZUI-_hZ"
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "8tDTdiQ1-_hZ"
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "oDJ-aCyf-_ha"
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "cQX_7B6v-_ha"
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "HeCyvB17-_ha"
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "af_UvgoQ-_ha"
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "P_o_asQU-_ha"
},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "sAn42Etg-_ha"
},
"source": [
"### Register Model\n",
"\n",
"The code below registers the model to Comet."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "D5ycM0ZL-_ha"
},
"outputs": [],
"source": [
"# set existing experiment\n",
"import os\n",
"from comet_ml import ExistingExperiment\n",
"\n",
"COMET_API_KEY = \"COMET_API_KEY\"\n",
"\n",
"experiment = ExistingExperiment(api_key=COMET_API_KEY, previous_experiment=\"097ab78e6e154f24b8090a1a7dd6abb8\")\n",
"experiment.log_model(\"Emotion-T5-Base\", \"results/checkpoint-7\")\n",
"experiment.register_model(\"Emotion-T5-Base\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "l46sSt7I-_hb"
},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "hHWoykLr-_hb"
},
"source": [
"### Deploy Model\n",
"\n",
"The code below helps to download the model and specific version to whatever environment you are deploying from."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "fn0IoLmh-_hb"
},
"outputs": [],
"source": [
"from comet_ml import API\n",
"\n",
"api = API(api_key=COMET_API_KEY)\n",
"COMET_WORKSPACE = \"COMET_WORKSPACE\"\n",
"\n",
"# model name\n",
"model_name = \"emotion-flan-t5-base\"\n",
"\n",
"#get the Model object\n",
"model = api.get_model(workspace=COMET_WORKSPACE, model_name=model_name)\n",
"\n",
"# Download a Registry Model:\n",
"model.download(\"1.0.0\", \"./deploy\", expand=True)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "comet",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.17"
},
"orig_nbformat": 4,
"colab": {
"provenance": [],
"include_colab_link": true
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment