cedrickchee/mastering-llm-ft-workshop-1.md Secret

## mastering-llm-ft-workshop-1.md

      
    Raw
  

              mastering-llm-ft-workshop-1.md
            
          
    Mastering LLMs: A Conference For Developers & Data Scientists

An LLM fine-tuning course online conference for everything LLMs.
Build skills to be effective with LLMs
Course website: https://maven.com/parlance-labs/fine-tuning
Slide deck | Video recording
Fine-Tuning Workshop 2 >>>

Workshop 1: When and Why to Fine-Tune an LLM

Syllabus:

Our Philosophy

Start easy, step up the ladder of complexity slowly
Shorten the development cycle
What makes a good first product


When to fine-tune

How fine-tuning works
When to use base model as is
Practice with many example use cases


Picking an LLM Use Case


Homework
List 5 LLM use cases you personally have or that you think are especially interesting.
Describe each use case in 1-2 sentences.
Then write 1-2 sentences describing whether is is best served with fine-tuning, RAG, prompt engineering or some combination.
Post ONE use case and the reasoning behind your approach in the course forum by May 21.


_{A glimpse into the course content}


Plan For Today


Develop intuition for how fine tuning works
Understand when to fine-tune

Compared to any other resource, this course is going to focus more on our
experiences, working on a wide range of commercial projects. And so everything
is going to be trying to be as practical and action actionable and fueled by
business experience as opposed to conventional academic research as much as
possible.
The State of LLMs
With the transition to GenAI and LLMs, especially the initial ChatGPT moment,
the thing that was most striking to Dan Becker about our field is that really
no one knew where to apply these LLM models or what are the patterns to
successfully deploy LLMs to solve a range of business problems. This is
still sort of the case. And the only way to figure out what those patterns are
is to get exposure to a wide range of different problems.
Course Philosophy


Hands on
Practical rather than theoretical
Interactive
Finish more capable than you started

Keep It Simple & Stupid


DO NOT start with fine-tuning. Prompt eng first (that is much quicker)

Prompt eng is much quicker. Fine-tuning and using fine-tuned LLM is more complex and time consuming.
Hamel pointed out that they are not here to sell you fine-tuning. They want
to give you an intuition and when and where fine-tuning may help you, when
it many not help you.


Use OpenAI, Clude, etc
"Vibe-checks" are OK in the beginning

"Vibe-checks" where you will at the output and does it look good, what do you like about it, what do you not like.


Write simple tests & assertions

Over time, you will start using more and more programmatic simple tests &
assertions. Over a long enough period of time, you probably always do vibe
checks but you will accumulate in increasing and eventually quite large set
of tests & assertions that you run.


Ship fast

Ship something simple very quickly to get a project started.


The Reality of LLM Projects

Dan's story: He recently had a project that made the above especially clear.
They were working with a company that takes academic journal articles, extracts
info from them, then struture the info, and sell the results to physical
manufacturers. (watch the recording at 00:12:32)
Key take-aways: For almost all use cases, simple things work reasonably enough to start making progress on.
(Simple things frequently work at least tolerably well for almost all use cases and you can improve on them.)
Evals Are Central


A workflow on how to continuosly your models especially with fine-tuning.
Hamel's blog post: Your AI Product Needs Evals -- How to construct domain-specific LLM evaluation systems.

I’ve found that unsuccessful products almost always share a common root cause: a failure to create robust evaluation systems.

What is Fine-Tuning


Walkthrough of fine-tuning LLM.
We will do that reasonably quickly and not going too much into technical or
mathematical detail. A quick refresher is broadly useful for everyone.
Model decides the next word at a time, example what is the likelihood of this
particular next word given the words seen so far. This likelihood
(predicted probabilities of different tokens) is calculated based on weights.
They give output based on the text they have learned.
Base models aren't helpful.
As their name suggests, suggests a good baseline for fine-tuning.
Fine-Tuning
When we do fine-tuning, we will start with a dataset. It will have many pairs of
input and output (prompt and response). We are going to train it to take
something in the form of the input and create something in the form of output.
We want to harness next token prediction.
The trick to do this is to put it in something which is call a template (see screeshot below).

Above is a very simple template. This template have a string, which is:

input that's highlighted in yellow
output that's highlighted in green
one token in between that's highlighted in red

The one token in between will be our way of making thd model at inference time
short circuit all the other training that it's had and actually say when I see
this token, the likely next token after that in this case are an answer or a
helpful answer.
So this is our way of training with next token prediction.
** Need consistent templating between training & inference**

One thing we're going to call it out so many times, (it is the bane of my day existence) and that is templating. (watch the recording at 00:23:34)
This is actually a harder problem than it would sound like. Hamel done some pretty cool work on how that relates to tokenization. There's a whole rabbit hole there.
Hamel: Yes, this is the biggest nightmare. So as you know we're going to spend
time learning Axolotl and when I teach Axolotl, the bulk of the time is making
sure that you understand what the template is. Because that is where 99% of
errors in practice happen with this. And it sounds like, "Oh, OK why would you
ever get this wrong?". The fact of the matter is, there's many different kinds
of templates out there. There's things that assemble templates for you. It's
really easy to misunderstand what the template is. It's often the case you
don't assemble this template yourself.
If you don't precisely understand what that template is, you can get into
trouble really fast. The reason why it comes up so much is, there's a lot of
abstractions out there. I've seen roughly half of the time, something go wrong
between, what you expect to happen and what actually being assembled.
Hamel drop a link in the chat about some very detailed analysis of these tokens and when you can be misled even when they look the same.
(You can read if you're interested.)
Is Fine-Tuning Dead?

Observation (watch the recording at 00:28:51)
Sharing Tweets by Anton (@abacaj) and Emmanuel Ameisen (@mlpowered).
Interest in fine tuning is like waves. It increases and decreases.
Story: A year ago, at OpenAI event, they said we think there's going to be one model to
rule them all and we don't think there's going to be like lots of small models
that are useful for specific purposes.
There is no question is that sometimes fine-tuning is the right approach. You're going to have bespoke data.
There's been an important trend towards longer context windows, which means you can give more examples in your prompt. That way is in favor of less fine-tuning and more of just dropping a lot into the prompts.
Dan don't think either extreme view is right, and the community will sort of move back and forth between those over time.
Hamel's take on this:
My sentiment is like, you should definitely try NOT to fine-tune first. You need to prove to yourself that you should fine-tune.
The way to prove to yourself is, you should have some minimal evaluation system. Once you don’t make progress move to fine-tune.
It's important to learn how to do prompting. It's kind of funny to think that prompting is a skill. But actually I think it is. I like practice it quite a bit.
You may have a generic model which has not seen specific data. Then you need fine-tuning.
Example


The task is to find the value based on description.
Because it's just regression you could use classical NLP or ML techniques for this. There's important reason that we did not want to do that.
NLP / Regression was not useful because description won’t cover all the words.
Users might enter new words and NLP will not work.
Slide 16: Unacceptable results

Learned that responses were round numbers, but not great at getting approximately right values - Inappropriate loss function

Considering this as a regression model. Output numbers are also string for the LLM. 10, 5, 100, 5000. They don’t use 97.


Training data had "wrong" small values

Careful with training data. Value has entered low for various reasons like insurance cost etc. Users will put 10 instead of 500.


Many incomprehensible descriptions due to length limit

Companies will have acronyms so that it would fit in 80 chars space. Thus the model needs access to the acronym and related policy of the company. This is all revealed by looking at the raw data.


Conventional NLP/ML also not good enough

In this case, NLP/ML was not useful.


Case Study: Honeycomb - NL to Query

(watch the recording at 00:39:35)
I think this is a really great use case in which to learn some of the important nuances about fine-tuning.
Honeycomb provides a domain specific query language for observability.
It's basically telemetry. You can think like people use DataDog for some of these things.
They created a natural language query assistant and it will build the query (DSL not SQL) for you using LLMs.
In the alpha version of the product that they released, the user provided two inputs: a natural query + schema. Input gets assembled into a prompt into GPT-3.5. And then out comes a Honeycomb query.
The Honeycomb query is a very complex prompt.
The Prompt
   System Message
         +
Content: column names
         +
    Query Spec
         +
       Tips

Problems
Complex tips, best practices, few shot example, etc. are very difficult to
followed or expressed by LLM.
That's a smell that may be fine-tuning might be useful.
Reasons to Fine Tune

Slide 21:

Data privacy

Honeycomb is using GPT-3.5. They're able to roll it out to a select number
of customers. But they have to get permission from their customers to ship
their data to OpenAI. Not everybody wants to opt-in to that. They also
don't want to ask everybody for that. It's kind of disruptive. They wanted
to see if there's a way that they can own the model, run these workloads
inside like a trusted boundary where data doesn't leave their property.


Quality vs. latency tradeoff

Honeycomb experimented with GPT-4. It was a bit too expensive and also
slower than they wanted. The reason you want to think about fine-tuning is
may be it's possible to train a smaller model to do better, to try to
approach the quality of like bigger model with lower latency.


Extremely narrow problem

Honeycomb problem is a very narrow problem -- the domain is very narrow.


Prompt engineering is impractical

Prompt engineering in Honeycomb's case is impractical -- to express all the
nuances of the Honeycomb query language, just even expressing that as a
human being is very difficult.


RESULT: Fine-tuned model was faster, more compliant from the data privacy perspective & higher quality vs. GPT 3.5
What we will do is we will simulate that in this course. We will give you access
to synthetic data and we will walk you through how we dit it.
Breakout Time

Slide 24:
Imagine you decide to build a chatbot for your first fine-tuning project. What
factors determine whether fine-tuning a chatbot will work well or not?
(watch the recoding: 01:09:18)
Rechat Case Study

(watch the recording at 01:20:39)
Slide 26:

Email composer
Listing Finder
CMA (Comparitive Market Analysis)
Create Marketing Website
Create Social Media Post
Query Knowledge Base
... 25 others tools

People wants to use chat bots. 9 out of 10 will ask for it. It is better to say
NO in most cases.
The idea: let's put a chat bot on your software and you can ask it anything.
So that breaks really fast because that surface area is extremely large and it
kind of devolves into AGI in the sense like "hey ask it to do anything.". It's
not really scoped. It's hard to make progess around something that isn't
scoped.
Slide 27:

Manage user expectations
Large surface area
Combinations of tools
Compromise - specificty

Chat bot can help in narrow cases. Users' expectations are very high. Build a
chat bot for each of the blocks.
Scope Isn't What You Say It Is

Slide 28
(watch the recording at 01:28:05)
Dan: We were working on a chat bot for a package delivery company called DPD.
Actually I told them I thought it was not ready to be released but they were
antsy so they released it.
DPD Chat bot error caused it to swear at customer.
So this, I think just speaks to the fact that we don't really have a great sense
for what people's expectations are.
Someone comments about guardrails in the chat. There's bunch of tools that are meant to be guard rails and like check these so called prompt injections.
None of those work especially well. Guardrails are not foolproof. (watch the recoding: 01:31:01)
Hamel: I'll drop a blog post in the chat about looking at your prompt and how
important that is, which highlights things like different kinds of guardrails.
(aside: Simon Willison's prompt injection blog post series)
Recap: When to Fine Tune

Slide 29:

Want bespoke behavior
Valuable enough to justify operational complexity
Have examples of desired input/outputs

(watch the recoding: 01:32:10)
Standards For "Desired" Input/Output

Slide 30
Table:
Prompt-Response pair 1 - Freat answer!

Prompt-Response pair 2 - Ok response!

Prompt-Response pair 2 - Too long-winded

Prompt-Response pair 2 - Pretty good

Prompt-Response pair 2 - Not bad. A little repetitive

Preference Optimization


(watch the recoding: 01:34:20)
While it's difficult to write perfect responses (as shown in the previous
slide), humans are typically pretty good at saying, given 2 choices, which they
like more.
So there is a whole field of techniques that are preference optimization
algorithms.
Regarding the screenshot of the Tweet: the top models on this leaderboard uses a
technique called DPO. DPO is short for Direct Preference Optimization.
What is DPO?
(watch the recoding: 01:35:11)
The model learns to imitate the behavior or style of responses to those prompts.
Supervised fine-tuning + prompt + response. You tell the model what is the best and worst response to a query.
DPO For Customer Service At Large Publisher

Dan: I did a project like that for a large publisher. This is an example we
worked on relatively little data.
So they had incoming emails. For each of 200 emails, we had 2 different customer
service agents write a response. Their manager took these pairs of responses
and said, of these 2, here's the one that I prefer. Then we fine-tune a Zephyr
model with DPO. Then we compared model to alternative response sources on new
emails.
Blinded Test Results


(watch the recoding: 01:37:12)
Test results for the responses showed DPO is better than GPT-4.
Your Turn

Slide 34
(watch the recoding: 01:39:09)
Quiz:


Restaurants, customer service emails (answer: good use case for fine-tune)


Medical publisher has an army of analysts that classify each new research article into some complex ontology that they've built. (answer: great use case for fine-tune)


A startup wants to build the world's best short fiction writer. Here, most
people said this is a poor fit for fine-tuning.
Dan: If I were a startup trying to build this, I would for a period of time
have two different models that produce different responses. I would have them
rank that story. Now we would be able to do DPO and say this story A is
better than story B. The model can really in a very granular way or very data
informed way learn about people's preferences like what do they like in a way
that I don't think is at all possible without some sort of preference
optimization.
Hamel: poor-fit for fine-tuning.


I had to fudge wants to give each employee an automated summary of new
articles on specific topics at the beginning of the day. They need LM based
service that takes news articles as inputs and responds with summaries.
(answer: poor fit for fine-tuning. Dan thinks: ChatGPT can do a great job of
this. I don't really understand what data that you would have internally.)


Q&A

Questions in the Zoom chat.

Wade: Can you show us some examples of assertions and simple tests?

Hamel: We will do that in the course when we get to the point.


What is the difference between pre-training and fine-tuning?

Hamel: They are the same thing. The same procedure. It's just a matter of
different data. Pre-training is not focus on a specific domain. You're
trying to feed a wide diverse set of data to teach a model general skills,
whereas fine-tuning is you're training a model to do really well on a very
specific domain. Pre-training is where your big base models come from and
then you can fine-tune on top of those.
Dan: They both mathematically the same basically. In terms of their purpose,
pre-training is really teaching the model to learn basic language and
fine-tuning is as the name suggest fine-tuning it for a specific purpose
that you're going to want to use it for in the future.


(watch the recording at 00:51:20)

How do you know it was a fine-tuning versus RAG question?

It's a common confusion actually. These 2 techniques RAG and fine-tuning are not competing with each other per se.
RAG is useful when the LLM can go to a data store to check the latest info.
Fine-tuning can be done on the output of RAG.
Consider fine-tuning when a good prompt and RAG does not work.


Can we fine-tuning a model to make it better at doing function calls?

Hamel: Yes, absolutely. There's some open models that have been fine-tuned already on Llama 3 and certainly Llama 2 with a specific purpose of function calling.
Dan: You need lots of data so that it maps to all the parts in your problem space.


How many samples are necessary for fine-tuning?

Dan: It varies quite a bit. The least that I have used that I think we
viewed as success is 100 samples. It wouldn't surprise me if there are
examples with less than that. The most important determinants of this is
how broad is your problem.


Can you have too much data?

Dan: No. I'm hesitant to say never.


Is there a value in fine-tuning a model on both correct and incorrect responses?

Dan: Soon we will talk about preference optimization which isn't exactly
this but pretty close to that. Where you've got instead of right and wrong,
you have better and worse. Example, a publisher where we built a tool to
automate responding to emails and we had better and worse samples. We use
preference optimization and came up with something that was better than if
you did conventional, supervised fine-tuning.


The Gorilla leaderboard: https://gorilla.cs.berkeley.edu/blogs/7_open_functions_v2.html

Hamel: It is for function calling. That's great but keep in mind with
Gorilla leaderboard it's a bit over fit to function calling. In practice,
you're going to have a mix of function calling and non-function calling.
Pick every leaderboard with a grain of salt. Also look at the test data and
think about how it might apply to your use case. But it's OK way to get a
general sense.


Multimodal fine-tuning

One thing that I would emphasize is that the LLava model is very good.
There's a script in the LLava repository for fine-tuning. Just getting that
set up has, if anything been easier than I would have expected. May be I
will write a blog post about it. If you were to look at the LLava
repository, you would be surprised at how well it can be done with an
amount of effort that's not as immense as I probably expected beforhand.


Does synthetic data have to come from a more powerful model?

Hamel: Yes, if you can. One of the key reasons why I like LLMs as opposed to
classic ML, it's more fun to work on those projects because I get unblocked
if I run into a situation where I don't have enough data. I usually use the
most powerful model I can to generate synthetic data. Usually Mistral large
because the T&C don't scare anybody. They're very permissive, like you
could generate synthetic data and train another model. There's a lot of
different ways to do that. One way is taking existing data and perturbing
that data like asking a language model to rewrite it and then change the
output in accordance with that all by using evals in the middle. Another
way you can generate test cases or inputs to your LM system. Your LM system
might be like some complex system that has RAG in it that does function
calls and then finally return something to the user. So you can generate
inputs into that system. There's a lot to say in words. We'll show you more
in upcoming lessons about what that means.


Do I use base model or instruction tune models for fine-tuning?

Hamel: Instruction tuned models are already fine-tuned. Base models are generally preferred.


What is the model size?

Hamel: I try to get away with the smallest size that I can. So I try to
fine-tune a 7-billion parameter model. I use my intuition at this point,
like how narrow is the domain based on the other things that I've done. The
best thing you can do is try to train a 7-billion parameter model. That's
the sweet spot. If you can get something into a very small package or
small-ish package. Then is going to make more sense. The larger the model
you have to justify more like it's going to cost more. It's going be harder
to host, so on and so forth. Those are where the payoff is really big.


Q&A again. (watch the recording at 01:52:56)

Quantization

Hamel: explain quantization
Dan: We have the CTO of Predibase as speaker. He is the expert in this area.


Hallucination taking the example of classifying academic or science articles
onto a complex ontology (thousands of classes). How do you make sure the LLM
only outputs valid classes?

Hamel: We have enough examples that only use a specific set of classes. We
have a set of metrics that we are checking all the time. And we will just
treat that as a miss-classification.


Is there any homework?

Go the Maven platform, check the Workshop 1 syllabus. The homework is there.
Come up with a set of 5 use cases, just rate them out that you think would
be interesting for each of them whether it's good or bad as a use case for
fine-tuning and share that in the Discord.


The customer service DPO fine-tuning example is better than GPT-4. Can share more detail?

Dan: McDonald example glutten free policies. GPT-4 respond to the customer
service manager, "I have no idea". So the idea that you're going to forever
tell GPT-4 enough that it can respond to all the questions that come. That
is fiction.


Does prompt engineering or few shot examples complement fine-tuning?

Dan: It is not necessarily the case that you would need to use just one or
the other. But for the most part I think of those as alternatives. You
could use both.
Hamel: One rule of thumb is, in your prompt anything that stays exactly the
same in your prompt and doesn't change from large language model
invocation, invocation, fine-tuning should be able to just completely
remove that. It's kind of dead weight. You can implicitly just teach your
model, whatever the hell is that you're saying that you're repeating every
single time, you don't need to say it anymore. Now, if your few shot
examples are dynamic, it depends. The more extensively you fine tune your
model. You shouldn't need few shot examples anymore. Few shot example is
more of like a prompt engineering technique. I haven't actually tested that
though to be honest. It always surprises me of what works. There's a
spectrum so like, if anything staying the same in your prompt and if you
have a few shot examples in your prompt and they're never changing, then
those are always that you can definitely get rid of with fine-tuning.


Human annotation

Hamel: So data annotation we'll cover this a bit in the next course. You
want to have a human in the loop when you're doing evals. And you want to
be able to look at lots of different examples and kind of curate which ones
are good and bad. You also want to look at your failure mode. You want to
curate data that covers all the different use cases that you can think of.
Every time I try to use some tools for looking at data, I get frustrated
because every domain is very specific. I like to build my own tools with
something like Gradio or Streamlit. I'll put a blog post that I wrote about
this topic in the chat, "Curating LLM data".


Lesson Resources

A collection of 98% of links posted in the chat:
AI Product Evaluation

Your AI Product Needs Evals - https://hamel.dev/blog/posts/evals/
Langtrace AI - Monitor, eval & improve your LLM apps - https://langtrace.ai
Observability for LLMs - https://www.honeycomb.io/llm
Inspect, a framework for large language model evaluations created by the UK AI Safety Institute - https://ukgovernmentbeis.github.io/inspect_ai/

Programming and Development Tools

DSPy: Programming—not prompting—Foundation Models - https://github.com/stanfordnlp/dspy
Cohere Toolkit to quickly build and deploy RAG apps - https://docs.cohere.com/docs/cohere-toolkit
Open UI - https://v0.dev and https://github.com/wandb/openui
Effortless Python web applications - https://shiny.posit.co/py/
Fireworks’ GPT-4-level function calling model - https://fireworks.ai/blog/firefunction-v1-gpt-4-level-function-calling
Code for the Hermes Pro Large Language Model to perform function calling based on the provided schema - https://github.com/NousResearch/Hermes-Function-Calling
InternVL - A Pioneering Open-Source Alternative to GPT-4V - https://github.com/OpenGVLab/InternVL
Gorilla: Large Language Model Connected with Massive APIs - https://gorilla.cs.berkeley.edu/blogs/7_open_functions_v2.html
Get your LLM app from prototype to production - https://www.langchain.com/langsmith

Tokenization and Fine-Tuning

Fine-tuning: Axolotl vs Unsloth vs TorchTune - https://swaroopch.com/notes/fine-tuning-library
Curating LLM data - https://hamel.dev/notes/llm/finetuning/04_data_cleaning.html
Tools for curating LLM data - https://hamel.dev/notes/llm/04_data_cleaning.html


Notebook fine-tuning on a Captcha image dataset - https://github.com/vikhyat/moondream/blob/main/notebooks/Finetuning.ipynb

Prompt Engineering

Anthropic's Prompt Engineering Interactive Tutorial - https://docs.google.com/spreadsheets/d/19jzLgRruG9kjUQNKtCg1ZjdD6l6weA6qRXG5zLIAhC8/htmlview?usp=sharing
Fuck You, Show Me The Prompt - https://hamel.dev/blog/posts/prompt/
Series: Prompt injection - https://simonwillison.net/series/prompt-injection/

Research Papers and Studies

Constitutional AI: Harmlessness from AI Feedback - https://arxiv.org/pdf/2212.08073
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions - https://arxiv.org/abs/2404.13208 (Making the model follow system prompts)
Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study - https://arxiv.org/abs/2404.10719
RAFT: Adapting Language Model to Domain Specific RAG - https://arxiv.org/abs/2403.10131
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? - https://arxiv.org/abs/2405.05904
Direct Preference Optimization: Your Language Model is Secretly a Reward Model - https://arxiv.org/pdf/2305.18290
The Unreasonable Ineffectiveness of the Deeper Layers -<https://twitter.com/kwindla/status/1788224280754618393
YouTube Channel by Umar Jamil - https://www.youtube.com/@umarjamilai

News and Articles

The End of Finetuning — with Jeremy Howard of Fast.ai - https://www.latent.space/p/fastai
Air Canada Has to Honor a Refund Policy Its Chatbot Made Up - https://www.wired.com/story/air-canada-chatbot-refund-policy/

Source: Discord
Some of them are recommended reading list for today's workshop by the
instructors.
Discord Messages

Some highlights:


Download chat messages

Summary of some of the main questions and answers from Hamel and Dan:

https://discord.com/channels/1238365980128706560/1239614536298795121/1240016639399952515


Session Summary from Limitless: https://discord.com/channels/1238365980128706560/1239614536298795121/1240020377296441445


Unrelated: Dan's DPO project (refer to slide on "DPO For Customer Service At Large Publisher")

I will be writing a white paper for Straive on our DPO project, but haven't written it yet. There are also some limits on what we can say based on the downstream client's preferences.
I will share the white paper here and with you when it's ready.


Many learners are curios about fine tuning embeddings

@mwildehahn on Discord says: "Same! Since the netflix paper (https://arxiv.org/html/2403.05440v1) there has been a lot of discussion about how cosine similarity isn't a great metric for semantic similarity but I haven't seen a lot around fine tuning your own embedding model or when that would be necessary. I get doing that for a very specific domain like medical language or company jargon but from today's session, it seems like the general purpose embeddings from something like openai would be best for embeddings given the base language knowledge?"