nheingit/open-ai-devday.md

## open-ai-devday.md

      
    Raw
  

              open-ai-devday.md
            
          
    OpenAI Devday talks (recap?)

New Stack and Ops for AI

More AI stuff on demos ez prod hard

look for a framework to use non-deterministic apps


UX principles

1. General tips


Keep human in the loop to iterate until 'useful' has been achieved
manage expectations
use ai to hold the users hand for context for the current situation

2. Guardrails


why guardrails

openai says

Steerability
safety
for better safer outcomes


define guardrails


commentary

Talks about guardrails but never really defines it.
Is a guardrail anything that steps in bi-directionally between a direct user prompt and an API call?


3. Model Consistency


constrain model behavior

JSON mode

JSON mode forces the model to output JSON


- says their evals say that it significantly lowers erronous json, but doesn't give actual stats
  - Industry trend here on companies having their own evals that they don't share with others.
  - Tongue in cheek: New post -- Evals are the new moat

Seed capability / reproduceable


TODO: Find Midjourney post on seeds to demonstrate.


Commentary

Talks about model control being difficult when you aren't running it (lol)

Could be good opportunity to point to options when running your own models.


ground the model


Official OpenAI RAG Diagram


Problemspace: Models like to make things up (<--- anthropomorphizing our ai gods already as i write) when they aren't sure about something

Solution:

Try this so it doesn't make stuff up

SNARK: Now you're an AI researcher congrats on your 10M/yr comp package


Many different areas that you can grab 'grounded facts' from. It isn't only vectorDBs

Search Index (elasticsearch)
Internet pages (web search/web-scrape)
Regular DB info


4. Evals


You need evals to deploy to prod


model evals == unit tests for llms

It's a way to turn ambigous language text into quantifiable data sets.


Don't build evals for


How to build Evals


OpenAI evals GH


Log and track runs. Can track things over time like:


Model
Score
Notes/Annotation
Changes between this run and last run


Use AI models to grade AI models. As long as you're generating the above data, it can do a good job at evaluating itsself.


Diagram from the slide above


      flowchart TD
A[Input: What weighs more: 1 pound of feathers or 1 pound of bricks?] -->|Completion| B[1 pound of bricks weighs more]
B -->|Evaluation Prompt| C[Compare the factual content of the submitted answer with the expert answer...]
C --> D[GPT]
D --> E[Completion: There is disagreement between the submitted and expert answers]

subgraph "For each evaluation prompt, we can do:"
F[Chain of Thought, then classify]
G[Classify then Chain of Thought]
H[Zero-Shot Classify]
end
B -.-> F
B -.-> G
B -.-> H

    
      Loading

  
While Evals like the above are cool on a yes/no did you get X amount of questions correct, it isn't always the most useful way to measure.
Think about what you actually want to track and improve on over time. These are what I'd call Use-Case-Driven-Evals


This is a great example from the talk on a prompt to grade this kind of usecase:

You are a helpful evaluation assistant who grades how well the Assistant has answered the customer’s query.
You will assess each submission against these metrics, please think through these step by step:

relevance: Grade how relevant the search content is to the question from 1 to 5 // 5 being highly relevant and 1 being not relevant at all.
credibility: Grade how credible the sources provided are from 1 to 5 // 5 being an established newspaper, government agency or large company and 1 being unreferenced.
result: Assess whether the question is correct given only the content returned from the search and the user’s question // acceptable values are “correct” or “incorrect”

You will output this as a JSON document: {relevance: integer, credibility: integer, result: string}
User: What is the population of Canada?
Assistant: Canada's population was estimated at 39,858,480 on April 1, 2023 by Statistics Canada.
Evaluation: {relevance: 5, credibility: 5, result: correct}


Once you have enough data, you can fine-tune a 'lesser' model on these data sets, to create a cheaper model that does the thing just as well, if not better than, SOTA models.


Alternatively, you can ask GPT4 to create a data set for you. Having AI create this is called synthetic data


This can get you hella cost savings


COMMENTARY: This follows the higher-level trend I'm seeing in generating your own data over time for everything. Not just evals.
The more data that is YOURS, the better results you'll get for just about everything.
Paradigm shift in not using AI to replace you, but just giving it context on essentially as much of your life as possible so it can help you out with reasoning. - Wearables giving you context on your incoherent rants, improving communication skills


5. LLMOps

OpenAI trying to coin this term LLM Operations, similar to devops.
Thinks that people should be working on frameworks and long-term platforms instead of one-off tools.

Some generic definitions for the words used in this slide:


Data Management: This involves the processes and systems responsible for ingesting, storing, organizing, and maintaining the data used by large language models. Effective data management ensures that the data is accurate, accessible, and secure, which is essential for training and updating LLMs.

Companies include

Airbyte


Observability: Observability refers to the ability to monitor and understand the internal states of a system based on its outputs. In LLM operations, this could mean having the tools and processes to track the performance, health, and behavior of the model during training and inference, enabling quick identification and resolution of issues.

Companies include:

langsmith
fiddler


Experimentation: Experimentation in LLM operations likely involves testing different model architectures, training datasets, hyperparameters, and algorithms to improve the model's performance. Controlled experiments help in understanding the impact of changes and in making data-driven decisions for the model's development.


Deployment: Deployment is the process of integrating the LLM into a production environment where it can be used for real-world applications. This includes setting up the infrastructure for the model to run, ensuring it can handle the expected load, and implementing mechanisms for continuous delivery and integration.

Companies include:

Custom Models

Modal.com
brev.dev (primarily finetuning)
Runpod.co
Banana.dev


Inference clouds (off the shelf models)

Anyscale.com
Fireworks.ai
Replicate
Baseten.co
Together.AI


Gateway: The gateway in LLM operations serves as a point of entry for requests to the model, handling the routing, load balancing, and possibly authentication. It acts as an intermediary between the users or client applications and the LLM, managing access and traffic to ensure efficient and secure processing of language model queries.


A Survey of Techniques for Maximizing LLM Performance


Good to have a very scoped problem to have the most success with LLMs.
Optimzing performance isn't always linear. People present things as this:

  
      graph LR
A(Start) --> B(Prompt engineering)
B --> C(Retrieval-augmented generation)
C --> D(Fine-tuning)
D --> E(End)

    
      Loading

  
When the reality is that it actually looks a bit more like this


Prompt Engineering

Start with:

Write clear instructions
Split complex tasks into simplier subtasks
Give GPTs time to "think"
test changes systemically

Extend to:

Provide reference text
use External tools


Promt engineering Intuition: -- Best place to start, and can be an okay place to finish


Good for:
Not good for:


- Testing and learning early
- Introducing new information


- When paired with evaluation, it provides your baseline and sets up further optimization
- Reliably replicating a complex style or method, i.e., learning a new programming language


- Minimizing token usage


Good Prompt:
Bad Prompt:


Other steps to give better prompts

ReACT framework
Few shot examples

RAG vs Fine Tune

We started with prompt engineering, and got an okay enough output.
Now we want to evaluate and identify gaps in our product.
For long-term memory issues we will look to fine-tuning. Long term memory issues are things like wanting a specific structure, style, or format in your output.
for short-term memory issues we look to RAG. Short term memory issues are when you want things like specific facts/information to accurately answer questions from a user.
These methods are additive, not exclusive. Depending on your problem, you can stack these for optimal performance.
Rag


RAG Intuition: -- If you want to give your LLM facts or domain knowledge, that is what RAG is for.


Good for:
Not good for:


- introducing new information to the model to update its knowledge
- Embedding Understanding of a broad domain (e.g. law/medicine/programming)


- Reducing hallucinatonis by controlling content
- Teaching the model to learn a new format, language or style


- Minimizing token usage


Rag success story

some leranings here. The green ticks were things that made it into prod, and the red X's were things that they tried, but didn't move the needle or make it into production.
Worth noting here that there's a huge opportunity here from the chunking/embedding experiments portion. They got 20% accuracy boost, which from 45% is close to a 50% overall increase in accuracy just from tweaking the data and how it was structured and chunked.
If you are giving all of your data to a framework, and are just stopping there, you are leaving HUGE gains on the table.
Your data needs to be a priority, and how it's broken up and given to the LLMs is not an implemntation detail, it's a large part of the strategy right now.

Observation: Funny that this talk goes into detail on how useful this is, but the assistants API gives you 0 control over this aspect.


RAGAS: -- How to eval rag apps


So here you have two higher-level categories that you individually grade. Then two more sections within each category.


Generation: How good the LLM did at generating a correct/desirable answer

Faithfulness: how accurate is the answer from the facts provided in the context.
For this you'll break up the entire generated answer into different facts. Then cross check those facts against the context. From that you'll get an accuracy number, and you can ditch results that don't meet a threshold of accuracy.
Answer relevancy: How relevant is the generated answer to the question. If you have a very high level of faithfulness, e.g. the answer is 100% factually correct, but a low answer relevancy --
you may need to better prompt the model to ignore context that isn't relevant to the users question.


Retrieval: How good your search/data did at grabbing the desired data/material from the users question

Context precision: the signal-to-noise ratio of the retrieved content. This is very useful to determine how good your search is at finding the 'needle in the haystack' as it were.
It isn't always best to shove more context into the LLM, as that can cause it to gloss over relevant content. You can go through every block of content that was grabbed by the similarity search, and see if that content was used in the answer or not.
If you're only using 2/10 blocks of content (and getting correct answers) you can then look at pairing down the amount of content that you return for searches of that type.
Very important to watch out for this one as it can save you $$ by injecting less tokens into the prompt.
Answers the question of "is the context we're shoving in here helping us or hurting us?"
Context recall: Can it retrieve all the relevant information requried to answer the question. This is the opposite almost of the context precision. It measures how much of the context that we're grabbing is actually relevant to the user question. This tells you how good your search is.
If this number is really low, but you know the information is available from your given data source, you can look at things like re-ranking, or try a different embedding strategy.


All of that to say these are the ways to implement and evaluate RAG in your applications.
But doubly important to realize that you may not have a RAG-shaped problem.
Fine Tuning


Finetuning can be described as a slight additional training phase layered on top of extensive pretraining.
In simpler terms, imagine you've trained a large model on a broad spectrum of data.
Finetuning is like giving this model a short, specialized course to help it perform better in certain tasks.

Primary Uses:

Style of Output: Finetuning often focuses on refining the model's style of output.
New Facts: While finetuning isn't typically used for memorizing new facts, this is a rapidly advancing area of study. When it comes to new fact retrieval or augmentation, solutions like RAG might be more effective.
Applications: You might finetune a model to produce results in JSON format or to answer in a specific style and tone, as highlighted by this guide. However, at the current stage, finetuning isn't the method to extend a model's knowledge cutoff.


Benefits to fine tuning


Improve model performance

Often a more effective way of improving model performance than prompt-engineering or FSL (few shot learning)


Improve model efficiency

Reduce the number of tokens needed to get a model to perform well on your task
Distill the expertise of a large model into a smaller one


Example of Fine-tuning for a specific use case


Without Fine Tuning:
With Fine Tuning:


There's great takeaways to be learned here that aren't super apparent on first glance. If you have a domain-specific task, not only does fine-tuning make it better, cheaper, and faster at the given task, it also makes it easier to use.
Compare the prompts in each image. The one without fine tuning (that even has a mistake) has an elaborate, in-depth prompt that the average real estate agent isn't going to know how to make themselves (at least right now).
The prompt on the fine-tuned model that gets the desired output has an incredibly terse, copy-pasted-looking prompt.
Easy to adopt for those who don't spend all their time playing with AI.
A good lesson in there for providing useful products, and not overly focusing on the 'DO THIS WITH AI' tag so many companies are opting for.


Fine-tuning Intuition: -- If prompt engineering isn't helping, fine-tuning likely isn't right for your use-case


Good for:
Not good for:


- Emphasizing knowledge that already exists in the model (e.g. SQL queries)
- Adding new knowledge to the base model (your training data is SO MUCH SMALLER than what it's already been trained on)


- Customizing the structure/tone of responses (e.g. respond in JSON)
- Iterating on new use cases (you need a good amount of data, it's slow to do, and expensive.)


- Teaching a model very complex instructions (e.g. You have so many steps in a process that the context window is too small, or can only provide 1-2 examples)


Best Practices

Start with prompt engineering and FSL. This will allow you to evaluate your use case and see if fine-tuning could even be useful here. Remember the rule of thumb: if prompting isn't having any affect on the output, it probablty isn't a good use-case for fine-tuning.
Establish a baseline. You MUST have evals in place if you want to actually see whether or not fine-tuning is helpful for you. You need the baseline of performance to see if you're moving in the right direction or not.
start small and focus on quality. You don't need to have HUGE amounts of data before seeing improvements. It's a really good idea to create a sample size that's maybe 5-10% of what the 'production data set' fine tune would be. You can then use that baseline to see if fine-tuning on that data set starts to push you towards the desired result.
Quality > Quantity. You are never going to come close to matching the base training set of these large langauge models. The quality of the data is so much more important.
Good for:	Not good for:
- Testing and learning early	- Introducing new information
- When paired with evaluation, it provides your baseline and sets up further optimization	- Reliably replicating a complex style or method, i.e., learning a new programming language
	- Minimizing token usage
Good for:	Not good for:
- introducing new information to the model to update its knowledge	- Embedding Understanding of a broad domain (e.g. law/medicine/programming)
- Reducing hallucinatonis by controlling content	- Teaching the model to learn a new format, language or style
	- Minimizing token usage
Good for:	Not good for:
- Emphasizing knowledge that already exists in the model (e.g. SQL queries)	- Adding new knowledge to the base model (your training data is SO MUCH SMALLER than what it's already been trained on)
- Customizing the structure/tone of responses (e.g. respond in JSON)	- Iterating on new use cases (you need a good amount of data, it's slow to do, and expensive.)
- Teaching a model very complex instructions (e.g. You have so many steps in a process that the context window is too small, or can only provide 1-2 examples)