Skip to content

Instantly share code, notes, and snippets.

@nheingit
Last active January 8, 2024 16:22
Show Gist options
  • Save nheingit/b559cbb52dd2f2cdde17ea515f45f459 to your computer and use it in GitHub Desktop.
Save nheingit/b559cbb52dd2f2cdde17ea515f45f459 to your computer and use it in GitHub Desktop.

OpenAI Devday talks (recap?)

New Stack and Ops for AI

More AI stuff on demos ez prod hard

  • look for a framework to use non-deterministic apps
  • tech-stack

UX principles

1. General tips

  • Keep human in the loop to iterate until 'useful' has been achieved
  • manage expectations
  • use ai to hold the users hand for context for the current situation

2. Guardrails

  • why guardrails

    • openai says
      • Steerability
      • safety
      • for better safer outcomes
  • define guardrails

    • guardrails
  • commentary

    • Talks about guardrails but never really defines it.
    • Is a guardrail anything that steps in bi-directionally between a direct user prompt and an API call?

3. Model Consistency

probabilistic

  1. constrain model behavior

    • JSON mode
      • JSON mode forces the model to output JSON
      json-mode - says their evals say that it significantly lowers erronous json, but doesn't give actual stats - Industry trend here on companies having their own evals that they don't share with others. - Tongue in cheek: New post -- Evals are the new moat
    • Seed capability / reproduceable seed
      • TODO: Find Midjourney post on seeds to demonstrate.
    • Commentary
      • Talks about model control being difficult when you aren't running it (lol)
        • Could be good opportunity to point to options when running your own models.
  2. ground the model

  • Official OpenAI RAG Diagram openai-rag-architecture

  • Problemspace: Models like to make things up (<--- anthropomorphizing our ai gods already as i write) when they aren't sure about something made_it_up

    Solution: grounded-facts Try this so it doesn't make stuff up

    • SNARK: Now you're an AI researcher congrats on your 10M/yr comp package
  • Many different areas that you can grab 'grounded facts' from. It isn't only vectorDBs

    • Search Index (elasticsearch)
    • Internet pages (web search/web-scrape)
    • Regular DB info

4. Evals

  • You need evals to deploy to prod eval-prod

  • model evals == unit tests for llms

    • It's a way to turn ambigous language text into quantifiable data sets.
  • Don't build evals for bad-evals

  • How to build Evals

    • OpenAI evals GH

    • Log and track runs. Can track things over time like: log-track-eval

      • Model
      • Score
      • Notes/Annotation
      • Changes between this run and last run
    • Use AI models to grade AI models. As long as you're generating the above data, it can do a good job at evaluating itsself. self-eval-diagram

      • Diagram from the slide above
      flowchart TD
      A[Input: What weighs more: 1 pound of feathers or 1 pound of bricks?] -->|Completion| B[1 pound of bricks weighs more]
      B -->|Evaluation Prompt| C[Compare the factual content of the submitted answer with the expert answer...]
      C --> D[GPT]
      D --> E[Completion: There is disagreement between the submitted and expert answers]
      
      subgraph "For each evaluation prompt, we can do:"
      F[Chain of Thought, then classify]
      G[Classify then Chain of Thought]
      H[Zero-Shot Classify]
      end
      B -.-> F
      B -.-> G
      B -.-> H
      
      Loading
      • While Evals like the above are cool on a yes/no did you get X amount of questions correct, it isn't always the most useful way to measure. Think about what you actually want to track and improve on over time. These are what I'd call Use-Case-Driven-Evals use-case-driven-eval

        • This is a great example from the talk on a prompt to grade this kind of usecase:

          You are a helpful evaluation assistant who grades how well the Assistant has answered the customer’s query.

          You will assess each submission against these metrics, please think through these step by step:

          • relevance: Grade how relevant the search content is to the question from 1 to 5 // 5 being highly relevant and 1 being not relevant at all.
          • credibility: Grade how credible the sources provided are from 1 to 5 // 5 being an established newspaper, government agency or large company and 1 being unreferenced.
          • result: Assess whether the question is correct given only the content returned from the search and the user’s question // acceptable values are “correct” or “incorrect”

          You will output this as a JSON document: {relevance: integer, credibility: integer, result: string}

          User: What is the population of Canada? Assistant: Canada's population was estimated at 39,858,480 on April 1, 2023 by Statistics Canada. Evaluation: {relevance: 5, credibility: 5, result: correct}

      • Once you have enough data, you can fine-tune a 'lesser' model on these data sets, to create a cheaper model that does the thing just as well, if not better than, SOTA models.

      • Alternatively, you can ask GPT4 to create a data set for you. Having AI create this is called synthetic data

      • This can get you hella cost savings Screenshot 2023-11-15 at 1 45 51 PM

  • COMMENTARY: This follows the higher-level trend I'm seeing in generating your own data over time for everything. Not just evals. The more data that is YOURS, the better results you'll get for just about everything. Paradigm shift in not using AI to replace you, but just giving it context on essentially as much of your life as possible so it can help you out with reasoning. - Wearables giving you context on your incoherent rants, improving communication skills

5. LLMOps

OpenAI trying to coin this term LLM Operations, similar to devops. Thinks that people should be working on frameworks and long-term platforms instead of one-off tools.

llmops

Some generic definitions for the words used in this slide:

  • Data Management: This involves the processes and systems responsible for ingesting, storing, organizing, and maintaining the data used by large language models. Effective data management ensures that the data is accurate, accessible, and secure, which is essential for training and updating LLMs.

  • Observability: Observability refers to the ability to monitor and understand the internal states of a system based on its outputs. In LLM operations, this could mean having the tools and processes to track the performance, health, and behavior of the model during training and inference, enabling quick identification and resolution of issues.

  • Experimentation: Experimentation in LLM operations likely involves testing different model architectures, training datasets, hyperparameters, and algorithms to improve the model's performance. Controlled experiments help in understanding the impact of changes and in making data-driven decisions for the model's development.

  • Deployment: Deployment is the process of integrating the LLM into a production environment where it can be used for real-world applications. This includes setting up the infrastructure for the model to run, ensuring it can handle the expected load, and implementing mechanisms for continuous delivery and integration.

  • Gateway: The gateway in LLM operations serves as a point of entry for requests to the model, handling the routing, load balancing, and possibly authentication. It acts as an intermediary between the users or client applications and the LLM, managing access and traffic to ensure efficient and secure processing of language model queries.

A Survey of Techniques for Maximizing LLM Performance

  • Good to have a very scoped problem to have the most success with LLMs.
  • Optimzing performance isn't always linear. People present things as this:
    graph LR
    A(Start) --> B(Prompt engineering)
    B --> C(Retrieval-augmented generation)
    C --> D(Fine-tuning)
    D --> E(End)
    
    Loading
    When the reality is that it actually looks a bit more like this
    • optimization-matrix

Prompt Engineering

Start with:

  • Write clear instructions
  • Split complex tasks into simplier subtasks
  • Give GPTs time to "think"
  • test changes systemically

Extend to:

  • Provide reference text
  • use External tools

Promt engineering Intuition: -- Best place to start, and can be an okay place to finish

Good for: Not good for:
- Testing and learning early - Introducing new information
- When paired with evaluation, it provides your baseline and sets up further optimization - Reliably replicating a complex style or method, i.e., learning a new programming language
- Minimizing token usage
Good Prompt: Bad Prompt:
good-prompt bad-prompt

Other steps to give better prompts

RAG vs Fine Tune

We started with prompt engineering, and got an okay enough output. Now we want to evaluate and identify gaps in our product. For long-term memory issues we will look to fine-tuning. Long term memory issues are things like wanting a specific structure, style, or format in your output. for short-term memory issues we look to RAG. Short term memory issues are when you want things like specific facts/information to accurately answer questions from a user. These methods are additive, not exclusive. Depending on your problem, you can stack these for optimal performance.

Rag

RAG Intuition: -- If you want to give your LLM facts or domain knowledge, that is what RAG is for.

RAG overview diagram

Good for: Not good for:
- introducing new information to the model to update its knowledge - Embedding Understanding of a broad domain (e.g. law/medicine/programming)
- Reducing hallucinatonis by controlling content - Teaching the model to learn a new format, language or style
- Minimizing token usage

Rag success story rag-success-chart

some leranings here. The green ticks were things that made it into prod, and the red X's were things that they tried, but didn't move the needle or make it into production. Worth noting here that there's a huge opportunity here from the chunking/embedding experiments portion. They got 20% accuracy boost, which from 45% is close to a 50% overall increase in accuracy just from tweaking the data and how it was structured and chunked. If you are giving all of your data to a framework, and are just stopping there, you are leaving HUGE gains on the table. Your data needs to be a priority, and how it's broken up and given to the LLMs is not an implemntation detail, it's a large part of the strategy right now.

  • Observation: Funny that this talk goes into detail on how useful this is, but the assistants API gives you 0 control over this aspect.

RAGAS: -- How to eval rag apps ragas-matrix

So here you have two higher-level categories that you individually grade. Then two more sections within each category.

  • Generation: How good the LLM did at generating a correct/desirable answer

    • Faithfulness: how accurate is the answer from the facts provided in the context. For this you'll break up the entire generated answer into different facts. Then cross check those facts against the context. From that you'll get an accuracy number, and you can ditch results that don't meet a threshold of accuracy.
    • Answer relevancy: How relevant is the generated answer to the question. If you have a very high level of faithfulness, e.g. the answer is 100% factually correct, but a low answer relevancy -- you may need to better prompt the model to ignore context that isn't relevant to the users question.
  • Retrieval: How good your search/data did at grabbing the desired data/material from the users question

    • Context precision: the signal-to-noise ratio of the retrieved content. This is very useful to determine how good your search is at finding the 'needle in the haystack' as it were. It isn't always best to shove more context into the LLM, as that can cause it to gloss over relevant content. You can go through every block of content that was grabbed by the similarity search, and see if that content was used in the answer or not. If you're only using 2/10 blocks of content (and getting correct answers) you can then look at pairing down the amount of content that you return for searches of that type. Very important to watch out for this one as it can save you $$ by injecting less tokens into the prompt. Answers the question of "is the context we're shoving in here helping us or hurting us?"
    • Context recall: Can it retrieve all the relevant information requried to answer the question. This is the opposite almost of the context precision. It measures how much of the context that we're grabbing is actually relevant to the user question. This tells you how good your search is. If this number is really low, but you know the information is available from your given data source, you can look at things like re-ranking, or try a different embedding strategy.

All of that to say these are the ways to implement and evaluate RAG in your applications. But doubly important to realize that you may not have a RAG-shaped problem.

Fine Tuning

fine-tune

Finetuning can be described as a slight additional training phase layered on top of extensive pretraining. In simpler terms, imagine you've trained a large model on a broad spectrum of data. Finetuning is like giving this model a short, specialized course to help it perform better in certain tasks.

  • Primary Uses:
    • Style of Output: Finetuning often focuses on refining the model's style of output.
    • New Facts: While finetuning isn't typically used for memorizing new facts, this is a rapidly advancing area of study. When it comes to new fact retrieval or augmentation, solutions like RAG might be more effective.
    • Applications: You might finetune a model to produce results in JSON format or to answer in a specific style and tone, as highlighted by this guide. However, at the current stage, finetuning isn't the method to extend a model's knowledge cutoff.

Benefits to fine tuning

  • Improve model performance

    • Often a more effective way of improving model performance than prompt-engineering or FSL (few shot learning)
  • Improve model efficiency

    • Reduce the number of tokens needed to get a model to perform well on your task
    • Distill the expertise of a large model into a smaller one

Example of Fine-tuning for a specific use case

Without Fine Tuning: With Fine Tuning:
without-fine-tuning with-fine-tuning

There's great takeaways to be learned here that aren't super apparent on first glance. If you have a domain-specific task, not only does fine-tuning make it better, cheaper, and faster at the given task, it also makes it easier to use. Compare the prompts in each image. The one without fine tuning (that even has a mistake) has an elaborate, in-depth prompt that the average real estate agent isn't going to know how to make themselves (at least right now). The prompt on the fine-tuned model that gets the desired output has an incredibly terse, copy-pasted-looking prompt. Easy to adopt for those who don't spend all their time playing with AI. A good lesson in there for providing useful products, and not overly focusing on the 'DO THIS WITH AI' tag so many companies are opting for.


Fine-tuning Intuition: -- If prompt engineering isn't helping, fine-tuning likely isn't right for your use-case

Good for: Not good for:
- Emphasizing knowledge that already exists in the model (e.g. SQL queries) - Adding new knowledge to the base model (your training data is SO MUCH SMALLER than what it's already been trained on)
- Customizing the structure/tone of responses (e.g. respond in JSON) - Iterating on new use cases (you need a good amount of data, it's slow to do, and expensive.)
- Teaching a model very complex instructions (e.g. You have so many steps in a process that the context window is too small, or can only provide 1-2 examples)

Best Practices

  • Start with prompt engineering and FSL. This will allow you to evaluate your use case and see if fine-tuning could even be useful here. Remember the rule of thumb: if prompting isn't having any affect on the output, it probablty isn't a good use-case for fine-tuning.
  • Establish a baseline. You MUST have evals in place if you want to actually see whether or not fine-tuning is helpful for you. You need the baseline of performance to see if you're moving in the right direction or not.
  • start small and focus on quality. You don't need to have HUGE amounts of data before seeing improvements. It's a really good idea to create a sample size that's maybe 5-10% of what the 'production data set' fine tune would be. You can then use that baseline to see if fine-tuning on that data set starts to push you towards the desired result.
  • Quality > Quantity. You are never going to come close to matching the base training set of these large langauge models. The quality of the data is so much more important.
@swyxio
Copy link

swyxio commented Nov 27, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment