Skip to content

Instantly share code, notes, and snippets.

@nsadeh
Created February 29, 2024 21:20
Show Gist options
  • Save nsadeh/c5d84db5a6554146326a4d121a21c369 to your computer and use it in GitHub Desktop.
Save nsadeh/c5d84db5a6554146326a4d121a21c369 to your computer and use it in GitHub Desktop.
Building LLM Applications notes

Course link

Course Repo

Summary

This course focused on using existing LLMs to build, test, and monitor apps. It introduced various ways to interact with LLMs using the chat endpoint, various ways to think about RAGs and LLM usage patterns, and had a number of guest speakers from various companies talking about their produces and how to use them in conjunction with LLMs to build great applications.

Ways to Evaluate LLM Outputs

This was a key driver for taking this course in the first place. We covered the following ways to evaluate LLM outputs:

  1. Vibe checks: feeding a number of prompts to the LLM and seeing what comes out. It's useful in prototyping, but it's not a way to live. However, I think there's value here in developing an [[LLM Prompt Checklist]] for prompt engineers to run through.
  2. Numerical experiments: if it's possible to numerically check your LLM outputs, you should create test datasets and run experiments on your LLMs. These tables can be kept as Weights and Biases table artifacts and analyzed, or the results can be traced via WANDB as well. An example is the equation solver, but not every problem can do this.
  3. Eval chains: LangChain has this built-in, but it's just a prompt. Essentially, you generate question-answer pairs and have the LLM evaluate answers as correct or incorrect. This relies on the assumption (generally sound) that evaluation is simpler and more reliable than generation.

There are some pieces missing here, notably:

  1. When can we not rely on LLM evals (who is checking the checker?)
  2. How do we evaluate the quality of embeddings in various settings (e.g., code vs English text)? What are some metrics we can use here?
  3. I would like to see more example of different types of LLM interactions. For example, supposed we are training an LLM to reply in pirate jargon, how do we evaluate whether it does that well?

Identifying Areas of Improvements

LLM systems are composed of several components, and it's important to test them both in isolation and together. You can use experiments that have an accuracy/fitness metric to test combinations. In this example, four different LLMs were tested, alongside other parameters such as temperature, database, template, etc.

In a real test, you'd probably want to run with different indices as well.

Some tips for document search:

  • Proper names/terms not appearing in embedding model -> combine embedding search with keyword-based search
  • Questions and answers are far in the embedding space -> use HyDe (generate hypothetical answer)
  • Limited documents in prompt result in low document diversity -> MMR (maximal-marginal relevance search)
  • Domain specific questions and document may not have good embedding representation -> train a custom embeddings model

Patterns of RAG interaction

  • MapReduce : run an initial prompt on each chunk of dat and combine the outputs in a separate prompt
  • Refine: run an initial prompt on the first chunk of data, refine the output based on the new documents for subsequent documents
  • Ranking: run an initial prompt on each chunk of data, rank the responses based on the certainty score and return the highest scored answer

Types of Prompt engineering

  • Zero-shot
  • Few-shot
  • Chain-of-thought
  • Self-consistency
  • Generate knowledge
  • Automatic prompt engineer
  • Active-prompt
  • Directional stimulus prompting
  • ReAct prompting
  • Multimodal COT
  • Graph prompting
  • Tree of thoughts prompting

Third Party Guests

The course featured several third-party speakers who spoke about the products that they're building to support LLM application developers.

  • Chroma: open-source embedding database. Would be a useful resource to learn about how to identify embeddings quality
  • Guardrails is an OS package that helps build reliable LLMs. It takes the output of the LLM and runs validation checks against it. It has a few strategies for what to do in case of validation error:
    • Reask
    • Fix
    • Filter
    • Refrain
    • Noop
    • Exception
  • Rebuff.ai is a security software that hardens your LLM against prompt injection

Weights and Balances Features Used

  • Artifacts: you can store models and tables in artifacts to analyze and retrieve later
  • Prompts for LLMs:
    • Track inputs/outputs
    • Debug chains
    • Evaluate performance
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment