This course focused on using existing LLMs to build, test, and monitor apps. It introduced various ways to interact with LLMs using the chat
endpoint, various ways to think about RAGs and LLM usage patterns, and had a number of guest speakers from various companies talking about their produces and how to use them in conjunction with LLMs to build great applications.
This was a key driver for taking this course in the first place. We covered the following ways to evaluate LLM outputs:
- Vibe checks: feeding a number of prompts to the LLM and seeing what comes out. It's useful in prototyping, but it's not a way to live. However, I think there's value here in developing an [[LLM Prompt Checklist]] for prompt engineers to run through.
- Numerical experiments: if it's possible to numerically check your LLM outputs, you should create test datasets and run experiments on your LLMs. These tables can be kept as Weights and Biases table artifacts and analyzed, or the results can be traced via WANDB as well. An example is the equation solver, but not every problem can do this.
- Eval chains: LangChain has this built-in, but it's just a prompt. Essentially, you generate question-answer pairs and have the LLM evaluate answers as correct or incorrect. This relies on the assumption (generally sound) that evaluation is simpler and more reliable than generation.
There are some pieces missing here, notably:
- When can we not rely on LLM evals (who is checking the checker?)
- How do we evaluate the quality of embeddings in various settings (e.g., code vs English text)? What are some metrics we can use here?
- I would like to see more example of different types of LLM interactions. For example, supposed we are training an LLM to reply in pirate jargon, how do we evaluate whether it does that well?
LLM systems are composed of several components, and it's important to test them both in isolation and together. You can use experiments that have an accuracy/fitness metric to test combinations. In this example, four different LLMs were tested, alongside other parameters such as temperature, database, template, etc.
In a real test, you'd probably want to run with different indices as well.
- Proper names/terms not appearing in embedding model -> combine embedding search with keyword-based search
- Questions and answers are far in the embedding space -> use HyDe (generate hypothetical answer)
- Limited documents in prompt result in low document diversity -> MMR (maximal-marginal relevance search)
- Domain specific questions and document may not have good embedding representation -> train a custom embeddings model
- MapReduce : run an initial prompt on each chunk of dat and combine the outputs in a separate prompt
- Refine: run an initial prompt on the first chunk of data, refine the output based on the new documents for subsequent documents
- Ranking: run an initial prompt on each chunk of data, rank the responses based on the certainty score and return the highest scored answer
- Zero-shot
- Few-shot
- Chain-of-thought
- Self-consistency
- Generate knowledge
- Automatic prompt engineer
- Active-prompt
- Directional stimulus prompting
- ReAct prompting
- Multimodal COT
- Graph prompting
- Tree of thoughts prompting
The course featured several third-party speakers who spoke about the products that they're building to support LLM application developers.
- Chroma: open-source embedding database. Would be a useful resource to learn about how to identify embeddings quality
- Guardrails is an OS package that helps build reliable LLMs. It takes the output of the LLM and runs validation checks against it. It has a few strategies for what to do in case of validation error:
- Reask
- Fix
- Filter
- Refrain
- Noop
- Exception
- Rebuff.ai is a security software that hardens your LLM against prompt injection
- Artifacts: you can store models and tables in artifacts to analyze and retrieve later
- Prompts for LLMs:
- Track inputs/outputs
- Debug chains
- Evaluate performance