nsadeh/building_llm_applications.md

## building_llm_applications.md

      
    Raw
  

              building_llm_applications.md
            
          
    Course link
Course Repo
Summary

This course focused on using existing LLMs to build, test, and monitor apps. It introduced various ways to interact with LLMs using the chat endpoint, various ways to think about RAGs and LLM usage patterns, and had a number of guest speakers from various companies talking about their produces and how to use them in conjunction with LLMs to build great applications.
Ways to Evaluate LLM Outputs

This was a key driver for taking this course in the first place. We covered the following ways to evaluate LLM outputs:

Vibe checks: feeding a number of prompts to the LLM and seeing what comes out. It's useful in prototyping, but it's not a way to live. However, I think there's value here in developing an [[LLM Prompt Checklist]] for prompt engineers to run through.
Numerical experiments: if it's possible to numerically check your LLM outputs, you should create test datasets and run experiments on your LLMs. These tables can be kept as Weights and Biases table artifacts and analyzed, or the results can be traced via WANDB as well. An example is the equation solver, but not every problem can do this.
Eval chains: LangChain has this built-in, but it's just a prompt. Essentially, you generate question-answer pairs and have the LLM evaluate answers as correct or incorrect. This relies on the assumption (generally sound) that evaluation is simpler and more reliable than generation.

There are some pieces missing here, notably:

When can we not rely on LLM evals (who is checking the checker?)
How do we evaluate the quality of embeddings in various settings (e.g., code vs English text)? What are some metrics we can use here?
I would like to see more example of different types of LLM interactions. For example, supposed we are training an LLM to reply in pirate jargon, how do we evaluate whether it does that well?

Identifying Areas of Improvements

LLM systems are composed of several components, and it's important to test them both in isolation and together. You can use experiments that have an accuracy/fitness metric to test combinations. In this example, four different LLMs were tested, alongside other parameters such as temperature, database, template, etc.
In a real test, you'd probably want to run with different indices as well.
Some tips for document search:


Proper names/terms not appearing in embedding model -> combine embedding search with keyword-based search
Questions and answers are far in the embedding space -> use HyDe (generate hypothetical answer)
Limited documents in prompt result in low document diversity -> MMR (maximal-marginal relevance search)
Domain specific questions and document may not have good embedding representation -> train a custom embeddings model

Patterns of RAG interaction


MapReduce : run an initial prompt on each chunk of dat and combine the outputs in a separate prompt
Refine: run an initial prompt on the first chunk of data, refine the output based on the new documents for subsequent documents
Ranking: run an initial prompt on each chunk of data, rank the responses based on the certainty score and return the highest scored answer

Types of Prompt engineering


Zero-shot
Few-shot
Chain-of-thought
Self-consistency
Generate knowledge
Automatic prompt engineer
Active-prompt
Directional stimulus prompting
ReAct prompting
Multimodal COT
Graph prompting
Tree of thoughts prompting

Third Party Guests

The course featured several third-party speakers who spoke about the products that they're building to support LLM application developers.

Chroma: open-source embedding database. Would be a useful resource to learn about how to identify embeddings quality
Guardrails is an OS package that helps build reliable LLMs. It takes the output of the LLM and runs validation checks against it. It has a few strategies for what to do in case of validation error:

Reask
Fix
Filter
Refrain
Noop
Exception


Rebuff.ai is a security software that hardens your LLM against prompt injection

Weights and Balances Features Used


Artifacts: you can store models and tables in artifacts to analyze and retrieve later
Prompts for LLMs:

Track inputs/outputs
Debug chains
Evaluate performance