cedrickchee/gen_ai_engineering.md

## gen_ai_engineering.md

      
    Raw
  

              gen_ai_engineering.md
            
          
    Generative AI Engineering

Building applications with foundation models.
Things to consider when using proprietary models and open models.


Proprietary Models
Open Models


Data
Have to send your data to model providers, which means your team can accidentally leak confidential information
Less checks and balances for data lineage/training data copyright


Functionality
- More likely to support function calling and JSON mode
- Less likely to expose log probs, which are helpful for classification tasks, evaluation, and interpretability
- No/limited support for function calling and JSON mode
- Can access log probs and intermediate outputs


Cost
API calls can get expensive at scale
Talent, time, engineering effort to optimize, host, maintain. Can be mitigated by using model hosting services.


Finetuning
Can only finetune models that model providers let you
In theory, you can finetune/quantize/optimize any model, but it can be hard to do so


Transparency
Lack of transparency in model changes and versioning
Easier to inspect changes in open models


Control and access
- Rate limits
- Model providers can stop supporting a model or features that you're using
No rate limits, but you're responsible for maintaining SLAs


Edge use cases
Can't run on device without Internet access
Can run on device, but again, might be hard to do so


Model Evaluations

A big issue I see with AI systems is that people aren't spending enough time evaluating their evaluation pipeline.


Most teams use more than one metrics (3-7 metrics in general) to evaluate their applications, which is a good practice.
However, very few are measuring the correlation between these metrics.
If two metrics are perfectly correlated, you probably don't need both of them.
If two metrics strongly disagree with each other, either this reveals something important about your system, or your metrics just aren't trustworthy.


Many (I estimate 60 - 70%?) use AI to evaluate AI responses, with common criteria being
conciseness, relevance, coherence, faithfulness, etc.
I find AI-as-a-judge very promising, and expect to see more of this approach in the future.
However, AI-as-a-judge scores aren’t deterministic the way classification F1 scores or accuracy are.
They depend on the model, the judge's prompt, and the use case.
Many AI judges are good, but many are bad.
Yet, very few are doing experiments to evaluate their AI judges.
Are good responses given better scores?
How reproducible the scores are -- if you ask the judge twice, do you get the same score?
Is the judge's prompt optimal?
Some aren’t even aware of the prompts their applications are using,
because they use prompts created by eval tools or by other teams.


Also fun fact I learned from a (small) poll yesterday:
some teams are spending more money on evaluating models’ responses than on generating responses.
Sources:

https://www.linkedin.com/posts/chiphuyen_aiengineering-aiapplications-llms-activity-7191471862994931713-T-3B
https://www.linkedin.com/posts/chiphuyen_aiengineering-llms-aievaluation-activity-7194734998376050688-uP2s
	Proprietary Models	Open Models
Data	Have to send your data to model providers, which means your team can accidentally leak confidential information	Less checks and balances for data lineage/training data copyright
Functionality	- More likely to support function calling and JSON mode - Less likely to expose log probs, which are helpful for classification tasks, evaluation, and interpretability	- No/limited support for function calling and JSON mode - Can access log probs and intermediate outputs
Cost	API calls can get expensive at scale	Talent, time, engineering effort to optimize, host, maintain. Can be mitigated by using model hosting services.
Finetuning	Can only finetune models that model providers let you	In theory, you can finetune/quantize/optimize any model, but it can be hard to do so
Transparency	Lack of transparency in model changes and versioning	Easier to inspect changes in open models
Control and access	- Rate limits - Model providers can stop supporting a model or features that you're using	No rate limits, but you're responsible for maintaining SLAs
Edge use cases	Can't run on device without Internet access	Can run on device, but again, might be hard to do so