szeitlin/AIQCon_notes_June25_2024.md

## AIQCon_notes_June25_2024.md

      
    Raw
  

              AIQCon_notes_June25_2024.md
            
          
    Mo Elshenawy - Cruise
Using GenAI for data mining re: specific scenarios and use cases not well-represented in the existing data.
Paper coming out soon

VC panel
James Cham - Bloomberg Beta (weights & biases, Kolena, Twilio, etc.)

he argued that quality is subjective (I think I see his point, but I don't agree)
he also said there's less innovation in applications than in developer tooling

Madison Faulkner - partner @NEA, led DS teams including @FB, focuses on Series B
Eric Carlborg - Lobby Capital, focuses on Series A
Natasha Macarenhas - reporter @TheInformation

Gordon Hart - Kolena
Quality Signals - lightweight classifiers trained very fast (he didn't really explain how)
guardrails can slow things down
Example: watermark detection on images
few-shot classifier is a good balance between training a custom model (high accuracy, more effort)
and zero-shot (low effort, but also low accuracy)
using siglip embedding model
embeddings can capture subtle concepts, but then that gets lost in typical similarity comparisons.
Much better to use the embeddings directly as input to a classifier.
Example of using a tiny classifier with few-shot on labeled data
multilayer perceptron with a single hidden later, ~100 parameters
trains in <1 second on a CPU
get an F1 score of 50 with only 10 example labeled images (5 positive, 5 negative controls)
with only 16 images F1 score goes up to 90% with no false-positives
Can use this model to clean historical data,
and mine and filter new data,
can also run it live as a guardrail for regression monitoring in production.
The goal is to have tooling and processes to do the right thing easily and repeatably, so you can
focus on your true objectives, not side quests

Amr Awadallah - Vectara: RAG vs. large context windows
Current: co-founder and CEO
Former: CTO for Cloudera, had another company he sold to Yahoo in 2000, and a PhD from Stanford
Vectara is "RAG in a box". You plug in your data, and it includes the vector database, guardrails, and quality signals
He says in 5 years every device will have a GenAI chat interface
He says hallucination happens because of lossy compression (100-1000x). Train on trillions of tokens, store 0.1-1% of the original size.
12.5% is the maximum compression for English (Shannon information theorem)
data --> Boomerang proprietary stateless retrieval model --> vector DB --> relevance scoring with reranker model --> prompt --> LLM
Boomerang "converts language to 'meaning space'" for many different languages (he named at least a few)
They published a leaderboard on HuggingFace with hallucination rates. GPT-4 Turbo is 2.5% https://huggingface.co/spaces/vectara/Hallucination-evaluation-leaderboard
(Opus isn't listed so this may be a bit out of date?)
Long context window (LCW) - anthropic has 200k tokens; Google has 2M tokens (600 pages!)
He says LCW is better for holistic analysis and relationships among results
can be slower than RAG for most things (NLogN with caching, N^2 for LCW without caching, vs. logN for RAG)
He's saying that RAG is just as (?) "easy to update" as LCW, because he says caching actually makes updates harder (not sure I agree with this)
Model can get distracted by too much info in LCW (this is definitely my observation)
LCW is better at finding a single needle in a haystack, RAG is better for multiple needle-finding use cases
Vectara has:

hallucination detection
explainability
prompt atttack protections and RBAC
copyright detection
bias and toxicity mitigation
free trial available https://console.vectara.com/signup


Jerry Liu - CEO @LlamaIndex
PyPDF can't do complex docs
investing time in better parsing does pay off
their new library LlamaParse had a big impact on reducing hallucinations - https://github.com/run-llama/llama_parse
Can combine advanced parsing with hierarchical indexing and retrieval
works with multimodal data
they do recursive retrieval, e.g. summary on chunks, tables, images
then index into that and link to the original object
advanced table understanding
some tips:


fix chunking, i.e. don't break up tables!


page-level chunking is often good


5 levels of text splitting (this was a link in his slides)


metadata extraction - adding metadata to chunks helps a lot (this has been my experience too)


indexing with a single vector is not enough - multiple vectors for the same text can be very helpful


docstore (k,v) in addition for the source docs is also helpful for caching and incremental syncing


consider a knowledge graph


consolidated, unified storage systems are needed (I agree - and see notes on LanceDB, apparently they're a customer)


Chang She - CEO & Co-founder at LanceDB: Self-optimizing RAG
@changiskhan
one of the original co-authors of Pandas
RAG has a demo-to-prod gap

retrieval quality - the first 80% is easy
continual improvement: evals

His advice:

Don't start with vector search, start with bm25

https://docs.llamaindex.ai/en/stable/examples/retrievers/bm25_retriever/
full-text search often does better than vector search

chunking matters


examples: text, josn, html, markdown, code
details matter - window size, overlap, delimiters
Langchain/LlamaIndex has many chunking processors, he said they have a blog post on this
evals speed things up
sample the data for openAI embedding API to keep costs of experimentation down


embedding matters
MTEB leaderboard on HuggingFace re: embedding benchmarks for different use cases https://huggingface.co/spaces/mteb/leaderboard


pick a good metric
NDCG is a good default for ranking/recsys https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ndcg_score.html


Fine-tuning
don't need TBs of new data. Gave an example of a LanceDB customer who generated $10 worth of synthetic data (said he's going to share his slides)


Hybrid search
multiple recallers for different use cases, then combine, e.g.


filters (use explicit structure), bm25 (keywords and fuzzy search), graph (relationships)

He says the Cohere reranker worked well for them
He's defining the "context engine" as the vector db + filters + reranker
You still need metrics, data viz, still can't be 100% automated all the time (same as for recsys)
LanceDB:

"all-in-one DB for AI"
embedded OSS
flexible storage & serverless options
image, audio, video
vector search, full-text, SQL
can handle embedding generation for you
has a discord for community
and they're hiring


Felix Heide - Torc Robotics (and has an academic lab at Princeton)
generating edge case data for autonomous trucking
modules for perception, prediction, and planning
perception includes: radar, lidar, cmaera, maps, object tracking, weather
prediction and planning includes: fault manager, scene context, ML & analytics
images are processed to be optimized for computer vision
see Kumar et al. 2024 re: dynamic re-calibration, neede for e.g. severe vibrations on trucks
they test with social planning (agents reacting to each other)
for training e2e they need generative world simulators
they solve tracking as "inverse rendering" and "re-identification"
whereas typical behavior models use replays of driving logs, it's hard to generate realistic examples
instead they do offline reinforcement learning using ctrl-sim: https://arxiv.org/pdf/2403.19918v2
return-conditioned with exponential "tilting" where "tilting" means - adjust distance/angle between
vehicle-vehicle, vehicle-goal, or vehicle-edge-of-map
they collect trajectories and then learn a policy
can create collisions this way, e.g. one vehicle speeds up when they shouldn't