Mo Elshenawy - Cruise
Using GenAI for data mining re: specific scenarios and use cases not well-represented in the existing data.
Paper coming out soon
VC panel
James Cham - Bloomberg Beta (weights & biases, Kolena, Twilio, etc.)
- he argued that quality is subjective (I think I see his point, but I don't agree)
- he also said there's less innovation in applications than in developer tooling
Madison Faulkner - partner @NEA, led DS teams including @FB, focuses on Series B
Eric Carlborg - Lobby Capital, focuses on Series A
Natasha Macarenhas - reporter @TheInformation
Gordon Hart - Kolena
Quality Signals - lightweight classifiers trained very fast (he didn't really explain how)
guardrails can slow things down
Example: watermark detection on images
few-shot classifier is a good balance between training a custom model (high accuracy, more effort) and zero-shot (low effort, but also low accuracy)
using siglip embedding model embeddings can capture subtle concepts, but then that gets lost in typical similarity comparisons. Much better to use the embeddings directly as input to a classifier.
Example of using a tiny classifier with few-shot on labeled data multilayer perceptron with a single hidden later, ~100 parameters trains in <1 second on a CPU get an F1 score of 50 with only 10 example labeled images (5 positive, 5 negative controls) with only 16 images F1 score goes up to 90% with no false-positives
Can use this model to clean historical data, and mine and filter new data, can also run it live as a guardrail for regression monitoring in production.
The goal is to have tooling and processes to do the right thing easily and repeatably, so you can focus on your true objectives, not side quests
Amr Awadallah - Vectara: RAG vs. large context windows
Current: co-founder and CEO Former: CTO for Cloudera, had another company he sold to Yahoo in 2000, and a PhD from Stanford
Vectara is "RAG in a box". You plug in your data, and it includes the vector database, guardrails, and quality signals
He says in 5 years every device will have a GenAI chat interface
He says hallucination happens because of lossy compression (100-1000x). Train on trillions of tokens, store 0.1-1% of the original size.
12.5% is the maximum compression for English (Shannon information theorem)
data --> Boomerang proprietary stateless retrieval model --> vector DB --> relevance scoring with reranker model --> prompt --> LLM
Boomerang "converts language to 'meaning space'" for many different languages (he named at least a few)
They published a leaderboard on HuggingFace with hallucination rates. GPT-4 Turbo is 2.5% https://huggingface.co/spaces/vectara/Hallucination-evaluation-leaderboard (Opus isn't listed so this may be a bit out of date?)
Long context window (LCW) - anthropic has 200k tokens; Google has 2M tokens (600 pages!) He says LCW is better for holistic analysis and relationships among results can be slower than RAG for most things (NLogN with caching, N^2 for LCW without caching, vs. logN for RAG) He's saying that RAG is just as (?) "easy to update" as LCW, because he says caching actually makes updates harder (not sure I agree with this)
Model can get distracted by too much info in LCW (this is definitely my observation) LCW is better at finding a single needle in a haystack, RAG is better for multiple needle-finding use cases
Vectara has:
- hallucination detection
- explainability
- prompt atttack protections and RBAC
- copyright detection
- bias and toxicity mitigation
- free trial available https://console.vectara.com/signup
Jerry Liu - CEO @LlamaIndex
PyPDF can't do complex docs
investing time in better parsing does pay off
their new library LlamaParse had a big impact on reducing hallucinations - https://github.com/run-llama/llama_parse
Can combine advanced parsing with hierarchical indexing and retrieval works with multimodal data
they do recursive retrieval, e.g. summary on chunks, tables, images then index into that and link to the original object
advanced table understanding
some tips:
-
fix chunking, i.e. don't break up tables!
-
page-level chunking is often good
-
5 levels of text splitting (this was a link in his slides)
-
metadata extraction - adding metadata to chunks helps a lot (this has been my experience too)
-
indexing with a single vector is not enough - multiple vectors for the same text can be very helpful
-
docstore (k,v) in addition for the source docs is also helpful for caching and incremental syncing
-
consider a knowledge graph
-
consolidated, unified storage systems are needed (I agree - and see notes on LanceDB, apparently they're a customer)
Chang She - CEO & Co-founder at LanceDB: Self-optimizing RAG
@changiskhan one of the original co-authors of Pandas
RAG has a demo-to-prod gap
- retrieval quality - the first 80% is easy
- continual improvement: evals
His advice:
- Don't start with vector search, start with bm25
https://docs.llamaindex.ai/en/stable/examples/retrievers/bm25_retriever/
full-text search often does better than vector search
- chunking matters
- examples: text, josn, html, markdown, code
- details matter - window size, overlap, delimiters
- Langchain/LlamaIndex has many chunking processors, he said they have a blog post on this
- evals speed things up
- sample the data for openAI embedding API to keep costs of experimentation down
-
embedding matters MTEB leaderboard on HuggingFace re: embedding benchmarks for different use cases https://huggingface.co/spaces/mteb/leaderboard
-
pick a good metric NDCG is a good default for ranking/recsys https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ndcg_score.html
-
Fine-tuning don't need TBs of new data. Gave an example of a LanceDB customer who generated $10 worth of synthetic data (said he's going to share his slides)
-
Hybrid search multiple recallers for different use cases, then combine, e.g.
- filters (use explicit structure), bm25 (keywords and fuzzy search), graph (relationships)
He says the Cohere reranker worked well for them
He's defining the "context engine" as the vector db + filters + reranker
You still need metrics, data viz, still can't be 100% automated all the time (same as for recsys)
LanceDB:
- "all-in-one DB for AI"
- embedded OSS
- flexible storage & serverless options
- image, audio, video
- vector search, full-text, SQL
- can handle embedding generation for you
- has a discord for community
- and they're hiring
Felix Heide - Torc Robotics (and has an academic lab at Princeton)
generating edge case data for autonomous trucking
modules for perception, prediction, and planning
perception includes: radar, lidar, cmaera, maps, object tracking, weather
prediction and planning includes: fault manager, scene context, ML & analytics
images are processed to be optimized for computer vision
see Kumar et al. 2024 re: dynamic re-calibration, neede for e.g. severe vibrations on trucks
they test with social planning (agents reacting to each other)
for training e2e they need generative world simulators
they solve tracking as "inverse rendering" and "re-identification"
whereas typical behavior models use replays of driving logs, it's hard to generate realistic examples
instead they do offline reinforcement learning using ctrl-sim: https://arxiv.org/pdf/2403.19918v2
return-conditioned with exponential "tilting" where "tilting" means - adjust distance/angle between
vehicle-vehicle, vehicle-goal, or vehicle-edge-of-map
they collect trajectories and then learn a policy
can create collisions this way, e.g. one vehicle speeds up when they shouldn't