Skip to content

Instantly share code, notes, and snippets.

@szeitlin
Created June 26, 2024 19:42
Show Gist options
  • Save szeitlin/402369c4a5b00e33960c9407ed5ec440 to your computer and use it in GitHub Desktop.
Save szeitlin/402369c4a5b00e33960c9407ed5ec440 to your computer and use it in GitHub Desktop.
AIQCon_notes_June25_2024

Mo Elshenawy - Cruise

Using GenAI for data mining re: specific scenarios and use cases not well-represented in the existing data.

Paper coming out soon


VC panel

James Cham - Bloomberg Beta (weights & biases, Kolena, Twilio, etc.)

  • he argued that quality is subjective (I think I see his point, but I don't agree)
  • he also said there's less innovation in applications than in developer tooling

Madison Faulkner - partner @NEA, led DS teams including @FB, focuses on Series B

Eric Carlborg - Lobby Capital, focuses on Series A

Natasha Macarenhas - reporter @TheInformation


Gordon Hart - Kolena

Quality Signals - lightweight classifiers trained very fast (he didn't really explain how)

guardrails can slow things down

Example: watermark detection on images

few-shot classifier is a good balance between training a custom model (high accuracy, more effort) and zero-shot (low effort, but also low accuracy)

using siglip embedding model embeddings can capture subtle concepts, but then that gets lost in typical similarity comparisons. Much better to use the embeddings directly as input to a classifier.

Example of using a tiny classifier with few-shot on labeled data multilayer perceptron with a single hidden later, ~100 parameters trains in <1 second on a CPU get an F1 score of 50 with only 10 example labeled images (5 positive, 5 negative controls) with only 16 images F1 score goes up to 90% with no false-positives

Can use this model to clean historical data, and mine and filter new data, can also run it live as a guardrail for regression monitoring in production.

The goal is to have tooling and processes to do the right thing easily and repeatably, so you can focus on your true objectives, not side quests


Amr Awadallah - Vectara: RAG vs. large context windows

Current: co-founder and CEO Former: CTO for Cloudera, had another company he sold to Yahoo in 2000, and a PhD from Stanford

Vectara is "RAG in a box". You plug in your data, and it includes the vector database, guardrails, and quality signals

He says in 5 years every device will have a GenAI chat interface

He says hallucination happens because of lossy compression (100-1000x). Train on trillions of tokens, store 0.1-1% of the original size.

12.5% is the maximum compression for English (Shannon information theorem)

data --> Boomerang proprietary stateless retrieval model --> vector DB --> relevance scoring with reranker model --> prompt --> LLM

Boomerang "converts language to 'meaning space'" for many different languages (he named at least a few)

They published a leaderboard on HuggingFace with hallucination rates. GPT-4 Turbo is 2.5% https://huggingface.co/spaces/vectara/Hallucination-evaluation-leaderboard (Opus isn't listed so this may be a bit out of date?)

Long context window (LCW) - anthropic has 200k tokens; Google has 2M tokens (600 pages!) He says LCW is better for holistic analysis and relationships among results can be slower than RAG for most things (NLogN with caching, N^2 for LCW without caching, vs. logN for RAG) He's saying that RAG is just as (?) "easy to update" as LCW, because he says caching actually makes updates harder (not sure I agree with this)

Model can get distracted by too much info in LCW (this is definitely my observation) LCW is better at finding a single needle in a haystack, RAG is better for multiple needle-finding use cases

Vectara has:

  • hallucination detection
  • explainability
  • prompt atttack protections and RBAC
  • copyright detection
  • bias and toxicity mitigation
  • free trial available https://console.vectara.com/signup

Jerry Liu - CEO @LlamaIndex

PyPDF can't do complex docs

investing time in better parsing does pay off

their new library LlamaParse had a big impact on reducing hallucinations - https://github.com/run-llama/llama_parse

Can combine advanced parsing with hierarchical indexing and retrieval works with multimodal data

they do recursive retrieval, e.g. summary on chunks, tables, images then index into that and link to the original object

advanced table understanding

some tips:

  • fix chunking, i.e. don't break up tables!

  • page-level chunking is often good

  • 5 levels of text splitting (this was a link in his slides)

  • metadata extraction - adding metadata to chunks helps a lot (this has been my experience too)

  • indexing with a single vector is not enough - multiple vectors for the same text can be very helpful

  • docstore (k,v) in addition for the source docs is also helpful for caching and incremental syncing

  • consider a knowledge graph

  • consolidated, unified storage systems are needed (I agree - and see notes on LanceDB, apparently they're a customer)


Chang She - CEO & Co-founder at LanceDB: Self-optimizing RAG

@changiskhan one of the original co-authors of Pandas

RAG has a demo-to-prod gap

  • retrieval quality - the first 80% is easy
  • continual improvement: evals

His advice:

  1. Don't start with vector search, start with bm25

https://docs.llamaindex.ai/en/stable/examples/retrievers/bm25_retriever/

full-text search often does better than vector search

  1. chunking matters
  • examples: text, josn, html, markdown, code
  • details matter - window size, overlap, delimiters
  • Langchain/LlamaIndex has many chunking processors, he said they have a blog post on this
  • evals speed things up
  • sample the data for openAI embedding API to keep costs of experimentation down
  1. embedding matters MTEB leaderboard on HuggingFace re: embedding benchmarks for different use cases https://huggingface.co/spaces/mteb/leaderboard

  2. pick a good metric NDCG is a good default for ranking/recsys https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ndcg_score.html

  3. Fine-tuning don't need TBs of new data. Gave an example of a LanceDB customer who generated $10 worth of synthetic data (said he's going to share his slides)

  4. Hybrid search multiple recallers for different use cases, then combine, e.g.

  • filters (use explicit structure), bm25 (keywords and fuzzy search), graph (relationships)

He says the Cohere reranker worked well for them

He's defining the "context engine" as the vector db + filters + reranker

You still need metrics, data viz, still can't be 100% automated all the time (same as for recsys)

LanceDB:

  • "all-in-one DB for AI"
  • embedded OSS
  • flexible storage & serverless options
  • image, audio, video
  • vector search, full-text, SQL
  • can handle embedding generation for you
  • has a discord for community
  • and they're hiring

Felix Heide - Torc Robotics (and has an academic lab at Princeton)

generating edge case data for autonomous trucking

modules for perception, prediction, and planning

perception includes: radar, lidar, cmaera, maps, object tracking, weather

prediction and planning includes: fault manager, scene context, ML & analytics

images are processed to be optimized for computer vision

see Kumar et al. 2024 re: dynamic re-calibration, neede for e.g. severe vibrations on trucks

they test with social planning (agents reacting to each other)

for training e2e they need generative world simulators

they solve tracking as "inverse rendering" and "re-identification"

whereas typical behavior models use replays of driving logs, it's hard to generate realistic examples

instead they do offline reinforcement learning using ctrl-sim: https://arxiv.org/pdf/2403.19918v2

return-conditioned with exponential "tilting" where "tilting" means - adjust distance/angle between

vehicle-vehicle, vehicle-goal, or vehicle-edge-of-map

they collect trajectories and then learn a policy

can create collisions this way, e.g. one vehicle speeds up when they shouldn't

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment