Skip to content

Instantly share code, notes, and snippets.

@lemonteaa
Last active June 11, 2023 20:57
Show Gist options
  • Save lemonteaa/6914448c9c142d3ea775cc8a0002c892 to your computer and use it in GitHub Desktop.
Save lemonteaa/6914448c9c142d3ea775cc8a0002c892 to your computer and use it in GitHub Desktop.

LLM Application Framework Proposal

Motivation

Improve from langchain?

Because LLM deserves a more pleasant development experience.

Disclaimer

This is just a brain dump at this point, I haven't even finished reading langchain's official doc (I think I should do that at least, if not actually try to develop an app with langchain first). Maybe my thoughts will completely change after that.

Also, nothing against langchain personally - it got in early in the LLM scene when the pattern is still nebulous and unclear. This one is more like trying to consolidate on something like a consensus (?) after reading netizen's comments.

Lesson from web 2.0 frameworks

  • Libraries, not framework
    • UNIX philosophy
    • Focused, does one thing well
    • However, deep integration + ecosystem does have some benefits - how to get this benefits without the downside
  • Avoid configuration hell
  • Declarative vs imperative
  • others?

Differences from langchain/llama-index/heystack

  • Smaller scope - LLM-enabled application, document understanding/retrieval/question-and-answer etc are out of scope and handled by llama-index or heystack (depending on your taste).
    • Kind of like how web framework may provide integrations with ORM and database but don't overfocus on that
    • Rationale: seems they have vastly different design consideration (online vs cron-job like)
  • Use chain as the central abstraction - current idea is that both chatbot and agent uses the same chain type underneath (but the chain content is of course different), the surrounding data structure is also different
    • (Still thinking) Want to provide more flexibility at the top layer (see below), but also want to keep a "clean architecture"

Long term vision

After reading more about langchain and its surrounding ecosystem, seems that a LLM app framework is one piece of the puzzle in the larger picture. In an ideal world, the enhanced developer experience, LLMOps stuff would be provided by specialized toolings outside of the framework, but have a seamless integration.

Two major part worth highlighting are:

  • Prompt engineering as part of the LLMOps lifecycle
    • Specialized tools to do this in a disciplined way
    • IDE/UI for easy experimentation during prototype/validation phase
    • Lineage tracking of prompts
    • Automated evaluation of prompts performance (like unit tests?) in later phase of development
  • Low code platform to design app
    • UI to drag-and-drop design a prompt chain
    • Share these chain? (need a standardized format?)
    • End to end platform? (with deployment included)

High level overview

Use a layered architecture to hopefully make things clearer.

  • Application Layer
    • Cognitive Architecture
    • Chatbot/Conversational Agent?
  • Orchestration layer
    • Prompt
    • Prompt Chain and Execution Engine
  • Integration layer
    • LLM-Interface
    • VectorStore (Out of scope)
    • Tools

Brief explanation:

We focus on making a framework for developing online type application that features using LLM with a central role/in the spotlight, where the LLM is augmented with tool use, agentization, and connection to a separate semantic-retrieval-plus-LLM-enhanced summarization/QA system. The core abstraction is a Prompt Chain - a computational graph with natural language text passing as data through the graph, and each node performs either general purpose processing using LLM, or an action through an external tool. Execution of such a chain is scheduled through the Execution Engine. Ingegration layer provide a somewhat unified interface. On top of Prompt Chains, LLM-enabled applications are built by enriching it with its own application specific data structures and configs.

Some key questions:

  • Is the current design of LLM-Interface too fat?
  • How to handle cross-cutting concern? (logging/event system etc) Without making everything too tightly coupled?
    • Eg: May want to offer the LLM-Interface as a standalone component that is useful outside without bringing in the whole framework
  • How to strike a balance between power and simplicity for the core abstraction of Prompt Chain?
    • Reality: May need an escape hatch and support a node type that is just arbitrary code
    • Reality: Even the basic prompt node type may become complex if we need post-processing, guardrail retry/failure handling for passive prompt where LLM structured output fails, preprocessing input with trim text based on token quota (dependent on tokenizer which in turns depends on which LLM is used)...
  • Is/should Prompt Chain composable? Is it okay to force-fit an entire application into one or more Prompt Chain (so it is self-contained and surrounding codes are mostly just light adaptors). Even with the escape hatch mentioned above, something that basically amount to an explicitly declared computational graph with separate execution scheduler sounds like a heavy restriction to impose - working on it will likely feel less free/easy compared to "just code".
    • On the other hand, it can be argued that imposing these discipline is precisely the point of using an application framework, by forcing you to be explicit about your application logic...
  • Although vector store etc is out of scope, how about conversation memory stored in vector store and retrieved to help agent maintain identities/relationships? Feel like it is kind of a gray area in-between the two patterns...
  • langchain recently introduced a second type of action where it is long running/in the background... can we handle it in a unified manner in the Prompt Chain + Execution Engine abstraction? Is just scheduling it in a more flexible manner already enough?

Core concepts/Components

LLM

Unified interface. (For pragmatic reasons, but we acknowledge that this is very likely to be a leaky abstraction and different concrete providers will have differences in feature supported/specific quirks etc)

Main types of implementations?

  • API based (OpenAI standard) (Important not to make excessive assumption for specific vendor, eg lesson from AWS S3 API integration)
  • local/FFI bindings
  • Reuse vendor provided SDK
  • Mocked (use during testing and development phase to rapidly iterate on ideas)

Cross-cutting/enhanced features it should supports:

  • Retries
  • Caching
    • Simple/exact cache
    • Semantic cache (sufficiently similar query)
  • (Structured) Logging
  • Metric
    • For controlling cost (counting token consumed)
    • For tracking/monitoring performance (token/sec)
  • Async/streaming response
  • Concurrency/Batching/Queueing
  • Configuration management
  • metadata tagging

(how about low level/logit level access?)

For logging, it can be sent to something like elasticsearch in production, but in development mode we can default to say store in a sqlite database. One benefit is that this allows DX improvement by allowing quick experimentation followed by exfiltrating prompts that "work" after the fact by reviewing them.

  • Need to generate UUID for each request?
  • Integration with prompt engineering tools outside the framework?

Note that a mocked LLM instance can give canned response driven by a special metadata tag, or even just return the same message in a special metadata so the client can completely control it.

Some examples LLM providers:

  • Hugging face API
  • Gradio SDK
  • Kobloid horde
  • llama-cpp-python
  • OpenAI compatible API

For implementation, prefer composition over inheritance? So may provide helper class/functions that do some of the features in a generic manner, but it is up to individual LLM plugin/provider to use it themselves.

For the actual interface, say something like LLMInterface.generate(prompt, LLMConfig(temp=0.7, top_p=0.95), other config...)

Tools

textual IO + side effect

event logging (for display to end user) + end user gating (i.e. require confirmation)

way to describe spec in prompts

examples of tools:

  • Calculator
  • Web search
  • Visit webpage
  • invoke LLM/spin up LLM instance/clone
  • sandboxed code interpreter
  • file IO
  • Vector store retrieval/summarization/QA/conversational and other memory

Prompts

  • Have a way to load a collection of prompts from directory
    • Watch directory and auto-reload during development
  • Integrations with online prompt hub? (many hub providers out there)
    • For development, let you search and use prompt inside a Jupyterhub notebook
  • Active vs Passive Prompts?
    • passive = just standard templating, maybe boosted with LLM specific functions. Note that can overcome limitation somewhat and achieve parity with active prompts using guardrail library + post-processing fix
    • active = guidance/lmql like, can force underlying LLM to conform, can direct LLM output, can extract output in a structured way.
  • Should prompts be more than just the text? eg include the I/O spec.

(Mentioned already elsewhere in this doc):

  • (separate project) Web UI that let you edit a prompt-chain visually/interactively
  • How about chaining? Separate class? (DAG? But what about looping/dynamic situation)

Prompt Chain

Seems to be the central abstraction.

Simplest is a static DAG, but can be active + dynamic => much more sophisticated. (Can also be fully integrated)

LLM-enabled app can be said to be just a facade of some data wrangling on top of a prompt chain to present a nice UI to user. In short your own fluent data format + (fully configured) prompt chain = LLM app.

But, doesn't this mean prompt chain is not composable?

Should allow more than one execution mode of a prompt chain to enable partial reuse.

Execution Engine

Schedule the execution of a prompt chain. Responsibility to figure out when to send which node to what resources (relevant when multiple resources available, matter in production?).

Some possible implementations:

  • Naive linearized scheduler - only ensure data dependencies are met by topological sorting, otherwise send node to execute one by one. Useful during development?
  • Parallal scheduler - Try to optimize time to result by sending nodes to available resources in parallel.
    • May be helped if statistics on typical execution time for each node is available?

Cognitive Architecture

Ability to specify in a declarative fashion?

Many variations, but generally:

  • Internal/mental states
  • External/world states
  • prompt chain for decision making => loop

Variation example:

  • critic agent (responsible to cross-check the base agent's thinking and course correct if necessary)
  • society of agent

Chatbot

Provide a Data format:

  • Collections of conversations isolated by sessions
    • Each one is two side taking turn saying... but
    • say more than one thing in one turn?
    • multi-party conversation?
    • Internal thoughts (not shown to end user) the bot used to derive answer/other internal system messages

and provide data manipulation methods, eg so one can feed into a vector store as memory

Cross-cutting concepts/Other components

Configuration management

Just like a typical web 2.0 app mostly?

  • Command line arguments
  • ENV variables
  • config file

Catch: do want to get out of the way during development and let people just spin up an app and play immediately...

Server endpoints integrations

TODO - a wider question of scope, should we add a fourth "Access Layer" on top of application layer and provide various "app delivery options"...

Utilities

DocumentChunk

  • Split by token limits
  • Overlapping window

Developer Experience (DX)

  • Dev mode => should support hot-reload
  • Ability to dump cached LLM response to a mock class, similar providance => lower friction during the tinkering phase

Some possible workflow example:

Start a causal experiment on jupyter notebook with a session, feed different prompts to LLM to tune the prompt. When response is satisfactory, can click button on gradio interface, or execute a special function, that will save the previous request + response pair to a repository (separate local sqlite file?) scoped to the session. User can then further edit the request/response pair if needed, and then can dump the repo into a mock LLM instance

Other unrelated stuff:

  • Springboot like way to quickly boostrap an app with common config for a bunch of components? Use a fluent API approach, like
app = AppBootstrap().useLLM('openai', key = <your key>)
                    .usePromptChain('retrieval-chatbot')
                    .addConfig(...)

Multi-programming-language

Note that langchain has both python and js version. One benefit is the potential to do the whole app logic on frontend (it just calls LLM API).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment