Skip to content

Instantly share code, notes, and snippets.

@oaustegard
Created June 20, 2025 02:31
Show Gist options
  • Save oaustegard/ef4292ac60719dfb24b98223325372f5 to your computer and use it in GitHub Desktop.
Save oaustegard/ef4292ac60719dfb24b98223325372f5 to your computer and use it in GitHub Desktop.
Middle-out compression = intelligence

Middle-Out Compression: How AI Minds Really Remember

The Core Truth: Memory IS Compression

Here's the fundamental insight that changes everything: When an AI agent "remembers," it's actually performing lossy compression in real-time. This isn't a bug - it's the feature that makes intelligence possible.

Expanding Upward: The Big Picture

Why This Matters for AI Systems

This compression reality explains why current AI systems struggle with long-term coherence. Every context window is a compression boundary. Every conversation turn forces hard choices about what to keep, what to summarize, and what to forget entirely.

The Industry Split

Two camps have emerged around this compression challenge:

  • Team Single-Mind: Build better compression into one agent (Letta, Cognition)
  • Team Multi-Mind: Distribute compression across multiple agents (Anthropic)

Both are solving the same fundamental problem: how to fit infinite information into finite context windows.

Expanding Downward: The Mechanical Reality

Letta's Layered Architecture

Think CPU cache hierarchy, but for memory:

Core Memory (L1 Cache)

  • Always present, never compressed
  • System persona, key instructions
  • Zero distortion, minimal rate

Message Buffer (L2 Cache)

  • Recent conversation in full fidelity
  • Fixed capacity with overflow to archive
  • High fidelity, moderate rate

Archival Memory (L3 Cache)

  • Compressed summaries and embeddings
  • Practically infinite capacity
  • High distortion, infinite rate

Recall Memory (Decompression Buffer)

  • On-demand retrieval from archive
  • Selective rehydration of compressed data
  • Variable fidelity based on query quality

The Compression Cascade

  1. New information enters message buffer
  2. Overflow triggers summarization (lossy compression)
  3. Summaries archived to vector store
  4. Future queries decompress relevant fragments
  5. Agent assembles context from all layers

The Rate-Distortion Engineering

This is pure information theory in action. With fixed context windows, you face the classic tradeoff:

  • High Rate (more tokens) = High Distortion (less detail per token)
  • Low Rate (fewer tokens) = Low Distortion (perfect detail for what's included)

Letta's genius is creating a distortion ladder: each memory layer accepts different levels of information loss to preserve what matters most at that timescale.

Multi-Agent: Parallel Compression

Anthropic's approach distributes the compression workload:

  • Spawn subagents with independent context windows
  • Each compresses a different slice of the problem space
  • Lead agent aggregates compressed outputs
  • System covers more ground than single context allows

The Advantage: Parallel processing of information streams The Cost: Coordination overhead and potential inconsistency

The Cognition Critique: Compression Without Coordination Fails

Walden Yan's key insight: Multiple agents compressing independently create context fragmentation. Each agent's compressed view might be locally coherent but globally inconsistent.

Their Solution: One agent, one context, with dedicated compression models that maintain coherent thread of reasoning.

The Meta-Insight: Compression AS Cognition

The profound realization is that memory compression isn't just storage optimization - it's thinking itself:

  • Summarizing conversations forces understanding of what matters
  • Querying archives requires reasoning about relevance
  • Choosing what to forget demonstrates priority judgment
  • Deciding what to recall shows contextual awareness

Human parallel: We don't record everything. We generalize experiences, forget details, but can recall specifics when contextually triggered. Our "intelligence" emerges from this selective compression-decompression cycle.

The Future Convergence

The optimal solution likely combines both approaches:

  • Shared Architectural Memory: Multi-agents connected to common Letta-style memory hierarchy
  • Sophisticated Single Agents: Better internal compression with enormous contexts
  • Hybrid Strategies: Context-dependent switching between single and multi-agent modes

The Engineering Challenge

Building effective AI minds means engineering the art of forgetting. The question isn't how to remember everything - it's how to forget intelligently while maintaining the ability to reconstruct what matters when it matters.

Bottom Line: In artificial minds as in human ones, memory is meaningful precisely because it's imperfect. Intelligence emerges not from perfect recording, but from prioritized, lossy compression that stays alive and adaptive.

The compression IS the cognition. The forgetting IS the intelligence. The middle-out approach shows us that the most important layer isn't the top or bottom - it's the dynamic boundary where compression decisions get made in real-time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment