daig/Reflective Agents.md

## Reflective Agents.md

      
    Raw
  

              Reflective Agents.md
            
          
    Epistemic Status Highly speculative. I speak informally whenever possible but use technical terms to point when no such common language is satisfactory. This isn't meant to imply certainty or prescriptiveness - The intention here is to gesture at something powerful with the rhetorical tools I have available. For brevity I factor out uncertainty to this warning, and present all conjecture as fact, so insert "I suspect" before everything that sounds like an assertion.
TL;DR A rough account for the meaning of reflection in multilevel inductors

Motivation
The reference "implementation" for logical induction has two components


A market of bids between polynomial-time traders, and


The polynomial-time traders themselves


I hope to build intuition that these are two fundamentally distinct types of computation, existing in distinct mathematical universes that nevertheless interact on reflection.
Stepping back from the concrete logical induction problem for a moment, consider informally traders as agents having some "beliefs" that use a "strategy" to inform their bidding "behavior" in the market. Viewed externally, other entities only see their behavior pattern. We can interpret the entire market as an intelligent agent trying to make accurate predictions using wisdom of the crowd. The individual traders are then special types of subagents concurrently mesa-optimizing for their own reward. In some sense they're trying to predict the behavior of the entire market for exploitation, despite being limited (This evidently has some overlapping intuition with HCH, but its too early to explore now).
This frame so far is just an evocative interpretation: A true reference inductor contains all possible bounded strategies, some of which may be "trying to predict" and others behaving wildly - the market layer sorts out which are more reasonable. But the reference inductor is intractably slow, so we may wish to speed it up dramatically by running fewer participants in the market, possibly at some bounded accuracy cost. Taking the evocative interpretation seriously, can we devise a sort of agent which is definitely "trying to predict", despite being a limited fragment of the market?

Reflection
Good traders might like to "reflect" on the beliefs of other traders, so they can detect and exploit structural bias. We as researchers might also like to reflect on traders for transparency reasons, so it's worth understanding exactly what's happening here. Instead of the asymmetry between a global market and limited traders, we may consider the behavior of traders themselves as an "open market of ideas", based on some shared foundational beliefs we ascribe to the trader itself.
Taking an operational view, consider the traders as something like RL agents, with some parameters representing their beliefs, and an algorithm for turning these parameters into a strategy. A strategy makes bids to the market based on visibility of other traders' bids, and updates their beliefs in response to winning or losing their bids, paying out or requesting payment from the market.
By "beliefs" we evocatively suggest traders parameters contain propositions, and might for example take the form of a belief network. Traders "learn" these beliefs by interacting with the market, but viewing the entire market as an agent, we can also interpret learning as "reflecting" the market's beliefs down into individual traders. On the other hand, when traders affect the market by bidding in it, they're teaching the market something about their local beliefs, "reifying" them into the portfolio. We informally call the entire capability of reifying or reflecting "Reflection"

Belief/ Behavior Duality
Behaviors are stuff that "actually happens in the world" while models/propositions are primarily platonic objects. Commonly we assert propositions "about" behaviour in the world, so how can we embed such pointers in a belief network when they live in a fundamentally different universe? Conversely, the world contains embedded agents, so actual behaviours in the world depend on the logical truth of computations they're running (eg: go left if 847631487509320457 is prime).
Taking a "God's Eye View", we can relate the two with a compatibility relation like M×S→⟨⋅,⋅⟩[0,1], interpreted loosely as "the degree to which m:M is consistent with world s:S . This is almost like asking P(M×S′) but notice the types are different: the joint probability assumes we've already lifted the world into a model, but compatibility lets us work with "unlifted" worlds directly. As intuitive edge-cases, notice that If ⟨m,s⟩=1 then s eventually takes all the money in a world where m is true. If ⟨m,s⟩=0 then s will definitely go bankrupt in such a world. This gives some external "meaning" to the beliefs of an agent, in the absence of a tangible mapping from parameters to behaviours, but note that compatibility is not generally something we compute but rather an idealized limit of logical truth.
Ex: The stock market doesn't have some hidden parameters somewhere we can read off to figure out what it "believes" about the world. The efficient market hypothesis suggests its beliefs will be embodied directly in the pricing of assets (ie. an asset will cost X if the market believes it's worth X), but this is only true close to equilibrium, and one might want to ask about the beliefs of a non-equilibrium system.
Using a push-pull formalism, we can push the compatibility [[Kernel]] forward to models or behaviors by marginalizing. ie. the "truth" of a model is its compatibility with all possible worlds, and the "fitness" of a behaviour is its compatibility with all possible beliefs (its capacity to adapt to reality). We can also pull-push behaviors (encoded (s:S)↦1) into a belief (M→[0,1]) characterizing the "minimal model embodied by s". The converse, pull-pushing (m:M)↦1 generates the "fittest behaviour assuming m". Generalizing behaviors (s:S)↦1 to any S→[0,1] can be interpreted internally as a sort of "utility" or "oughtness". On the other hand, any likelihood M→L[0,1] is equivalent to a weighted mixture of models (∑m:ML(m)×m)↦1. So this all amounts to a means of translating between probability and utility through an underlying "ProbUtil"

Mysteries
This picture leaves a number of mysteries which together constitute something like a research agenda:


The reflective story I've laid out moves up or down a level every time we reify or reflect, but agents expect to be both subject and object under reflection (ex: reflecting on my beliefs), so there's some sort of knot tying that needs to happen at the top level. I expect this is related to the boundary problem, and to give a localized meaning of compatibility measure.


Boundary problem: We assume here some fixed collection of traders with well specified input and output interface, but the underlying reality is that agents emerge naturally from selection/free-energy arguments, superagents and subagents overlap, and superagents may spin up mesa-agents accidentally. I believe generalizing existing agent-based frameworks to this fuzzy reality to be the most fundamental problem in AI safety. The general approach suggested by this post is to consider (not-necessarily-agenty) behaviours as a natural space to draw such boundaries in, and to consider behaviours "agenty" when they embody a large amount of belief under reflection.


Bounded computation: Nested agent markets are not obviously efficient like polynomial-by-definition traders, can we make any guarantees about resource use under reflection? Seems like there should be a notion of "linear" behaviour with nonlinear complexity only introduced by crossing reflection boundaries. This should be straightforward to check once everything else is formalized


Formal categories: Push-pull through ProbUtil evidently gives an adjunction but what are the morphisms transported in each case? ie. what are the morphisms for a category of belief networks and a category of backprop learners. Some hints exist in backprop-as-a-functor / open-games approaches and Categorical Probability Theory but this needs to be grounded out more. Expect both to have quite a lot of additional structure (eg. double categories) to work with.


Collapsing Beliefs: The 'Utility ↔ Probability' translation evidently gives an adjoint letting us construct a Reflection Monad to collapse multi-level agents into a single market of simple agents, but the actual construction depends sensitively on categories we choose for each.


Value Trajectories:


The Behaviour/Utility side of the adjoint gives a notion of intention for each agent - at each step an agent moves up its oughtness gradient, which locally looks like a direction in agent space (an action is equivalent to a self-modification under this paradigm). ie this view of intention only talks about immediate actions with no reference to final goal states, which can be imputed via integration but are otherwise unnatural.
As an agent learns, its values cannot remain stationary because of ontological shift: the space of possible actions changes every moment. We can however ask for a notion of "minimal drift", a preservation of intention from the perspective of the agent. My intuition leads towards a sort of affine connection between agents with different knowledge states, but there's clearly a lot of formalization work before this makes sense.
Given such a connection, we could ask questions about how value preservation over time corresponds to belief drift over time, as well as compute an interaction sequence to "synthesize" two agents to an overlapping ontology where value transfer is possible and automatic.
There's a lot more to say about each of these mysteries, but they're self-contained enough to warrant separate discussion - considering one seriously does not immediately require the rest although I believe they're ultimately aspects of the same general problem.

Here is my take:

Value is a function of the entire state space, and can't be neatly decomposed as a sum of subgames.

Rather (dually), value on ("quotient") subgames must be confluent with the total value on the joint game.