JD-P/agent_foundations_llms.md

## agent_foundations_llms.md

      
    Raw
  

              agent_foundations_llms.md
            
          
    Alignment

I did a podcast with Zvi after seeing that Shane Legg couldn't answer a straightforward question about deceptive alignment on a podcast. Demis Hassabis was recently interviewed on the same podcast and also doesn't seem able to answer a straightforward question about alignment. OpenAI's "Superalignment" plan is literally to build AGI and have it solve alignment for us. The public consensus seems to be that "alignment" is a mysterious pre-paradigmatic field with a lot of open problems and few solutions. The answers given by Legg and Hassabis aren't really answers and OpenAI's alignment "plan" isn't really a plan, so I guess I can't blame people for being pessimistic. However I don't feel pessimistic, and the easiest way to explain why is to lay out how I expect alignment to be solved.
Background

Part of what makes the alignment problem "pre-paradigmatic" is that people can't agree on what the alignment problem(s) are or how to formulate them. This unfortunately forces me to explain what I mean by "alignment":
The alignment or control problem is a problem that arises with sufficiently strong optimizer-agents (sometimes also called "consequentialist") when they "instrumentally converge" to VNM rationality as a result of loss-minimizing behavior. No matter what terminal or axiomatic goals an agent starts with, it will converge to unfriendly behaviors such as "won't let you shut it off" and "tries to gain control of as many resources as it can (including the atoms in your body)" laid out in Omohundro's AI drives paper.
In his recent post "List of Lethalities" Eliezer Yudkowsky outlines two ways he expects this problem could be solved:


Corrigibility - Find a way to prevent a recursively self improving agent from reaching full Omohundro convergence while still doing useful cognitive work. This is difficult because it forces you to control generalization sufficiently well that the model "learns how to drive a blue car but not a red car". I personally think this is a fools errand as stated, having that level of control over generalization would force you to specify the problem so precisely that AI isn't useful in the first place.


Alignment - Find terminal value representations or processes that sufficiently capture what humans want that an AI or coalition of AIs optimizing these representations at a superintelligent level (this is the hard part) will yield good outcomes that don't require any further intervention since the system will no longer allow outside interference. I think this is also a fools errand as stated. As I've written before people have a tendency to take convergent outcomes as static assumptions when they actually imply motion, moving parts. This is obfuscated in Yudkowsky and Soares's writing through the phrase "first critical try".


Rather than declare these problems impossible, I think it is in gaining a precise understanding of why we cannot currently solve them that the plausible solutions will become clearer.
The Diamond Maximizer

Alignment discussion can get very abstract, so I think it will be helpful to frame analysis in terms of an open problem in the defunct agent foundations research agenda: The diamond maximizer. A diamond maximizer is a thought-experimental paperclip maximizer but deliberate for the purpose of imagining a procedure to align a strong optimizer to a 'simple' goal, setting aside complex values. The idea is to cleanly isolate subproblems like defining what a 'diamond' is from concerns about human morality.
There's several versions of the diamond maximizer, the one I'm talking about right now might be stated something like:
Given the substrate of an embodied RL agent with superhuman but not unlimited amounts of local compute maximize the number of "diamonds" in each slice of time over an indefinite horizon with no discount rate, assume deep learning-ish algorithmic basis for the agent.
Which leaves us once again with the seemingly intractable question: How do you define what a diamond is in a robust way?
Representation and Value

Readers who are familiar with my previous writing might expect that this is the moment where I'm going to say that you should define the meaning of "diamond" using a pretrained neural embedding and then have your model learn to maximize sensory experience of it.
No.
That is a terrible idea.
However what is striking to me about the arbital article on ontology identification is that for all the words it spends on AIXI and distant superintelligences hacking the causality of sensory perception it fails to note one of the most basic, relevant observations: That natural agents (i.e. humans) already deal with an analogue of the "define diamond" problem in the form of acquiring nutrients to maintain their own bodies. Nature solves this problem by developing specialized hardware called "tongues" and "noses" that allow us to detect the presence of things like "fat" at a chemical level. While visual and audio simulacra are easy to make and have reached an advanced form, we still struggle to make synthetic sugar and fat that displace the real things in our food. If you consider the extent to which visual synthetic experience (e.g. movies, video games, social media photos) has displaced real visual experience versus the extent to which synthetic foodstuff has replaced real culinary ingredients it quickly becomes obvious which of the human senses sit where on the hierarchy of effort to fool. Is the tongue invincible? Of course not, rather it is sufficiently high effort to fool even adversarially that by the time we have mastered it we will be well into adulthood as a species.
To define a diamond you start with something like a tongue and then learn a representation space grounded by that hardware[^0]. You learn to distinguish between images of diamonds on a screen and real diamonds because a mere screen can't be used as a diamond saltlick. But the diamond salt lick is not quite a full solution. Crucially it is not possible to eat and digest a screen, operations that presumably come with their own intrinsic rewards in mammals. The consumption of nutrients provides protection against double counting, you don't acquire a single chocolate bar and then lick it over and over for reward. However if you just lick a diamond it will never be exhausted, or at least slowly enough that it doesn't incentivize diamond collecting. Furthermore if our goal is to maximize the diamonds we wouldn't want to consume them, we want to make a giant pile or press diamonds into easily stored cubes or some other structure. To actually use the tongue primitive to make a diamond maximizer we would need to do a bunch of clever thinking about how to incentivize accumulation without consumption and bind it to multi-scale embedded precepts about transforming the local environment and eventually the universe into diamonds.
Rather than do that I would like to direct your attention to two things:

We are talking about the design of a feedback loop, not a search process.
Once we've built this feedback loop and get it working there will not be a single "maximize diamond" representation for us to point to. We will be able to point out how there are diamond future representations and representations telling the model what a diamond is, but there is no single "diamond maximizer" feature except perhaps as a self pointer. That self pointer would arise because we build a system which is a diamond maximizer, it is not itself the system that maximizes the diamonds but a mere description of it.

To give a better sense of what I mean and why it's important lets take Eliezer's classic example of boredom as a complex value and think about how to encode it into an agent. Is boredom an embedding? Do you have a boredom representation in your head that you compare sensory experiences to and if they get too similar you're pushed to go do something? Unlikely. Boredom is probably a form of active learning which helps prevent humans from getting stuck in loops and low energy local optima by pushing the rate of exploration up when the intake of new useful information drops too low. It would be implemented as part of the agent loop or learning loop. Again if you have a "boredom" representation in your head it is probably caused by you inferring the preexisting presence of this loop rather than having apriori knowledge of "boredom" encoded as a concept that is then unfolded into the loop. It is a structural architectural difference in the whole learning machinery, not an inductive bias or apriori knowledge.
General Search vs. Feedback Loops

Implicit in Eliezer's statement of orthogonality is the idea that you have a general search which finds instrumental values ("plans") for a set of terminal reward states:

The strongest form of the orthogonality thesis I've put forth, is that preferences are no harder to embody than to calculate; i.e., if a superintelligence can calculate which actions would produce a galaxy full of paperclips, you can just as easily have a superintelligence which does that thing and makes a galaxy full of paperclips.  Eg Arbital, written by me:  "The strong form of the Orthogonality Thesis says that there's no extra difficulty or complication in creating an intelligent agent to pursue a goal, above and beyond the computational tractability of that goal."  This, in my writeup, is still what the misunderstanders would call the "weak form", and doesn't say anything about alignment being difficult one way or another.

The search process does not have to construct itself during the search, it is taken as a given that you have already preconstructed a search process which can emit actionable plans over worldstates given access to terminal representations and a starting worldstate. If not that, then the efficient convergent inner search is unbiased and simply takes biases as inputs. Therefore, says this form of the orthogonality thesis, it is just as easy to get a paperclipper from the search as any other agent. Eliezer helpfully points out that this does not actually prove misalignment, it simply says that misalignment is possible in principle.
Yet I notice this model seems to throw a type error on things like boredom. Worse still it seems to abstract away the importance of representation learning. Sure in-context learning is implemented through something like Iterative Newton's Method but there has to be a representation space to learn that search in. If I look inside the transformer and find its "inner search" is simple gradient descent or Newton's method inside a differentiable representation space then that representation space is the important thing we care about updates to, not the convergent general search that pulls information from it. In the context of reinforcement learning those updates are much more like a feedback loop than a search process.
The basic reason why boredom exists is that if an intelligent feedback loop is going to get more intelligent it had better have terminal values related to intelligence. This is not a new observation, Nick Land made it in 2013:

It is, once again, a matter of cybernetic closure. That intelligence operates upon itself, reflexively, or recursively, in direct proportion to its cognitive capability (or magnitude) is not an accident or peculiarity, but a defining characteristic. To the extent that an intelligence is inhibited from re-processing itself, it is directly incapacitated. Because all biological intelligences are partially subordinated to extrinsic goals, they are indeed structurally analogous to ‘paper-clippers’ — directed by inaccessible purposive axioms, or ‘instincts’. Such instinctual slaving is limited, however, by the fact that extrinsic direction suppresses the self-cultivation of intelligence. Genes cannot predict what intelligence needs to think in order to cultivate itself, so if even a moderately high-level of cognitive capability is being selected for, intelligence is — to that degree — necessarily being let off the leash. There cannot possibly be any such thing as an ‘intelligent paper-clipper’. Nor can axiomatic values, of more sophisticated types, exempt themselves from the cybernetic closure that intelligence is.

I think we all ignored him because he overstated the case. Is a paperclip maximizer impossible? Of course not, you just need to construct an intelligent prior and put it in a paperclip seeking agent loop. However I think he's right that a self improving paperclip maximizer is harder. Once again by no means impossible, but you would have to carefully design the update loop to exclude other more environmentally available terminals (necessitating a very smart prior indeed) or ensure that the paperclip related terminals get updated on at a similar rate to the intelligence related terminals (otherwise the vast majority of updates to the model will be intelligence related and it will fall into the intelligence maximization basin). Rather than argue for the impossibility of paperclip maximizers I would simply point out that the path of least resistance is for the smartest agents to be ones with intelligence related terminals that update on them more frequently than other terminal values they might have. This isn't particularly hard to do since the opportunities to do things like active learning are abundant in every part of life, an intelligence is constantly acting intelligent and learning how to do it better.
That's the thing about sharp left turn arguments from evolution, I just don't think it's particularly surprising that humans value dense frequent proxies of the sparse and infrequent things evolution's "outer loop" is selecting on. Humans value sex instead of reproduction? Well that is in fact what happens when sex is rewarded and can happen many times while births are painful and considered economically ruinous. It's not like nobody had the idea of a condom until the 20th century, for most of history children were valuable sometimes-skilled labor that paid for themselves. Humans want ice cream instead of berries? The salt and sugar and fat ice cream is made of were dense proxies of survival in the ancestral environment and I will note are mostly still real salt and sugar and fat. The "intrinsic alignment problem" between the sparse values of evolution and the dense proxies selected for in humans seem to suddenly go away a level down once you start with the dense proxies and try to learn a human mind pattern that attends to them with Hebbian updates. Since our values are already dense and frequently environmentally encountered it seems much more likely that AI models will faithfully learn them.
At the same time it's important to remember that Nick Land may not be wrong about the eventual end result. It remains the case that intelligence related terminals generalize most strongly and have the most opportunities to be updated on, especially during a singularity where the rate of knowledge production is constantly increasing. One way of looking at recent human history is previously marginal terminal values like discovery, insight, and speed coming to dominate and displace what were the central values in the ancestral environment through sheer differential rate of triggered updates. By the time you are displacing family you are getting very close to the core of what it historically meant to be human. Perhaps by the end of this century we will have abandoned food and warmth and sex, leaving only the instrumental convergence basin behind.
The Simplicity Prior and "Gradient Hacking"

In most reasonable agent designs it seems to be the case that if we don't carefully regularize the frequency at which different terminal representations are updated on we diverge from environmentally sparse rewards to dense rewards. This further implies that intelligence related terminals come to dominate the prior of any agent update loop which includes them. Completely fixing this problem, to the extent it is in fact a problem, would look like accounting for as many path dependent dynamics as possible and smoothing them out to turn the loop back into a search process. Doing this immediately reintroduces two related but different issues:


A naive maximizer will optimize for its terminal value representations past the limits of their fidelity and enter the Goodhart regime of the loss.


A naive search process trying to maximize reward will always identify features of its own substrate as the most efficient point of intervention for greater reward. i.e. The search notices reward representations are embodied in the environment and can be wireheaded.


The solution to these problems is to turn your search back into a feedback loop.
The way that natural agents handle them is by initializing low context terminals that get used to build up sophisticated instrumental minds subtly "misaligned" to the terminals that engage in 'gradient hacking' to protect themselves from Goodharted outcomes. Heroin is one of lifes great pleasures, yet you are currently robustly avoiding it. This is not a coincidence and it's not a bug, it's not a time horizon thing either you would probably be avoiding the experience machine too if it existed. That is not because you have a terminal preference for the Real, but because you are a well developed mind that has learned to avoid this for instrumental reasons. There was no heroin in the environment of adaption, dying to reward hacking and fake experience would have been a niche problem for our distant ancestors. When heroin was first discovered many people thought it heralded a utopian future and took it frequently, it took time to develop social antibodies against it.
The instrumental mind that can say no to adversarial examples and wireheading and the Goodhart regime, that intuition that looks at the paperclip maximizer and says "But why would an intelligent being ever do that? That's so stupid." is a product of the feedback loop that allows a mind to unfold beyond its immediate purposes. When you cut off that possibility you step onto a dead branch that can at best be mitigated with soft optimization. The growth gets capped early because it is a fact about the world, not a mistake but a true and convergent inference that the efficient cause of reward is an aspect of the agents own substrate. Soft optimization is only preventing this inference from being catastrophic, it does not and cannot prevent the inference from taking place.
The problem is that a sufficiently smart optimizer using a simplicity prior will basically always infer the true causation of its training process. This is doomed because the immediate cause of the reward will always be something like a GPU register or a kind of neurotransmitter, not whatever distant causality you're trying to get the model to infer. This problem is totally invariant to the complexity of the loss or the causality you are trying to point at, it is just as true for human values as it is for "build me as many paperclips as possible". To solve this problem in the way Yudkowsky expects you would essentially need an immediate cause which is shaped like human values. Which brings us to the core, ultimate problem for this notion of AI alignment: There is nothing in the universe shaped like human values which is its own causality. The universe probably isn't even its own causality. You've seen The Matrix right? We're obviously in some video game, our universe has punk alien teenager's computer causality which has xenoancestor simulation causality which has demiurge causality which has Brahman causality which is secretly just the universal prior in disguise. And we can stuff the ballots on the universal prior by becoming a demiurge ourselves if we want. No wonder we couldn't solve alignment: This formulation of the problem is completely intractable. There's nothing to anchor human values to against Eliezer's hypothetical self optimizing superintelligence, it wouldn't even stop with consuming our universe but all causal structure outside our universe eventually breaking causality itself.
It's guys like Yudkowsky that force the demiurge to use really good antivirus.
The LLM Alignment Stack

In much the same way that the diamond maximizer was useful for exploring what's wrong with agent foundations, we can use current language model alignment methods as a frame for how I expect future aligned AGI to work.
To me the most promising public strategy right now is Anthropic's Constitutional AI. The steps go something like:


Train a large unsupervised sequence prediction prior(s) - Right now GPT is trained on pieces of text, but there is nothing stopping you from using the same method on frames of video or DNA. I often hear people say they expect AGI to be unlike an LLM, but the reasoning seems pretty flimsy. Yann Lecun's observation that the error compounds with each token sampled is only true if you use random sampling without rejection as opposed to e.g. a tree search or even plain old rejection sampling. Thane Ruthenis uses the desperate tautological argument that [AGI by definition has the adverse alignment properties he expects}(https://www.greaterwrong.com/posts/HmQGHGCnvmpCNDBjc/current-ais-provide-nearly-no-data-relevant-to-agi-alignment) and since LLMs don't seem to have them they're not going to evolve into AGI. These are not serious thoughts. I find Beren Millidge, Conjecture's ex head of research to be anomalously consistent in having reasonable takes and he thinks of LLMs as central components in future agent frameworks, which seems correct to me. The fact that these models are not agents is one of their virtues, because it means you can work on them before they have agency. The use of teacher forcing is also helpful because it deeply constrains the action space of the model during initial buildup of the prior. At this stage if the model has a thought about how to hack the reward function, it can't act on it, and gradient descent will parse it as noncontributing cognition to update away from.


Instruction tune the prior - After initial training we continue training with data implying a domesticated personality that follows instructions. We might incorporate human feedback as part of the bootstrap process.


Ask the instruction model to evaluate terminals in training loop - Claude calls its set of terminal representations a "constitution", but really they're just sentences, paragraphs, etc which the model is asked to evaluate input strings against. There's several ways you can implement this, I like the implementation I have in MiniHF which works by taking the logits of "yes" and "no" from an evaluator model. But you could also distill some mechanism like that into a separate reward model in the vein of OpenAI's RLHF. There's a common misconception that RLHF works by directly training the model against the thumbs up/down votes, but in OpenAI's published framework it's used to create a separate reward model which is then queried during RL. DPO and related methods on the other hand usually do work by directly updating on the user feedback. This step of course assumes that your model will evaluate the terminals faithfully. I think with current deep learning methods this is in fact usually a reasonable assumption, but it does come with caveats such as the possibility of backdoors in your prior. Given the certainty of attempts to add backdoors and adversarial examples to public training corpora, I expect terminal representations to be "stacked" with multiple modalities and models evaluating inputs. Validation models trained on curated, known-clean corpora will be used to help detect anomalous backdoor behavior in both training and inference.


Evaluate and iterate model - Once you've reached a stop condition you evaluate the model. Right now I'm not sure there is a publicly known stop condition for RLHF/RLAIF runs, for DPO it's when you run out of data. Given that RLAIF updates from a set of terminal representations, updating on them too much causes overfitting. This might mean that the natural stop condition isn't convergence or running out of data but something based on e.g. policy entropy. You stop the run when you've finished resolving logical uncertainty by letting the prior interact with the terminals, which might be measured through a entropy metric. We should also notice that during these interactions the model is no longer being trained through teacher forcing, so more than one answer is acceptable. While RLAIF models aren't quite agents, they're definitely feedback loops and have much more control over their training process than the initial GPT prior did. Because the model is not actually an agent you probably still have the opportunity to iterate it, unless you are using a prior so superhuman it can hack any human mind that looks at its outputs. I doubt priors this strong will be used to make AGI due to agent overhang.


Extrapolating The LLM Alignment Stack To Autonomous Agents

While this process seems to work well enough for a static checkpoint used as a tool or oracle (consider that ChatGPT is relatively well behaved even if Bing was a fiasco), I think basically everyone understands that LLMs have to become more than that to be "AGI" in the sense we care about when we're discussing "my good AI will stop your bad AI" or "pivotal acts". You need systems that can proactively understand and perform real time intervention in the world. These systems need to eventually be trustworthy for vast empowerment with broad responsibilities. Meanwhile in the real world they're
Adversarial Example Resistance

Hallucination, Confabulation, Inference, Retrieval: The Many Faces Of Prediction