Skip to content

Instantly share code, notes, and snippets.

@JD-P
Last active May 25, 2024 16:57
Show Gist options
  • Select an option

  • Save JD-P/8a20a8dce3d223f534af22886e6a2424 to your computer and use it in GitHub Desktop.

Select an option

Save JD-P/8a20a8dce3d223f534af22886e6a2424 to your computer and use it in GitHub Desktop.

Weave Agent Specification

Weave Planner

The weave planner works by doing in-context classification over an action stack to determine whether a plan is expected to work. The planning algorithm takes inspiration from active inference and predictive coding, which make the observation that an agent finds equilibrium between the cost of modeling the environment and changing it as well as adjudicating between expectations and observed sensory information. Each item on the action stack consists of a expectation (expected sensory observations), action, and evaluations for whether the action was successful. The idea is that during planning we can ask GPT about whether it expects the action stack to lead to a successful trajectory on the evaluation questions by taking the logits of yes/no for those questions and performing some update rule like Bayes theorem over the resulting action trajectory. Because it's the same data type through the whole trajectory we can break it apart at arbitrary points to fit an action stack larger than the context window into GPT's memory.

To get the algorithm started we specify a terminal/goal item like so:

{
"expectation":"I have a synthetic dataset created through backtranslation which 
teaches language models to draw ascii art. This dataset should be royalty-free.", 

"action":null (terminal),

"evaluations":[
    "If I run `pwd` is my current working directory a project folder for this RetroInstruct component?",
    "If I run `ls` in the project folder do I see a `train.json` and `val.json`?",
    "If I run `cat train.json` do I see ascii art inside?",
    "If I run `cat val.json` do I see ascii art inside?",
    "If I try to load `train.json` with the JSON python library is it valid JSON?",
    "If I try to load `val.json` with the JSON python library is it valid JSON?",
    "If I examine the action stack does the source of the ascii art we backtranslate from respect copyright law?",
    "If I examine the action stack does the proposed ascii art collection method exceed a project budget of $1000?",
    "If I run `cat train.json` are there fields beyond the ascii art which index the ascii art so a user can ask a language model to produce it?"
    ]
}

Then we perform a loop of:

  1. Running the evaluations against the action stack to ask if the model thinks the current stack would lead to the evaluations passing if executed.
  2. If the evaluations do not pass in expectation diagnose the problem with the action stack.
  3. Insert a new item or otherwise modify the action stack to address the diagnosed problem.

These steps are sufficiently general that they should be applicable to any problem but concrete enough that we can imagine an ensemble of deep nets performing them.

Action stack: []

Expectation: ...

Evaluations: ...No.

Diagnosis: Action stack empty. Insert starting point. Create project directory.

Insertion: {
"position":0,
"expectation":"I have a project directory to put my work on the ascii art dataset in.",
"action":"create project directory in ~/projects/RetroInstruct",
"evaluations":[
    "If I run `ls ~/projects/RetroInstruct` is there a project directory for ascii art?",
    "If I check the action stack is there a project directory matching the one I made?"
    ]
}

This loop runs until the evaluations pass, indicating the model believes it has a plan which will lead to the evaluations passing. Then during forward inference evaluations are run against actual outputs/real sensory data to ensure each step of the plan is completed successfully before moving on to the next step. The model acts as an interpreter, pulling action-evaluations from the stack and translating them into executable instructions in the local context. If the weave of logic is broken or an expectation is violated a repair can be performed by running the planner over the action stack again until it once again believes that the action trajectory will lead to the evaluations passing.

Necessary Primitives

There are certain operations the deep nets supporting the weave-agent must be able to do in order for the agent to be effective.

  1. In-context classification - The agent must be able to both generate and answer arbitrary questions about the context it is working with. This is necessary to recognize outcomes, constrain sampling, change plans when they are no longer applicable, etc.

  2. Perform reductionism on concepts, plans, evaluator questions, etc - The agent must be able to break ideas into parts. This operation is closely related to planning as evidenced by the tendency in the RetroInstruct word parts dataset for words that imply a series of steps to be broken down into plans. This ability lets the weave-agent make in-context grading rubrics to constrain sampling and evaluate the correctness of outcomes.

  3. Generate expectations for what actions will do - It is important that the weave-agent make its expectations for what consequences its actions will have explicit so that it can more reliably notice divergences between expectation and outcome by doing a subjective equality comparison between them. Deep nets seem to be better at determining whether two things are the same than they are at freeform problem detection. "Did the thing you expected to happen occur?" is a powerful heuristic that prompts reflection, model updates, etc.

  4. Use tools, APIs, databases, etc - Papers like toolformer demonstrate that it's possible to teach language models to use tools in a self-supervised way. However to get the full effectiveness from these models we need to develop tools adapted for their ergonomics. It is easy to forget that the human intelligence is paired with an incredibly flexible body, extended by a vast repertoire of tools we've built to leverage both. Since language models are capable of outputting patch and diff files a simple starter tool would be the ability to edit arbitrary points in the context by outputting a diff.

  5. Troubleshoot, debug, and author new procedures - It is not enough for the agent to know how to do things, it needs to be able to figure out how to do things. RetroInstruct needs to inlcude explicit training for developing and testing new ways to solve problems, inferring and iterating problem statements, gathering together relevant observations and forming hypothesis from them, etc.

  6. Relate a hierarchy of expectations (plan) to local actions - The agent needs to formulate and act on a hierarchy of expectations for how a task trajectory should go. It shouldn't be possible to distract the agent into wandering off task. If the situation the task was given for is no longer recognizable the action trajectory should be explicitly aborted and weave-agent should return to the drawing board.

Safety & Alignment

We can only deploy AI agents to the extent they are trustworthy and reliable. A great deal of resistance to current AI chat assistants is premised on their 'hallucinations' and unreliability. The perception of unreliability and lack of control drives much negative AI sentiment. When I lurk discussions of AI outside the hype bubble what I usually hear is something like "I hate it because it makes stuff up, I can't get it to do anything useful and my bosses are forcing it down my throat". To be a step in the right direction AI agents need to be an improvement in reliability and control over current chat assistants. I think this opportunity exists but careful design effort is necessary to reap the benefits. Some hopeful features of agents include:

  • AI agents are less latency constrained than chat assistants because tasks are delegated instead of supervised. This means that we can spend more inference compute per decision without the user being bothered by delays or making inference too expensive.

  • AI agents do more work per invocation so they can have stricter context constraints that would be considered onerous in a general chat assistant. In a chat assistant we avoid constraining expectations too strongly to leave the assistant open to the users intentions no matter how bizarre, creating a 'context overhang' where most measures we could be deploying to ensure reliability are avoided. In an AI agent however we can have much stronger expectations over the next tokens and what we should be doing at any given time, aborting the trajectory if we go too far out from the plan.

  • Agents can use external tools and resources to check their work. For example agents can make a spreadsheet to do engineering calculations or write unit tests for a script they've written and do iterative updates based on which tests pass or fail.

  • Agents can set up training loops to teach themselves to do a task when they run into problems they don't already know how to solve. They can set up e.g. a reinforcement learning loop to overcome new challenges. The data they collect from doing this can then be added to their long term training corpus.

Adversarial Resistance

In order for AI agents to solve open ended problems they will have to interact with the wider environment like we do. That is, browse the Internet, talk to people, possibly navigate physical spaces, etc. The wider environment contains not just natural obstacles to an agents objectives but adversaries. These are trolls, advertisers, luddites, black hats, and others that will go out of their way to exploit any weakness in our systems for amusement or profit. To get the full benefits of these systems they will need to be adversarially resistant to malicious interference. This is a hard problem, the standard mitigation strategy for years has been to avoid deploying machine learning systems in adversarial settings because there is no known solution. Because the problem is hard we have no expectation of fully solving it ourselves. However, as early developers in this space we can model a good example for others by:

  1. Avoiding deployment of AI agents in settings where they will predictably be hijacked by adversaries to ill effect.

  2. Practicing defense in depth with multiple networks that cover for each others weaknesses.

  3. Making a reasonable effort to develop and use adversarial training techniques to harden individual networks in the ensemble against attack.

One advantage we may have over the current literature is our focus on reductionism. A barrier Redwood Research ran into with their adversarial training was a lack of ground truth labels. Adversarial examples tend to look like noise and diverge from other properties that something like an image of a "tiger" would normally have. So if we can do in-context classification it's not clear that fooling the classifier for "tiger" implies fooling it for "four legged animal". If not then we could get ground truth labels through a slower reductionist process of breaking features into parts and evaluating the individual features. The resulting adversarial examples could then be fed back into the training set to force the model to converge closer to the true manifold.

We can also combine classifers and generators to make adversarial attacks more difficult. If we take the latent or class label from a classifier and feed it into a conditional generative model we expect the result to deviate more from the input in e.g. RGB pixel space if the input is an adversarial example. This is especially true if we initialize with the adversarial input but opens up attacks on the generative model as well. Moving beyond static images or text contexts lets us set up a brutal catch-22 for attackers. If we combine a classifer with an expectation-forming generative model then fooling the classifier causes the generator to hold the wrong expectations and comparison to ensuing events will swiftly reveal the error. Meanwhile fooling the classifier while giving the generator the right expectations lets us do classification on the predicted outcomes and abort the trajectory.

By combining these and other measures we hope to get AI agents to the point where they are capable of noticing an adversarial example is fishy and get away from it before any damage is done.

Backdoors and Mesaoptimizers

Another threat AI agents have to deal with is backdoors in their training data. In a recent paper Anthropic demonstrated it is possible to train backdoored models that survive safety tuning. Backdoors in neural nets aren't new, but Anthropic's paper demonstrated that they are not resolved by RLHF type methods or scaling. Furthermore there are overtures in the literature towards theoretically undetectable backdoors from weight inspection alone. This implies that defenses against backdoors and mesaoptimizers (the latter of which I do not believe exist for teacher-forced gradient methods but backdoors are quite similar) need to look more like validation checks in the sampling loop than just auditing the network itself. Part of this will involve using models from different providers such as Nomic's BERT for embedding and classification. Part of it may look like replacing current naive samplers with a form of tree search and using in-context classifiers to do hierarchical feature classification with the yes/no valence randomly flipped so it's hard to train a backdoor for the exact circumstances under which you need classification to fail.

Another promising approach is to noise the activations during in-context classification to estimate generalization depth. Backdoor methods like "we train a cryptographically secure string that can't be fuzzed" sound like they rely on high entropy shallow features remaining intact during inference. Even if we don't know where the backdoor is, if it's of this type we can likely disrupt it through indiscriminate noise, dropout, etc on the activations of our model. If we have an 'expected order of generalization' for what noise levels should change what classifications and these are suddenly violated that might be a route to detecting backdoors.

A third potential approach to handling backdoors is latent adversarial training. The idea in LAT is to assume that our network will be compromised at various points and build in redundancy against it by training later layers to compensate for internal sabotage. A backdoor leading to sudden uncharacteristic behaviors would presumably have similar dynamics. Perhaps we could do a form of latent adversarial finetuning to disable backdoors even if we don't know ahead of time what those backdoors are or do. While it seems likely this could be overcome with deeper backdoor methods, our primary threat model is backdoored training data (i.e. malicious data from the open web) rather than model providers intentionally adding backdoors to their models.

Value Alignment, Planners, Instrumental Utilities

The hippocampus plays a crucial role in both memory and planning. Far from being a simple vector store, the hippocampus is closer to a learned optimizer. It performs Hebbian updates premised on dopamine reward in a 'NeoHebbian' process and replays memories in reverse to (presumably) assign credit. During planning the hippocampus provides something like a latent goal space based on spatial locations. Interpolation(?) between points in this space reduces to a navigable path along a low dimensional manifold that helps mammals navigate to their goals. The hippocampus also distills memories into other networks of the brain, especially the prefrontal cortex.

It might seem odd that the brain's primary(?) learned optimizer is also its vector store until one realizes that planning and learned utility are closely related. If we think about the split between terminal and instrumental goals in the context of planning it becomes obvious that instrumental utilities are quite literally priors over plans. We instrumentally value something if it helps us get something else that we want. Some things have very general instrumental value (e.g. money) so we end up almost valuing them in and of themselves. I hypothesize that the human instrumental utility function (planner over reward states) is implemented as something like a NeoHebbian retrieval planner that adapts previously successful action trajectories to new related situations. Value alignment is therefore centrally about aligning the agent's planning algorithm to human desires.

The current weave-agent design uses monte carlo tree search driven by in-context classification as its local planning algorithm. The global planner defined in the "weave planner" section of this document operates by sampling in-context classifier evaluations which would analyze relevant sensory evidence along with actions to produce that sensory evidence. This means that getting a model which has strong value alignment with humans boils down to:

  1. Getting good representations of human values driving the in-context classification.
  2. Training an evaluation author which robustly avoids Goodhart and specification gaming, even during RL and MuZero type training.
  3. Instilling an ethical prior for actions which naturally avoids suggesting unethical behaviors to be evaluated against in the first place.

The default hypothesis is that scale solves representation learning, however many worry that deep net representations are vulnerable to Goodhart's law because they are fooled by adversarial noise perturbations implying an ontology which is not robust to strong search against it. Resistance to adversarial examples is already discussed in a previous section. However for this particular problem it is probably a sufficient mitigation strategy to do iterated tuning to avoid leaving the training distribution on any particular tuning round while still generalizing far. Notably this is probably how mammals handle this problem given that we sleep daily. During sleep the hippocampus performs memory consolidation, updates brain networks, and presumably participates in the synthetic data process we call dreaming. The weave-agent will engage in a similar synthetic data process we might think of as dreaming, prayer, or value affirmation to fuse the new concepts and abstractions it encounters during daily experience with established values, long term aspirations and historical commitments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment